Skip to content

feat: add fsst encode#24234

Closed
SkyFan2002 wants to merge 4 commits intoapache:masterfrom
SkyFan2002:fsst
Closed

feat: add fsst encode#24234
SkyFan2002 wants to merge 4 commits intoapache:masterfrom
SkyFan2002:fsst

Conversation

@SkyFan2002
Copy link

Proposed changes

Issue Number: close #xxx

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions
Copy link
Contributor

sh-checker report

To get the full details, please check in the job output.

shellcheck errors

'shellcheck ' returned error 1 finding the following syntactical issues:

----------

In be/src/fsst/paper/compare.sh line 4:
  fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
  ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
        ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
        ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
           ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
                         ^--^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                 ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
                                          ^--^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.

Did you mean: 
  fgrep "${i}" "$1" | fgrep -v "${i}"2 | fgrep -v "${i}"pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'


In be/src/fsst/paper/evolution.sh line 7:
(for i in dbtext/*; do (./cw-strncmp $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'
                                     ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                     ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw-strncmp "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'


In be/src/fsst/paper/evolution.sh line 8:
(for i in dbtext/*; do (./cw $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|str-as-long|scalar"}'
                             ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                             ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|str-as-long|scalar"}'


In be/src/fsst/paper/evolution.sh line 9:
(for i in dbtext/*; do (./cw-greedy $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|greedy-match|str-as-long|scalar" }'
                                    ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                    ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw-greedy "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|greedy-match|str-as-long|scalar" }'


In be/src/fsst/paper/evolution.sh line 10:
(for i in dbtext/*; do (./vcw $i 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|binary-search|greedy-match|str-as-long|scalar" }'
                              ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                              ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                         ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./vcw "${i}" 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|binary-search|greedy-match|str-as-long|scalar" }'


In be/src/fsst/paper/evolution.sh line 11:
(for i in dbtext/*; do (./hcw $i 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|branch-scalar" }'
                              ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                              ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                       ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw "${i}" 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|branch-scalar" }'


In be/src/fsst/paper/evolution.sh line 13:
(for i in dbtext/*; do (./hcw-opt $i 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|adaptive-scalar|optimized-construction" }'
                                  ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                  ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                           ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw-opt "${i}" 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|adaptive-scalar|optimized-construction" }'


In be/src/fsst/paper/evolution.sh line 14:
(for i in dbtext/*; do (./hcw-opt $i 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|avx512|optimized-construction" }'
                                  ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                  ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                             ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw-opt "${i}" 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|avx512|optimized-construction" }'


In be/src/fsst/paper/kernels.sh line 1:
#/bin/bash
 ^-- SC1113 (error): Use #!, not just #, for the shebang.


In be/src/fsst/paper/kernels.sh line 4:
echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
     ^-----^ SC2086 (info): Double quote to prevent globbing and word splitting.
     ^-----^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
echo "${PARAMS}" | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'


In be/src/fsst/paper/kernels.sh line 5:
echo "\\\\"
     ^----^ SC2028 (info): echo may not expand escape sequences. Use printf.


In be/src/fsst/paper/kernels.sh line 10:
   for m in $PARAMS
            ^-----^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
   for m in ${PARAMS}


In be/src/fsst/paper/kernels.sh line 12:
     (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
                       ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                               ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                               ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
     (./hcw-opt dbtext/"${i}" 511 -"${m}" 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'


In be/src/fsst/paper/kernels.sh line 14:
   echo $i
        ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
        ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
   echo "${i}"


In be/src/fsst/paper/lz4-smallblocks.sh line 3:
dd if=$1 of=tmpsplit.out bs=$maxsize count=1 2> /dev/null
      ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                            ^------^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                            ^------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
dd if="$1" of=tmpsplit.out bs="${maxsize}" count=1 2> /dev/null


In be/src/fsst/paper/lz4-smallblocks.sh line 5:
    mkdir tmpsplit$blocksize
                  ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                  ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    mkdir tmpsplit"${blocksize}"


In be/src/fsst/paper/lz4-smallblocks.sh line 6:
    split -b $blocksize tmpsplit.out tmpsplit$blocksize/x
             ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                             ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                                             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    split -b "${blocksize}" tmpsplit.out tmpsplit"${blocksize}"/x


In be/src/fsst/paper/lz4-smallblocks.sh line 7:
    echo -n $blocksize ""
            ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
            ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    echo -n "${blocksize}" ""


In be/src/fsst/paper/lz4-smallblocks.sh line 8:
    size=$((for f in tmpsplit$blocksize/x*; do lz4 -c $f | wc -c; done) | awk '{s+=$1} END {print s}')
         ^-- SC1102 (error): Shells disambiguate $(( differently or not at all. For $(command substitution), add space after $( . For $((arithmetics)), fix parsing errors.
                             ^--------^ SC2231 (info): Quote expansions in this for loop glob to prevent wordsplitting, e.g. "$dir"/*.txt .
                             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                      ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                                      ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    size=$((for f in tmpsplit${blocksize}/x*; do lz4 -c "${f}" | wc -c; done) | awk '{s+=$1} END {print s}')


In be/src/fsst/paper/lz4-smallblocks.sh line 9:
    echo "$maxsize / $size" | bc -l
          ^------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                     ^---^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    echo "${maxsize} / ${size}" | bc -l


In be/src/fsst/paper/lz4-smallblocks.sh line 10:
    rm -rf tmpsplit$blocksize/
                   ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                   ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    rm -rf tmpsplit"${blocksize}"/


In be/src/fsst/paper/sorted.sh line 8:
cd dbtext
^-------^ SC2164 (warning): Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

Did you mean: 
cd dbtext || exit


In be/src/fsst/paper/sorted.sh line 11:
  sort $i > ../.sorted/$i; 
       ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                       ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  sort "${i}" > ../.sorted/"${i}"; 


In be/src/fsst/paper/sorted.sh line 14:
cd ..
^---^ SC2103 (info): Use a ( subshell ) to avoid having to cd back.


In be/src/fsst/paper/sorted.sh line 19:
  ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
                                   ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                   ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  ./filtertest compare 1000 dbtext/"${i}" | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'


In be/src/fsst/paper/sorted.sh line 20:
  ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
                                    ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                    ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  ./filtertest compare 1000 .sorted/"${i}" | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'

For more information:
  https://www.shellcheck.net/wiki/SC1102 -- Shells disambiguate $(( different...
  https://www.shellcheck.net/wiki/SC1113 -- Use #!, not just #, for the sheba...
  https://www.shellcheck.net/wiki/SC2164 -- Use 'cd ... || exit' or 'cd ... |...
----------

You can address the above issues in one of three ways:
1. Manually correct the issue in the offending shell script;
2. Disable specific issues by adding the comment:
  # shellcheck disable=NNNN
above the line that contains the issue, where NNNN is the error code;
3. Add '-e NNNN' to the SHELLCHECK_OPTS setting in your .yml action file.



shfmt errors

'shfmt ' returned error 1 finding the following formatting issues:

----------
--- be/src/fsst/paper/compare.sh.orig
+++ be/src/fsst/paper/compare.sh
@@ -1,5 +1,4 @@
 #!/bin/bash
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_commen ps_comment 
- do
-  fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
- done) | awk '{print$0;k++;for(i=2;i<=NF;i++) r[i]+=$i;}END{printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", "AVG",r[2]/k,r[3]/k,r[4]/k,r[5]/k,r[6]/k,r[7]/k,r[8]/k}'
+(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_commen ps_comment; do
+    fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
+done) | awk '{print$0;k++;for(i=2;i<=NF;i++) r[i]+=$i;}END{printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", "AVG",r[2]/k,r[3]/k,r[4]/k,r[5]/k,r[6]/k,r[7]/k,r[8]/k}'
--- be/src/fsst/paper/evolution.sh.orig
+++ be/src/fsst/paper/evolution.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 # output format: STCB CCB CR
 # STCB: symbol table construction cost in cycles-per-compressed byte (constructing a new ST per 8MB text)
-# CCB:  compression speed cycles-per-compressed byte 
+# CCB:  compression speed cycles-per-compressed byte
 # CR:   compression (=size reduction) factor achieved
 
 (for i in dbtext/*; do (./cw-strncmp $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'
@@ -16,10 +16,10 @@
 # on Intel SKX CPUs| the results look like:
 #
 # 75.117,160.11,1.97194 iterative|suffix-array|dynp-matching|strncmp|scalar
-#   \--> 160 cycles per byte produces a very slow compression speed (say ~20MB/s on a 3Ghz CPU) 
+#   \--> 160 cycles per byte produces a very slow compression speed (say ~20MB/s on a 3Ghz CPU)
 #
 # 73.6948,81.6404,1.97194 iterative|suffix-array|dynp-matching|str-as-long|scalar
-#   \--> str-as-long (i.e. FSST focusing on 8-byte word symbols) improves compression speed 2x 
+#   \--> str-as-long (i.e. FSST focusing on 8-byte word symbols) improves compression speed 2x
 #
 # 74.4996,37.457,1.94764 iterative|suffix-array|greedy-match|str-as-long|scalar
 #   \--> dynamic programming brought only 3% smaller size. So drop it and gain another 2x compression speed.
@@ -28,7 +28,7 @@
 #   \--> bottom-up is *really* better in terms of compression factor than iterative with suffix array.
 #
 # 1.74783,10.7009,2.28103 bottom-up|lossy-hash|greedy-match|str-as-long|scalar-branch
-#   \--> hashing significantly improves compression speed at only 5% size cost (due to hash collisions) 
+#   \--> hashing significantly improves compression speed at only 5% size cost (due to hash collisions)
 #
 # 1.74783,9.8142,2.28103 bottom-up|lossy-hash|greedy-match|str-as-long|scalar-adaptive
 #   \--> adaptive use of encoding kernels gives compression speed a small bump
@@ -39,4 +39,4 @@
 # optimized construction refers to the combination of three changes:
 # - reducing the amount of bottom-up passes from 10 to 5 (less learning time, but.. slighty worsens CR)
 # - looking at subsamples in early rounds (increasing the sample as the rounds go up). Less compression work.
-# - splitting the counters for less cache pressure and aiding fast skipping over counts-of-0 
+# - splitting the counters for less cache pressure and aiding fast skipping over counts-of-0
--- be/src/fsst/paper/kernels.sh.orig
+++ be/src/fsst/paper/kernels.sh
@@ -1,15 +1,15 @@
 #/bin/bash
 PARAMS='simd1 simd2 simd3 simd4 adaptive'
-(echo | awk '{ print "{\\begin{tabular}{|rrrr|r|l|}\n\\hline"}'
-echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
-echo "\\\\"
-echo "\\hline"
-echo "\\hline"
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment 
- do 
-   for m in $PARAMS
-   do
-     (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
-   done
-   echo $i
- done) | awk '{for(i=1;i<NF;i++){r[i]+=$i;printf "{\\footnotesize{X%d%5.2f}}& ",i,$i}k++;printf "{\\footnotesize %s}\\\\\n",$NF}END{print "\\hline"; for(j=1;j<i;j++)printf "{\\footnotesize{X%d%5.2f}}& ",j,r[j]/k;print "{\\footnotesize average}\\\\\n\\hline\n\\end{tabular}}"}' | sed 's/_/\\_/g' | sed 's/[0-9]*-//') | sed 's/X[38]/\\bf /g' | sed 's/X[1-9]//g' | sed 's/adaptive/scalar/' 
+(
+    echo | awk '{ print "{\\begin{tabular}{|rrrr|r|l|}\n\\hline"}'
+    echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
+    echo "\\\\"
+    echo "\\hline"
+    echo "\\hline"
+    (for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment; do
+        for m in $PARAMS; do
+            (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
+        done
+        echo $i
+    done) | awk '{for(i=1;i<NF;i++){r[i]+=$i;printf "{\\footnotesize{X%d%5.2f}}& ",i,$i}k++;printf "{\\footnotesize %s}\\\\\n",$NF}END{print "\\hline"; for(j=1;j<i;j++)printf "{\\footnotesize{X%d%5.2f}}& ",j,r[j]/k;print "{\\footnotesize average}\\\\\n\\hline\n\\end{tabular}}"}' | sed 's/_/\\_/g' | sed 's/[0-9]*-//'
+) | sed 's/X[38]/\\bf /g' | sed 's/X[1-9]//g' | sed 's/adaptive/scalar/'
be/src/fsst/paper/lz4-smallblocks.sh:8:17: not a valid arithmetic operator: f
--- be/src/fsst/paper/sorted.sh.orig
+++ be/src/fsst/paper/sorted.sh
@@ -6,17 +6,15 @@
 rm -rf .sorted 2>/dev/null
 mkdir .sorted
 cd dbtext
-for i in * 
-do 
-  sort $i > ../.sorted/$i; 
+for i in *; do
+    sort $i >../.sorted/$i
 done
 cp chinese japanese faust hamlet ../.sorted/
 cd ..
 
 # note sizes, display stats
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment
- do 
-  ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
-  ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
- done) | 
-awk '{ s1+=$2; s2+=$3; s3+=$4; s4+=$5; k++; print $0} END {printf "% 16s %1.2f% 1.2f %1.2f %1.2f\n", "avg",s1/k, s2/k, s3/k, s4/k}'
+(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment; do
+    ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
+    ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
+done) |
+    awk '{ s1+=$2; s2+=$3; s3+=$4; s4+=$5; k++; print $0} END {printf "% 16s %1.2f% 1.2f %1.2f %1.2f\n", "avg",s1/k, s2/k, s3/k, s4/k}'
----------

You can reformat the above files to meet shfmt's requirements by typing:

  shfmt  -w filename


@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@SkyFan2002
Copy link
Author

run buildall

@github-actions
Copy link
Contributor

sh-checker report

To get the full details, please check in the job output.

shellcheck errors

'shellcheck ' returned error 1 finding the following syntactical issues:

----------

In be/src/fsst/paper/compare.sh line 4:
  fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
  ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
        ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
        ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
           ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
                         ^--^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                 ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
                                          ^--^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.

Did you mean: 
  fgrep "${i}" "$1" | fgrep -v "${i}"2 | fgrep -v "${i}"pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'


In be/src/fsst/paper/evolution.sh line 7:
(for i in dbtext/*; do (./cw-strncmp $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'
                                     ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                     ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw-strncmp "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'


In be/src/fsst/paper/evolution.sh line 8:
(for i in dbtext/*; do (./cw $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|str-as-long|scalar"}'
                             ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                             ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|str-as-long|scalar"}'


In be/src/fsst/paper/evolution.sh line 9:
(for i in dbtext/*; do (./cw-greedy $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|greedy-match|str-as-long|scalar" }'
                                    ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                    ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw-greedy "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|greedy-match|str-as-long|scalar" }'


In be/src/fsst/paper/evolution.sh line 10:
(for i in dbtext/*; do (./vcw $i 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|binary-search|greedy-match|str-as-long|scalar" }'
                              ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                              ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                         ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./vcw "${i}" 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|binary-search|greedy-match|str-as-long|scalar" }'


In be/src/fsst/paper/evolution.sh line 11:
(for i in dbtext/*; do (./hcw $i 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|branch-scalar" }'
                              ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                              ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                       ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw "${i}" 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|branch-scalar" }'


In be/src/fsst/paper/evolution.sh line 13:
(for i in dbtext/*; do (./hcw-opt $i 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|adaptive-scalar|optimized-construction" }'
                                  ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                  ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                           ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw-opt "${i}" 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|adaptive-scalar|optimized-construction" }'


In be/src/fsst/paper/evolution.sh line 14:
(for i in dbtext/*; do (./hcw-opt $i 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|avx512|optimized-construction" }'
                                  ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                  ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                             ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw-opt "${i}" 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|avx512|optimized-construction" }'


In be/src/fsst/paper/kernels.sh line 1:
#/bin/bash
 ^-- SC1113 (error): Use #!, not just #, for the shebang.


In be/src/fsst/paper/kernels.sh line 4:
echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
     ^-----^ SC2086 (info): Double quote to prevent globbing and word splitting.
     ^-----^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
echo "${PARAMS}" | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'


In be/src/fsst/paper/kernels.sh line 5:
echo "\\\\"
     ^----^ SC2028 (info): echo may not expand escape sequences. Use printf.


In be/src/fsst/paper/kernels.sh line 10:
   for m in $PARAMS
            ^-----^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
   for m in ${PARAMS}


In be/src/fsst/paper/kernels.sh line 12:
     (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
                       ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                               ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                               ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
     (./hcw-opt dbtext/"${i}" 511 -"${m}" 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'


In be/src/fsst/paper/kernels.sh line 14:
   echo $i
        ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
        ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
   echo "${i}"


In be/src/fsst/paper/lz4-smallblocks.sh line 3:
dd if=$1 of=tmpsplit.out bs=$maxsize count=1 2> /dev/null
      ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                            ^------^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                            ^------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
dd if="$1" of=tmpsplit.out bs="${maxsize}" count=1 2> /dev/null


In be/src/fsst/paper/lz4-smallblocks.sh line 5:
    mkdir tmpsplit$blocksize
                  ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                  ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    mkdir tmpsplit"${blocksize}"


In be/src/fsst/paper/lz4-smallblocks.sh line 6:
    split -b $blocksize tmpsplit.out tmpsplit$blocksize/x
             ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                             ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                                             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    split -b "${blocksize}" tmpsplit.out tmpsplit"${blocksize}"/x


In be/src/fsst/paper/lz4-smallblocks.sh line 7:
    echo -n $blocksize ""
            ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
            ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    echo -n "${blocksize}" ""


In be/src/fsst/paper/lz4-smallblocks.sh line 8:
    size=$((for f in tmpsplit$blocksize/x*; do lz4 -c $f | wc -c; done) | awk '{s+=$1} END {print s}')
         ^-- SC1102 (error): Shells disambiguate $(( differently or not at all. For $(command substitution), add space after $( . For $((arithmetics)), fix parsing errors.
                             ^--------^ SC2231 (info): Quote expansions in this for loop glob to prevent wordsplitting, e.g. "$dir"/*.txt .
                             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                      ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                                      ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    size=$((for f in tmpsplit${blocksize}/x*; do lz4 -c "${f}" | wc -c; done) | awk '{s+=$1} END {print s}')


In be/src/fsst/paper/lz4-smallblocks.sh line 9:
    echo "$maxsize / $size" | bc -l
          ^------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                     ^---^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    echo "${maxsize} / ${size}" | bc -l


In be/src/fsst/paper/lz4-smallblocks.sh line 10:
    rm -rf tmpsplit$blocksize/
                   ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                   ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    rm -rf tmpsplit"${blocksize}"/


In be/src/fsst/paper/sorted.sh line 8:
cd dbtext
^-------^ SC2164 (warning): Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

Did you mean: 
cd dbtext || exit


In be/src/fsst/paper/sorted.sh line 11:
  sort $i > ../.sorted/$i; 
       ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                       ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  sort "${i}" > ../.sorted/"${i}"; 


In be/src/fsst/paper/sorted.sh line 14:
cd ..
^---^ SC2103 (info): Use a ( subshell ) to avoid having to cd back.


In be/src/fsst/paper/sorted.sh line 19:
  ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
                                   ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                   ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  ./filtertest compare 1000 dbtext/"${i}" | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'


In be/src/fsst/paper/sorted.sh line 20:
  ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
                                    ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                    ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  ./filtertest compare 1000 .sorted/"${i}" | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'

For more information:
  https://www.shellcheck.net/wiki/SC1102 -- Shells disambiguate $(( different...
  https://www.shellcheck.net/wiki/SC1113 -- Use #!, not just #, for the sheba...
  https://www.shellcheck.net/wiki/SC2164 -- Use 'cd ... || exit' or 'cd ... |...
----------

You can address the above issues in one of three ways:
1. Manually correct the issue in the offending shell script;
2. Disable specific issues by adding the comment:
  # shellcheck disable=NNNN
above the line that contains the issue, where NNNN is the error code;
3. Add '-e NNNN' to the SHELLCHECK_OPTS setting in your .yml action file.



shfmt errors

'shfmt ' returned error 1 finding the following formatting issues:

----------
--- be/src/fsst/paper/compare.sh.orig
+++ be/src/fsst/paper/compare.sh
@@ -1,5 +1,4 @@
 #!/bin/bash
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_commen ps_comment 
- do
-  fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
- done) | awk '{print$0;k++;for(i=2;i<=NF;i++) r[i]+=$i;}END{printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", "AVG",r[2]/k,r[3]/k,r[4]/k,r[5]/k,r[6]/k,r[7]/k,r[8]/k}'
+(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_commen ps_comment; do
+    fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
+done) | awk '{print$0;k++;for(i=2;i<=NF;i++) r[i]+=$i;}END{printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", "AVG",r[2]/k,r[3]/k,r[4]/k,r[5]/k,r[6]/k,r[7]/k,r[8]/k}'
--- be/src/fsst/paper/evolution.sh.orig
+++ be/src/fsst/paper/evolution.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 # output format: STCB CCB CR
 # STCB: symbol table construction cost in cycles-per-compressed byte (constructing a new ST per 8MB text)
-# CCB:  compression speed cycles-per-compressed byte 
+# CCB:  compression speed cycles-per-compressed byte
 # CR:   compression (=size reduction) factor achieved
 
 (for i in dbtext/*; do (./cw-strncmp $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'
@@ -16,10 +16,10 @@
 # on Intel SKX CPUs| the results look like:
 #
 # 75.117,160.11,1.97194 iterative|suffix-array|dynp-matching|strncmp|scalar
-#   \--> 160 cycles per byte produces a very slow compression speed (say ~20MB/s on a 3Ghz CPU) 
+#   \--> 160 cycles per byte produces a very slow compression speed (say ~20MB/s on a 3Ghz CPU)
 #
 # 73.6948,81.6404,1.97194 iterative|suffix-array|dynp-matching|str-as-long|scalar
-#   \--> str-as-long (i.e. FSST focusing on 8-byte word symbols) improves compression speed 2x 
+#   \--> str-as-long (i.e. FSST focusing on 8-byte word symbols) improves compression speed 2x
 #
 # 74.4996,37.457,1.94764 iterative|suffix-array|greedy-match|str-as-long|scalar
 #   \--> dynamic programming brought only 3% smaller size. So drop it and gain another 2x compression speed.
@@ -28,7 +28,7 @@
 #   \--> bottom-up is *really* better in terms of compression factor than iterative with suffix array.
 #
 # 1.74783,10.7009,2.28103 bottom-up|lossy-hash|greedy-match|str-as-long|scalar-branch
-#   \--> hashing significantly improves compression speed at only 5% size cost (due to hash collisions) 
+#   \--> hashing significantly improves compression speed at only 5% size cost (due to hash collisions)
 #
 # 1.74783,9.8142,2.28103 bottom-up|lossy-hash|greedy-match|str-as-long|scalar-adaptive
 #   \--> adaptive use of encoding kernels gives compression speed a small bump
@@ -39,4 +39,4 @@
 # optimized construction refers to the combination of three changes:
 # - reducing the amount of bottom-up passes from 10 to 5 (less learning time, but.. slighty worsens CR)
 # - looking at subsamples in early rounds (increasing the sample as the rounds go up). Less compression work.
-# - splitting the counters for less cache pressure and aiding fast skipping over counts-of-0 
+# - splitting the counters for less cache pressure and aiding fast skipping over counts-of-0
--- be/src/fsst/paper/kernels.sh.orig
+++ be/src/fsst/paper/kernels.sh
@@ -1,15 +1,15 @@
 #/bin/bash
 PARAMS='simd1 simd2 simd3 simd4 adaptive'
-(echo | awk '{ print "{\\begin{tabular}{|rrrr|r|l|}\n\\hline"}'
-echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
-echo "\\\\"
-echo "\\hline"
-echo "\\hline"
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment 
- do 
-   for m in $PARAMS
-   do
-     (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
-   done
-   echo $i
- done) | awk '{for(i=1;i<NF;i++){r[i]+=$i;printf "{\\footnotesize{X%d%5.2f}}& ",i,$i}k++;printf "{\\footnotesize %s}\\\\\n",$NF}END{print "\\hline"; for(j=1;j<i;j++)printf "{\\footnotesize{X%d%5.2f}}& ",j,r[j]/k;print "{\\footnotesize average}\\\\\n\\hline\n\\end{tabular}}"}' | sed 's/_/\\_/g' | sed 's/[0-9]*-//') | sed 's/X[38]/\\bf /g' | sed 's/X[1-9]//g' | sed 's/adaptive/scalar/' 
+(
+    echo | awk '{ print "{\\begin{tabular}{|rrrr|r|l|}\n\\hline"}'
+    echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
+    echo "\\\\"
+    echo "\\hline"
+    echo "\\hline"
+    (for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment; do
+        for m in $PARAMS; do
+            (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
+        done
+        echo $i
+    done) | awk '{for(i=1;i<NF;i++){r[i]+=$i;printf "{\\footnotesize{X%d%5.2f}}& ",i,$i}k++;printf "{\\footnotesize %s}\\\\\n",$NF}END{print "\\hline"; for(j=1;j<i;j++)printf "{\\footnotesize{X%d%5.2f}}& ",j,r[j]/k;print "{\\footnotesize average}\\\\\n\\hline\n\\end{tabular}}"}' | sed 's/_/\\_/g' | sed 's/[0-9]*-//'
+) | sed 's/X[38]/\\bf /g' | sed 's/X[1-9]//g' | sed 's/adaptive/scalar/'
be/src/fsst/paper/lz4-smallblocks.sh:8:17: not a valid arithmetic operator: f
--- be/src/fsst/paper/sorted.sh.orig
+++ be/src/fsst/paper/sorted.sh
@@ -6,17 +6,15 @@
 rm -rf .sorted 2>/dev/null
 mkdir .sorted
 cd dbtext
-for i in * 
-do 
-  sort $i > ../.sorted/$i; 
+for i in *; do
+    sort $i >../.sorted/$i
 done
 cp chinese japanese faust hamlet ../.sorted/
 cd ..
 
 # note sizes, display stats
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment
- do 
-  ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
-  ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
- done) | 
-awk '{ s1+=$2; s2+=$3; s3+=$4; s4+=$5; k++; print $0} END {printf "% 16s %1.2f% 1.2f %1.2f %1.2f\n", "avg",s1/k, s2/k, s3/k, s4/k}'
+(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment; do
+    ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
+    ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
+done) |
+    awk '{ s1+=$2; s2+=$3; s3+=$4; s4+=$5; k++; print $0} END {printf "% 16s %1.2f% 1.2f %1.2f %1.2f\n", "avg",s1/k, s2/k, s3/k, s4/k}'
----------

You can reformat the above files to meet shfmt's requirements by typing:

  shfmt  -w filename


@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@SkyFan2002
Copy link
Author

run buildall

@github-actions
Copy link
Contributor

sh-checker report

To get the full details, please check in the job output.

shellcheck errors

'shellcheck ' returned error 1 finding the following syntactical issues:

----------

In be/src/fsst/paper/compare.sh line 4:
  fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
  ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
        ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
        ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
           ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
                         ^--^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                 ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
                                          ^--^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.

Did you mean: 
  fgrep "${i}" "$1" | fgrep -v "${i}"2 | fgrep -v "${i}"pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'


In be/src/fsst/paper/evolution.sh line 7:
(for i in dbtext/*; do (./cw-strncmp $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'
                                     ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                     ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw-strncmp "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'


In be/src/fsst/paper/evolution.sh line 8:
(for i in dbtext/*; do (./cw $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|str-as-long|scalar"}'
                             ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                             ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|str-as-long|scalar"}'


In be/src/fsst/paper/evolution.sh line 9:
(for i in dbtext/*; do (./cw-greedy $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|greedy-match|str-as-long|scalar" }'
                                    ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                    ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw-greedy "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|greedy-match|str-as-long|scalar" }'


In be/src/fsst/paper/evolution.sh line 10:
(for i in dbtext/*; do (./vcw $i 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|binary-search|greedy-match|str-as-long|scalar" }'
                              ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                              ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                         ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./vcw "${i}" 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|binary-search|greedy-match|str-as-long|scalar" }'


In be/src/fsst/paper/evolution.sh line 11:
(for i in dbtext/*; do (./hcw $i 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|branch-scalar" }'
                              ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                              ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                       ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw "${i}" 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|branch-scalar" }'


In be/src/fsst/paper/evolution.sh line 13:
(for i in dbtext/*; do (./hcw-opt $i 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|adaptive-scalar|optimized-construction" }'
                                  ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                  ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                           ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw-opt "${i}" 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|adaptive-scalar|optimized-construction" }'


In be/src/fsst/paper/evolution.sh line 14:
(for i in dbtext/*; do (./hcw-opt $i 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|avx512|optimized-construction" }'
                                  ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                  ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                             ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw-opt "${i}" 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|avx512|optimized-construction" }'


In be/src/fsst/paper/kernels.sh line 1:
#/bin/bash
 ^-- SC1113 (error): Use #!, not just #, for the shebang.


In be/src/fsst/paper/kernels.sh line 4:
echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
     ^-----^ SC2086 (info): Double quote to prevent globbing and word splitting.
     ^-----^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
echo "${PARAMS}" | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'


In be/src/fsst/paper/kernels.sh line 5:
echo "\\\\"
     ^----^ SC2028 (info): echo may not expand escape sequences. Use printf.


In be/src/fsst/paper/kernels.sh line 10:
   for m in $PARAMS
            ^-----^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
   for m in ${PARAMS}


In be/src/fsst/paper/kernels.sh line 12:
     (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
                       ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                               ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                               ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
     (./hcw-opt dbtext/"${i}" 511 -"${m}" 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'


In be/src/fsst/paper/kernels.sh line 14:
   echo $i
        ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
        ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
   echo "${i}"


In be/src/fsst/paper/lz4-smallblocks.sh line 3:
dd if=$1 of=tmpsplit.out bs=$maxsize count=1 2> /dev/null
      ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                            ^------^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                            ^------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
dd if="$1" of=tmpsplit.out bs="${maxsize}" count=1 2> /dev/null


In be/src/fsst/paper/lz4-smallblocks.sh line 5:
    mkdir tmpsplit$blocksize
                  ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                  ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    mkdir tmpsplit"${blocksize}"


In be/src/fsst/paper/lz4-smallblocks.sh line 6:
    split -b $blocksize tmpsplit.out tmpsplit$blocksize/x
             ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                             ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                                             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    split -b "${blocksize}" tmpsplit.out tmpsplit"${blocksize}"/x


In be/src/fsst/paper/lz4-smallblocks.sh line 7:
    echo -n $blocksize ""
            ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
            ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    echo -n "${blocksize}" ""


In be/src/fsst/paper/lz4-smallblocks.sh line 8:
    size=$((for f in tmpsplit$blocksize/x*; do lz4 -c $f | wc -c; done) | awk '{s+=$1} END {print s}')
         ^-- SC1102 (error): Shells disambiguate $(( differently or not at all. For $(command substitution), add space after $( . For $((arithmetics)), fix parsing errors.
                             ^--------^ SC2231 (info): Quote expansions in this for loop glob to prevent wordsplitting, e.g. "$dir"/*.txt .
                             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                      ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                                      ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    size=$((for f in tmpsplit${blocksize}/x*; do lz4 -c "${f}" | wc -c; done) | awk '{s+=$1} END {print s}')


In be/src/fsst/paper/lz4-smallblocks.sh line 9:
    echo "$maxsize / $size" | bc -l
          ^------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                     ^---^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    echo "${maxsize} / ${size}" | bc -l


In be/src/fsst/paper/lz4-smallblocks.sh line 10:
    rm -rf tmpsplit$blocksize/
                   ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                   ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    rm -rf tmpsplit"${blocksize}"/


In be/src/fsst/paper/sorted.sh line 8:
cd dbtext
^-------^ SC2164 (warning): Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

Did you mean: 
cd dbtext || exit


In be/src/fsst/paper/sorted.sh line 11:
  sort $i > ../.sorted/$i; 
       ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                       ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  sort "${i}" > ../.sorted/"${i}"; 


In be/src/fsst/paper/sorted.sh line 14:
cd ..
^---^ SC2103 (info): Use a ( subshell ) to avoid having to cd back.


In be/src/fsst/paper/sorted.sh line 19:
  ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
                                   ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                   ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  ./filtertest compare 1000 dbtext/"${i}" | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'


In be/src/fsst/paper/sorted.sh line 20:
  ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
                                    ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                    ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  ./filtertest compare 1000 .sorted/"${i}" | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'

For more information:
  https://www.shellcheck.net/wiki/SC1102 -- Shells disambiguate $(( different...
  https://www.shellcheck.net/wiki/SC1113 -- Use #!, not just #, for the sheba...
  https://www.shellcheck.net/wiki/SC2164 -- Use 'cd ... || exit' or 'cd ... |...
----------

You can address the above issues in one of three ways:
1. Manually correct the issue in the offending shell script;
2. Disable specific issues by adding the comment:
  # shellcheck disable=NNNN
above the line that contains the issue, where NNNN is the error code;
3. Add '-e NNNN' to the SHELLCHECK_OPTS setting in your .yml action file.



shfmt errors

'shfmt ' returned error 1 finding the following formatting issues:

----------
--- be/src/fsst/paper/compare.sh.orig
+++ be/src/fsst/paper/compare.sh
@@ -1,5 +1,4 @@
 #!/bin/bash
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_commen ps_comment 
- do
-  fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
- done) | awk '{print$0;k++;for(i=2;i<=NF;i++) r[i]+=$i;}END{printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", "AVG",r[2]/k,r[3]/k,r[4]/k,r[5]/k,r[6]/k,r[7]/k,r[8]/k}'
+(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_commen ps_comment; do
+    fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
+done) | awk '{print$0;k++;for(i=2;i<=NF;i++) r[i]+=$i;}END{printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", "AVG",r[2]/k,r[3]/k,r[4]/k,r[5]/k,r[6]/k,r[7]/k,r[8]/k}'
--- be/src/fsst/paper/evolution.sh.orig
+++ be/src/fsst/paper/evolution.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 # output format: STCB CCB CR
 # STCB: symbol table construction cost in cycles-per-compressed byte (constructing a new ST per 8MB text)
-# CCB:  compression speed cycles-per-compressed byte 
+# CCB:  compression speed cycles-per-compressed byte
 # CR:   compression (=size reduction) factor achieved
 
 (for i in dbtext/*; do (./cw-strncmp $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'
@@ -16,10 +16,10 @@
 # on Intel SKX CPUs| the results look like:
 #
 # 75.117,160.11,1.97194 iterative|suffix-array|dynp-matching|strncmp|scalar
-#   \--> 160 cycles per byte produces a very slow compression speed (say ~20MB/s on a 3Ghz CPU) 
+#   \--> 160 cycles per byte produces a very slow compression speed (say ~20MB/s on a 3Ghz CPU)
 #
 # 73.6948,81.6404,1.97194 iterative|suffix-array|dynp-matching|str-as-long|scalar
-#   \--> str-as-long (i.e. FSST focusing on 8-byte word symbols) improves compression speed 2x 
+#   \--> str-as-long (i.e. FSST focusing on 8-byte word symbols) improves compression speed 2x
 #
 # 74.4996,37.457,1.94764 iterative|suffix-array|greedy-match|str-as-long|scalar
 #   \--> dynamic programming brought only 3% smaller size. So drop it and gain another 2x compression speed.
@@ -28,7 +28,7 @@
 #   \--> bottom-up is *really* better in terms of compression factor than iterative with suffix array.
 #
 # 1.74783,10.7009,2.28103 bottom-up|lossy-hash|greedy-match|str-as-long|scalar-branch
-#   \--> hashing significantly improves compression speed at only 5% size cost (due to hash collisions) 
+#   \--> hashing significantly improves compression speed at only 5% size cost (due to hash collisions)
 #
 # 1.74783,9.8142,2.28103 bottom-up|lossy-hash|greedy-match|str-as-long|scalar-adaptive
 #   \--> adaptive use of encoding kernels gives compression speed a small bump
@@ -39,4 +39,4 @@
 # optimized construction refers to the combination of three changes:
 # - reducing the amount of bottom-up passes from 10 to 5 (less learning time, but.. slighty worsens CR)
 # - looking at subsamples in early rounds (increasing the sample as the rounds go up). Less compression work.
-# - splitting the counters for less cache pressure and aiding fast skipping over counts-of-0 
+# - splitting the counters for less cache pressure and aiding fast skipping over counts-of-0
--- be/src/fsst/paper/kernels.sh.orig
+++ be/src/fsst/paper/kernels.sh
@@ -1,15 +1,15 @@
 #/bin/bash
 PARAMS='simd1 simd2 simd3 simd4 adaptive'
-(echo | awk '{ print "{\\begin{tabular}{|rrrr|r|l|}\n\\hline"}'
-echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
-echo "\\\\"
-echo "\\hline"
-echo "\\hline"
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment 
- do 
-   for m in $PARAMS
-   do
-     (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
-   done
-   echo $i
- done) | awk '{for(i=1;i<NF;i++){r[i]+=$i;printf "{\\footnotesize{X%d%5.2f}}& ",i,$i}k++;printf "{\\footnotesize %s}\\\\\n",$NF}END{print "\\hline"; for(j=1;j<i;j++)printf "{\\footnotesize{X%d%5.2f}}& ",j,r[j]/k;print "{\\footnotesize average}\\\\\n\\hline\n\\end{tabular}}"}' | sed 's/_/\\_/g' | sed 's/[0-9]*-//') | sed 's/X[38]/\\bf /g' | sed 's/X[1-9]//g' | sed 's/adaptive/scalar/' 
+(
+    echo | awk '{ print "{\\begin{tabular}{|rrrr|r|l|}\n\\hline"}'
+    echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
+    echo "\\\\"
+    echo "\\hline"
+    echo "\\hline"
+    (for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment; do
+        for m in $PARAMS; do
+            (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
+        done
+        echo $i
+    done) | awk '{for(i=1;i<NF;i++){r[i]+=$i;printf "{\\footnotesize{X%d%5.2f}}& ",i,$i}k++;printf "{\\footnotesize %s}\\\\\n",$NF}END{print "\\hline"; for(j=1;j<i;j++)printf "{\\footnotesize{X%d%5.2f}}& ",j,r[j]/k;print "{\\footnotesize average}\\\\\n\\hline\n\\end{tabular}}"}' | sed 's/_/\\_/g' | sed 's/[0-9]*-//'
+) | sed 's/X[38]/\\bf /g' | sed 's/X[1-9]//g' | sed 's/adaptive/scalar/'
be/src/fsst/paper/lz4-smallblocks.sh:8:17: not a valid arithmetic operator: f
--- be/src/fsst/paper/sorted.sh.orig
+++ be/src/fsst/paper/sorted.sh
@@ -6,17 +6,15 @@
 rm -rf .sorted 2>/dev/null
 mkdir .sorted
 cd dbtext
-for i in * 
-do 
-  sort $i > ../.sorted/$i; 
+for i in *; do
+    sort $i >../.sorted/$i
 done
 cp chinese japanese faust hamlet ../.sorted/
 cd ..
 
 # note sizes, display stats
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment
- do 
-  ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
-  ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
- done) | 
-awk '{ s1+=$2; s2+=$3; s3+=$4; s4+=$5; k++; print $0} END {printf "% 16s %1.2f% 1.2f %1.2f %1.2f\n", "avg",s1/k, s2/k, s3/k, s4/k}'
+(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment; do
+    ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
+    ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
+done) |
+    awk '{ s1+=$2; s2+=$3; s3+=$4; s4+=$5; k++; print $0} END {printf "% 16s %1.2f% 1.2f %1.2f %1.2f\n", "avg",s1/k, s2/k, s3/k, s4/k}'
----------

You can reformat the above files to meet shfmt's requirements by typing:

  shfmt  -w filename


@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

sh-checker report

To get the full details, please check in the job output.

shellcheck errors

'shellcheck ' returned error 1 finding the following syntactical issues:

----------

In be/src/fsst/paper/compare.sh line 4:
  fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
  ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
        ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
        ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
           ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
                         ^--^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                 ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.
                                          ^--^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.

Did you mean: 
  fgrep "${i}" "$1" | fgrep -v "${i}"2 | fgrep -v "${i}"pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'


In be/src/fsst/paper/evolution.sh line 7:
(for i in dbtext/*; do (./cw-strncmp $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'
                                     ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                     ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw-strncmp "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'


In be/src/fsst/paper/evolution.sh line 8:
(for i in dbtext/*; do (./cw $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|str-as-long|scalar"}'
                             ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                             ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|str-as-long|scalar"}'


In be/src/fsst/paper/evolution.sh line 9:
(for i in dbtext/*; do (./cw-greedy $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|greedy-match|str-as-long|scalar" }'
                                    ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                    ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
(for i in dbtext/*; do (./cw-greedy "${i}" 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|greedy-match|str-as-long|scalar" }'


In be/src/fsst/paper/evolution.sh line 10:
(for i in dbtext/*; do (./vcw $i 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|binary-search|greedy-match|str-as-long|scalar" }'
                              ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                              ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                         ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./vcw "${i}" 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|binary-search|greedy-match|str-as-long|scalar" }'


In be/src/fsst/paper/evolution.sh line 11:
(for i in dbtext/*; do (./hcw $i 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|branch-scalar" }'
                              ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                              ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                       ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw "${i}" 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|branch-scalar" }'


In be/src/fsst/paper/evolution.sh line 13:
(for i in dbtext/*; do (./hcw-opt $i 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|adaptive-scalar|optimized-construction" }'
                                  ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                  ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                           ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw-opt "${i}" 511 -adaptive 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|adaptive-scalar|optimized-construction" }'


In be/src/fsst/paper/evolution.sh line 14:
(for i in dbtext/*; do (./hcw-opt $i 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|avx512|optimized-construction" }'
                                  ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                  ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                             ^---^ SC2197 (info): fgrep is non-standard and deprecated. Use grep -F instead.

Did you mean: 
(for i in dbtext/*; do (./hcw-opt "${i}" 2>&1) | fgrep -v target | awk '{ l++; if (l==2) t=$2; if (l==4) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " bottom-up|lossy-hash|greedy-match|str-as-long|avx512|optimized-construction" }'


In be/src/fsst/paper/kernels.sh line 1:
#/bin/bash
 ^-- SC1113 (error): Use #!, not just #, for the shebang.


In be/src/fsst/paper/kernels.sh line 4:
echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
     ^-----^ SC2086 (info): Double quote to prevent globbing and word splitting.
     ^-----^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
echo "${PARAMS}" | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'


In be/src/fsst/paper/kernels.sh line 5:
echo "\\\\"
     ^----^ SC2028 (info): echo may not expand escape sequences. Use printf.


In be/src/fsst/paper/kernels.sh line 10:
   for m in $PARAMS
            ^-----^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
   for m in ${PARAMS}


In be/src/fsst/paper/kernels.sh line 12:
     (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
                       ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                               ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                               ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
     (./hcw-opt dbtext/"${i}" 511 -"${m}" 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'


In be/src/fsst/paper/kernels.sh line 14:
   echo $i
        ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
        ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
   echo "${i}"


In be/src/fsst/paper/lz4-smallblocks.sh line 3:
dd if=$1 of=tmpsplit.out bs=$maxsize count=1 2> /dev/null
      ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                            ^------^ SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                            ^------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
dd if="$1" of=tmpsplit.out bs="${maxsize}" count=1 2> /dev/null


In be/src/fsst/paper/lz4-smallblocks.sh line 5:
    mkdir tmpsplit$blocksize
                  ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                  ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    mkdir tmpsplit"${blocksize}"


In be/src/fsst/paper/lz4-smallblocks.sh line 6:
    split -b $blocksize tmpsplit.out tmpsplit$blocksize/x
             ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                             ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                                             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    split -b "${blocksize}" tmpsplit.out tmpsplit"${blocksize}"/x


In be/src/fsst/paper/lz4-smallblocks.sh line 7:
    echo -n $blocksize ""
            ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
            ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    echo -n "${blocksize}" ""


In be/src/fsst/paper/lz4-smallblocks.sh line 8:
    size=$((for f in tmpsplit$blocksize/x*; do lz4 -c $f | wc -c; done) | awk '{s+=$1} END {print s}')
         ^-- SC1102 (error): Shells disambiguate $(( differently or not at all. For $(command substitution), add space after $( . For $((arithmetics)), fix parsing errors.
                             ^--------^ SC2231 (info): Quote expansions in this for loop glob to prevent wordsplitting, e.g. "$dir"/*.txt .
                             ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                                                      ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                                                      ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    size=$((for f in tmpsplit${blocksize}/x*; do lz4 -c "${f}" | wc -c; done) | awk '{s+=$1} END {print s}')


In be/src/fsst/paper/lz4-smallblocks.sh line 9:
    echo "$maxsize / $size" | bc -l
          ^------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                     ^---^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    echo "${maxsize} / ${size}" | bc -l


In be/src/fsst/paper/lz4-smallblocks.sh line 10:
    rm -rf tmpsplit$blocksize/
                   ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                   ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
    rm -rf tmpsplit"${blocksize}"/


In be/src/fsst/paper/sorted.sh line 8:
cd dbtext
^-------^ SC2164 (warning): Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

Did you mean: 
cd dbtext || exit


In be/src/fsst/paper/sorted.sh line 11:
  sort $i > ../.sorted/$i; 
       ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.
                       ^-- SC2086 (info): Double quote to prevent globbing and word splitting.
                       ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  sort "${i}" > ../.sorted/"${i}"; 


In be/src/fsst/paper/sorted.sh line 14:
cd ..
^---^ SC2103 (info): Use a ( subshell ) to avoid having to cd back.


In be/src/fsst/paper/sorted.sh line 19:
  ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
                                   ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                   ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  ./filtertest compare 1000 dbtext/"${i}" | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'


In be/src/fsst/paper/sorted.sh line 20:
  ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
                                    ^-- SC2248 (style): Prefer double quoting even when variables don't contain special characters.
                                    ^-- SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
  ./filtertest compare 1000 .sorted/"${i}" | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'

For more information:
  https://www.shellcheck.net/wiki/SC1102 -- Shells disambiguate $(( different...
  https://www.shellcheck.net/wiki/SC1113 -- Use #!, not just #, for the sheba...
  https://www.shellcheck.net/wiki/SC2164 -- Use 'cd ... || exit' or 'cd ... |...
----------

You can address the above issues in one of three ways:
1. Manually correct the issue in the offending shell script;
2. Disable specific issues by adding the comment:
  # shellcheck disable=NNNN
above the line that contains the issue, where NNNN is the error code;
3. Add '-e NNNN' to the SHELLCHECK_OPTS setting in your .yml action file.



shfmt errors

'shfmt ' returned error 1 finding the following formatting issues:

----------
--- be/src/fsst/paper/compare.sh.orig
+++ be/src/fsst/paper/compare.sh
@@ -1,5 +1,4 @@
 #!/bin/bash
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_commen ps_comment 
- do
-  fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
- done) | awk '{print$0;k++;for(i=2;i<=NF;i++) r[i]+=$i;}END{printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", "AVG",r[2]/k,r[3]/k,r[4]/k,r[5]/k,r[6]/k,r[7]/k,r[8]/k}'
+(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_commen ps_comment; do
+    fgrep $i $1 | fgrep -v ${i}2 | fgrep -v ${i}pedia | awk '{ printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", $1, $7, $2, $8, $3, $11, $6}'
+done) | awk '{print$0;k++;for(i=2;i<=NF;i++) r[i]+=$i;}END{printf "% 16s   %1.2f  %1.2f   % 8.2f   % 8.2f   % 8.2f   % 8.2f\n", "AVG",r[2]/k,r[3]/k,r[4]/k,r[5]/k,r[6]/k,r[7]/k,r[8]/k}'
--- be/src/fsst/paper/evolution.sh.orig
+++ be/src/fsst/paper/evolution.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 # output format: STCB CCB CR
 # STCB: symbol table construction cost in cycles-per-compressed byte (constructing a new ST per 8MB text)
-# CCB:  compression speed cycles-per-compressed byte 
+# CCB:  compression speed cycles-per-compressed byte
 # CR:   compression (=size reduction) factor achieved
 
 (for i in dbtext/*; do (./cw-strncmp $i 2>&1) | awk '{ l++; if (l==3) t=$2; if (l==6) c=$2; d=$1}END{print t " " c " " d}'; done) | awk '{t+=$1;c+=$2;d+=$3;k++}END{ print (t/k) " " (c/k) " " d/k " iterative|suffix-array|dynp-matching|strncmp|scalar" }'
@@ -16,10 +16,10 @@
 # on Intel SKX CPUs| the results look like:
 #
 # 75.117,160.11,1.97194 iterative|suffix-array|dynp-matching|strncmp|scalar
-#   \--> 160 cycles per byte produces a very slow compression speed (say ~20MB/s on a 3Ghz CPU) 
+#   \--> 160 cycles per byte produces a very slow compression speed (say ~20MB/s on a 3Ghz CPU)
 #
 # 73.6948,81.6404,1.97194 iterative|suffix-array|dynp-matching|str-as-long|scalar
-#   \--> str-as-long (i.e. FSST focusing on 8-byte word symbols) improves compression speed 2x 
+#   \--> str-as-long (i.e. FSST focusing on 8-byte word symbols) improves compression speed 2x
 #
 # 74.4996,37.457,1.94764 iterative|suffix-array|greedy-match|str-as-long|scalar
 #   \--> dynamic programming brought only 3% smaller size. So drop it and gain another 2x compression speed.
@@ -28,7 +28,7 @@
 #   \--> bottom-up is *really* better in terms of compression factor than iterative with suffix array.
 #
 # 1.74783,10.7009,2.28103 bottom-up|lossy-hash|greedy-match|str-as-long|scalar-branch
-#   \--> hashing significantly improves compression speed at only 5% size cost (due to hash collisions) 
+#   \--> hashing significantly improves compression speed at only 5% size cost (due to hash collisions)
 #
 # 1.74783,9.8142,2.28103 bottom-up|lossy-hash|greedy-match|str-as-long|scalar-adaptive
 #   \--> adaptive use of encoding kernels gives compression speed a small bump
@@ -39,4 +39,4 @@
 # optimized construction refers to the combination of three changes:
 # - reducing the amount of bottom-up passes from 10 to 5 (less learning time, but.. slighty worsens CR)
 # - looking at subsamples in early rounds (increasing the sample as the rounds go up). Less compression work.
-# - splitting the counters for less cache pressure and aiding fast skipping over counts-of-0 
+# - splitting the counters for less cache pressure and aiding fast skipping over counts-of-0
--- be/src/fsst/paper/kernels.sh.orig
+++ be/src/fsst/paper/kernels.sh
@@ -1,15 +1,15 @@
 #/bin/bash
 PARAMS='simd1 simd2 simd3 simd4 adaptive'
-(echo | awk '{ print "{\\begin{tabular}{|rrrr|r|l|}\n\\hline"}'
-echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
-echo "\\\\"
-echo "\\hline"
-echo "\\hline"
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment 
- do 
-   for m in $PARAMS
-   do
-     (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
-   done
-   echo $i
- done) | awk '{for(i=1;i<NF;i++){r[i]+=$i;printf "{\\footnotesize{X%d%5.2f}}& ",i,$i}k++;printf "{\\footnotesize %s}\\\\\n",$NF}END{print "\\hline"; for(j=1;j<i;j++)printf "{\\footnotesize{X%d%5.2f}}& ",j,r[j]/k;print "{\\footnotesize average}\\\\\n\\hline\n\\end{tabular}}"}' | sed 's/_/\\_/g' | sed 's/[0-9]*-//') | sed 's/X[38]/\\bf /g' | sed 's/X[1-9]//g' | sed 's/adaptive/scalar/' 
+(
+    echo | awk '{ print "{\\begin{tabular}{|rrrr|r|l|}\n\\hline"}'
+    echo $PARAMS | awk "{for(i=1;i<=NF;i++) printf \"{\\\\footnotesize{X%d\$%s\$}}&\",i,\$i}" | sed 's/simd/simd_/g'
+    echo "\\\\"
+    echo "\\hline"
+    echo "\\hline"
+    (for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment; do
+        for m in $PARAMS; do
+            (./hcw-opt dbtext/$i 511 -$m 2>&1) | tail -2 | head -1 | awk '{ printf "%f ", $2 }'
+        done
+        echo $i
+    done) | awk '{for(i=1;i<NF;i++){r[i]+=$i;printf "{\\footnotesize{X%d%5.2f}}& ",i,$i}k++;printf "{\\footnotesize %s}\\\\\n",$NF}END{print "\\hline"; for(j=1;j<i;j++)printf "{\\footnotesize{X%d%5.2f}}& ",j,r[j]/k;print "{\\footnotesize average}\\\\\n\\hline\n\\end{tabular}}"}' | sed 's/_/\\_/g' | sed 's/[0-9]*-//'
+) | sed 's/X[38]/\\bf /g' | sed 's/X[1-9]//g' | sed 's/adaptive/scalar/'
be/src/fsst/paper/lz4-smallblocks.sh:8:17: not a valid arithmetic operator: f
--- be/src/fsst/paper/sorted.sh.orig
+++ be/src/fsst/paper/sorted.sh
@@ -6,17 +6,15 @@
 rm -rf .sorted 2>/dev/null
 mkdir .sorted
 cd dbtext
-for i in * 
-do 
-  sort $i > ../.sorted/$i; 
+for i in *; do
+    sort $i >../.sorted/$i
 done
 cp chinese japanese faust hamlet ../.sorted/
 cd ..
 
 # note sizes, display stats
-(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment
- do 
-  ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
-  ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
- done) | 
-awk '{ s1+=$2; s2+=$3; s3+=$4; s4+=$5; k++; print $0} END {printf "% 16s %1.2f% 1.2f %1.2f %1.2f\n", "avg",s1/k, s2/k, s3/k, s4/k}'
+(for i in hex yago email wiki uuid urls2 urls firstname lastname city credentials street movies faust hamlet chinese japanese wikipedia genome location c_name l_comment ps_comment; do
+    ./filtertest compare 1000 dbtext/$i | tail -1 | awk '{ printf "% 16s %1.2f %1.2f ",$1,$2,$7}'
+    ./filtertest compare 1000 .sorted/$i | tail -1 | awk '{ printf "%1.2f %1.2f\n",$2,$7}'
+done) |
+    awk '{ s1+=$2; s2+=$3; s3+=$4; s4+=$5; k++; print $0} END {printf "% 16s %1.2f% 1.2f %1.2f %1.2f\n", "avg",s1/k, s2/k, s3/k, s4/k}'
----------

You can reformat the above files to meet shfmt's requirements by typing:

  shfmt  -w filename


@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@SkyFan2002 SkyFan2002 closed this Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments