# Why learn AWK?
This notebook is inspired by a very nice series of blog posts by Jonathan Palardy that I stumbled across recently:
* https://blog.jpalardy.com/posts/why-learn-awk/
* https://blog.jpalardy.com/posts/awk-tutorial-part-1/
* https://blog.jpalardy.com/posts/awk-tutorial-part-2/
* https://blog.jpalardy.com/posts/awk-tutorial-part-3/
* https://blog.jpalardy.com/posts/my-best-awk-tricks/

Maybe it makes sense to try awk after all :-)

In this notebook, I'll reproduce the examples given in the blog posts, work on the exercises, and sometimes do a bit more.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#AWK-tutorial,-part-1" data-toc-modified-id="AWK-tutorial,-part-1-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>AWK tutorial, part 1</a></span><ul class="toc-item"><li><span><a href="#Download-the-example-data" data-toc-modified-id="Download-the-example-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Download the example data</a></span></li><li><span><a href="#Helper-functions" data-toc-modified-id="Helper-functions-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Helper functions</a></span></li><li><span><a href="#Printing-Columns" data-toc-modified-id="Printing-Columns-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Printing Columns</a></span></li><li><span><a href="#Structure-of-AWK-rules" data-toc-modified-id="Structure-of-AWK-rules-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Structure of AWK rules</a></span><ul class="toc-item"><li><span><a href="#Automatic-conversions" data-toc-modified-id="Automatic-conversions-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Automatic conversions</a></span></li><li><span><a href="#Missing-condition" data-toc-modified-id="Missing-condition-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>Missing condition</a></span></li><li><span><a href="#Missing-action" data-toc-modified-id="Missing-action-1.4.3"><span class="toc-item-num">1.4.3&nbsp;&nbsp;</span>Missing action</a></span></li></ul></li><li><span><a href="#More-Printing" data-toc-modified-id="More-Printing-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>More Printing</a></span><ul class="toc-item"><li><span><a href="#Printing-multiple-columns" data-toc-modified-id="Printing-multiple-columns-1.5.1"><span class="toc-item-num">1.5.1&nbsp;&nbsp;</span>Printing multiple columns</a></span></li><li><span><a href="#More-formatting-power-with-printf" data-toc-modified-id="More-formatting-power-with-printf-1.5.2"><span class="toc-item-num">1.5.2&nbsp;&nbsp;</span>More formatting power with <code>printf</code></a></span></li><li><span><a href="#String-concatenation" data-toc-modified-id="String-concatenation-1.5.3"><span class="toc-item-num">1.5.3&nbsp;&nbsp;</span>String concatenation</a></span></li></ul></li><li><span><a href="#Exercises" data-toc-modified-id="Exercises-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Exercises</a></span><ul class="toc-item"><li><span><a href="#Print-the-&quot;Date&quot;,-&quot;Volume&quot;,-&quot;Open&quot;,-&quot;Close&quot;-columns,-in-that-order" data-toc-modified-id="Print-the-&quot;Date&quot;,-&quot;Volume&quot;,-&quot;Open&quot;,-&quot;Close&quot;-columns,-in-that-order-1.6.1"><span class="toc-item-num">1.6.1&nbsp;&nbsp;</span>Print the "Date", "Volume", "Open", "Close" columns, in that order</a></span></li><li><span><a href="#Only-print-lines-where-the-stock-price-increased-(&quot;Close&quot;->-&quot;Open&quot;)" data-toc-modified-id="Only-print-lines-where-the-stock-price-increased-(&quot;Close&quot;->-&quot;Open&quot;)-1.6.2"><span class="toc-item-num">1.6.2&nbsp;&nbsp;</span>Only print lines where the stock price increased ("Close" &gt; "Open")</a></span></li><li><span><a href="#Print-the-&quot;Date&quot;-column-and-the-stock-price-difference-(&quot;Close&quot;---&quot;Open)" data-toc-modified-id="Print-the-&quot;Date&quot;-column-and-the-stock-price-difference-(&quot;Close&quot;---&quot;Open)-1.6.3"><span class="toc-item-num">1.6.3&nbsp;&nbsp;</span>Print the "Date" column and the stock price difference ("Close" - "Open)</a></span></li><li><span><a href="#Print-an-empty-line-between-each-line" data-toc-modified-id="Print-an-empty-line-between-each-line-1.6.4"><span class="toc-item-num">1.6.4&nbsp;&nbsp;</span>Print an empty line between each line</a></span></li></ul></li></ul></li><li><span><a href="#AWK-tutorial,-part-2" data-toc-modified-id="AWK-tutorial,-part-2-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>AWK tutorial, part 2</a></span><ul class="toc-item"><li><span><a href="#Matching-with-Regular-Expressions" data-toc-modified-id="Matching-with-Regular-Expressions-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Matching with Regular Expressions</a></span></li><li><span><a href="#Comparisons-and-Logic" data-toc-modified-id="Comparisons-and-Logic-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Comparisons and Logic</a></span></li><li><span><a href="#Built-in-Variables" data-toc-modified-id="Built-in-Variables-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Built-in Variables</a></span></li><li><span><a href="#User-Defined-Variables" data-toc-modified-id="User-Defined-Variables-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>User-Defined Variables</a></span></li><li><span><a href="#Special-patterns:-BEGIN-and-END" data-toc-modified-id="Special-patterns:-BEGIN-and-END-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Special patterns: <code>BEGIN</code> and <code>END</code></a></span></li><li><span><a href="#Blocks-and-Control:-next-and-exit" data-toc-modified-id="Blocks-and-Control:-next-and-exit-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Blocks and Control: <code>next</code> and <code>exit</code></a></span></li><li><span><a href="#Exercises" data-toc-modified-id="Exercises-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Exercises</a></span><ul class="toc-item"><li><span><a href="#Only-print-lines-between-February-29,-2016-and-March-4,-2016" data-toc-modified-id="Only-print-lines-between-February-29,-2016-and-March-4,-2016-2.7.1"><span class="toc-item-num">2.7.1&nbsp;&nbsp;</span>Only print lines between February 29, 2016 and March 4, 2016</a></span></li><li><span><a href="#Sum-the-volumes-for-all-days-of-January-2016" data-toc-modified-id="Sum-the-volumes-for-all-days-of-January-2016-2.7.2"><span class="toc-item-num">2.7.2&nbsp;&nbsp;</span>Sum the volumes for all days of January 2016</a></span></li><li><span><a href="#Average-the-closing-price-over-all-days-of-March-2015" data-toc-modified-id="Average-the-closing-price-over-all-days-of-March-2015-2.7.3"><span class="toc-item-num">2.7.3&nbsp;&nbsp;</span>Average the closing price over all days of March 2015</a></span></li><li><span><a href="#Check-that-all-lines-have-7-columns" data-toc-modified-id="Check-that-all-lines-have-7-columns-2.7.4"><span class="toc-item-num">2.7.4&nbsp;&nbsp;</span>Check that <em>all</em> lines have 7 columns</a></span></li><li><span><a href="#Only-print-every-other-line-(say,-even-lines)" data-toc-modified-id="Only-print-every-other-line-(say,-even-lines)-2.7.5"><span class="toc-item-num">2.7.5&nbsp;&nbsp;</span>Only print every other line (say, even lines)</a></span></li><li><span><a href="#Remove-empty-lines-in-a-file" data-toc-modified-id="Remove-empty-lines-in-a-file-2.7.6"><span class="toc-item-num">2.7.6&nbsp;&nbsp;</span>Remove empty lines in a file</a></span></li></ul></li></ul></li><li><span><a href="#AWK-Tutorial,-part-3" data-toc-modified-id="AWK-Tutorial,-part-3-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>AWK Tutorial, part 3</a></span></li></ul></div>

## AWK tutorial, part 1
https://blog.jpalardy.com/posts/awk-tutorial-part-1/

### Download the example data
Jonathan Palardy provides a file with historical Netflix stock prices.

In [1]:
DESTINATION=download/2020-01-24-awk-tutorial
mkdir -p $DESTINATION
cd $DESTINATION

if [ ! -f netflix.tsv ]; then
    curl -sO https://blog.jpalardy.com/assets/awk-tutorials/netflix.tsv
fi

### Helper functions

The file `netflix.tsv` and also the processed output that we get for some of the examples in this notebook contain many lines. To enable us to focus on the important parts, we implement a bash function `snip` which can
* read input from a pipe, or
* take a file name as a parameter,

and does the following:
* if the input or the file contains 6 lines or less, print it as it is.
* if it contains 7 lines or more, print the first 4 lines, then a line containing the text `# ... line(s) omitted` (where `...` is replaced by the number of omitted lines), and then the last 2 lines.

Some information about the development of this function and source code with comments can be found in [a separate notebok](2020-01-26-head-plus-tail-with-awk.ipynb).

In [2]:
snip() {
    awk -v HEAD=4 -v TAIL=2 '
    NR <= HEAD { print; next }

    { 
      if (tail_lines < TAIL)
        tail_lines++
      else
        for (i = 1; i < tail_lines; i++)
          last[i] = last[i+1]
      last[tail_lines] = $0
    }

    END {
      omitted = NR - (HEAD + TAIL)
      if (omitted == 1)
        print "# 1 line omitted"
      else if (omitted > 0)
        print "# " omitted " lines omitted"

      for (i = 1; i <= tail_lines; i++)
        print last[i]
    }' "$@"
} 

This is how it processes data read from a pipe:

In [3]:
cat netflix.tsv | snip

Date	Open	High	Low	Close	Volume	Adj Close
2016-03-24	98.639999	98.849998	97.07	98.360001	10646900	98.360001
2016-03-23	99.75	100.389999	98.809998	99.589996	8292300	99.589996
2016-03-22	100.480003	101.519997	99.199997	99.839996	9039500	99.839996
# 3479 lines omitted
2002-05-24	17.00	17.15	16.76	16.940001	11104800	1.21
2002-05-23	16.19	17.399999	16.04	16.75	104790000	1.196429


It can also read files:

In [4]:
snip netflix.tsv

Date	Open	High	Low	Close	Volume	Adj Close
2016-03-24	98.639999	98.849998	97.07	98.360001	10646900	98.360001
2016-03-23	99.75	100.389999	98.809998	99.589996	8292300	99.589996
2016-03-22	100.480003	101.519997	99.199997	99.839996	9039500	99.839996
# 3479 lines omitted
2002-05-24	17.00	17.15	16.76	16.940001	11104800	1.21
2002-05-23	16.19	17.399999	16.04	16.75	104790000	1.196429


Sometimes, it will be useful to format the output such that columns are aligned.

In [5]:
format-and-snip () {
    column -t "$@" | snip 
}

In [6]:
cat netflix.tsv | format-and-snip

Date        Open        High        Low         Close       Volume     Adj         Close
2016-03-24  98.639999   98.849998   97.07       98.360001   10646900   98.360001
2016-03-23  99.75       100.389999  98.809998   99.589996   8292300    99.589996
2016-03-22  100.480003  101.519997  99.199997   99.839996   9039500    99.839996
# 3479 lines omitted
2002-05-24  17.00       17.15       16.76       16.940001   11104800   1.21
2002-05-23  16.19       17.399999   16.04       16.75       104790000  1.196429


In [7]:
format-and-snip netflix.tsv

Date        Open        High        Low         Close       Volume     Adj         Close
2016-03-24  98.639999   98.849998   97.07       98.360001   10646900   98.360001
2016-03-23  99.75       100.389999  98.809998   99.589996   8292300    99.589996
2016-03-22  100.480003  101.519997  99.199997   99.839996   9039500    99.839996
# 3479 lines omitted
2002-05-24  17.00       17.15       16.76       16.940001   11104800   1.21
2002-05-23  16.19       17.399999   16.04       16.75       104790000  1.196429


### Printing Columns
* `$1` is the content of the first column,
* `$2` is the content of the second column, etc.
* `$0` is the entire line.
* Single quotes tell bash to keep string contents untouched.

In [8]:
cat netflix.tsv | awk '{print $2}' | snip

Open
98.639999
99.75
100.480003
# 3479 lines omitted
17.00
16.19


Alternatively, awk can read the file instead of getting the content through a pipe:

In [9]:
awk '{print $2}' netflix.tsv | snip

Open
98.639999
99.75
100.480003
# 3479 lines omitted
17.00
16.19


### Structure of AWK rules
An AWK program consists of *rules* which look like this:

    condition { or or more statements }

The action inside the curly braces is executed if the condition evaluates to *true*.

Both the condition and the action are optional.

#### Automatic conversions
Strings are automatically converted to numbers if needed. **TODO:** Why is "Open" considered greater than 100?

In [10]:
cat netflix.tsv | awk '$2 > 100 { print $2 }' | snip

Open
100.480003
101.150002
100.50
# 1174 lines omitted
100.440002
100.009999


#### Missing condition
A missing condition causes the action to be executed always:

In [11]:
echo 'A B C' | awk '{print $2}'

B


This is equivalent to

In [12]:
echo 'A B C' | awk '1 {print $2}'

B


#### Missing action
The default action is to print the whole line.

In [13]:
cat netflix.tsv | awk '$2 > 100' | format-and-snip

Date        Open        High        Low         Close       Volume     Adj         Close
2016-03-22  100.480003  101.519997  99.199997   99.839996   9039500    99.839996
2016-03-21  101.150002  102.099998  99.50       101.059998  9562900    101.059998
2016-03-18  100.50      102.410004  100.010002  101.120003  15437300   101.120003
# 1174 lines omitted
2010-04-26  100.440002  109.700001  100.440002  108.169999  47215000   15.452857
2010-04-23  100.009999  100.559999  96.360003   99.73       29817900   14.247143


The default action is thus equivalent to `print` without arguments, which also prints the whole line:

In [14]:
cat netflix.tsv | awk '$2 < 99 { print }' | format-and-snip

2016-03-24  98.639999  98.849998   97.07      98.360001   10646900   98.360001
2016-03-16  97.529999  99.730003   97.50      99.349998   12598600   99.349998
2016-03-15  97.870003  98.510002   96.43      97.860001   9678000    97.860001
2016-03-14  97.199997  99.419998   97.169998  98.129997   11223200   98.129997
# 2283 lines omitted
2002-05-24  17.00      17.15       16.76      16.940001   11104800   1.21
2002-05-23  16.19      17.399999   16.04      16.75       104790000  1.196429


`print $0` also prints the whole line because `$0` is a special variable which contains the current line, before it was split into fields.

In [15]:
cat netflix.tsv | awk '$2 >= 99 && $2 <= 100 { print $0 }' | format-and-snip

2016-03-23  99.75      100.389999  98.809998  99.589996   8292300    99.589996
2016-03-17  99.050003  101.389999  99.00      99.720001   13755000   99.720001
2016-03-11  99.510002  99.599998   96.050003  97.660004   15097900   97.660004
2016-01-26  99.739998  100.550003  94.849998  97.830002   22024700   97.830002
# 10 lines omitted
2010-07-30  99.689999  103.179998  98.539999  102.549997  30704800   14.65
2010-05-03  99.959998  103.809999  99.000003  101.989998  13974100   14.57


### More Printing

#### Printing multiple columns
A comma between print values will insert a space in the output:

In [16]:
cat netflix.tsv | awk '{print $1, $6, $5}' | snip

Date Volume Close
2016-03-24 10646900 98.360001
2016-03-23 8292300 99.589996
2016-03-22 9039500 99.839996
# 3479 lines omitted
2002-05-24 11104800 16.940001
2002-05-23 104790000 16.75


#### More formatting power with `printf`
`printf` provides functionality similar to the C function of the same name: https://linux.die.net/man/3/printf.

Note that we use the condition `NR > 1` to remove the first line because the `f` format does not work well with the column headers.

`NR` is a variable that contains the current line.

In [17]:
cat netflix.tsv | awk 'NR > 1 {printf "%s %15s %.1f\n", $1, $6, $5}' | snip

2016-03-24        10646900 98.4
2016-03-23         8292300 99.6
2016-03-22         9039500 99.8
2016-03-21         9562900 101.1
# 3478 lines omitted
2002-05-24        11104800 16.9
2002-05-23       104790000 16.8


#### String concatenation
Putting two or more values next to each other in a `print` statement concatenates the values:

In [18]:
cat netflix.tsv | awk '{print $1 "," $6}' | snip

Date,Volume
2016-03-24,10646900
2016-03-23,8292300
2016-03-22,9039500
# 3479 lines omitted
2002-05-24,11104800
2002-05-23,104790000


### Exercises

#### Print the "Date", "Volume", "Open", "Close" columns, in that order

In [19]:
cat netflix.tsv | awk '{print $1, $6, $2, $5}' | format-and-snip

Date        Volume     Open        Close
2016-03-24  10646900   98.639999   98.360001
2016-03-23  8292300    99.75       99.589996
2016-03-22  9039500    100.480003  99.839996
# 3479 lines omitted
2002-05-24  11104800   17.00       16.940001
2002-05-23  104790000  16.19       16.75


#### Only print lines where the stock price increased ("Close" > "Open")

In [20]:
cat netflix.tsv | awk '$5 > $2' | format-and-snip

2016-03-18  100.50      102.410004  100.010002  101.120003  15437300   101.120003
2016-03-17  99.050003   101.389999  99.00       99.720001   13755000   99.720001
2016-03-16  97.529999   99.730003   97.50       99.349998   12598600   99.349998
2016-03-14  97.199997   99.419998   97.169998   98.129997   11223200   98.129997
# 1692 lines omitted
2002-06-03  15.120001   16.089999   15.069999   15.799999   3151400    1.128571
2002-05-23  16.19       17.399999   16.04       16.75       104790000  1.196429


Maybe it's useful if we include the column headers:

In [21]:
cat netflix.tsv | awk 'NR == 1 || $5 > $2' | format-and-snip

Date        Open        High        Low         Close       Volume     Adj         Close
2016-03-18  100.50      102.410004  100.010002  101.120003  15437300   101.120003
2016-03-17  99.050003   101.389999  99.00       99.720001   13755000   99.720001
2016-03-16  97.529999   99.730003   97.50       99.349998   12598600   99.349998
# 1693 lines omitted
2002-06-03  15.120001   16.089999   15.069999   15.799999   3151400    1.128571
2002-05-23  16.19       17.399999   16.04       16.75       104790000  1.196429


#### Print the "Date" column and the stock price difference ("Close" - "Open)

In [22]:
cat netflix.tsv | awk 'NR > 1 {print $1, $5 - $2}' | format-and-snip

2016-03-24  -0.279998
2016-03-23  -0.160004
2016-03-22  -0.640007
2016-03-21  -0.090004
# 3478 lines omitted
2002-05-24  -0.059999
2002-05-23  0.56


#### Print an empty line between each line

In [23]:
format-and-snip netflix.tsv | awk '{ print $0 "\n"}'

Date        Open        High        Low         Close       Volume     Adj         Close

2016-03-24  98.639999   98.849998   97.07       98.360001   10646900   98.360001

2016-03-23  99.75       100.389999  98.809998   99.589996   8292300    99.589996

2016-03-22  100.480003  101.519997  99.199997   99.839996   9039500    99.839996

# 3479 lines omitted

2002-05-24  17.00       17.15       16.76       16.940001   11104800   1.21

2002-05-23  16.19       17.399999   16.04       16.75       104790000  1.196429



## AWK tutorial, part 2
https://blog.jpalardy.com/posts/awk-tutorial-part-2/

### Matching with Regular Expressions

Now we want to extract data from 2015. Note that a regular expression, by itself, is a shorthand for the condition `$0 ~ /regex/`.

This means that
    
    awk '/^2015-/'

is equivalent to

    awk $0 ~ '/^2015-/'

In [24]:
cat netflix.tsv | awk '/^2015-/' | format-and-snip

2015-12-31  116.209999  117.459999  114.279999  114.379997  9245000    114.379997
2015-12-30  118.949997  119.019997  116.43      116.709999  8116200    116.709999
2015-12-29  118.190002  119.599998  116.919998  119.120003  8159200    119.120003
2015-12-28  117.260002  117.349998  113.849998  117.110001  8406300    117.110001
# 246 lines omitted
2015-01-05  344.810001  344.810001  330.03001   331.179996  18165000   47.311428
2015-01-02  344.059998  352.32      341.12001   348.940002  13475000   49.848572


We can also match regular expressions on specific columns. This is how we extract lines where the "Open" column contains a value which is a multiple of $1 plus 63 cents: 

In [25]:
cat netflix.tsv | awk '$2 ~ /\.63$/ { print $1 "\t" $2}' | format-and-snip

2010-01-14  52.63
2008-07-24  27.63
2008-05-13  30.63
2006-04-26  31.63
# 2 lines omitted
2003-02-28  16.63
2002-08-22  13.63


### Comparisons and Logic

AWK has the usual comparison operators `==`, `!=`, `>`, `>=`, `<`, `<=`.

Moreover:
* `$2 ~ /^10./` matches a column with a regex
* `$2 !~ /^10./` is a negated regex match

Expressions can be combined with the logical operators `&&` and `||`, and negated if prefixed with `!`.

### Built-in Variables
The most useful built-in variables are:
* `NR`: the number of records (lines) processed since AWK started.
* `NF`: the number of fields (columns) in the current line

When working with multiple files, the following variables can come in handy:
* `FNR`: like `NR`, but resets to 1 when it begins processing a new file
* `FILENAME`: the name of the current file.

More variables are documented at http://www.math.utah.edu/docs/info/gawk_11.html.

### User-Defined Variables
A variable starts to exist when it is first used. A previously undefined variable contains the empty string (see below for the meaning of `BEGIN`):

In [26]:
awk 'BEGIN { print "The value of x is \"" x "\"" }'

The value of x is ""


As we mentioned earlier, strings are converted to numbers if needed. The empty string is converted to zero:

In [27]:
awk 'BEGIN { print "The numerical value of an uninitialized number is " x + 0 "."}'

The numerical value of an uninitialized number is 0.


In [28]:
awk 'BEGIN { x = x + 2; print x }'

2


### Special patterns: `BEGIN` and `END`

These are special conditions that get triggered exactly once (even if there are no input lines):
* `BEGIN` gets triggered *before* processing any line. Common use-cases are:
    * initialization of variables,
    * printing a header.
* `END` get triggered *after* all lines have been processed. It is often used to calculate a result and print it.

In [29]:
# reproduce the result of wc -l
cat netflix.tsv | awk 'END { print NR }'

3485


### Blocks and Control: `next` and `exit`
If there are multiple condition-block pairs, each of them is applied to every input line (except for the special cases `BEGIN` and `END`):

In [30]:
cat netflix.tsv | awk '/^2016-03-24/; $4 == 96.43'

2016-03-24	98.639999	98.849998	97.07	98.360001	10646900	98.360001
2016-03-15	97.870003	98.510002	96.43	97.860001	9678000	97.860001


What if there is a line that matches both conditions?

In [31]:
cat netflix.tsv | awk '/^2016-03-24/; $4 == 97.07'

2016-03-24	98.639999	98.849998	97.07	98.360001	10646900	98.360001
2016-03-24	98.639999	98.849998	97.07	98.360001	10646900	98.360001


We could use a single block, and combine both conditions into one with `&&`:

In [32]:
cat netflix.tsv | awk '/^2016-03-24/ && $4 == 97.07'

2016-03-24	98.639999	98.849998	97.07	98.360001	10646900	98.360001


However, if two separate condition-block pairs are needed, and it is not easily possible to make the conditions mutually exlusive, we can use the `next` statement, which will make awk go to the next input line:

In [33]:
cat netflix.tsv | awk '/^2016-03-24/ { print; next } $4 == 97.07'

2016-03-24	98.639999	98.849998	97.07	98.360001	10646900	98.360001


There is also an `exit` statement, which will execute the `END` block if there is one, and then exit the script:

In [34]:
# Find the first line from 2016.
# Note that we avoid cat and pipes here to avoid a 'broken pipe' error.
awk '/^2016-/ { print; exit }' netflix.tsv

2016-03-24	98.639999	98.849998	97.07	98.360001	10646900	98.360001


### Exercises

#### Only print lines between February 29, 2016 and March 4, 2016

In [35]:
cat netflix.tsv | awk '$1 >= "2016-02-29" && $1 <= "2016-03-04"' | column -t

2016-03-04  98.760002  102.220001  98.32      101.580002  23388400  101.580002
2016-03-03  97.830002  98.349998   95.389999  97.93       15303600  97.93
2016-03-02  98.010002  99.480003   95.900002  97.610001   19088400  97.610001
2016-03-01  94.580002  99.160004   93.610001  98.300003   16997700  98.300003
2016-02-29  94.809998  97.199997   93.339996  93.410004   13157100  93.410004


#### Sum the volumes for all days of January 2016

In [36]:
cat netflix.tsv | awk '
    $1 ~ /^2016-01/ { sum += $6 }
    END { print sum }'

484411300


#### Average the closing price over all days of March 2015

In [37]:
cat netflix.tsv | awk '
    $1 >= "2015-03-01" && $1 <= "2015-03-31" { count++; sum += $5 }
    END { print sum / count }'

437.661


#### Check that *all* lines have 7 columns

In [38]:
cat netflix.tsv | awk 'NF != 7 { print "This line has " NF " columns, not 7: " $0 }'

This line has 8 columns, not 7: Date	Open	High	Low	Close	Volume	Adj Close


Check how often each column count occurs:

In [39]:
cat netflix.tsv | awk '{ print "lines have", NF, "columns" }' | sort -n | uniq -c

   3484 lines have 7 columns
      1 lines have 8 columns


#### Only print every other line (say, even lines)

In [40]:
cat netflix.tsv | awk 'NR % 2 == 0 { $0 }'

#### Remove empty lines in a file

In [41]:
printf "first line\n\nlast line\n"

first line

last line


In [42]:
printf "first line\n\nlast line\n" | awk '$0 != ""'

first line
last line


Alternative solution with a regular expression:

In [43]:
printf "first line\n\nlast line\n" | awk '! /^$/'

first line
last line


Even easier: check if there are more than zero columns.

In [44]:
printf "first line\n\nlast line\n" | awk 'NF'

first line
last line


## AWK Tutorial, part 3
https://blog.jpalardy.com/posts/awk-tutorial-part-3/

# TO BE CONTINUED