<div class="alert alert-block alert-info">

# SFS Notebook 1 - mawk (Awk and Gawk)

<b>Created for Edinburgh College 2023
    by: </b> [michael.ferrie@edinburghcollege.ac.uk](mailto:michael.ferrie@edinburghcollege.ac.uk)

![Coding Gif](https://cdn.githubraw.com/michaelferrie/labs/main/code-coding.gif)

# Introduction

Some people might tell you that you don't need to learn awk, despite the naysayers I have used it many times and not many people are any good at it, but not you - you will not be one of those sad people who ask me to write their awks, you can write your own.

Imagine you have a 2 million line log file and you very quickly need to get out the 3rd and 85th column and you only want lines that end with x and start with y, and you have 10 seconds to achieve this when you are SSH'd into a remote server. That's where awk will be most useful. While other folk are speaking about how they can split the file up to open it in Excel, you'll be off for lunch.

So, what actually is [awk](https://en.wikipedia.org/wiki/AWK), it was invented by three computer scientists and each of their last names are a, w, and k. The original awk was written for the Unix operating system but we will be using the Debian implementation of awk, named mawk. A Debian package maintainer wrote a faster and more efficient version of awk and he was called Mike, that's why it's called mawk.

This lab provides an introduction to the programming language awk. There is an advanced tutorial [here](https://www.grymoire.com/Unix/Awk.html) if you would like to learn more about. I would also recommend the [O'Reilly Sed & Awk](https://www.amazon.co.uk/sed-awk-Pocket-Reference-OReilly/dp/0596003528) pocket reference book.

# Structure of awk programs

Awk programs often follow the structure:

The program starts with a begin block, this is the start of the program using the keyword `BEGIN`:

BEGIN {commands}

Then there is usually some kind of pattern which is between /pattern/

Then there is an `END` block

In order to demonstrate awk in action in Jupyter, click File > New > Text File, copy with CTRL+C then paste the following into the text file. Click File > Save Text As, and save it as pings.txt, this is just a basic file we can use to practice. We have some log data on pings:

```
1)  R1   S1  11
2)  R1   S2  20
3)  S3   S4  14
4)  D1   D2  16
5)  A5   S1  26
```

# Getting into awk with Print

First lets add some headers to the file and display the output, I will break this command down:

1. First we say awk - to tell the computer we are using awk - _obviously_ ;)
2. Then open quotes to start the program and write `BEGIN`
3. Next we start the pattern with {}, and say print with `printf`
4. Double quotes then say what we want to print adding `\n` to the end to add a new line
5. Close } and then add `{print}` to the end and close quotes, then add the filename

In [1]:
awk 'BEGIN{printf "Ping Source Dest Time-ms \n"} {print}' pings.txt

Ping Source Dest Time-ms 
awk: cannot open pings.txt (No such file or directory)


: 2

Let's say we only want column 1, we can tell awk to just print that with `$1`, column 2 with `$2`, column 3 with `$3` and so on.

In [None]:
awk '{print $1}' pings.txt

# Built-in functions

Here are some useful built-in functions, we can specify a variable to use in the program with `-v`:

In [None]:
awk -v my_var=5 'BEGIN { print my_var }'

If we want more than one variable, we need to assign them in the program ; delimited:

In [None]:
awk 'BEGIN { l = 50; b = 20; print "Area = (length * breadth) = ", (l * b) }'

We can use awk as a calculator, here is how to square a number with `sqrt()`:

In [None]:
awk -v my_int=25 'BEGIN { print sqrt(my_int) }'

If we want to put parts on a new line to improve readability, we should keep code at around 70 characters or less per line due to terminal limitations we can say newline with `\`. Awk doesnt care about tabs/spaces like python so you can use tabs to indent it.

In [None]:
awk 'BEGIN { l = 50; b = 20; \
        print \
            "Area = (length * breadth) = ", (l * b) \
            }'

Here is how to write a loop in awk, first assign a variable, then use the keyword `do`, then print the variable, then increment with `i+=1` when use the `while()` function to specify a condition:

In [None]:
awk 'BEGIN { i = 0; \
        do { print i; i+=1; } while(i < 5) \
     }'

# Conditionals 

Conditionals are defined inside `()` each separate predicate and alternative with `;` separated by {, look at this example:

In [None]:
awk 'BEGIN { \

a = 34; print "Number is " a

if (a > 35)
    print a " is greater than 35";
else
    print a " is less than 35";
}'

Arguably, this is easier than python as you don't need to worry about spaces or indentation. Here is a more elaborate example, I'll add in comments to explain inside the program:

In [None]:
# start program and define variables
awk 'BEGIN { \
score1 = 5 ; score2 = 6 ; score3 = 4 ; \

# print totals and calculate average
print "Total of scores is " score1 + score2 + score3 \
     "\nAverage score is " (score1+score2+score3) / 3

# create conditional, check if more than 35
if (score1+score2+score3 >= 35)
    print "Total scores greater than 35";
else
    print "Total scores less than 35";
}'

## Download CSV
For this lab I have a CSV file of different countries, we can use this to perform some tasks with awk download the file and open it to look at the layout.

In [None]:
wget  https://raw.githubusercontent.com/michaelferrie/labs/main/wdicountry.csv

## Check awk version

In [None]:
awk -W version

# Printing

Print the full file with awk

In [None]:
awk '{print}' wdicountry.csv

Print only the first column, the columns are numbered from $1..$n, and $0 represents the full file.

In [None]:
awk '{print $1}' wdicountry.csv

Notice the file prints more than one column, despite us telling it to print only $1, this is because awk uses space as the field separator by default, we need to define the field separator. Use -F then specify the field separator in quotes to tell awk how to delimit the columns.

awk -F ',' '{print $1}' wdicountry.csv

In [None]:
# print column 2
awk -F ',' '{print $2}' wdicountry.csv

In [None]:
# print column 3
awk -F ',' '{print $3}' wdicountry.csv

Lets swap two columns round, this is especially handy when we only care about two columns in a file.

In [None]:
awk -F ',' '{print $2,$1}' wdicountry.csv

## Regex with awk

Imagine we could tell awk to look for a pattern in a file and then return matches, this type of matching is called a regular expression, these are common in programming languages. To give only those countries that begin with A we need to specify a pattern, the ^ (caret) symbol indicates the start of the match, then the ~ (tilde) symbol indicates match.

In [None]:
awk -F ',' '/^A/ {print $2}' wdicountry.csv

In [None]:
awk -F ',' '/^A/ {print $2}' wdicountry.csv

There is a problem here - this output is showing the United Arab Emirates as well as the other countries that start with A, try specifying the match to fix this. This is because we are scanning the first column, we can tell awk that we want it to check the second column by adding $2 ~ then the regexp.

In [None]:
awk -F ',' '$2 ~ /^A/ {print $2}' wdicountry.csv

Inside the print statement you can specify multiple columns to print and in which order, can you find the column that has the currency used by each of the countries that start with A.

In [None]:
awk -F ',' '$2 ~ /^A/ {print $2,$1}' wdicountry.csv

We can daisy-chain the matches using || as OR and && as AND, for example to print countries who are classified by the World Bank as upper middle income and begin with the letter A the following would work:

In [None]:
awk -F ',' '$2 ~ /^A/ && $9 ~ /Up/ {print $2,$9}' wdicountry.csv

Now you are an expert in awk. Answer the following eight questions, these are useful notes to keep for future reference. Each question builds on the last so it is sensible to complete them in order. Have a look at the first row where the column names are shown:

In [None]:
cat wdicountry.csv | head -1

We have a lot of information in the file about each of the countries, we can do some awk-ing, to get the information we want.

# Questions

1. If `/^A/` matches any pattern that starts with A and `/a$/` matches any word that ends in a, can you find any country whose currency ends in o and starts with A?

In [None]:
awk -F ',' '$6 ~ /^A.*o$/ { print $2, $6 }' wdicountry.csv

2. When does the fiscal year end in Pakistan?

In [None]:
awk -F ',''$2 ~ /Pakistan/ { print "Fiscal year ends:", $15 }' wdicountry.csv

3. What currency is used in Zambia?

In [None]:
awk -F ',' '$2 ~ /Zambia/ { print "Zambia uses:", $6 }' wdicountry.csv

4. How many countries have an official name that starts, ‘Republic of F’?

In [None]:
awk -F, '$4 ~ /^Republic of F/ { count++ } END { print count " countries found" }' wdicountry.csv

5. According to this CSV, how many countries are high income and start with the letter B?

In [None]:
awk -F ',' '$2 ~ /^B/ && $9 == "High income" { count++ } END { print count " high-income countries start with B" }' wdicountry.csv

6. Which countries start with the letter K and are classified by the World Bank as upper middle income?

In [None]:
awk -F, '$2 ~ /^K/ && $9 == "Upper middle income" { print $2 }' wdicountry.csv

7. Which countries use the Euro and start with the letter S

In [None]:
awk -F, '$2 ~ /^S/ && $6 == "Euro" { print $2 }' wdicountry.csv


8. Write an awk statement that prints out the short name for Mexico, the long name, the currency and the 2 letter country abbreviation?

In [None]:
awk -F, '$2 == "Mexico" { print "Short name:", $2, "\nLong name:", $4, "\nCurrency:", $6, "\n2-letter code:", $5 }' wdicountry.csv

<div class="alert alert-block alert-warning">
<b>Challenge Questions:</b> Who needs friends when you have awk
</div>

9. Modify the awk script below so that it will only print out the cube root of the variable `b`?

In [2]:
awk -v b=27 'BEGIN { print b^(1/3) }'

3


10. Given the variable `x`, modify the awk script so that when I change the value of `x` to any whole positive integer, awk will print out the following messages. Either: `x is not divisible by 5` or `x is divisible by 5`?

In [None]:
awk -v x=25 'BEGIN { if (x % 5 == 0) print x " is divisible by 5"; else print x " is not divisible by 5" }'

11. The following script has two numbers, write a program to check if they are coprime integers. 

We will start off easy with this one 14 and 25 only share a factor of 1 (and 14 and 25 respectively), so these are coprime.

Write a program that prints `x and y are coprime` or `x and y are not coprime`. Then change x to 15, which shares a factor of 5 with 25 (so isn't coprime) to test your program?

In [None]:
awk 'function gcd(a, b) { return b ? gcd(b, a % b) : a } BEGIN { x = 14; y = 25; if (gcd(x, y) == 1) print x " and " y " are coprime"; else print x " and " y " are not coprime" }'

In [None]:
# add awk here
awk 'BEGIN { x = 14; y = 25; print "Integers = " x " and " y }'

12. Keep writing this great program, if x and y are coprime, we may as well calculate N and PHI, calculate N and PHI if x and y are coprime and print those out with the program? But, do not print them if x and y are not coprime?

In [None]:
awk 'function gcd(a, b) { return b ? gcd(b, a % b) : a } 
BEGIN { 
    x = 14; y = 25;
    if (gcd(x, y) == 1) { 
        N = x * y; 
        PHI = (x - 1) * (y - 1);
        print x " and " y " are coprime";
        print "N =", N, "PHI =", PHI;
    } else {
        print x " and " y " are not coprime";
    }
}'