# The AWK Programming Language

Alfred V. Aho  
Brian W. Kernighan  
Peter J Weinberger

## 1.1 Getting Started

Suppose you have a
file called `emp.data` that contains the name, pay rate in dollars per hour, and
number of hours worked for your employees, one employee record per line:

In [1]:
cat emp.data

Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18


Now you want to print the name and pay (rate times hours) for everyone who
worked more than zero hours:

In [1]:
awk '$3 > 0 { print $1, $2 * $3 }' emp.data

Kathy 40
Mark 100
Mary 121
Susie 76.5


To print the names of those employees who did not work:

In [2]:
awk '$3 == 0 { print $1 }' emp.data

Beth
Dan


## 1.2 Simple Output

**Print every line**

In [3]:
awk '{ print }' emp.data

Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18


In [4]:
awk '{ print $0 }' emp.data

Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18


**Print certain fields**

In [5]:
awk '{ print $1, $3 }' emp.data

Beth 0
Dan 0
Kathy 10
Mark 20
Mary 22
Susie 18


**NF, the Number of Fields**

In [6]:
awk '{ print NF, $1, $NF }' emp.data

3 Beth 0
3 Dan 0
3 Kathy 10
3 Mark 20
3 Mary 22
3 Susie 18


**Computing and Printing**

In [7]:
awk '{ print $1, $2 * $3 }' emp.data

Beth 0
Dan 0
Kathy 40
Mark 100
Mary 121
Susie 76.5


**Printing line numbers**

In [8]:
awk '{ print NR, $0 }' emp.data

1 Beth 4.00 0
2 Dan 3.75 0
3 Kathy 4.00 10
4 Mark 5.00 20
5 Mary 5.50 22
6 Susie 4.25 18


**Putting Text in the Output**

In [10]:
awk '{ print "total pay for", $1, "is", $2 * $3 }' emp.data

total pay for Beth is 0
total pay for Dan is 0
total pay for Kathy is 40
total pay for Mark is 100
total pay for Mary is 121
total pay for Susie is 76.5


## 1.3 Fancier output

**Lining Up Fields**

The `printf` statement has the form

printf($format, value_1, value_2, ... , value_n$)

where $format$ is a string that contains text to be printed verbatim, interspersed
with specifications of how each of the values is to be printed. A specification is
a `%` followed by a few characters that control the format of a value.

Here's a program that uses `printf` to print the total pay for every
employee:

In [11]:
awk '{ printf("total pay for %s is $%.2f\n", $1, $2 * $3) }' emp.data

total pay for Beth is $0.00
total pay for Dan is $0.00
total pay for Kathy is $40.00
total pay for Mark is $100.00
total pay for Mary is $121.00
total pay for Susie is $76.50


The `%s` says to print the first value, `$1`, as a string of characters; the
second, `%.2f`, says to print the second value, `$2 * $3`, as a number with 2 digits
after the decimal point. Everything else in the specification string, including the
dollar sign, is printed verbatim.

Here's another program that prints each employee's name and pay:

In [12]:
awk '{ printf("%-8s $%6.2f\n", $1, $2 * $3) }' emp.data

Beth     $  0.00
Dan      $  0.00
Kathy    $ 40.00
Mark     $100.00
Mary     $121.00
Susie    $ 76.50


The first specification, `%-8s`, prints a name as a string of characters left-
justified in a field 8 characters wide. The second specification, `%6.2f`, prints
the pay as a number with two digits after the decimal point, in a field 6 characters wide.

**Sorting the output**

In [16]:
awk '{ printf("%6.2f %s\n", $2 * $3, $0) }' emp.data | sort -n

  0.00 Beth 4.00 0
  0.00 Dan 3.75 0
 40.00 Kathy 4.00 10
 76.50 Susie 4.25 18
100.00 Mark 5.00 20
121.00 Mary 5.50 22


# 1.4 Selection

**Selection by Comparison**

This program uses a comparison pattern to select the records of employees
who earn $5.00 or more per hour, that is, lines in which the second field is
greater than or equal to 5:

In [17]:
awk '$2 >= 5' emp.data

Mark 5.00 20
Mary 5.50 22


**Selection by Computation**

This program prints the pay of those employees whose total pay exceeds $50:

In [19]:
awk '$2 * $3 > 50 { printf("$%.2f for %s\n", $2 * $3, $1) }' emp.data

$100.00 for Mark
$121.00 for Mary
$76.50 for Susie


**Selection by Text Content**

Print all lines in which the first field is Susie:

In [21]:
awk '$1 == "Susie"' emp.data

Susie 4.25 18


The operator `==` tests for equality. You can also look for text containing any of
a set of letters, words, and phrases by using patterns called regular expressions.

This program prints all lines that contain Susie anywhere:

In [22]:
awk '/Susie/' emp.data

Susie 4.25 18


**Combinations of Patterns**

Patterns can be combined with parentheses and the logical operators `&&`, `||` and `!`, which stand for *AND*, *OR*, and *NOT*.

Print the lines where `$2` is at least `4` or `$3` is at least `20`:

In [23]:
awk '$2 >= 4 || $3 >= 20' emp.data

Beth 4.00 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18


Lines that satisfy both conditions are printed only once.

The next one prints lines where it is not true that `$2` is less than `4` and `$3` is less than `20`; this condition is equivalent to the first one above, though perhaps less readable.

In [39]:
awk '!($2 < 4 && $3 < 20)' emp.data

Beth 4.00 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18


**Data Validation**

There are always errors in real data. Awk is an excellent tool for checking
that data has reasonable values and is in the right format, a task that is often
called *data validation*.

Data validation is essentially negative: instead of printing lines with desirable
properties, one prints lines that are suspicious. The following program uses comparison patterns to apply five plausibility tests to each line of `emp.data`:

In [11]:
awk 'NF != 3 { print $0, "number of fields is not equal to 3" }' emp.data

In [43]:
awk '$2 < 3.35 { print $0, "rate is below minimum wage" }' emp.data

In [8]:
awk '$2 > 10 { print $0, "rate exceeds $10 per hour" }' emp.data

In [7]:
awk '$3 < 0 { print $0, "negative hours worked" }' emp.data

In [6]:
awk '$3 > 60 { print $0, "too many hours worked" }' emp.data

If there are no errors, there's no output.

**BEGIN and END**

The special pattern `BEGIN` matches before the first line of the first input file is read, and `END` matches after the last line of the last file has been processed. This program uses `BEGIN` to print a heading:

In [2]:
awk 'BEGIN { print "NAME RATE HOURS"; print "" }
           { print }' emp.data

NAME RATE HOURS

Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18


## 1.5 Computing with AWK

An action is a sequence of statements separated by newlines or semicolons.
You have already seen examples in which the action was a single `print` statement. This section provides examples of statements for performing simple
numeric and string computations. In these statements you can use not only the
built-in variables like `NF`, but you can create your own variables for performing
calculations, storing data, and the like. In awk, user-created variables are not
declared.

**Counting**

This program uses a variable emp to count employees who have worked more
than 15 hours:

In [3]:
awk '$3 > 15 { emp = emp + 1 }
     END     { print emp, "employees worked more than 15 hours" }' emp.data

3 employees worked more than 15 hours


Awk variables used as numbers begin life with the value `0`, so we didn't need to
initialize `emp`.

**Computing Sums and Averages**

To count the number of employees, we can use the built-in variable `NR`, which holds the number of lines read so far; its value at the end of all input is the total number of lines read.

In [12]:
awk 'END { print NR, "employees" }' emp.data

6 employees


Here is a program that uses `NR` to compute the average pay:

In [13]:
awk '{ pay = pay + $2 * $3 }
 END { print NR, "employees"
       print "total pay is", pay
       print "average pay is", pay/NR }' emp.data

6 employees
total pay is 337.5
average pay is 56.25


Clearly, `printf` could be used to produce neater output. There's also a potential error: in the unlikely case that `NR` is zero, the program will attempt to divide by zero and thus will generate an error message.

**Handling Text**

One of the strengths of awk is its ability to handle strings of characters as conveniently as most languages handle numbers. Awk variables can hold strings of characters as well as numbers. This program finds the employee who is paid the most per hour:

In [14]:
awk '$2 > maxrate { maxrate = $2; maxemp = $1 }
     END { print "highest hourly rate:", maxrate, "for", maxemp }' emp.data

highest hourly rate: 5.50 for Mary


In this program the variable `maxrate` holds a numeric value, while the variable `maxemp` holds a string. (If there are several employees who all make the same maximum pay, this program finds only the first.}

**String Concatenation**

New strings may be created by combining old ones; this operation is called concatenation. This program collects all the employee names into a single string, by appending each name and a blank to the previous value in the variable names. The value of names is printed by the END action:

In [15]:
awk ' { names = names $1 " " }
  END { print names }' emp.data

Beth Dan Kathy Mark Mary Susie 


The concatenation operation is represented in an awk program by writing string values one after the other. At every input line, the first statement in the program concatenates three strings: the previous value of names, the first field, and a blank; it then assigns the resulting string to names. Thus, after all input lines have been read, names contains a single string consisting of the names of all the employees, each followed by a blank. Variables used to store strings begin life holding the null string (that is, the string containing no characters), so in this program names did not need to be explicitly initialized.

**Printing the Last Input Line**

Although `NR` retains its value in an `END` action, `$0` does not. This program is one way to print the last input line:

In [16]:
awk '{ last = $0 }
 END { print last }' emp.data

Susie 4.25 18


**Built-in Functions**

We have already seen that awk provides built-in variables that maintain frequently used quantities like the number of fields and the input line number. Similarly, there are built-in functions for computing other useful values. Besides arithmetic functions for square roots, logarithms, random numbers, and the like, there are also functions that manipulate text. One of these is length, which counts the number of characters in a string. For example, this program computes the length of each person's name:

In [17]:
awk '{ print $1, length($1)}' emp.data

Beth 4
Dan 3
Kathy 5
Mark 4
Mary 4
Susie 5


**Counting Lines, Words, and Characters**

This program uses `length`, `NF`, and `NR` to count the number of lines, words, and characters in the input. For convenience, we'll treat each field as a word.

In [18]:
awk '{ nc = nc + length($0) + 1
       nw = nw + NF
     }
     END { print NR, "lines,", nw, "words,", nc, "characters" }' emp.data

6 lines, 18 words, 77 characters


We have added one for the newline character at the end of each input line,
since `$0` doesn't include it.