Skip to content

Latest commit

 

History

History
179 lines (144 loc) · 8.21 KB

README.md

File metadata and controls

179 lines (144 loc) · 8.21 KB

AWK birds

AWK crashcourse

AWK language course aims to explain AWK in 15 minutes to let you find awesome tool friend despite it's given name. The correct pronunciation is [auk] after smaller seabirds Parakeet auklets.

General language description

AWK language (is):

  • (mainly) text processing language
  • available on most UNIX-like systems by default, on Windows there is either native binary or cygwin one
  • syntax is influenced by c and shell programming languages
  • programs from single line to multiple library files
  • several implementations available, notably gawk and mawk
  • solves generaly same problems as similar text-processing tools sed, grep, wc, tr, cut, printf, tail, head, cat, tac, bc, column, ...

AWK language use-cases are:

  • computing int / floating point math formulas (based on input)
  • general text-processing
    • cutting pieces from input text stream
    • reformatting input text stream
  • (shell) meta-programming generator

AWK language capabilities:

  • text-processing functions
  • regular expression support
  • math functions
  • dynamic typing, support for
    • integer / long
    • floats
    • associative arrays (including multi-dimensional array support)
  • external execution support

Processing workflow aka main()

Every AWK execution consist of folowing three phases:

  • [1] BEGIN{ ... } are actions performed at the beginning before first text character is read
    • multiple blocks allowed (normally single)
  • [2] [condition]{ ... } are actions performed on every AWK record (default text line)
    • every AWK record is automatically split into AWK fields (by default words)
    • multiple blocks allowed
  • [3] END{ ... } are actions performed at the end of the execution after last text character is read
    • multiple blocks allowed (normally single)

AWK process flow

AWK process flow

warm-up basic example

$ echo -e "AWK is still useful\ntext-processing  technology!" | \
>   awk 'BEGIN{wcnt=0;print "lineno/#words/3rd-word:individual words\n"}
>             {printf("% 6d/% 6d/% 8s:%s\n",NR,NF,$3,$0);wcnt+=NF}
>          END{print "\nSummary:", NR, "lines/records,", wcnt, "words/fields"}'
lineno/#words/3rd-word:individual words

     1/     4/   still:AWK is still useful
     2/     2/        :text-processing  technology!

Summary:2 lines/records, 6 words/fields

Command-line basics

  • Passing text data to AWK:

    • from pipe: cat input-data.txt | awk <app>
    • from file[s] read by awk itself: awk <app> input-data.txt
  • AWK application execution styles (-f):

    • on command-line awk '{ ... }' input-data.txt
    • in separate files awk -f myapp.awk input-data.txt
  • specifying an AWK variable on command-line -v var=val

  • specifying AWK field separator FS variable or -F <FS> switch

Global variables

Global variables are documented here, most common ones are:

  • $0 value of current AWK record (whole line without line-break)
    • $1, $2, ... $NF values of first, second, ... last AWK field (word)
  • FS Specifies the input AWK field separator, i.e. how AWK breaks input record into fields (default: a whitespace).
  • RS Specifies the input AWK record separator, i.e. how AWK breaks input stream into records (default: an universal line break).
  • OFS Specifies the output separator, i.e. how AWK print parsed fields to the output stream using print() (default: single space).
  • ORS Specifies the output separator, i.e. how AWK print parsed records to the output stream using print() (default: line break)
  • FILENAME contains the name of the input file read by awk (read only global variable)

Buildin functions

AWK functions are documented, the most important ones are:

  • print, printf() and sprintf()
    • printing functions
  • length()
    • length of an string argument
  • substr()
    • splitting string to a substring
  • split()
    • split string into an array of strings
  • index()
    • find position of an substring in a string
  • sub() and gsub()
    • (regexp) search and replace (once respectivelly globally)
  • ~ operator and match()
    • regexp search
  • tolower() and toupper()
    • convert text to lowercase resp. uppercase

Learn by examples

Best practices

Portability

Prefer general awk before an specific AWK implementation:

  • use general awk for portable programs
  • otherwise use the particular implementation e.g. gawk

AWK programs extension and readability

General rule of thumb is to create AWK program as a *.awk file if equivalent one-liner is not well readable.

If you have troubles to understand one line awk program then feel free to use GNU AWK's profiling functionality i.e. -p option to receive pretty printed AWK code (in awkprof.out).

Code quality

  • comment properly
  • indent similarly as in c/c++ programmimng languages
  • use functions whenever possible
  • stay explicit avoiding awk default (implicit) actions which make AWK application hard to understand
    • example: length > 80 should be rather written 'length($0) > 80 { print }' or 'length($0) > 80 { print $0 }'

Pitfalls

  • don't forget to always use apostrophe ' quotation when writing awk oneline applications to avoid shell expansion (for instance $1)
    • awk "{print $1}" should be awk '{print $1}'
  • use one of the recommended implementations as old implementations are quite limited (old awk or nawk)
  • string / array indexing from 1 (index(), split(), $i, ...)
  • GNU AWK implementation understand localization & utf-8/unicode and thus replacing with [g]sub() can lead to unwanted behavior unless you force gawk to drop such support via exporting environment variable LC_ALL=C
    • other awk implementations may not support utf-8/unicode:
# awk implementation versions
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.1)
mawk 1.3.4 20161107
BusyBox v1.22.1 (2016-02-03 18:22:11 UTC) multi-call binary.

$ echo "Zřetelně" | gawk '{print toupper($0)}'
ZŘETELNĚ
$ echo "Zřetelně" | mawk '{print toupper($0)}'
ZřETELNě
$ echo "Zřetelně" | busybox awk '{print toupper($0)}'
ZřETELNě

  • extended reqular expressions are available just for gawk (and for older version has to be explicitly enabled):
$ ps auxwww | gawk '{if($2~/^[0-9]{1,1}$/){print}}'
root         1  0.0  0.0 197064  4196 ?        Ss   Oct31   2:21 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
root         4  0.0  0.0      0     0 ?        S<   Oct31   0:00 [kworker/0:0H]

$ ps auxwww | gawk --re-interval '{if($2~/^[0-9]{1,1}$/){print}}'
root         1  0.0  0.0 197064  4196 ?        Ss   Oct31   2:21 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
root         4  0.0  0.0      0     0 ?        S<   Oct31   0:00 [kworker/0:0H]

$ ps auxwww | mawk '{if($2~/^[0-9]{1,1}$/){print}}'
$