# awk and sed practice

## 1. Intro

Text files preprocessing is quite important for NLP or Text Mining. 

These work includes replacing some particular texts; filtering target records; inserting some fields into texts.

Of course we can use python to do these jobs. Python scripts can be neat and powerful. The best advantage we can get by using python is that the scripts can be run in different platforms. 

However, if we take linux as our production environment, linux's text processing tools such as **awk** and **sed** can be more powerful and productive. We can regard these tools as a text processing architecture. By only adding a few lines, we can get the result we want.

I will introduce the basic idea and usage of **awk** and **sed** by some simple but frequently used practice.

## 2.awk

### 2.1 basic grammer

``` bash
awk [options] 'script' var=value file(s) 
awk [options] -f scriptfile var=value file(s)
```
* -F fs, delimiter
* -f scripfile, we can read awk code from the scriptfile 
* -v var=value, send outter var to awk

**basic strcture**

``` bash
awk 'BEGIN{ print "start" } pattern{ commands } END{ print "end" }' file 
```

First, run BEGIN part once; for every line in the file, do pattern{ commands }; in the end, run END part once.


In [3]:
echo -e "A line 1nA line 2" | awk 'BEGIN{ print "Start" } { print } END{ print "End" }'

Start
A line 1nA line 2
End


### 2.2  operator

![](https://raw.githubusercontent.com/applenob/linux_basic/master/resource/operator.gif)

It's almost as other program languages.

### 2.3 inner functions

**arithmetic functions**

- atan2( y, x )	
- cos( x )	
- sin( x ) 
- exp( x )	
- log( x )	
- sqrt( x )	
- int( x )	
- rand( ) : return a random number n， 0 <= n < 1
- srand( [expr] )

**String Functions**

- gsub( Ere, Repl, [ In ] )	: substitute string globally
- sub( Ere, Repl, [ In ] )	: substitute string for the first match.
- index( String1, String2 )	
- length [(String)]	
- blength [(String)]	
- substr( String, M, [ N ] )	
- match( String, Ere ) : RSTART: mathed result start index; RLENGTH: mathed result length.
- tolower( String )	
- toupper( String )	
- sprintf(Format, Expr, Expr, . . . )	

Ere： regular expression

**common funtions**

- close( Expression )	
- system(command )	
- Expression | getline [ Variable ]	
- getline [ Variable ] < Expression	
- getline [ Variable ]

**inner vars**

- \$0 : whole line
- \$n : #n field
- ARGC : arg counts
- ARGV : arg vars
- ENVIRON : support environment variable
- FILENAME 
- FNR : record number of file 
- FS : field seperator
- RS : record seperator
- NF : number of fields
- NR : record number now
- OFS : outter field seperator
- ORS : outter record seperator

### 2.4 practice #1

Firstly, check out our sample data file:


In [4]:
cat data.txt

CONTEXT1    APP#APP1    LABEL0
CONTEXT1    APP#APP1    LABEL1
CONTEXT1    APP#APP2    LABEL1
CONTEXT2    APP#APP2    LABEL0
CONTEXT2    APP#APP3    LABEL1
CONTEXT3    APP#APP1    LABEL0
CONTEXT3    APP#APP2    LABEL1

Suppose we want to get basic statistic information about this text file, **get the count of CONTEXT1**, in this case.

We can do like this:

In [10]:
awk 'BEGIN{count=0}{if($1=="CONTEXT1"){count++}}END{print "# of CONTEXT1 is: "count}' data.txt

# of CONTEXT1 is: 3


### 2.5 practice #2

Now we want to do something else with this data. Suppose we want to substitute all the filed #2 from **APP#APP2** to **#APP2%APP2**. 

I know it may seem strange to do that, but it is a good practice for substitution.

In order to make this more clear, let's seperate this operation into 2 steps. 

First, substitute all the filed #2 from **APP#APP2** to **#APP2**.

In [11]:
awk '{sub(/[A-Z]+#/,"#",$2);print $0}' data.txt

CONTEXT1 #APP1 LABEL0
CONTEXT1 #APP1 LABEL1
CONTEXT1 #APP2 LABEL1
CONTEXT2 #APP2 LABEL0
CONTEXT2 #APP3 LABEL1
CONTEXT3 #APP1 LABEL0
CONTEXT3 #APP2 LABEL1


The second step is more complicated. We need get the string that matches the regular expression for substitution.

We can do this job by using the **match** function with **RSTART** and **RLENGTH**.

In [27]:
awk '{sub(/[A-Z]+#/,"#",$2);match($2,/#[A-Z]+[0-9]/);app_name=substr($2,RSTART+1,RLENGTH-1);\
sub("#"app_name,"#"app_name"%"app_name,$2); print $0}' data.txt

CONTEXT1 #APP1%APP1 LABEL0
CONTEXT1 #APP1%APP1 LABEL1
CONTEXT1 #APP2%APP2 LABEL1
CONTEXT2 #APP2%APP2 LABEL0
CONTEXT2 #APP3%APP3 LABEL1
CONTEXT3 #APP1%APP1 LABEL0
CONTEXT3 #APP2%APP2 LABEL1


As we can see, the code is not neat enough.

Instead, we can use **sed** for substitution tasks.

Let **awk** do the statistical job, and let **sed** do the substitution job.

### 2.6 practice #3

Suppose we want to know all the names of apps, we can use **awk** and **uniq** together:

In [29]:
awk '{print $2}' data.txt | uniq

APP#APP1
APP#APP2
APP#APP3
APP#APP1
APP#APP2


It seems that uniq did not work. 

In [30]:
awk '{print $2}' data.txt | sort | uniq

APP#APP1
APP#APP2
APP#APP3


or use one more tricky way, (the **seen** array can be any other name):

In [42]:
awk '!seen[$2]++' data.txt

CONTEXT1    APP#APP1    LABEL0
CONTEXT1    APP#APP2    LABEL1
CONTEXT2    APP#APP3    LABEL1


Which is short for:

In [55]:
awk '!seen[$2]++{print $0}' data.txt

CONTEXT1    APP#APP1    LABEL0
CONTEXT1    APP#APP2    LABEL1
CONTEXT2    APP#APP3    LABEL1


Which is short for:

In [53]:
awk '{if (!seen[$2]) {print $0; seen[$2]++} }' data.txt

CONTEXT1    APP#APP1    LABEL0
CONTEXT1    APP#APP2    LABEL1
CONTEXT2    APP#APP3    LABEL1


I believe this is the best practice for getting unique result, since it costs least memory.

## 3. sed

sed solution for practice #2. Since **sed** provides **&** representing matched string and **\\n** representing matched substring, sed solution is much simpler.

In [57]:
sed 's/[A-Z]\+#/#/g' data.txt

CONTEXT1    #APP1    LABEL0
CONTEXT1    #APP1    LABEL1
CONTEXT1    #APP2    LABEL1
CONTEXT2    #APP2    LABEL0
CONTEXT2    #APP3    LABEL1
CONTEXT3    #APP1    LABEL0
CONTEXT3    #APP2    LABEL1

In [60]:
sed 's/[A-Z]\+#/#/g' data.txt | sed 's/#\([A-Z]\+[0-9]\)/&%\1/g'

CONTEXT1    #APP1%APP1    LABEL0
CONTEXT1    #APP1%APP1    LABEL1
CONTEXT1    #APP2%APP2    LABEL1
CONTEXT2    #APP2%APP2    LABEL0
CONTEXT2    #APP3%APP3    LABEL1
CONTEXT3    #APP1%APP1    LABEL0
CONTEXT3    #APP2%APP2    LABEL1

It's not that easy to read those sed command, but it's realy convinient to use them.