# UNIX Commands for Data Scientists

## How to execute the commands

**Windows**: you need to open Git Bash and paste the command into the terminal, either using the mouse right click or right clicking on the top window border and select edit -> paste.

**Mac OS**/**Linux**: you can choose to either execute commands through this Jupyter Notebook or copy paste them into a terminal.

## Check file path

List contents of a directory.

In [2]:
!ls ./unix

Icon?           fruits.txt      shakespeare.txt


## Declare Filename

Create a variable to hold the filename of the text file.

This is the only case where the syntax is different in the Jupyter Notebook and running directly in the shell.

In the Notebook, each command is run on a separate shell process therefore we need to store `filename` in an enviromental variable, which is a way to set a persistent variable. This is performed using the `%env` IPython Magic function, execute `%env?` to learn more.

In [3]:
%env filename=./unix/shakespeare.txt

env: filename=./unix/shakespeare.txt


If you are instead running in a shell, you can just define a shell variable named filename with this syntax:

    filename=./unix/shakespeare.txt
    
Make sure that there are **no spaces** around the equal sign.

## Verify the variable is created

We can verify that the variable is now defined by printing it out with `echo`. For the rest of this reading we will use this variable to point to the filename.

In [4]:
!echo $filename

./unix/shakespeare.txt


## head

`head` prints some lines from the top of the file, you can specify how many with `-n`, what happens if you don't specify a number of lines?

In [5]:
!head -n 3 $filename

This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS.  Project Gutenberg


## tail

`tail` prints some lines from the bottom of the file, you can specify how many with `-n`.

In [6]:
!tail -n 10 $filename

PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>



End of this Etext of The Complete Works of William Shakespeare





## wc

`wc`, which stands for wordcount, prints the number of lines, words and characters:

In [7]:
!wc $filename

  124505  901447 5583442 ./unix/shakespeare.txt


## wc

specify `-l` to only print the number of lines. Execute (in Git Bash on Windows or on Linux):

    wc --help
    
or (on Mac or on Linux):

    man wc
        
to find out how to print only words instead. Or guess!

In [8]:
!wc -l $filename

  124505 ./unix/shakespeare.txt


## cat

You can use pipes with `|` to stream the output of a command to the input of another, this is useful to compone many tools together to achieve a more complicated output.

For example `cat` dumps the content of a file, then we can pipe it to `wc`:

In [9]:
!cat $filename | wc -l 

  124505


## grep

`grep` is an extremely powerful tool to look for text in one or more files. For example in the next command we are looking for all the lines that contain a word, we also specify with `-i` that we are interested in case insensitive matching, i.e. don't care about case.

In [10]:
!grep -i 'parchment' $filename

  If the skin were parchment, and the blows you gave were ink,
  Ham. Is not parchment made of sheepskins?
    of the skin of an innocent lamb should be made parchment? That
    parchment, being scribbl'd o'er, should undo a man? Some say the
    Upon a parchment, and against this fire
    But here's a parchment with the seal of Caesar;  
    With inky blots and rotten parchment bonds;
    Nor brass, nor stone, nor parchment, bears not one,


## grep

We can combine `grep` and `wc` to count the number of lines in a file that contain a specific word: 

In [11]:
!grep -i 'liberty' $filename | wc -l

      72


## sed

`sed` is a powerful stream editor, it works similarly to `grep`, but it also modifies the output text, it uses regular expressions, which are a language to define pattern matching and replacement.

### sed

For example:

    s/from/to/g
    
means:

* `s` for substitution
* `from` is the word to match
* `to` is the replacement string
* `g` specifies to apply this to all occurrences on a line, not just the first

In the following we are replacing all instances of 'parchment' to 'manuscript'

Also we are redirecting the output to a file with `>`. Therefore the output instead of being printed to screen is saved in the text file `temp.txt`.

## sed

In [12]:
#replace all instances of 'parchment' to 'manuscript'

!sed -e 's/parchment/manuscript/g' $filename > temp.txt

Then we are checking with `grep` that `temp.txt` contains the word "manuscript":

In [13]:
!grep -i 'manuscript' temp.txt 

  If the skin were manuscript, and the blows you gave were ink,
  Ham. Is not manuscript made of sheepskins?
    of the skin of an innocent lamb should be made manuscript? That
    manuscript, being scribbl'd o'er, should undo a man? Some say the
    Upon a manuscript, and against this fire
    But here's a manuscript with the seal of Caesar;  
    With inky blots and rotten manuscript bonds;
    Nor brass, nor stone, nor manuscript, bears not one,


## sort

In [14]:
!head -n 5 $filename

This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS.  Project Gutenberg
often releases Etexts that are NOT placed in the Public Domain!!



We can sort in alphabetical order the first 5 lines in the file, see that we are just ordering by the first letter in each line:

In [15]:
!head -n 5 $filename | sort


Library of the Future and Shakespeare CDROMS.  Project Gutenberg
This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
often releases Etexts that are NOT placed in the Public Domain!!


## sort

We can specify that we would like to sort on the second word of each line, we specify that the delimiter is space with `-t' '` and then specify we want to sort on column 2 `-k2`.

Therefore we are sorting on "is, of, presented, releases"

In [16]:
!head -n 5 $filename | sort -t' ' -k2


This is the 100th Etext file presented by Project Gutenberg, and
Library of the Future and Shakespeare CDROMS.  Project Gutenberg
is presented in cooperation with World Library, Inc., from their
often releases Etexts that are NOT placed in the Public Domain!!


## uniq

`sort` is often used in combination with `uniq` to remove duplicated lines.

`uniq -u` eliminates duplicated lines, but they need to be consecutive, therefore we first use `sort` to have equal lines consecutive and then we can filter them out easily with `uniq`:

In [17]:
!sort $filename | wc -l

  124505


In [18]:
!sort $filename | uniq -u | wc -l

  110834
