The Shell Environment
====

So far, we have exclusively used Python to manipulate files, and used the shell only to open `ipython`. However, the shell can give you a range of excellent tools to manipulate (text) files. We wil introduce the basics here. Each of those tools was designed to do one specific thing, and do it really well.

Another advantage of the shell is that you can string different tools together in a pipeline. Say, run a python script on a file, write the output to another file, and pass that one on to an external classification tool (for example **Vowpal Wabbit**). These pipelines can be stored as scripts, and then executed as often as you want. They can take parameters and are a useful tool if you work on a larger project, because you can run the same pipeline with different settings without having to write all commands by hand each time.

As powerful as the shell is, it unfortunately does not run under Windows or the DOS shell, only on Mac and Linux. If you are using a Windows machine, you can install an environment like `cygwin` that simulates the shell.

If you want to execute shell commands from ipython, you have to prefix them with the exclamation mark `!`, i.e., 

`!head file.txt`

For the exercises, we will give you access to a linux server, where you can run the commands. 

If you have any problems with a command, you can always type `man COMMAND` to get a manual. 

In the following, variables will be spelled in ***UPPERCASE***. Note that the shell and its programs are case-sensitive...


Directories
====

You've already used `cd` to **c**hange a **d**irectory. 

`mkdir DIRECTORY` creates a new directory. 

You can also run it with a flag `-p`, and create several levels at once. In that case, it will create all directories along the path.

`mkdir -p DIRECTORY/SUB_DIR/MOAR_SUB_DIR` creates all directories in the path, unless they already exist. 


In [1]:
!mkdir tmp
!ls

mkdir: tmp: File exists
Exercises.ipynb       data.txt              male.txt              male2.txt
The_Environment.ipynb data.zip              male1.txt             [34mtmp[m[m


Files
====

There are several commands to manipulate files, i.e., create, move, or delete them.

`touch FILE` will create a file with the given name, if it does not exist. Useful to make sure a certain file exists before writing to it, for example.

`cp SOURCE TARGET` **c**o**p**ies a file from the source to the target. The source file still exists.

`mv SOURCE TARGET` **m**o**v**es a file from source to target. The source file no longer exists.

`rm FILE` **r**e**m**oves a file. It's gone. Deleted. Kaputt!

All these commands also work for directories. However, you have to include a parameter `-r` (for **r**ecursively) if you want to include the subdirectories.

You can also fetch remote files and directories from a URL by using `curl`:

`curl -O URL` will fetch a file from a URL and write it into a file of the same name in the current directory.


Wildcards
----

If you want to operate on several files that differ in only a few characters, it can be useful to have wildcards. If you had 10 files named `file000.txt` through `file009.txt`, you can copy all of them with 

`cp file0* some_directory/`

If you only want to operate on certain files , you can provide a list:

`cp file00[2468].txt some_directory/`


In [2]:
!cp data.txt tmp/reviews.tsv
!cp data.txt tmp/
!pwd

/Users/dirkhovy/Dropbox/copenhagen/teaching/scientific_programming/scientific-programming-2015/lectures/lecture07


Compressing and uncompressing
====

It can be convenient to pack several files into one and compress it.

`zip TARGET FILE(S)` packs all of the specified files into an archive. You should specify the ending `.zip` for the target.

`unzip FILE.ZIP` reverses that, i.e., it decompresses and unpacks a `.zip` file.

Alternatively, you can use `tar`. It's famously powerful and infamously complicated in it's syntax (see http://xkcd.com/1168/). Never mind, you can just remember these two:

`tar czf FILE.tar.gz TARGET(S)` packs a tar file and gzips it.

`tar zxvf FILE.tar.gz` unpacks a gzipped tar file.


`head/tail`
====
You have already seen the `head` and `tail` commands, that let you look at the first and last few lines of a file (and you have seen that `pandas` has its own version).
Both commands take a parameter that let's you specify how many lines you would like to display.

`head -100 file.txt`
gets the first 100 lines,
`tail -100 file.txt`
the last 100 lines.

`tail` also takes an offset to start at the $n$th byte block, i.e., line.

`tail +2 file.txt`
skips the first line and shows the rest of the file.



In [3]:
!head -10 data.txt

NONE	M	The order was filled quickly and correctly. No problems at all, highly recommended.
NONE	M	I found the parts I was looking for much cheaper than at major part dealer sites. The delivery was fast and the part fit perfectly. Great service. No complaints. You will definitely get more of my business and I will gladly recommend you to friends!
NONE	M	Camicia non conforme rispetto all'ordine  : sbagliate le cuciture dei bottoni.
NONE	F	This is the greatest thing, Once a month Petflow sends a box of "goodies" for your pet . Charlotte my ShihTzu got her first one on December 22nd just in time for Christmas .  She got an adorable cow toy,  some treats shes never tried before but seemed to like all of them. And some teeth cleaner that doesn't involve brushing. I dont know whos more excited about next month's package Charlotte or me...
NONE	M	Honesty and sincerity are very important to have and I think that they have both. I would definitely recommend them.
NONE	F	One Product to big f

Piping and output
====
Two important concepts in the shell are piping and output. Piping (i.e., creating a pipeline) allows you to execute several commands in a one go, where each takes the output of the previous commands as input. You can do this by using the `|` symbol (appropriately called the "pipe").

`command1 file.txt | command2`

File output comes in two forms: standard output and standard error, aka STDOUT and STDERR. This is extremely useful to separate progress output (e.g., "Training model..."), which is typiucally written to STDERR, from actual output (e.g., "Accuracy: 89.56"), which is typically written to STDOUT. If you run a script in the shell, you can not necessarily see which output is which, but you can redirect them to different files.
In order to write standard output to a file, use the `>` sign after a command.

`head -100 file.txt > file.top100.txt`

Any STDERR output generated during this will still occur in the shell. In order to redirect STDERR, you use `2>`. This is very handy to create log files of the programs you are running.

`command file.txt > output.txt 2> output.log`


In [4]:
!head tmp/reviews.tsv > tmp/top10.reviews.tsv
!tail tmp/reviews.tsv > tmp/bottom10.reviews.tsv
!head tmp/bottom10.reviews.tsv

21	M	I've already purchased like 4 or 5 times in PCGameSupply PSN Cards, always they're almost Instant Issued to my account - orders before I notice...
NONE	M	Got my labels quickly, and they look fantastic. Thanks!
NONE	M	I have just sold my I-phone to envirophone and I am glad to say got a good price and excellent service. I will certainly use them again and recommend  them to you if you have a spare phone to dispose of if you have just upgraded. Give them a try what have you got to lose?
NONE	F	Well structured website, helpful with good reviews and videos.
NONE	F	I have 5 dogs and I like to order toys for them.  None of the 5 were interested in your Gumby Toy.  How have other dogs reacted to Gumby?
NONE	F	I have used easytobook for the last 8 years and couldnt be happier with the website and service: the prices are lower than most other hotel websites and the page is easy to navigate. The page layout is extremely important if someone wants to such a large number of hotels and do

`sort` and `uniq`
====

You can sort a file by running `sort` on it. 

`sort FILE`

Several flags let you defined how to sort. The most important ones are `-r` for reverse order, and `-n`, for numerical sorting (otherwise, the numbers 1 through 10 will be ordered as 1 ***10*** 2 3 4 5 6 7 8 9)


`uniq` is typically called with a pipe after sort, and basically removes duplicates. 

`sort FILE | uniq`

It has a flag, `-c`, that counts how often each type has occurred in the sorted list.


In [5]:
!sort tmp/top10.reviews.tsv

50	F	Used this site many times.  I always start mysearch here. Prices change daily and if you want to really research the price continually at many different sites, I have found cheaper cars elsewhere.  However, if you don't have a lot of time to research the price, this site has always been among the top three (e.g., cheapest) of the ten sites I use to reserve a car.
NONE	F	One Product to big for being a med. and the color was not what was expected.
NONE	F	This is the greatest thing, Once a month Petflow sends a box of "goodies" for your pet . Charlotte my ShihTzu got her first one on December 22nd just in time for Christmas .  She got an adorable cow toy,  some treats shes never tried before but seemed to like all of them. And some teeth cleaner that doesn't involve brushing. I dont know whos more excited about next month's package Charlotte or me...
NONE	F	Very unsatisfied with product. It was shipped with multiple wrong parts. It took months to resolve. They stopped answering my

`cat`
====
`cat` simply prints the contents of a file to the shell. 

`cat FILE`

However, you can specify as many files as you want and output them together to create a new, combined file.

`cat FILE1 FILE2 > FILE3`


In [6]:
!cat tmp/top10.reviews.tsv tmp/bottom10.reviews.tsv > tmp/combined.tsv
!wc -l tmp/combined.tsv

      20 tmp/combined.tsv


`echo`
====
`echo` takes a string or a variable and prints it to standard output. 

`echo $VAR` or `echo "my hovercraft is full of eels"`

This is very useful if you want to know the value of an environment variable, or if you want to leave traces in your shell scripts.

In [7]:
!echo "NONE\tM\ttest"
!echo "NONE\tM\ttest" >> tmp/combined.tsv
!wc -l tmp/combined.tsv

NONE	M	test
      21 tmp/combined.tsv


In [8]:
!wc -l tmp/combined.tsv


      21 tmp/combined.tsv


`tr`
====

`tr` is a simple way to replace characters for others. `tr` has the strange property that you need to pipe the file **into** the program, so the syntax is

`tr 'TARGET' 'REPLACEMENT' < FILE`

A common operation is to replace spaces with newlines (`\n`)
The target and replacement can also be groups of characters, such as all uppercase or all lowercase letters, which makes lowercasing a whole file very easy:

`tr 'A-Z' 'a-z' < FILE`



In [9]:
!tr 'A-Z' '0-9' < tmp/top10.reviews.tsv

9994	9	9he order was filled quickly and correctly. 9o problems at all, highly recommended.
9994	9	8 found the parts 8 was looking for much cheaper than at major part dealer sites. 9he delivery was fast and the part fit perfectly. 6reat service. 9o complaints. 9ou will definitely get more of my business and 8 will gladly recommend you to friends!
9994	9	2amicia non conforme rispetto all'ordine  : sbagliate le cuciture dei bottoni.
9994	5	9his is the greatest thing, 9nce a month 9etflow sends a box of "goodies" for your pet . 2harlotte my 9hih9zu got her first one on 3ecember 22nd just in time for 2hristmas .  9he got an adorable cow toy,  some treats shes never tried before but seemed to like all of them. 0nd some teeth cleaner that doesn't involve brushing. 8 dont know whos more excited about next month's package 2harlotte or me...
9994	9	7onesty and sincerity are very important to have and 8 think that they have both. 8 would definitely recommend them.
9994	5	9ne 9roduct to big f

`paste`
====
`paste` takes any number of files and concatenates them horizontally, i.e., as tab-separated columns.

`paste FILE1 FILE2 FILE3`

Very useful to generate results files from bit and pieces of other files.

In [10]:
!echo "1\n2\n3\n4\n5\n6\n7\n8\n9\n10" > tmp/numbers
!paste tmp/numbers tmp/top10.reviews.tsv > tmp/numbered-top10.reviews.tsv

In [11]:
!cat tmp/numbered-top10.reviews.tsv

1	NONE	M	The order was filled quickly and correctly. No problems at all, highly recommended.
2	NONE	M	I found the parts I was looking for much cheaper than at major part dealer sites. The delivery was fast and the part fit perfectly. Great service. No complaints. You will definitely get more of my business and I will gladly recommend you to friends!
3	NONE	M	Camicia non conforme rispetto all'ordine  : sbagliate le cuciture dei bottoni.
4	NONE	F	This is the greatest thing, Once a month Petflow sends a box of "goodies" for your pet . Charlotte my ShihTzu got her first one on December 22nd just in time for Christmas .  She got an adorable cow toy,  some treats shes never tried before but seemed to like all of them. And some teeth cleaner that doesn't involve brushing. I dont know whos more excited about next month's package Charlotte or me...
5	NONE	M	Honesty and sincerity are very important to have and I think that they have both. I would definitely recommend them.
6	NONE	F	One Prod

`grep/egrep`
====

These commands do basically the same, namely search for a specified string in the file. `egrep` allows for regular expressions.

`grep TARGET FILE`

There are three important flags:

* `-i`: ignore case
* `-v`: return lines that ***don't*** contain the search string
* `-c`: return the number of times the string has been found rather than the actual lines

In [12]:
!grep "all" tmp/top10.reviews.tsv
!grep -c "all" tmp/top10.reviews.tsv
!grep -n "all" tmp/top10.reviews.tsv


NONE	M	The order was filled quickly and correctly. No problems at all, highly recommended.
NONE	M	Camicia non conforme rispetto all'ordine  : sbagliate le cuciture dei bottoni.
NONE	F	This is the greatest thing, Once a month Petflow sends a box of "goodies" for your pet . Charlotte my ShihTzu got her first one on December 22nd just in time for Christmas .  She got an adorable cow toy,  some treats shes never tried before but seemed to like all of them. And some teeth cleaner that doesn't involve brushing. I dont know whos more excited about next month's package Charlotte or me...
50	F	Used this site many times.  I always start mysearch here. Prices change daily and if you want to really research the price continually at many different sites, I have found cheaper cars elsewhere.  However, if you don't have a lot of time to research the price, this site has always been among the top three (e.g., cheapest) of the ten sites I use to reserve a car.
NONE	F	Very unsatisfied with product. It w

`sed`
====

`sed` stands for *String Editor* and is on the surface similar to `tr`, but more powerful. It can replace whole strings or regular expressions. One of the most common commands is 

`sed 's/TARGET/REPLACEMENT/g' FILE`

It replaces all instances of the target in the file and writes the result to STDOUT. Useful to replace misspelled words.

In [13]:
!sed 's/ I/ ***YOU***/g' tmp/top10.reviews.tsv

NONE	M	The order was filled quickly and correctly. No problems at all, highly recommended.
NONE	M	I found the parts ***YOU*** was looking for much cheaper than at major part dealer sites. The delivery was fast and the part fit perfectly. Great service. No complaints. You will definitely get more of my business and ***YOU*** will gladly recommend you to friends!
NONE	M	Camicia non conforme rispetto all'ordine  : sbagliate le cuciture dei bottoni.
NONE	F	This is the greatest thing, Once a month Petflow sends a box of "goodies" for your pet . Charlotte my ShihTzu got her first one on December 22nd just in time for Christmas .  She got an adorable cow toy,  some treats shes never tried before but seemed to like all of them. And some teeth cleaner that doesn't involve brushing. ***YOU*** dont know whos more excited about next month's package Charlotte or me...
NONE	M	Honesty and sincerity are very important to have and ***YOU*** think that they have both. ***YOU*** would definitely reco

`awk`
====

`awk` is not just a program, but a small and simple programming language, although it's mostly used for one-liners. It's great to

1. find lines that fulfill a certain condition
2. print only certain columns
3. do arithmatic (summing, averaging, etc.) on counts

It can split files on a given character (typically "\t" for tabs, "," or, by default " "), and indexes each resulting column by a variable from `$1` to `$NF` (the ***N***umber of ***F***ields, i.e., the last column). You can get the number of columns by printing `NF` (without the `$`). Similarly, you can get the rows by `NR`.

There is not enough space here to do `awk` justice in all its glory, but consider using these gems:

* `awk -F "\t" '{if ($1=="STRING") print}' FILE` splits a file on tabs, and prints the line if the first column has a certain value.

* `awk -F "," '{print $1,$4,$8}' FILE` splits a file on commas (hellos, CSVs...) and prints the first, fourth, and eights column.

* `awk '{sum+=$1} END {print sum/NR}' FILE` splits a file on spaces and prints the average of the first column. Nice if you have a lot of results in one file...


In [14]:
!awk -F "\t" '{print $3}' tmp/top10.reviews.tsv

The order was filled quickly and correctly. No problems at all, highly recommended.
I found the parts I was looking for much cheaper than at major part dealer sites. The delivery was fast and the part fit perfectly. Great service. No complaints. You will definitely get more of my business and I will gladly recommend you to friends!
Camicia non conforme rispetto all'ordine  : sbagliate le cuciture dei bottoni.
This is the greatest thing, Once a month Petflow sends a box of "goodies" for your pet . Charlotte my ShihTzu got her first one on December 22nd just in time for Christmas .  She got an adorable cow toy,  some treats shes never tried before but seemed to like all of them. And some teeth cleaner that doesn't involve brushing. I dont know whos more excited about next month's package Charlotte or me...
Honesty and sincerity are very important to have and I think that they have both. I would definitely recommend them.
One Product to big for being a med. and the color was not what

`for`-loops
====

Just as in Python, you can execute `for`-loops in the shell. This is extremely handy if you want to iterate over a bunch of files that have the same name with different extensions, or are in different directories.
The syntax is

```
for VAR1 VAR2 VAR3
do
    SOME_COMMAND
done
```

The variables to iter over can either be strings or a range of integers. In the latter case, you can specify a range like so:

`for x in {START..FINISH}`



In [15]:
!for x in en fr de dk es; do echo $x; done
!echo
!for x in {1..10}; \
do echo $x; \
done

en
fr
de
dk
es

1
2
3
4
5
6
7
8
9
10


Reading
====

* http://web.stanford.edu/class/cs124/kwc-unix-for-poets.pdf