### Drill on UNIX tools ###

>**NOTE: the following Unix commands are meant to be run in a terminal window, not the notebook!** If you had problems spinning up your computer in the Amazon cloud, and you have a Mac, you can issue these commmands in a new Terminal window instead. 

As we did in the last class, we will work on [this file](https://github.com/computationaljournalism/columbia2019/raw/master/data/columbia.txt), which is a log file from the columbia.edu web server. To download the file from github to our EC2 instance, we can use a command-line tool called [wget](https://www.gnu.org/software/wget/). Run the following command on our EC2 instance we just created. (If you are using your Terminal window on a Mac, download it from GitHub and put it in your "home directory.")

In [None]:
wget https://github.com/computationaljournalism/columbia2019/raw/master/data/columbia.txt

You should see some output that looks like this:

```2019-03-13 14:30:03 (47.0 MB/s) - ‘columbia.txt’ saved [1048576/1048576]
```

We've downloaded a file from our course web server. It's called `columbia.txt` and it's now on our cloud computer. We are going to examine it using some simple UNIX commands. It is worthwhile reviewing notebook number 13. Recall, for example, that we "list" the files on your cloud computer via the command `ls`.

In [None]:
ls -l 

To understand what this all means, let's look at the first line (for me).

<pre>
-rw-rw-r--   1 ec2-user ec2-user 1048576 Mar 13 14:33 columbia.txt
</pre>

Recall that the command **head** does what you might expect given our exposure to Pandas. It prints out the first 10 lines of a file, the name of which you pass as an argument. Here we look at the first 10 lines of `columbia.txt`. How do you get the last 10?

In [None]:
head columbia.txt

**Web access logs (as a reminder)**

OK what kind of data do we have? This is the so-called [combined log format](https://httpd.apache.org/docs/1.3/logs.html) from an Apache web server. Whenever you browse a web site (in this case, [www.journalism.columbia.edu](http://www.journalism.columbia.edu)), there is a program responding to your requests. Want the home page? Want information about the Dual Degree? You request the HTML page and that request is recorded as a single line in the log file. Then, to render the page, your browser might need some CSS files or JavaScript files or just some simple images. The subsequent requests for these objects are also recorded, one line each, in the log file. 

So the log file is growing with each user's visit. Requests are logged to the bottom of the file in time, so the oldest entries are at the top of the file and the newest at the bottom. If many people are looking at the site at the same time, their requests are interleaved in the file, as it records requests in time order. 

Each line in the log file hold these values

>IP address<br>
Identity<br>
Userid<br>
date<br>
Request<br>
Status<br>
Bytes<br>
Referrer<br>
Agent

Let's compare this information with the first line (oldest request) in our file. (Notice that these log lines are really long and so "wrap around" the cell and can look like two or more lines.)

In [None]:
head -1 columbia.txt

Recall that we can use the command `cut` to select specific items from the file. Here we pass options that include `-d` (a character to be used as a delimiter defining separate fields in the file) and `-f` (to specify which fields to cut from the file). 

Below we define individual fields as being separated by a blank space character and then ask for just the first field, the IP address.

In [None]:
cut -d" " -f1 columbia.txt 

In [None]:
cut -d" " -f10 columbia.txt 

Look at one of the log lines above and make sure you understand that the 10th field (as defined by spaces) is the number of bytes transferred. 

Below, use another delimiter to pull out the month the request was made.

In [None]:
# your code here



The options for the fields to keep include lists separated by commas and ranges defined by a hyphen. The next two are fields 1 and 10 and then fields 1 through 3.

In [None]:
cut -d" " -f1,10 columbia.txt

In [None]:
cut -d" " -f1-3 columbia.txt

As its name suggests, the command `sort` will order the rows in our file. By default it
uses alphabetical order but the option `-n` lets you sort numerically instead. Below we `cut` out just the IP's and then "redirect the output" using a pipe to `head`.

In [None]:
cut -d" " -f1 columbia.txt | sort | head -100

Next, recall the command `uniq` will remove repeated adjacent lines in a file, so if your file is sorted, it will return just the unique rows. Piping things together...

In [None]:
cut -d" " -f1 columbia.txt | sort | uniq | head -100

The command `uniq` has an option `-c` that returns the counts of each row in the file. 

In [None]:
cut -d" " -f1 columbia.txt | sort | uniq -c | head -100

Finally, we can add a second sort to this pipeline to sort in reverse numerical order (using options `-r` and `-n`) the `uniq`'d file, giving us the most frequently seen IPs first.

In [None]:
cut -d" " -f1 columbia.txt | sort | uniq -c | sort -rn | head -25

Finally, we can use a filtering command known as `egrep` to pull just the lines that match a regular expression pattern (in quotes). The word "grep" stands for "global regular expression print" and `egrep` is an extended version of the command. So you will often see people using just the command `grep`. 

So, if we want to pull out the rows of the file with IP address 207.46.13.69 we can use the following, where we put the regular expression in quotes.

In [None]:
egrep "207\.46\.13\.69" columbia.txt

And we could see what the pattern is using [regexper.com](https://regexper.com/#207%5C.46%5C.13%5C.69).

We could save these lines in a new file if we wanted to do more work. But for now, we see that they are all running "bingbot" which is the spider (scraper) for the Bing search engine. Let's see how many times "bingbot" is used.

In [None]:
egrep "bingbot" columbia.txt | wc

So 481 out of our 4000 or so requests were from Bing. 

The referrer field is number 11. It records the link someone clicked on to get to the page they're requesting.

In [None]:
cut -d" " -f11 columbia.txt | head -100

And here we look at just referrers that are Google.

In [None]:
cut -d" " -f11 columbia.txt | egrep "google" 

**Your turn**

1. Use `egrep`, `cut`, `sort` and `uniq` to make a list of the number of different Googles people used to access our website -- you'll see `google.com.ph` for the service in the Philippines or `google.com.tr` for Turkey. 

2. Use the basic UNIX tools to find a breakdown of the different Status codes. How many 404's were there?

3. In the old design of the web site, faculty profiles were listed under `/profile`, meaning for example, that a request for James Stewart's page would look like `GET /profile/66-james-stewart/10`. Use your UNIX tools to make a list of the different faculty (or at least people with `/profile` pages) and the number of times they were requested by a visitor. 

4. (Bonus) Do the same as in 3, but eliminate duplicate requests from the same IP address. 

We are now going to work with Jeb Bush's email release. As we mentioned in class, at some point, he released all of his emails to the public and planned to write a book from them. [The Washington Post requested these files as well and wrote on them](https://www.washingtonpost.com/politics/jeb-bush-e-mails-offer-a-look-at-the-republicans-hands-on-style-as-governor/2014/12/23/f61cf6ac-8ae7-11e4-a085-34e9b9f09a58_story.html?noredirect=on&utm_term=.299c676903d3). His email file from his first month in office (January, 1999) is [loaded on our GitHub.](https://github.com/computationaljournalism/columbia2019/raw/master/data/01Jan). Download it to your home directory or use `wget` to load it to your EC2 computer.  

In [None]:
wget https://github.com/computationaljournalism/columbia2019/raw/master/data/01Jan

The data are in Mbox format. [You can read about it here.](https://en.wikipedia.org/wiki/Mbox) The important thing is that the file contains all the mail messages in reverse chronological order (newest first, oldest last). The format is pretty lose. You recognize you've moved from one email address to another because the word `From` starts the line. If someone actually started a line in the body of an email with `From`, then the mailer would change it to `>From` to avoid confusion. 

So we have a file of emails, each email delineated with a line that begins with the word `From`. Here are the tops of the first two emails in January's file.

>From "brett.rayman@laspbs.state.fl.us" Mon Feb  1 02:20:16 1999
<br><br>
From "jeb@jeb.org" Mon Feb  1 03:12:59 1999

5. Use `more` to have a scroll through the file to look at the different emails. Notice that the blocks of solid text with seemingly random numbers and letters represent encoded attachments. 

6. Count the number of email messages in this file. 

7. From whom did Bush receive the most emails? Second most?

8. Come up with a question on your own about this...

This is a pretty powerful pipeline!
<br><br>

    _     /)---(\          /~~~\
    \\   (/ . . \)        /  .. \
     \\__)-\(*)/         (_,\  |_)
      \_       (_         /   \@/    /^^^\
      (___/-(____) _     /      \   / . . \
                   \\   /  `    |   V\ Y /V
                    \\/  \   | _\    / - \
                     \   /__'|| \\_  |    \
                      \_____)|_).\_).||(__V