Unix, Mongo and a Computer in the Cloud ☁️
----------------------------------------

All of the data gathering and computation we've done up until now has been done on our own laptops. Going forward, there will be many times when we'll need to collect data 24x7. For this, we'll need a computer that runs 24x7 (i.e. not our laptop). We will also want to store some of our data in a database, which gives us a lot more power and flexibility over storing data in, say, CSV files. Before we get to setting up a computer in the cloud and using our first database, we'll talk a bit about the Unix operating system and the power of Unix commands (what we'll use to drive our cloud computer). This notebook may span a few classes, but here's a quick overview of what we'll cover:

1. Unix Introduction
2. Set up a computer in the cloud using Amazon Web Services (AWS)
3. Unix commands (how we drive our cloud computer)
3. twarc - a comand-line based Twitter API
4. Mongo database - where we can store our data in the cloud
5. How to talk to our Mongo database from the notebook

Unix 
---------------------------

Today we step back to a simpler time when we interacted with computers through a handful of typed commands. Specifically, we'll deal in some pretty old magic -- UNIX commands. At one level, you can think of these as tools to help you manipulate programmatically the basic **stuff of a computer**. We'll work with files and folders and running jobs. These tools come from a time when computers looked like this.

<img align=center src=http://history-computer.com/ModernComputer/Software/images/Dennis-Ritchie-Ken-Thompson-and-PDP11-UNIX-1972.jpg>

Pictured above are two of the developers of UNIX, Dennis Ritchie and Ken Thompson. To give their work some context, let's define terms. The start of this notebook is a little chatty, but it will also be one of the most practial we'll have all term.

      __
    o-''|\_____/)
     \_/|_)     )
        \  __  /
        (_/ (_/    

**An operating system** 

An operating system is a piece of software (code) that organizes and controls hardware and other software so your computer behaves in a flexible but predictable way.

Most devices that contain a computer of some kind will have an OS. Operating systems appear when the appliance will have to deal with new applications, complex user input and possibly changing requirements of its function. In addition to a laptop or desktop computer, your DVR, smartphone and even your automobile all have operating systems.

The computer you're using to run this notebook probably has a Windows, MacOS or Linux operating system. 

**Your computer**

Let's think about your computer a little more deeply -- it consists of several components. **The Central Processing Unit** (CPU) or microprocessor (a microchip) is a complete computational engine, capable of carrying out a number of basic commands (perform simple arithmetic calculations, store and retrieve information from its memory, and so on). 

The CPU itself has the capacity to store some amount of information, and when it needs more space, it moves data to another kind of chip known as **Random Access Memory** (RAM) — "random access" as opposed to, say, sequential access to memory locations. 

Your computer also has one or more storage devices that can be used to organize and store data — hard disks or drives store data magnetically, while solid state drives again use special chips. (A solid state drive is a larger, more sophisticated version of your traditional thumb drive.)

**Operating systems, again**

Your operating system, then, manages all of your computer’s resources, providing layers of abstraction for both you, the user, as well as developers who are writing new programs for your computer.

With the emergence of so-called **cloud computing**, we imagine a variety of computing resources "out there" on the web that we can execute -- think about the variety of APIs we've encountered that do various smart things for us or to our data. In this model, computations are performed elsewhere, and your own computer might function more as a "browser" receiving results -- Google’s Chrome operating system is "minimalist" in this sense. 

But let’s not get ahead of ourselves. The computers you are probably sitting at are running the Mac OS which is built on a Unix platform. Let’s spend some time talking about Unix.


                /)-_-(\        /)-_-(\
                 (o o)          (o o)
         .-----__/\o/            \o/\__-----.
        /  __      /              \      __  \
    \__/\ /  \_\ |/                \| /_/  \ /\__/
         \\     ||                  ||      \\
         //     ||                  ||      //
         |\     |\                  /|     /|
         
**UNIX history**

In 1964, Bell Labs (the research arm of AT&T) partnered with MIT and GE to create Multics (for Multiplexed Information and Computing Service) -- here is the vision they had for computing

>“Such systems must run continuously and reliably 7 days a week, 24 hours a day in a
way similar to telephone or power systems, and must be capable of meeting wide
service demands: from multiple man-machine interaction to the sequential processing
of absentee-user jobs; from the use of the system with dedicated languages and
subsystems to the programming of the system itself”

Bell Labs pulled out of the Multics project in 1969, a group of researchers at Bell Labs started work on Unics (Uniplexed information and computing system) because initially it could only support one user; as the system matured, it was renamed UNIX, which isn’t an acronym for
anything. Ritchie simply says that UNIX is a "somewhat treacherous pun on Multics."

While this seems like quite a long time ago, consider how Dennis Ritchie described UNIX support for programming.

>Ritchie observes: “What we wanted to preserve was not just a
good environment in which to do programming, but a system
around which a fellowship could form. We knew from
experience that the essence of communal computing, as
supplied by remote-access, time-shared machines, is not just
to type programs into a terminal instead of a keypunch, but to
encourage close communication.” The theme of computers
being viewed not merely as logical devices by as the nuclei of
communities was in the air; 1969 was also the year the
ARPANET (the direct ancestor of today’s Internet) was
invented. The theme of “fellowship” would resonate all through
UNIX’s subsequent history.
<br><br>From ["The Art of Unix Programming"](http://www.catb.org/esr/writings/taoup/) by Raymond

In Multics, we find the first notion of a hierarchical file system -- software for
organizing and storing computer files and the data they contain in UNIX, files are
arranged in a tree structure that allows separate users to have control of their own
areas. Think a system of folders or directories -- one folder can contain files and other folders and so on. A tree! UNIX began (more or less) as a file system and then an interactive shell emerged to let you examine its contents and perform basic operations. And these are what we will focus on today.

**The UNIX kernel and shell**

The **UNIX kernel** is the part of the operating system that provides other programs
access to the system’s resources (the computer’s CPU or central processing unit, its
memory and various I/O or input/output devices).

The **UNIX shell** is a command-line interface to the kernel — keep in mind that UNIX
was designed by computer scientists for computer scientists and the interface is not
optimized for novices. (The term "shell" is general in that a shell is the outermost
interface to the inner workings of the system it surrounds -- where have we seen this idea before?)

The UNIX shell is a type of program called an interpreter — in this case, think of it as a
text-based interface to the kernel. It operates in a simple loop: It accepts a command, interprets it, executes the command and waits for another. Very obedient. The shell displays a prompt to tell you that it is ready to accept a command. 



On a Mac, you can open the Terminal application and be greeted with a happy UNIX prompt. On a Windows laptop, you can run a Unix shell but you'll need something like [WSL](https://docs.microsoft.com/en-us/windows/wsl/install-win10), [cygwin](https://www.cygwin.com/) or a virtual machine.

              /\___/\
              `)9 9('
              {_:Y:.}_
    ----------( )U-'( )----------

Since not everyone in class is on a Mac, we are going to set up our first cloud computer (using Amazon) where we can practice our Unix commands together. For those on a Mac, you can run Unix commands directly from the notebook using "cell magic" syntax.

One last comment. There are several versions of a UNIX shell. Why might we want different kinds of interfaces to our computer? Well, it turns out that some shells are good for interactive work (allowing you to hit the Tab key and have a command "autocomplete") while others have additional programming support to help you make "scripts" (think of the move from single commands to functions in Python). The **sh**, or [the Bourne Shell](https://en.wikipedia.org/wiki/Bourne_shell), is an old standby, whereas **bash**, or [the Bourne Again Shell](https://www.gnu.org/software/bash/), combines many characteristics of different shells together (bashing them together).



### Our Own Computer in the (Amazon) Cloud

Alright, we are now going to venture into the cloud. Our first task is to create a computer "out there". Something that isn't our laptop. It won't cost us anything as we will rent a **computer equivalent of a 70s sedan**. But the process is the same if we were renting a Porsche. We will use [Amazon's EC2 (Elastic Compute Cloud) service.](https://aws.amazon.com/ec2/) 

```
                .--~~,__
   :-....,-------`~~'._.'
    `-,,,  ,_      ;'~U'
     _,-' ,'`-__; '--.
    (_/'~~      ''''(;
```

If you don't already have an Amazon Web Services account, please [set that up](https://portal.aws.amazon.com/billing/signup) now (it's free!).

Once you have an account, head over to the [EC2 site](https://aws.amazon.com/ec2/). In the upper righthand corner, click on the yellow "Sign In to the Console" button and use your Amazon account. 

We will want to sign in to the console which should land us on a page that looks like this.<br><br>

<img src=https://github.com/computationaljournalism/columbia2018/raw/master/images/screen.jpg style="width: 65%; border: #000000 1px outset;"/>
<br>

Select the "EC2" services which, in turn, will take us to a screen that should look a lot like this.<br><br>

<img src=https://github.com/computationaljournalism/columbia2018/raw/master/images/screen2.jpg 
style="width: 65%; border: #000000 1px outset;"/>
<br>

From here, we can "launch an instance", or, rather, startup a computer for our personal use. This computer will remain awake and operational until we decide to take it down. Click on the blue "Launch Instance" button in the middle of the page. 

1. Use the blue "Select" button to choose the first kind of computer you are offered on the page, **"Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type"**. Don't sweat all the lingo, just know that this is a Linux computer, and the operating system was configured by Amazon. 
2. After you select the type of computer, you will be asked for the "size" specifications. You will select from the "General purpose" family, a **t2.micro** computer. This baby is free! OK you have 1Gb of RAM and about 8Gb  of storage, and not all of your jobs will work with this free choice. *But*, for now, it gets the point across. With the "t2.micro" selected, scroll to the bottom of the page and click the blue **"Review and Launch"** button.
3. Scroll to the bottom of the page and select **"Launch"**. You will be immediately prompted to create so-called key pair. From the menu, choose to create a new key pair and give it a name. This will cause a file (ending in .pem) to be downloaded to your computer. We'll do more with that in a second. For now, hit **"Launch Instance"** and away we go!
4. Finally, on the landing page, click on the link that looks like **The following instance launches have been initiated: i-097....743ca** to trot over and have a look at your new arrival.


### Our "70s Sedan"

This is the equivalent of the computer we just started:

<img src="http://chrisoncars.com/wp-content/uploads/2010/06/Ford_Fairmont_sedan_2.jpg">

### Let's Connect to Our Computer in the Cloud

The following instructions on how to connect to our cloud computer are for those on a Mac (or Linux). For those using Windows, you can use a program called Putty (instructions are [here](http://www.dorusomcutean.com/ssh-ec2-instance-windows-using-putty/)).

First, **open a new Terminal window.** This will be our  portal to the computer we just created. 

Back on the AWS EC2 site, click the button next to your new "instance" in the table. At the top of the EC2 console, you'll see a grey button asking you to **"Connect"**. It will pop up a small window that tells you what to do with your key file. The command `chmod` is a UNIX command you type into the terminal window that changes the "permissions" on the key file. I usually make a folder called Credentials and put the file in there. The 400 says that only you can look at this file and that the other users of your laptop (guests, say) can't see it. 

If you are on a Mac: in your terminal let's type a few Unix commands to set the proper permissions on the credentials file:
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`cd Downloads`
<br><br>
`cd` stands for "change directory" and takes us to whatever directory we specify. Here, make sure you use the name of the directory where you downloaded the .pem file.
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`chmod 400 (yourkey.pem)`
<br><br>
Security! Then, use the `ssh` command they provide, again in the terminal window, with the right path to your key file. It should look something like:
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`ssh -i (yourkey.pem) ec2-user@(your machine)`


**Copy this into your Terminal window.** The command `ssh` stands for secure shell and is your window to the new computer. To get there you have to provide your key (which is why you want to keep it safe) and the address of the machine. You should be greeted with something like this...<br><br>

```

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2017.09-release-notes/
6 package(s) needed for security, out of 8 available
Run "sudo yum update" to apply all updates.
[ec2-user@ip-172-30-0-108 ~]$
```
<br><br>
With the dollar sign being your very own UNIX prompt out in the cloud! Ha! Now, let's learn some Unix commands so we know how to drive this 70s sedan....

```
         __
        /  \
       / ..|\
      (_\  |_)
      /  \@'
     /     \
 _  /  `   |
\\/  \  | _\
 \   /_ || \\_
  \____)|_) \_)
  ```

To start off, let's download a file and give you a sense of what UNIX commands are capable of.

## NOTE: the following Unix commands are meant to be run in a terminal window, not the notebook!

We will work on [this file](https://github.com/computationaljournalism/columbia2019/raw/master/data/columbia.txt), which is a log file from the columbia.edu web server. To download the file from github to our EC2 instance, we can use a command-line tool called [wget](https://www.gnu.org/software/wget/). Run the following command on our EC2 instance we just created:

In [None]:
wget https://github.com/computationaljournalism/columbia2019/raw/master/data/columbia.txt

You should see some output that looks like this:

```2019-03-13 14:30:03 (47.0 MB/s) - ‘columbia.txt’ saved [1048576/1048576]
```

We've downloaded a file from our course web server. It's called columbia.txt and it's now on our cloud computer. We are going to examine it using some simple UNIX commands. 

A few for exploring your folders: **pwd, ls, cd**<br><br>Making and removing folders (directories): **mkdir, rmdir**<br><br>Copying, renaming and removing files: **cp, mv, rm**<br><br>


         |\_/|                  
         | @ @   Woof! 
         |   <>              _  
         |  _/\------____ ((| |))
         |               `--' |   
     ____|_       ___|   |___.' 
    /_/_____/____/_______|
    
<br><br>First, **pwd** or "print working directory" will tell you which folder you're in. For the notebook, this means the folder your data and notebook file are being stored in.

In [None]:
pwd

The command **ls** lists the contents of a folder. Compare this list below to what you  see when you use your finder to examine the same folder. 

In [None]:
ls

Unix commands can be modified by adding one or more **options**. In the case of
ls, we can add a "-l" for the "long" form of the output and a "-a" for all directories that
begin with a “.” Another useful option is "-h" for humanly readable output or "-G" for
color (which will only show up in the Terminal and not the notebook).

The long printout below tells us the size of each file and what we can do to it.

In [None]:
ls -l 

To understand what this all means, let's look at the first line (for me).

<pre>
-rw-rw-r--   1 ec2-user ec2-user 1048576 Mar 13 14:33 columbia.txt
</pre>

Let's split things up a bit. 

<pre>
 f1  f2   f3   f4  f5  f6        f7        f8       f9            f10
  -  rw-  rw-  r--  1  ec2-user  ec2-user  1048576  Mar 13 14:33  columbia.txt
</pre>

Let's walk through the different fields.

**f1**: - for File, d for Directory, l for "Link"

**f2**, **f3** and **f4**: These are "permissions" that mean you can read (r), write (w) and execute (x) a file, or not (-). They come in three different clusters specifing permissions

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**f2**: The owner has over the file,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**f3**: The group has over the file, and<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**f4**: Everybody else has over the file

**f5**: This field specifies the number of links or directories inside this directory.

**f6**: This is the user who owns the file or directory.

**f7**: The group that file belongs to, and any user in that group will have the permissions given in the third field over that file.

**f8**: The size in bytes, you may modify this by using the -h option (humanly readable) together with -l this will have the output in k,M,G so it's just a bit easier to understand.

**f9**: The date of last modification

**f10**: The name of the file

**Aside: Why a command line?**

While interacting with a computer by typing in commands might seem primitive, it has its advantages (there are reasons why it's hanging around).

**Agile** — It is designed tok be very interactive, supporting exploratory
analysis; it is also close to the “filesystem” which means the tools are
close to the data you’re working with

**Scalable** — You are interacting with your computer by typing
commands and not through a graphical user interface (GUI) which
means your instructions can be combined into a file or script and
reused

**Extensible** — New tools are being developed for the command line
on a daily basis, being written in a variety of languages but all usable
in the same way as the original tools that appeared in the 1960’s and
1970s

**Ubiquitous** — It is hard to find a computer system that you’ll
purchase (desktop or laptop) and if you soar into “the cloud” you will
likely encounter the various computers you find there through a
command line interface

       _=,_
    o_/6 /#\
    \__ |##/
     ='|--\
       /   #'-.
       \#|_   _'-. /
        |/ \_( # |" 
       C/ ,--___/

**Back to the drill**

The command **man** will provide you with help on any Unix command. You simply
supply the name of the command you are interested in as an argument. 

In [None]:
man ls

In the **man** command above, the string "ls" is passed as an **argument** that tells UNIX which "data" to work with. Here's another example of an argument. 

The command **head** does what you might expect given our exposure to Pandas. It prints out the first 10 lines of a file, the name of which you pass as an argument. Here we look at the first 10 lines of "columbia.txt". How do you get the last 10?

In [None]:
head columbia.txt

In [None]:
man head

A UNIX command  will often involve both arguments and options. Here we tell **head** to only print out the first three lines.

In [None]:
head -3 columbia.txt

And where there's a head, you'll also find a tail!

In [None]:
tail columbia.txt

**Web access logs**

OK what kind of data do we have? This is the so-called [combined log format](https://httpd.apache.org/docs/1.3/logs.html) from an Apache web server. Whenever you browse a web site (in this case, [www.journalism.columbia.edu](http://www.journalism.columbia.edu)), there is a program responding to your requests. Want the home page? Want information about the Dual Degree? You request the HTML page and that request is recorded as a single line in the log file. Then, to render the page, your browser might need some CSS files or JavaScript files or just some simple images. The subsequent requests for these objects are also recorded, one line each, in the log file. 

So the log file is growing with each user's visit. Requests are logged to the bottom of the file in time, so the oldest entries are at the top of the file and the newest at the bottom. If many people are looking at the site at the same time, their requests are interleaved in the file, as it records requests in time order. 

Each line in the log file hold these values

>IP address<br>
Identity<br>
Userid<br>
date<br>
Request<br>
Status<br>
Bytes<br>
Referrer<br>
Agent

Let's compare this information with the first line (oldest request) in our file. (Notice that these log lines are really long and so "wrap around" the cell and can look like two or more lines.)

In [None]:
head -1 columbia.txt

So the visit was from someone using the address 128.59.40.117 at 17/Apr/2016:06:27:25. The request was for a file called "robots.txt" which describes where automated programs are allowed to scrape on the site. [The 200 means](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) that the transmission was completed and that 469 bytes were sent. The user agent is not a browser but a "crawler" which means an automated scraper that is sucking up our content, presumably because it's feeding a search engine.

Finally, a UNIX command can help us figure out the IP address. The `whois` command does not come pre-installed on our Amazon t2.micro but we can use it on our Mac, try an [online service](https://www.ultratools.com/tools/ipWhoisLookup) to do it for us, or even install the program on our instance (by typing `sudo yum install jwhois`).

In [None]:
whois 128.59.40.117

The newest entries in the file are obtained from the bottom of the file. The last few lines are displayed with **tail**.

In [None]:
tail -1 columbia.txt

This last request has a timestamp of "17/Apr/2016:09:12:53". That means we have captured about 3 hours worth of activity on our site. How many requests is that? The command **wc** tells us how many lines, words and characters are in a file.

In [None]:
wc columbia.txt

So in our three hours we have 4,000 or so requests. What other questions might we ask of the data? We might want to know how many different IP addresses appear in the data set. Or maybe how many different status codes. 

We can use the command **cut** to select specific items from the file. Here we pass options that include "-d" (a character to be used as a delimiter defining separate fields in the file) and "-f" (to specify which fields to cut from the file). 

Below we define individual fields as being separated by a blank space character and then ask for just the first field, the IP address.

In [None]:
cut -d" " -f1 columbia.txt 

In [None]:
cut -d" " -f10 columbia.txt 

Look at one of the log lines above and make sure you understand that the 10th field (as defined by spaces) is the number of bytes transferred. 

Below, use another delimiter to pull out the month the request was made.

In [None]:
# your code here



The options for the fields to keep include lists separated by commas and ranges defined by a hyphen. The next two are fields 1 and 10 and then fields 1 through 3.

In [None]:
cut -d" " -f1,10 columbia.txt

In [None]:
cut -d" " -f1-3 columbia.txt

At this point, we're getting tired of seeing 4000 lines of output scroll by. We can catch the output and "pipe" it into the command that restricts us to 10 lines, **head**. The vertical bar "|" is known as a pipe and it takes the output of one command (cut, below) and pipes it as input to the next command (head, below). 

The net result is printing just 10 lines of fields 1 and 10.

In [None]:
cut -d" " -f1,10 columbia.txt | head

As its name suggests, the command **sort** will order the rows in our file. By default it
uses alphabetical order but the option "-n" lets you sort numerically instead. Below we **cut** out just the IP's and then "redirect the output" to a file called "ips.txt". We then sort the IP addresses and put the sorted result in a filed called "ips_sorted.txt". 

In [None]:
cut -d" " -f1 columbia.txt > ips.txt
sort ips.txt > ips_sorted.txt
head -100 ips_sorted.txt

With UNIX pipes, we can avoid the extra files and just get the sorted data directly.

In [None]:
cut -d" " -f1 columbia.txt | sort | head -100

(You will end the display with a red box saying that the **head** command only allowed 100 lines to be printed and not all of the output from **sort**. It's OK.)

Next, the command **uniq** will remove repeated adjacent lines in a file, so if your file is sorted, it will return just the unique rows.

In [None]:
uniq ips_sorted.txt 

Or in one line...

In [None]:
cut -d" " -f1 columbia.txt | sort | uniq | head -100

The command **uniq** has an option "-c" that returns the counts of each row in the file. If we apply it to "ips_sorted.txt", we'll get two columns -- one is how many requests were made in our 3 hour window by the IP address and the second is the IP address.

In [None]:
uniq -c ips_sorted.txt

Or, preferably, in one line...

In [None]:
cut -d" " -f1 columbia.txt | sort | uniq -c | head -100

Finally, we can add a second sort to this pipeline to sort in reverse numerical order (using options -r and -n) the **uniq**'d file, giving us the most frequently seen IPs first.

In [None]:
cut -d" " -f1 columbia.txt | sort | uniq -c | sort -rn | head -25

So 207.46.13.69 was seen 139 times. What is this address?

In [None]:
whois 207.46.13.69

It's owned by Microsoft. We can use a filtering command known as **egrep** to pull just the lines that match a regular expression pattern (in quotes). So we might do the following.

In [None]:
egrep "207\.46\.13\.69" columbia.txt

And we could see what the pattern is using [regexper.com](https://regexper.com/#207%5C.46%5C.13%5C.69).

We could save these lines in a new file if we wanted to do more work. But for now, we see that they are all running "bingbot" which is the spider (scraper) for the Bing search engine. Let's see how many times "bingbot" is used.

In [None]:
egrep "bingbot" columbia.txt | wc

So 481 out of our 4000 or so requests were from Bing. 

The referrer field is number 11. It records the link someone clicked on to get to the page they're requesting.

In [None]:
cut -d" " -f11 columbia.txt | head -100

And here we look at just referrers that are Google.

In [None]:
cut -d" " -f11 columbia.txt | egrep "google" 

Finally, we can clean up a little. Let's remove our two files of IP addresses. First, find them...

In [None]:
ls -l ips*.txt

And then use **rm** to remove them, with a follow up **ls** to make sure they're gone.

In [None]:
rm ips.txt
rm ips_sorted.txt
ls -l

**Your turn**

Come up with three questions about the visitors to the site and answer them using simple UNIX commands. Recall we've seen 

>**pwd, ls, rm, mv, cp<br><br> head, tail, wc, <br><br> cut, sort, uniq,<br><br> grep**

This is a pretty powerful pipeline!
<br><br>

    _     /)---(\          /~~~\
    \\   (/ . . \)        /  .. \
     \\__)-\(*)/         (_,\  |_)
      \_       (_         /   \@/    /^^^\
      (___/-(____) _     /      \   / . . \
                   \\   /  `    |   V\ Y /V
                    \\/  \   | _\    / - \
                     \   /__'|| \\_  |    \
                      \_____)|_).\_).||(__V
                      

In [None]:
# your code here




**Twitter in the Cloud**

To monitor Twitter remotely, we can use an application called twarc. It is a command line tool for archiving tweets. It handles all the rate limits and lets you worry about what you're going to do with the data once it's puddledup. First, using your terminal window that is logged into the EC2 t2.micro computer, install twarc. It's a Python application so...

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo pip install twarc`

Ah, `sudo`. That's a command that basically promotes you to administrator while you install twarc. Think about your laptop. When you install software, you have to type in a password because you need to have super powers to put files on certain parts of the computer. Your guests, for example, probably don't have this ability. 

Once you have installed tward, you should configure it with your keys. Have them ready from Twitter ([go to apps.twitter.com](https://apps.twitter.com/)) and type

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`twarc configure`

OK that done, we can now start monitoring Twitter! Here's a simple 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`twarc timeline realDonaldTrump`

OK that had a lot of stuff streaming by. Essentially, you received all of realDonaldTrump's tweets, up to the rate limit. We didn't ask to do anything with them so they just printed out. twarc has a lot of great features that let you do things like follow people and watch their tweets in real time. According to [https://tweeterid.com/](https://tweeterid.com/), realDonaldTrump as a twitter id of 25073877. Here's how we follow this account.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`twarc filter --follow  25073877`

And now the printout is slower, but it is meant to be printing both realDonaldTrump's tweets and retweets. You end this parade by entering Cntl-C to kill the twarc job.

twarc is quite useful. Here is [detailed documentation](https://github.com/DocNow/twarc). I'd advise using it where you can.

**Storage: Moving files back and forth into the cloud**

Now, rather than have the data stream by, we could capture it in a file. Recall our "redirect" that dumps output into a file. Here we store realDonaldTrump's timeline in a file called `trump.json`. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`twarc timeline realDonaldTrump > trump.json`

To make use of it, let's copy it from our cloud computer back to our desktop. So type 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`exit`

and you should end up with a prompt that looks more like your laptop where you started. Now, hit the "up arrow" key on your keyboard while you are in the terminal window. This will recall your last command. You can then alter it to the following (keeping `yourkey` and `yourmachine` as they are in your terminal window. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`scp -i (yourkey.pem) ec2-user@(your machine):trump.json .`

This command is "secure copy" -- it uses your credentials or key to move the file `trump.json` from your amazon computer to your laptop. If you want to copy some other file `abc.txt` to the cloud machine you would do this.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`scp -i (yourkey.pem) abc.txt ec2-user@(your machine).com:`

So the syntax is 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`scp -i (yourkey.pem) from_file to_file`

... make sure you see this by comparing the two lines above. Now, we can read that file in (maybe you have to put `trump.json` into the folder where your notebook is located).

In [None]:
from json import loads

# read in the tweets as strings from the file - one line per tweet
tweetstrings = open('trump.json').readlines()

# for the first 10 strings, load them into python objects (dictionaries)
# and print out the text of the tweet

for t in tweetstrings[:10]:
    tweet = loads(t)
    print(tweet["full_text"])
    print("-------------")

```
            ,/A\,
          .//`_`\\,
        ,//`____-`\\,
      ,//`[_ROVER_]`\\,
    ,//`=  ==  __-  _`\\,
   //|__=  __- == _  __|\\
   ` |  __ .-----.  _  | `
     | - _/       \-   |
     |__  | .-"-. | __=|
     |  _=|/)   (\|    |
     |-__ (/ a a \) -__|
jgs  |___ /`\_Y_/`\____|
          \)8===8(/
```

**Storage: Installing a Mongo database**

That's cool but we can do way way better. Let's go back to your computer in the Amazon cloud and install a database. We will use something called MongoDB (Mongo from hu*mongo*us.). You can [read about the project here.](https://www.mongodb.com/). It is an example of a new-ish breed of data bases that have emerged. They are called NoSQL (for non-SQL or "not only" SQL) and signal a break from the relational model (which, weirdly, we will come back to). According to the Mongo site, some examples of this new breed include

* **Document databases** pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
* **Graph stores** are used to store information about networks of data, such as social connections. Graph stores include Neo4J and Giraph.
* **Key-value stores** are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as 'integer', which adds functionality.
* **Wide-column stores** such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.

Mongo is a document database, where the documents are represented by JSON strings. This kind of flexibility is perfect for our Twitter data, as a tweet is just a JSON object. To install Mongo on our cloud machine, `ssh` over there and let's do the following. Oh we have [taken these instructions mainly from here.](https://github.com/SIB-Colombia/dataportal-explorer/wiki/How-to-install-node-and-mongodb-on-Amazon-EC2)

1. Secure shell over to your Amazon machine using the `ssh` command from above.
2. The version of `pip` for Linux (or a version) is something called `yum`. It stands for "Yellow Dog Updater, Modified". It was the package manager for a version of Linux called Yellow Dog Linux (which was an early UNIX OS that ran on a Mac!). Our first command will be to update `yum` itself. First, making sure all its packages are current.
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo yum check-update`<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo yum update`
<br><br> Then, `yum` needs a database of projects to look through and our next commands will be to update that list. Don't worry too much about this, it's just adding a "repo" to the places where `yum` looks  for code to install.  
<br>
`echo "[mongodb-org-4.0]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/amazon/2013.03/mongodb-org/4.0/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-4.0.asc" |
sudo tee -a /etc/yum.repos.d/mongodb-org-4.0.repo`
<br><br>
3. Next, install MongoDB. It's just like using `pip` except that we have to `sudo` for administrator powers and then use `yum` for a Linux app. 
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo yum install -y mongodb-org`
<br><br>
4. We are using the /var/lib/mongo folder to save our database data, a log file and so on. These defaults are fine.
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo mkdir /var/lib/mongo/data`
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo mkdir /var/lib/mongo/log`
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo mkdir /var/lib/mongo/journal`
<br><br>
5. Set the storage items (data, log, journal) to be owned by the user (mongod) and group (mongod) that MongoDB will be starting under:
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo chown mongod:mongod /var/lib/mongo/data`
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo chown mongod:mongod /var/lib/mongo/log`
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo chown mongod:mongod /var/lib/mongo/journal`
<br><br>
6. Set the MongoDB service to start at "boot" (if you ever have to reboot your machine) and activate Mongo!
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo chkconfig mongod on`
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo /etc/init.d/mongod start`
<br><br>
7. Have a look around! MongoDB has a shell (everything does!) that you can use to look at data, etc. There's not much to do now except maybe create a new database and a user who can read and write into the database.
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`mongo`
<br><br>
Not much to do just yet. I mean you can ask for help, and maybe `show databases`. So let's add some data. But first, <br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`quit()` 
<br><br>
out of here. 

```
 __/ / \
|    |/\
|_--\   \              /-
     \   \-___________/ /
      \                :
      |                :
      |       ___ \    )
       \|  __/     \  )
        | /         \  \
        |l           ( l
        |l            ll
        |l            |l
       / l           / l
       --/           --
```

**Loading data into Mongo**

Now, let's store data. We'll take realDonaldTrump's timeline and dump it into a database. The instructions [are given here](https://gist.github.com/edsu/ac57715ac0a2fec3bc64).
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`twarc timeline realDonaldTrump | mongoimport --db tweets --collection realDonaldTrump` 
<br><br>
This command uses twarc, asks for realDonaldTrump's timeline and then pipes the output (pipes!) into  a command `mongoimport` to bring the data into our database. The database is called `tweets` and the particular collection of documents is called `realDonaldTrump`. 

Think of this structure as having one database per project and then multiple collections of documents (JSON data) in each database.
<br><br>
To see what we've done, let's get into Mongo 
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`mongo`
<br><br>
and then look at the databases
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`show databases`
<br><br>
to see your `tweets` and then 
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`use tweets`
<br><br>
to switch to that database. We can then see what collections are available in this database.
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`show collections`
<br><br>
And we can see what's there. Maybe we count the tweets we've recorded in each collection.
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`db.realDonaldTrump.count()`
<br>

The structure of this command in the Mongo shell is `db.collectionname.action`. We can do things like find all the tweets from `@realDonaldTrump` that were retweeted over 80,000 times:
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`db.realDonaldTrump.find({retweet_count:{$gt:80000}})`

This gives us the entire tweet object for each tweet that was RT'd over 80,000 times. We can modify our `find` command to have it return only the retweet_count, and the text of the tweet (as opposed to the whole thing). Here is the code:
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`db.realDonaldTrump.find({retweet_count:{$gt:80000}},{retweet_count:1,full_text:1})`

In general, the `find()` command searches for JSON documents and uses a dictionary syntax to find them. Here we searched for the `retweet_count` field, looking for those well retweeted tweets. \$gt and \$lt are ways to specify ranges. The second argument of `find()` gives a dictionary that tells you what data to keep. The value 1 means keep.

The Mongo shell is really powerful. The Mongo site [has great documentation on `find()` and other commands](https://docs.mongodb.com/manual/reference/method/db.collection.find/). Now, we are often going to access a database from the comfort of some other computing environment. In this case, Python. 



1. Before we move on, let's set up a few users in our database to allow us to administer it (later) and access it remotely. First, we'll create our admin user. Please change the password here to something you can use later. Copy the following lines in to mongo:
<pre>
use admin
db.createUser(
  {
    user: "admin",
    pwd: "SomeStrongPasswordHere4JustYou",
    roles: [ { role: "userAdminAnyDatabase", db: "admin" }, "readWriteAnyDatabase" ]
  }
)
</pre>

2. To prepare our database for remote access, we'll set up a user (with a password). In mongo you can copy the following:
<pre>
use tweets
db.createUser({
  user: 'journalist',
  pwd: 'secret',
  roles: [{ role: 'readWrite', db:'tweets'}]
})
</pre>
and then get out
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;quit()
<br><br>

3. Next, we want to open up Mongo to talk to the outside world. This means changing its configuration file. Here we comment out one command and remove the comment from another. We are using an old old UNIX command called Sed for the Stream Editor.
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sed -i 's/bindIp.*/bindIp: 0.0.0.0/' /etc/mongod.conf`
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo sed -i 's/^#security/security/' /etc/mongod.conf `
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo sed -i "/^security/a  \ \ \ authorization: 'enabled'" /etc/mongod.conf`
<br><br>
4. Return to your EC2 Console and click on the instance in the upper pane of the console. Below you will see details about your computer and scroll down to "Security Groups". It should probably be "launch-wizard-1". Click on it and look at its security rules. Click on the "Inbound" tab. You see port 22 on the machine is open for `ssh` communication (that includes the secure shell and secure copy). Click "Edit" and then "Add Rule". You will want to select "Custom TCP Rule" (the default) and then Port 27017 and the access IP of 0.0.0.0/0, meaning every computer can connect. If there was just one IP address that needed your data, you could put it there instead. Hit "Save" and go back to your Terminal logged into the Amazon computer.
5. On the Amazon computer restart Mongo with its new user and new network aware self.
<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`sudo service mongod restart`
<br><br>
And there we are. Mongo is up and running and we can now talk to it. Let's! To do this in Python, we need to install PyMongo. Yay!

```
            __
(\,--------'()'--o
 (_    ___    /~"
  (_)_)  (_)_)
  ```

**Python and Mongo - PyMongo, of course!**

PyMongo lets us access a database from the comfort of our notebook. There are a few tutorials online, but [the basic documentation is here.](http://api.mongodb.com/python/current/tutorial.html) For the most part, the structure and commands are similar to those in Mongo itself. The documents stored in a Mongo database can be nested structures and, as we have seen, we often do a little work to get them into regular table format. 

Before we go too far, let's install PyMongo.

In [None]:
%%sh

pip install pymongo

We are going to import the MongoClient function. It takes a specification for the location of a Mongo database and returns a client object. Working with the client object  is a bit like typing into the `mongo` shell as we did above. Here we create the client and then access the "tweets" database.

**NOTE** you will need to put the IP address of your Amazon EC2 instance in the code below where we connect to our Mongo database. Look back at the EC2 console web page and find the `IPv4 Public IP` for our instance.

In [None]:
from pymongo import MongoClient

client = MongoClient("mongodb://journalist:secret@put_your_instance_IP_here:27017/tweets")
type(client)

In [None]:
db = client.tweets

The expression above matches the  Mongo shell expression. We can also use something a bit more Python inspired  and ask for the "tweets" database using subset notation. The expression above is equivalent to the the one  below. both return a link to a database. 

In [None]:
db = client["tweets"]

From here, we can ask for the different `collection_names()` that we have loaded. Remember that a document database consists of different collections of documents. In our case a document is a tweet and our collections refer to @realDonaldTrump tweets.

In [None]:
db.collection_names()

Recalling our simple Mongo commands, here  are the number  of documents (tweets) in each collection.

In [None]:
db["realDonaldTrump"].count()

And here we create a variable `trump` that represents the `realDonaldTrump` collection of tweets. This keeps us from continually typing out the full expression. From here we can look at one  tweet...

In [None]:
trump = db["realDonaldTrump"]
trump.find_one()

... or maybe iterate through several that match a search criterion. That's one of the reasons to have a database in the first place -- we can make searching very fast. In the expression below, we form a search for all the tweets with where the language is "undefined". A search is literally expressed as another document. Here's our query.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`{"lang": "und"}`

Let's see how many tweets match this criterion.

In [None]:
trump.find({"lang": "und"}).count()

Here we use a regular expression to find a pattern in the source and not just a literal match. Here we find how many times he tweeted from his iPhone. For this, we make use of an operator to specify the documents of interest.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`{"source": {"$regex":"iphone"}}`

Let's see how many there are...

In [None]:
trump.find({"source": {"$regex":"(iphone)"}}).count()

In the expression below, we form a search for all the tweets with a `retweet_count` larger than 20000. There are special operators $gt and 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`{"retweet_count":{"$gt":20000}}`

And count...

In [None]:
trump.find({"retweet_count":{"$gt":20000}}).count()

```
    ___
 __/_  `.  .-"""-.
 \_,` | \-'  /   )`-')
  "") `"`    \  ((`"`
 ___Y  ,    .'7 /|
(_,___/...-` (_/_/ sk
```

Rather than just counting, we can iterate through the set to display our results. Here we search for tweets with a retweet count over 20,000 and we only keep the fields `text`, `retweet_count` and `user.screen_name` (the "." is how we index into an embedded document, `screen_name` being a key to the `user` dictionary of the tweet). The notation in the second dictionary assigns a value of `True` to a key if you want to keep the variable with that name and it assigns a `False` otherwise. (You will also see 1 and 0 instead -- remember `True` reduces to 1 and `False` to 0.) This is called **a projection**. The first two arguments to `find()` are `filter=` and `projection=`. 

Here we are leaving out Mongo's `_id` variable as it's an internal Mongo index.

In [None]:
for tweet in trump.find({"retweet_count":{"$gt":20000}},{"full_text":True,"retweet_count":True,"user.screen_name":True,"_id":False}):
    print(tweet)
    print("-------------")

The database can do quite a bit of work for you before you request  any data. You can, for example, look at all the tweets from realDonaldTrump, ordered by retweet count, but maybe with the largest retweet count first. 

In [None]:
from pymongo import ASCENDING, DESCENDING

for tweet in trump.find().sort("retweet_count",DESCENDING).limit(10):
    print(tweet["retweet_count"],tweet["full_text"])
    print("-------------")

We can filter these out using a regular expression (this one using a negative lookahead for RT) and then sorting the results. The point is that all this work is done in the database and we blisfully pull over just the data we want.

In [None]:
for tweet in trump.find({"full_text":{"$regex":"^(?!RT)"}}).sort("retweet_count",DESCENDING).limit(10):
    print(tweet["retweet_count"],tweet["full_text"])
    print("-------------")

So, this lets us form subsets of data and creates a "cursor" that lets us walk through the data, processing things as we like. We can also take the data directly into a, you guessed it, Pandas data frame. We'll dive in to more of this in the coming lesson.

**Summary**

We now have a number of options for working with data. We can store data  locally as JSON or CSV files, but these are relatively "inert". They have to be read into Python or some other system to make convenient searches, for example. Through a database, you have consistent storage that many people can access and you can use the computational engine  to filter, group and compute on data before you bring it into Python, say. So while your  data may be humongous, your interest might be in individual people. There is no need to hold gigabytes of data in memory when you only need a small plart.
