# Command line and data frames


Spring 2018 - Profs. Foster Provost and Josh Attenberg

Teaching Assistant: Apostolos Filippas


***

## Command Line (Terminal)

One of the most underappreciated tools for dealing with data is the command line of the unix/linux operating system. As you probably know, an operating system runs computer-based devices, including servers, laptops, smartphones, and many others, usually "under the hood".   "Unix" is the family of operating systems that runs the most devices.  If you have a Mac -- there's Unix under the hood.  An iPhone?  Unix under the hood. An Android phone? Guess what ... Unix under the hood.  And your cloud instance?  You guessed it.

You can get under the hood of a Unix-based computer via the command line, often called the Terminal, and do all manner of crazy things.  We will study a few of them.

We can access the command line of our instances through the Ipython Notebooks. You can use "shell" commands (such as the following) by prefixing the line with an exclamation point.

(The "shell" is a technical name for the command line interface you see in a Terminal window.  There are actually different shells with different commands & syntax, even for the same operating system, but we won't delve into that in this class.  We will use the particular shell that AWS provides us by default.  The default is the "bash" shell, for those of you who care.)

*Here we will give you a brief overview, and show some very useful commands.  If you're serious about dealing with data, and you don't have much command-line experience, you should get the book "Data Science at the Command Line" by Janssens. The first half of this (thin) book gives an excellent practical guide to dealing with data at the Unix command line.*


#### Interaction with files and folders

We can navigate the folder structure in which we are working.  Folders are called "directories". You will typically use commands such as `ls` (list directory contents) and `cd` (change to another directory). You can make a directory with `mkdir` or move (`mv`) and copy (`cp`) files. To delete a file you can `rm` (remove) it--careful, there's no getting it back.  To see the contents of a file you can `cat` it to the screen.  (Why `cat`?  That command actually concatenates multiple files and outputs the result.  If you give it one, then it just outputs that single file.  The default is to output it to the screen.  You'll see below that we can send command outputs elsewhere besides to the screen--like into another command!)

Many commands have options you can set when running them. For example to get a listing of files as a vertical list with extra details you can pass the `-l` (list) flag, e.g. `ls -l`. During the normal course of using the command line, you will learn the most useful flags. If you want to see all possible options you can always read the `man` (manual) page for a command, e.g. `man ls`. When you are done reading the `man` page, you can exit by hitting `q` to quit.


In [1]:
!ls

Command_line_and_data_frames_2017.ipynb [34mimages[m[m
[34mdata[m[m


In [2]:
!ls -l

total 1056
-rw-r--r--   1 mariazamora  staff  536994 Feb 21 17:33 Command_line_and_data_frames_2017.ipynb
drwxr-xr-x   5 mariazamora  staff     170 Feb 13 23:53 [34mdata[m[m
drwxr-xr-x  12 mariazamora  staff     408 Feb  6 17:44 [34mimages[m[m


In [3]:
!mkdir test

In [4]:
!ls

Command_line_and_data_frames_2017.ipynb [34mimages[m[m
[34mdata[m[m                                    [34mtest[m[m


In [5]:
!ls images/

new_notebook.png   notebook.png       terminal.png
new_terminal.png   script.png         terminal_2017.png
new_text.png       selectlanguage.png text.png


In [6]:
!cp images/terminal.png test/some_picture.png

In [7]:
!ls test/

some_picture.png


In [8]:
# WARNING: THIS WILL DELETE THE TEST FOLDER JUST CREATED
!rm -rf test/

In [9]:
!ls

Command_line_and_data_frames_2017.ipynb [34mimages[m[m
[34mdata[m[m


#### Data manipulation and exploration
Virtually anything you want to do with a data file can be done at the command line. There are dozens of commands that can be put together to get almost any result! Let's try it.

Lets take a look at the the file `data/users.csv`.

Before we do anything, lets take a look at the first few lines of the file to get an idea of what's in it.

In [10]:
!head data/users.csv

user,variable1,variable2
parallelconcerned,145.391881,-6.081689
driftmvc,145.7887,-5.207083
snowdonevasive,144.295861,-5.826789
cobolglaucous,146.726242,-6.569828
stylishmugs,147.22005,-9.443383
hypergalaxyfibula,143.669186,-3.583828
pipetsrockers,-45.425978,61.160517
bracesworkable,-51.678064,64.190922
spiritedjump,-50.689325,67.016969


Maybe we want to see a few more lines of the file,

In [11]:
!head -15 data/users.csv

user,variable1,variable2
parallelconcerned,145.391881,-6.081689
driftmvc,145.7887,-5.207083
snowdonevasive,144.295861,-5.826789
cobolglaucous,146.726242,-6.569828
stylishmugs,147.22005,-9.443383
hypergalaxyfibula,143.669186,-3.583828
pipetsrockers,-45.425978,61.160517
bracesworkable,-51.678064,64.190922
spiritedjump,-50.689325,67.016969
barnevidence,-68.703161,76.531203
emeraldclippers,-18.072703,65.659994
maintainwiggly,-14.401389,65.283333
submittedwavelength,-15.227222,64.295556
clucklinnet,-17.425978,65.952328


How about the last few lines of the file?

In [12]:
!tail data/users.csv

troubledseptum,135.521667,-29.716667
troubledseptum,-118.598889,34.256944
organicmajor,-5.435,36.136
cobolglaucous,-123.5,48.85
troubledseptum,-124.016667,49.616667
snaildossier,-124.983333,50.066667
unbalancedprotoplanet,-127.028611,50.575556
badgefields,-126.833333,50.883333
backedammeter,-123.00596,48.618397
clucklinnet,-117.1995,32.7552


We can count how many lines are in the file by using `wc` (wordcount--which counts more than just words) with the `-l` flag to count lines,

In [13]:
!wc -l data/users.csv

    8104 data/users.csv


It looks like there are three columns in this file, lets take a look at the first one alone. Here, we can `cut` the field (`-f`) we want as long as we give the proper delimeter to tell what separates fields (`-d` defaults to tab).

In [14]:
!cut -f1 -d',' data/users.csv

user
parallelconcerned
driftmvc
snowdonevasive
cobolglaucous
stylishmugs
hypergalaxyfibula
pipetsrockers
bracesworkable
spiritedjump
barnevidence
emeraldclippers
maintainwiggly
submittedwavelength
clucklinnet
bluetailgodwottery
microwavejar
croutonwrack
submittedwavelength
moderatohorn
heaterinert
micaassistant
gaudyfea
turnoverlovesick
amuckpoints
allegatorwafers
expecteffective
mincegaiters
peacefulceaseless
decanterbalance
synonympatisserie
starbucksbluetail
pipeathlete
radicandoceanic
somethingalbedo
craytugofwar
pipetsrockers
unbalancedprotoplanet
emeraldclippers
ischemicfrosted
binomialapathetic
stairsgobsmacked
ledgeindeed
badgefields
synonympatisserie
worldlyventuri
globeshameful
alloweruptions
burritoscarriage
grabbig
dronessomersault
latticelaboratory
ellipticalfabricator
amuckpoints
guavaconfide
fundingticket
croutonwrack
elatedunicorn
freelysociable
loindecorate
micaassistant
dweebspices
latticelaboratory
babyam

That's a lot of output. Let's combine the `cut` command with the `head` command by *piping* the output of one command into another command.  The vertical bar `|` is the *pipe*.

In [15]:
!cut -f1 -d',' data/users.csv | head

user
parallelconcerned
driftmvc
snowdonevasive
cobolglaucous
stylishmugs
hypergalaxyfibula
pipetsrockers
bracesworkable
spiritedjump


We can use pipes (`|`) to string together many commands to create very powerful one-liners. For example, let's figure out the number of _unique_ users in the first column of the data file. We will get all the values from the first column, sort them, reduce that to only the unique values, and then count the number of lines in the result:

In [16]:
!cut -f1 -d',' data/users.csv | sort | uniq | wc -l

     201


Or, we can get a list of the top-10 most frequently occuring users. If we give `uniq` the `-c` flag, it will return the number of times each value occurs. Since these counts are the first entry in each new line, we can tell `sort` to expect numbers (`-n`) and to give us the results in reverse (`-r`) order. Note, that when you want to use two or more single letter flags, you can just place them one after another.

In [17]:
!cut -f1 -d',' data/users.csv | sort | uniq -c | sort -nr | head

  59 compareas
  56 upbeatodd
  56 burntrifle
  56 binomialapathetic
  54 frequencywould
  54 ellipticalfabricator
  53 globeshameful
  52 badgefields
  52 ashamedmuscles
  51 alloweruptions


After some exploration we decide we want to keep only part of our data and bring it into a new file. Let's find all the records that have a negative value in the second and third columns and put these results in a file called `data/negative_users.csv`. Searching through files can be done using _[regular expressions](http://www.robelle.com/smugbook/regexpr.html#expression)_ with a tool called `grep` (Global Regular Expression Printer). 

And remember that we can send the output of a command to the screen or into another command?  You can also direct output of a command (or a string of commands) into a file using the "redirection" operator `>`.  (NB: the pipe is also a redirection operator.  Another redirection operator is `>>`, which concatenates the output to the end of the file, rather than overwriting it.)

In [18]:
!grep '.*,-.*,-.*' data/users.csv > data/negative_users.csv

In [19]:
!ls data

ds_survey.csv      negative_users.csv users.csv
