# Browsing data

...with ease

In this tutorial, we'll cover two data browsing methods, and one visualization

 * Browsing files on the command line
 * Visualizing numerical data using the ECDF
 * Building pivot tables

## 1. Intro to *nix command line (for browsing data)

Several commands form a virtual Swiss army knife of tools for initial examination of (text) files:

  * cat
  * head
  * tail
  * cut
  * tr
  * sort
  * uniq
  * wc
  
There are a few fancier commands -- more powerful but more difficult to learn:

  * grep
  * awk
  * sed
  
And, there are a few odds and ends that will help work with files and scripting more generally:

  * xargs
  * find
  * echo
  
Many other commands exist.

We'll glue these commands together using redirection and loops

### Prerequisite:  set up your directory tree

(Normally view your directory tree with the ``tree`` program, but it's not installed on syzygy.)

```bash
src             # source code
src/projectA
src/projectB
usr             # optional: or an alternative to using src
bin             # binaries: locally-installed
var             # variable files; often auxiliary to some sort of source program
var/data        # my favourite place to put data
var/www         # this is where I put my test web pages
tmp             # things in here could be deleted at any time (use for downloads, mostly)
```

### Prerequisite:  install gdrive

``gdrive`` is a Command Line Interface (CLI) to Google Drive.

```bash
cd tmp
wget -O gdrive https://docs.google.com/uc?id=0B3X9GlR6EmbnWksyTEtCM0VfaFE&export=download
chmod a+x gdrive
mv gdrive ../bin
```

Your bin directory is probably not in your PATH...let's set that up, too:
```bash
cd ~
env | grep PATH                # list all your environment variables but just show the one called PATH
export PATH=$PATH:$(pwd)/bin   # append your /home/dir/bin to PATH
```

Test it out
```bash
gdrive list
```

This will prompt you to verify that your installation of gdrive is allowed to access your Google drive.

Reference:  http://olivermarshall.net/how-to-upload-a-file-to-google-drive-from-the-command-line/

### Get the data:

```bash
cd ~/var/data/ && gdrive download 0B3vTSAOy4zNvR3JCX3pOd0NHN1E --recursive
```

Note the use of the ``&&`` operator.  When the shell parses your commands, it tries to evaluate the command as true or false.  If the first command (the cd command here) fails, then the shell already knows that the overall result will be false, and so exits without evaluating any other commands.

### Examine the data

Examples:

```bash
cd SummerData
pwd                                     # show the directory we're in
cat README.md                           # show an entire file
wc -l D30.csv                           # count lines in a file

for F in *.csv; do wc -l $F; done       # iterate through all csv files; count their lines
head D30.csv                            # show the first 10 lines of D30.csv
head -n 1 D30.csv | tr ',' '\n'         # show the header

head -n 1 D30.csv | tr ',' '\n' | nl    # show line numbers
head D30.csv | cut -d',' -f5            # view a single column
head D30.csv > headers.txt              # send the headers to a text file

cut -d',' -f5 D30.csv | sort | uniq -c  # make a histogram
tail -n 10 D30.csv                      # last 10 lines of the file
tail -n +2 D30.csv                      # all but the first line of the file (skip the header)
```

Sed:
  * search and replace tool
  * Sed one-liners:  search for ``sed one liners`` or go to:  https://www.google.ca/search?q=sed+one+liners&oq=sed+one+liners

Awk:
  * line-by-line data extraction
  * Awk one-liners:  search for ``awk one liners`` or go to:  https://www.google.ca/search?q=awk+one+liners&oq=awk+one+liners

Example:

```bash
head -n 50 D30.csv | awk -F',' '{ if ($5=="640x360") print $0 }'  # filter based on a column
```

### Exercises

  * What range of values is in each column?  (Make histograms)
  * Concatenate the files into a single file having only one header
  * Use awk or sed to renumber the Index column
  
Concatenating files:
```
mkdir working
for F in *.csv; do cat $F >> working/AllData.csv; done
cd working
head AllData.csv

grep -n Index AllData.csv  # show with line numbers
grep -n Index AllData.csv | cut -d':' -f1  # just show line numbers
grep -n Index AllData.csv | cut -d':' -f1 | wc -l # count header lines

ls ../*.csv | wc -l # sure enough...one header from each file
```

## 2. ECDFs are better than histograms

This is the only Python we'll do here today.

An Empirical Cumulative Distribution Function (ECDF) is a plot that shows the distribution of your data *without putting it into arbitrary bins*.

The ECDF is a plot of the empirical probability (frequency) of a random variable being smaller than each of its values on the X-axis.  The slope of the ECDF is the density that you're used to seeing in histograms.  However, the ECDF doesn't require binning in order to work.

### Prerequisite:

Create a datafile (called ``TotalBytes.csv`` below) that is a single column -- the TotalBytes column -- in the ``AllData.csv`` file.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
def show(_plt):
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        _plt.show()

In [None]:
with open('/home/rdonalds/var/data/SummerData/working/TotalBytes.csv') as fh:
    next(fh)
    total_bytes = [float(row) for row in fh]

In [None]:
len(total_bytes)

In [None]:
plt.figure(figsize=(12, 8))

y = np.arange(len(total_bytes))/len(total_bytes)
plt.plot(sorted(total_bytes), y, '-b')
plt.xlabel('total_bytes')
plt.ylabel('P(TotalBytes < total_bytes)')

plt.xlim(-1e8, 5e8)
show(plt)

### Exercises

  * Make and interpret an ECDF for one of the other numerical columns in the data.
  * Make an ECDF for a numerical column of data for one of the cameras

## 3. Intro to Pivot Tables

Excel isn't completely sukky...if your data is really well organized, a spreadsheet can actually help...