# 3. Working with data

## 3.0 Preamble

At the top of almost every piece of scientific computing work, we'll import these standard modules.

In [38]:
# Import modules, and give them short aliases so we can write e.g. np.foo rather than numpy.foo
import math, random
import numpy as np
import matplotlib.pyplot as plt
import scipy
import scipy.optimize
import pandas
# The next line is a piece of magic, to let plots appear in our Jupyter notebooks
%matplotlib inline 

# 3.1 What data looks like

Scientific computing is all about the data. You will almost always work with data in the form of a spreadsheet-like table, often referred to as a _data frame_. For example, here are some rows from the classic [Iris](https://en.wikipedia.org/wiki/Iris_flower_data_set) dataset, introduced by [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) in 1936. Fisher was described as a "genius who almost single-handedly created the foundations for modern statistical science".

| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|----------------------------------------------------------------|
| 7.7          |        3.8  |         6.7  |        2.2  |  virginica
| 5.3          |        3.7  |         1.5  |        0.2  |     setosa
| 5.8          |        2.7  |         5.1  |        1.9  |  virginica
| 5.5          |        2.4  |         3.7  |        1.0  | versicolor
| 6.7          |        3.0  |         5.2  |        2.3  |  virginica
| ...

A data frame is a collection of named columns. Each column has the same length, and all entries in a column have the same type, though different columns may have different types. It is basically the same as a table in a relational database (except that you should think of scientific data tables as permanent logs of observations, so that UPDATE and DELETE database operations are irrelevant).

In Python, we have some options about how to store data frames. A simple choice, which we'll use in this notebook, is to store them as dictionaries of `numpy` vectors, e.g.

In [3]:
iris = {'Sepal.Length': np.array([7.7, 5.3, 5.8, 5.5, 6.7]),
        'Sepal.Width': np.array([3.8, 3.7, 2.7, 2.4, 3.0]),
        'Petal.Length': np.array([6.7, 1.5, 5.1, 3.7, 5.2]),
        'Petal.Width': np.array([2.2, 0.2, 1.9, 1.0, 2.3]),
        'Species': np.array(['verginica', 'setosa', 'virginica', 'versicolor', 'verginica'])}

* Why not store it as a `numpy` matrix? Because all elements in a `numpy` matrix have to be the same type.
* What's bad about storing data frames as a dictionary of vectors? Because we might accidentally set some columns to have different lengths. Also, because we have to write bothersome code for simple tasks like picking out rows.
* Does it have to be `numpy` vectors rather than plain Python lists? Numpy is great for speed, as we learnt in Section 2, but for small datasets or for custom column types it's fine to use lists.
Also, `numpy` doesn't have a standard type for categorical data (like `Species` in the example above), which is a bother.
* Why not use object-oriented design and invent a class for data frames? That's exactly what the [`pandas`](http://pandas.pydata.org/) module does. It is currently the best library for working with data in Python, widely used inside companies like Google. However, it has idiosyncratic syntax which takes some time to learn. Also, I think it's not yet mature, and by the time you graduate there will probably be something better. If you want to learn data science right now you should learn to use `pandas`.

## 3.1 Importing and cleaning data
In my experience, around 75% of the time you spend working with data will be fighting to import it and clean it up. This depends mostly on general-purpose programming skills, but here are some snippets that may be useful.

### 3.1.0 Import from a file or file-like thing
When your data is a very simple comma-separated value (CSV) file then it's very easy to import. A CSV file looks like this:
```
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"
```

In [39]:
df = pandas.read_csv('data/iris.csv')    # read in CSV file as a pandas.DataFrame
df = {col:df[col].values for col in df}  # convert it to a dict of np.array

If your file is nearly a CSV but has some quirks such as comments or a missing header row, experiment with the options in [`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) or [`pandas.read_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html).

You can use the same command to read in file-like things, such as files retrieved over the web, or strings.

In [None]:
# The same file, fetched over the web


Sometimes it's useful to be able to read from a string or string-like object as though it were a file. You can use the same command for this:

In [47]:
import io
f = io.StringIO("""
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"
""")
df = pandas.read_csv(f)
f.close()    # free up the memory used to store the string
df = {col:df[col].values for col in df}

### 3.1.1 Import from a text log file


## 3.2 Manipulating data

* pandas indexing
* new columns etc.
* statistics
* histograms, plots with error bars

## 3.4 More plotting

* facets