#  An intro to the Python numerical stack

In [None]:
# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline 
#this line above prepares IPython notebook for working with matplotlib

# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().

import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm # allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns #sets up styles and gives us more plotting options

## Pandas 

Here is a cheatsheet for Pandas you night find useful: https://drive.google.com/folderview?id=0ByIrJAE4KMTtaGhRcXkxNHhmY2M&usp=sharing

### Reading data in from a file

A lot of the data out there in the world is in CSV files. If not, it can be put into a CSV file unless its too big. If its too big, its probably in a database where it looks like a CSV file.

What do we mean  when we say that it looks like a CSV file? We mean that its **rectangular**.  That is, there are some features/variables/co-variates of the data in columns, with observations in rows. These observations constitute a **sample** of the data, where we generally assume that the sample is drawn from the entire universe of possible observations of this type.

Here we read in some car data (from R) from a CSV file. Note that CSV files can be output by any spreadsheet software, and are plain text, so make a great way to share data. 

The documentation for this data is [here](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html) but I have extracted some relevant parts below:

```
Description

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Usage

mtcars
Format

A data frame with 32 observations on 11 variables.

[, 1]	mpg	Miles/(US) gallon
[, 2]	cyl	Number of cylinders
[, 3]	disp	Displacement (cu.in.)
[, 4]	hp	Gross horsepower
[, 5]	drat	Rear axle ratio
[, 6]	wt	Weight (1000 lbs)
[, 7]	qsec	1/4 mile time
[, 8]	vs	V/S
[, 9]	am	Transmission (0 = automatic, 1 = manual)
[,10]	gear	Number of forward gears
[,11]	carb	Number of carburetors
Source

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.
```

Let us capture this spreadsheet in **memory**. The structure we capture it in is called a pandas **dataframe**. 

In [None]:
dfcars=pd.read_csv("data/mtcars-course.csv")
dfcars.head()

Notice we have a table! A spreadsheet! And it indexed the rows (the 0,1,2,3,4...to the left in bold font).

`dfcars`, in `python` parlance, is an instance of the `pd.DataFrame` class, created by calling the `pd.read_csv` function, which calls the DataFrame constructor inside of it. If you dont understand this sentence, dont worry, it will become clearer later. What you need to take away is that `dfcars` is a dataframe object, and it has methods, or functions belonging to it, which allow it to do things. For example `df.head()` is a method that shows the first 5 rows of the dataframe.

The model for a `pandas` dataframe is that of a set of columns pasted together into a spreadsheet. This is slightly different from the model that you might be used to from Excel or Google Sheets, where you might make the spreadsheet a row at a time...

![](images/pandastruct.png)

(image stolen from the cheatsheet above)

There is an ugly  poorly named column right here. Lets fix that.

In [None]:
dfcars=dfcars.rename(columns={"Unnamed: 0":"name"})
dfcars.head()

Notice that we  created a new dataframe but renamed it as `dfcars`. This is because variables in python are just **bindings**: they are just aliases for a piece of memory.  The `rename` method on dataframes creates a new dataframe, and we rebind the variable `dfcars` to point to this new piece of memory. What about the old piece of memory `dfcars` pointed to? Its now  bindingless and will be destroyed by Python's garbage collector. This is how `python` manages memory on your computer.

Unless you have very limited memory on your computer, this is the recommended style of `python` programming. Don't create a `dfcars2`. If you have very limited memory, almost all `pandas` methods have a `inplace=True` option. Refer to the documentation on `rename` for example.  You can then say:
```
In[3]: dfcars.rename(columns={"Unnamed: 0":"name"}, inplace=True)
```

without reassigning, and the column will be updated in memory. 

### Properties of dataframes

Here are some of the properties of our dataframe. The `shape` tells us the `rows x columns` we have.

In [None]:
dfcars.shape # 12 columns, each of length 32

The columns may also be obtained as an attribute:

In [None]:
dfcars.columns

### Types of the columns

Columns in a dataframe come with their own types. Some data may be categorical...they come with only few well-defined values. An example is cylinders  (`cyl`). Cars may be 4, 6, or 8 cylinders. There is an ordered interpretation to this (8 cylinders more powerful engine) but also a one-of-three-types interpretation to this. 

Sometimes categorical data does not have an ordered interpretation. An example is `am`: a boolean variable which indicates whether the car is an automatic or not.

Other column types are integer, floating-point, and `object`. The latter is a catchall for anything `pandas` could not infer, or a string.
 
Lets see the types of the columns.

In [None]:
dfcars.info()

In [None]:
dfcars.dtypes

The `dtypes` attribute tells us what kind of columns we have. Some are floating-point numbers, and these typically have continuous values. The other which are integers  have integer values like 2 carbouretters, but you could consider these as **categorical or factor variables** as well: a number of carbouretters factor. `am` is a factor coded as an integer where 0 is automatic and 1 is manual.

`dtypes` are useful for debugging. If one of these columns is not the type you expect, it can point to missing or malformed values and you ought to go investigating. `Pandas` assigns these types by inspection of some of the values, and if the types are mixed it will make this an `object`, like the `name` column. This process is called **type inference**.

Consider an example:

In [None]:
different_values = ['a', 1, 2, 3]
different_series = pd.Series(different_values)
different_series

In [None]:
different_series.dtypes # object because type inference fails

In [None]:
similar_values = [2, 3, 4]
similar_series = pd.Series(similar_values)
similar_series

In [None]:
similar_series.dtypes # correctly infers ints

In [None]:
trans_map = dict(M=0, A=1)
dfcars['am'] = dfcars['am'].apply(lambda x: trans_map[x])
dfcars.head()

In [None]:
dfcars.dtypes

### Saving

Lets save this out:

In [None]:
dfcars.to_csv("data/mtcars-cleaned.csv", index=False)

Pandas `describe` gives us a nice statistical summary, in its own dataframe:

In [None]:
dfcars.describe()

### Accessing columns

Like in a dictionary, we can get each column. The type of the resulting column is a Pandas **Series**. Indeed a dataframe is formed by pasting together many series. 

In [None]:
type(dfcars['carb'])

In [None]:
dfcars['carb'] #you can also use df.carb (won't work for column names with spaces)

One may also access columns using a "property" like notation, but clearly this wont fork for column names that have spaces in them.

In [None]:
dfcars.carb

## Pandas is built on top of `numpy`

`numpy` is `Python`'s numerical programming library. It provides arrays in multiple dimensions, and operations that work on these arrays. You ought to use `numpy` for scientific programming, regular `python` lists are just too slow.

You can get the `numpy` arrays corresponding to `pandas` series and dataframes using the values attribute.

In [None]:
dfcars['carb'].values

`dtypes` works for these as well.

In [None]:
dfcars['carb'].values.dtype

You can construct `numpy` arrays yourself.

In [None]:
my_array = np.array([1,2,3], dtype="int64")
my_array

In [None]:
my_array = np.array([1,2,3,4,5], dtype="float64")
my_array

> YOUR TURN NOW

>Create an array of 10 random numbers from the Normal distribution with 0 mean and standard deviation 1

Hint:  If you're not how to proceed, have a look at the `numpy` random distribution documentation:  [https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.randn.html](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.randn.html)

In [None]:
# your code here


## Visualizing your data

You can first see what color palette you are using. If you make multiple curves in a plot, these are the colors that will be sequentially used.

In [None]:
color_palette = sns.color_palette()
color_palette

Seaborn's `palplot` lets you visualize the colormap you are using. We are using matplotlib's default colormap, which looks like so:

In [None]:
sns.palplot(color_palette)

### Histograms

Let us call below `.hist` method of a `Pandas` series. This is an example of the way `Pandas` makes visualization for us easy. See the `Pandas` Visualization documentation for more details.

In [None]:
dfcars.mpg.hist()
plt.xlabel("mpg");

We can also call the `matplotlib` `hist` function on a `pandas` series (or for that matter, anything listy, like a `python` list or a `numpy` array). 

Notice that the style is different: in the image above, `pandas` imposed some style setttings on us. We'll learn more about plot styles as the course progresses.

In the `hist` function  you can also change the number of bins, and the transparency of the color.

In [None]:
plt.hist(dfcars.mpg, bins=15, alpha=0.5);
plt.xlabel("mpg");
plt.title("Miles per Gallon");

Here are the most commonly used matplotlib plotting routines.

![](images/mpl1.png)

We will convert the "mpg" series to a `numpy` array with the `values` attribute.  Then we will set the `x` limits to be between 0 and 40.

In [None]:
plt.hist(dfcars.mpg.values, bins=15, alpha=0.5);
plt.xlim(5, 40)
plt.xlabel("mpg");
plt.title("Miles per Gallon");

Next we will use  `seaborn` to change the axes style. Here we use a dark grid. (See the `seaborn` docs for more styles). The rest is your job.

We also show how to label a plot and obtain a legend from the plot. A vertical line is drawn at the mean, capturing the color used in the histogram and using it again.

One can set bins using a list, and also label the histogram.  Finally one can "normalize" the histogram to put the frequencies on a probability scale.

In [None]:
with sns.axes_style("darkgrid"):
    color = sns.color_palette()[0]
    plt.hist(dfcars.mpg.values, bins=range(3, 40, 3), label="probability", color=color, density=True)
    plt.axvline(dfcars.mpg.mean(), 0, 1.0, color=color, label='Mean')
    plt.xlabel("mpg")
    plt.ylabel("Counts")
    plt.title("Cars Miles per gallon Probability Graph")
    plt.legend()

### Plotting features against other features

Sometimes we want to see co-variation amongst our columns. A scatter-plot does this for us. 

In [None]:
with sns.plotting_context('poster'): #temporarily make plot large
    plt.scatter(dfcars.wt, dfcars.mpg)

You can use matplotlib's `plot` instead.

In [None]:
plt.plot(dfcars.wt, dfcars.mpg)

This gave us spagetti lines. Why? 

One can use markers instead of lines. Also see how the semicolon suppresses the text output like `[<matplotlib.lines.Line2D at 0x10ffbd978>]`. The semicolon will generally supress the return value of any `python` function.

In [None]:
plt.plot(dfcars.wt, dfcars.mpg, 'o');

In [None]:
plt.plot(dfcars.wt, dfcars.mpg, 'o')
plt.show()

But what if we want to save our figure into a file? The extension tells you how it will be saved..and note that the `savefig` needs to be in the same cell as the plotting commands. Go look at the files..

In [None]:
plt.plot(dfcars.wt, dfcars.mpg, 'o')
plt.savefig('foo1.pdf')
plt.savefig('foo2.png', bbox_inches='tight') #less whitespace around image

## Masks and Queries

A dataframe is useless if you cant dice/sort/etc it. To do this, one needs to use the concept of a **boolean mask**.

In [None]:
dfcars.mpg < 20

This gives us Trues and Falses. Such a series is called a **mask**.  A mask  is the basis of filtering. We can do:

In [None]:
dfcars[dfcars.mpg < 20]

Notice that the dataframe (spreadsheet) has been filtered down to only include those cars with `mpg < 20`. The rows with `False` in the mask have been eliminated, and those with `True` in the mask have been kept.

In [None]:
np.sum(dfcars.mpg < 20)

Why did that work? The booleans are coerced to integers as below:

In [None]:
1*True, 1*False

If we count the number of Trues, and divide by the total, we'll get the fraction of  cars with `mpg < 20`. Thus you can get probabilities by computing sample means (since you divide by the total number of items, and those not fitting the query give 0).

In [None]:
np.mean(dfcars.mpg < 20)

Or directly, in Pandas, which works since dfcars.mpg is a pandas Series.

In [None]:
(dfcars.mpg < 20).mean()