# Data for Pandas

## 1 Preliminaries

In [None]:
# The usual preamble
import pandas as pd

In [2]:
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default') 
pd.set_option('display.width', 5000) 
pd.set_option('display.max_columns', 60) 

#figsize(15, 5)

line_width has been deprecated, use display.width instead (currently both are
identical)



## Importing data

Previously we've seen how to read in a csv file and eyeball the data with the `describe` method on the resulting `DataFrame`.  

In [3]:
import pandas as pd
pima = pd.read_csv("pima.csv", index_col=0)
pima.describe()

Unnamed: 0,npreg,glu,bp,skin,bmi,ped,age
count,332.0,332.0,332.0,332.0,332.0,332.0,332.0
mean,3.48494,119.259036,71.653614,29.162651,33.239759,0.528389,31.316265
std,3.283634,30.501138,12.799307,9.748068,7.282901,0.363278,10.636225
min,0.0,65.0,24.0,7.0,19.4,0.085,21.0
25%,1.0,96.0,64.0,22.0,28.175,0.266,23.0
50%,2.0,112.0,72.0,29.0,32.9,0.44,27.0
75%,5.0,136.25,80.0,36.0,37.2,0.67925,37.0
max,17.0,197.0,110.0,63.0,67.1,2.42,81.0


Other statistics can be found by looking at the output of `dir`.  For a complex class like `DataFrame`, this generally works best if you know what you're looking for.  For example, if you know a little statistics you might wonder if `DataFrames` have a way of getting a `median` statistic as well as a mean:

In [5]:
dir(pima)

['T',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_SLICEMAP',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_wrap__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__idiv__',
 '__imul__',
 '__init__',
 '__invert__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__'

In [4]:
pima.median()

npreg      2.00
glu      112.00
bp        72.00
skin      29.00
bmi       32.90
ped        0.44
age       27.00
dtype: float64

You can find other ways of reading in data by looking at the `DataFrame` documentation.

In [4]:
pima?

Various rows of the above description can be returned as a single Series. Again the directory of class methods, is a good guide if you're unsure how to call the method.  For example:

In [8]:
pima.std()

npreg     3.283634
glu      30.501138
bp       12.799307
skin      9.748068
bmi       7.282901
ped       0.363278
age      10.636225
dtype: float64

And:

In [6]:
pima.max()

npreg      17
glu       197
bp        110
skin       63
bmi      67.1
ped      2.42
age        81
type      Yes
dtype: object

## Using your own data

Using your own data to make data frames sometimes requires a little preporcessing.  The basic
idea is to created a dictionary whose keys are column names and whose values
are the sequences of values for that column.  The trick is when you want to have an
index for columns, as is often necessary when you wanbt to show meaningful plots.  That has to be a predefined sequence of values as well.
The following example illustrates the idea.

## From [data science lab](http://datasciencelab.wordpress.com/2013/12/21/beautiful-plots-with-pandas-and-matplotlib/)

In [None]:
import pandas as pd
import matplotlib as mpl
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.lines import Line2D
from matplotlib import pylab as plt
 
countries = ['France','Spain','Sweden','Germany','Finland','Poland','Italy',
             'United Kingdom','Romania','Greece','Bulgaria','Hungary',
             'Portugal','Austria','Czech Republic','Ireland','Lithuania','Latvia',
             'Croatia','Slovakia','Estonia','Denmark','Netherlands','Belgium']
extensions = [547030,504782,450295,357022,338145,312685,301340,243610,238391,
              131940,110879,93028,92090,83871,78867,70273,65300,64589,56594,
              49035,45228,43094,41543,30528]
populations = [63.8,47,9.55,81.8,5.42,38.3,61.1,63.2,21.3,11.4,7.35,
               9.93,10.7,8.44,10.6,4.63,3.28,2.23,4.38,5.49,1.34,5.61,
               16.8,10.8]
life_expectancies = [81.8,82.1,81.8,80.7,80.5,76.4,82.4,80.5,73.8,80.8,73.5,
                    74.6,79.9,81.1,77.7,80.7,72.1,72.2,77,75.4,74.4,79.4,81,80.5]
data = {'extension' : pd.Series(extensions, index=countries),
        'population' : pd.Series(populations, index=countries),
        'life expectancy' : pd.Series(life_expectancies, index=countries)}
 
df = pd.DataFrame(data)
df = df.sort('life expectancy')

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=1)
col_axes = []
for i, c in enumerate(df.columns):
    col_axes.append(df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c))
for axs in col_axes[:2]:
    labels = axs.get_xticklabels()
    for label in labels:
        label.set_rotation(40)
print fig.subplots_adjust(hspace=.6)
plt.show()

## How to find some practice data sets

Futher datasets can be found in `R` in the `Datasets` pacakage, which often comes preloaded in `R`.  In otherwords in `R`, one can look in the `R` Package Manager to check what's loaded, and if one sees the `datasets` package is, one consults the package docs (click on the package in the package manager), sees the dataset `USAccDeaths` is is present and just types 

> USAccDeaths

to the `R` prompt.  The dataset or a summary of it will appear.  One can then write it as
follows

```
> write.csv(USAccDeaths, "/Users/gawron/USAccDeaths.csv")
```

Similarly, many `R` packages come with data, even if not preloaded  For example, the MASS package, which includes support functions and data for Venable's and Ripley's MASS provides the following [`Pima Indians diabetes` set](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes) 

```
> library(MASS)
> write.csv(Pima.te, "/Users/gawron/pima.csv")
```