![pandas](pandas.jpg)

#[`pandas`](http://pandas.pydata.org/) Can Make Your Life Easier

## Review

1. Python makes it easier to traverse the whole analytic pipeline.
2. It is open source.
3. It is flexible enough to build your own products and custom solutions.
4. It has an awesome and accessible community/ecosystem.
5. It's free.

## What is `pandas`

**Website's Definition**

>*`pandas` is an open source BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.*

**My Definition**

>*`pandas` is a critical part of the Python scientific computing stack.  It is a wrapper around [`numpy`](http://www.numpy.org/) that provides tools to work with multi-dimensional data in a tabular arrangement.  These data management tools provide functions including, but not limited to, the following:*
+ Reading and writing data from many formats (text files, Excel, [SQL](https://en.wikipedia.org/wiki/SQL) databases, [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)) and the web;
+ Reshaping of data;
+ Extensive hierarchical indexing/filtering/subsetting capability;
+ Time series and general data alignment;
+ Merging, SQL type joins, and conatenation;
+ Convenient tools for handling missing data.

>*In general, pandas has proven to be so useful that a number of libraries have been retrofitted to ensure compabitibility (most notably [`scikit learn`](http://scikit-learn.org/stable/)), and subsequent libraries have used it as a basis.*

In [34]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import seaborn as sb
from IPython.display import HTML

## Basic Building Blocks

### Series

In [3]:
#Generate random vector of data
rand_vec=np.random.uniform(size=6)

print 'Here is our random vector of data:',rand_vec

#Capture vector in a series
ser=Series(rand_vec)

print '\nHere is our random vector housed in a Series object'
ser

Here is our random vector of data: [ 0.85148964  0.98516028  0.60355308  0.79261065  0.87291105  0.91455937]

Here is our random vector housed in a Series object


0    0.851490
1    0.985160
2    0.603553
3    0.792611
4    0.872911
5    0.914559
dtype: float64

Not much functionality over a `numpy` array here, but Series have a few more tricks...

In [4]:
#Generate copy of series
ser2=ser

#Generate 'meaningful' row labels
idx=pd.Index(['a','b','c','d','e','f'])
print 'This is a standalone index object:',idx

#Reassign new labels to Series index
ser2.index=idx
ser2.index.name='label'
print '\nSeries with improved row labels from our new index\n',ser2

#Use index to reassign value
ser2.ix['c']=.5
print '\nSeries with value in row `c` modified\n',ser2

#Capture 3 largest values
print '\nThe 3 largest values in the Series\n',ser2.nlargest(3)

#Calculate the cumulative sum
print '\nThe cumulative sum of the Series values\n',ser2.cumsum()

#Capture 2x3 array
print '\nJust an example of Fortranish (?) ancestry\n'
print ser2.reshape(2,3)

This is a standalone index object: Index([u'a', u'b', u'c', u'd', u'e', u'f'], dtype='object')

Series with improved row labels from our new index
label
a    0.851490
b    0.985160
c    0.603553
d    0.792611
e    0.872911
f    0.914559
dtype: float64

Series with value in row `c` modified
label
a    0.851490
b    0.985160
c    0.500000
d    0.792611
e    0.872911
f    0.914559
dtype: float64

The 3 largest values in the Series
label
b    0.985160
f    0.914559
e    0.872911
dtype: float64

The cumulative sum of the Series values
label
a    0.851490
b    1.836650
c    2.336650
d    3.129261
e    4.002172
f    4.916731
dtype: float64

Just an example of Fortranish (?) ancestry

[[ 0.85148964  0.98516028  0.5       ]
 [ 0.79261065  0.87291105  0.91455937]]


The thing to remember is that, under the hood, a Series is a [dictionary](https://docs.python.org/2/tutorial/datastructures.html#dictionaries) on steroids.

In [5]:
ser2.to_dict()

{'a': 0.85148964498636193,
 'b': 0.9851602841006103,
 'c': 0.5,
 'd': 0.79261065436661571,
 'e': 0.87291104616709148,
 'f': 0.91455936903772372}

### DataFrames

DataFrame objects are collections of Series objects.

In [6]:
#Generate DataFrame
d=DataFrame(np.random.uniform(size=20).reshape(4,5),
            columns='a b c d e'.split(' '),
            index=['one','two','three','four'])

d

Unnamed: 0,a,b,c,d,e
one,0.753688,0.935552,0.870894,0.782522,0.951156
two,0.288132,0.407981,0.994355,0.12156,0.401432
three,0.303235,0.931048,0.724357,0.643065,0.396567
four,0.321368,0.245475,0.438714,0.059332,0.624612


Indexing is super easy!

In [7]:
print d.ix['two'][['b','c']]
print d.ix['two',['b','c']]

b    0.407981
c    0.994355
Name: two, dtype: float64
b    0.407981
c    0.994355
Name: two, dtype: float64


Creating new variables is a breeze...

In [8]:
d['f']=d['a']+d['c']
d['d_mean']=d['d']-d['d'].mean()

d

Unnamed: 0,a,b,c,d,e,f,d_mean
one,0.753688,0.935552,0.870894,0.782522,0.951156,1.624582,0.380902
two,0.288132,0.407981,0.994355,0.12156,0.401432,1.282487,-0.28006
three,0.303235,0.931048,0.724357,0.643065,0.396567,1.027592,0.241445
four,0.321368,0.245475,0.438714,0.059332,0.624612,0.760082,-0.342287


...as are element-wise transformations.

In [9]:
print '***PLUS ONE***\n',d.applymap(lambda x: x+1)

def pos_only(x):
    if x<0:
        return np.nan
    else:
        return x

print '\n***POSITIVE ONLY***\n',d.applymap(lambda x: pos_only(x))

***PLUS ONE***
              a         b         c         d         e         f    d_mean
one    1.753688  1.935552  1.870894  1.782522  1.951156  2.624582  1.380902
two    1.288132  1.407981  1.994355  1.121560  1.401432  2.282487  0.719940
three  1.303235  1.931048  1.724357  1.643065  1.396567  2.027592  1.241445
four   1.321368  1.245475  1.438714  1.059332  1.624612  1.760082  0.657713

***POSITIVE ONLY***
              a         b         c         d         e         f    d_mean
one    0.753688  0.935552  0.870894  0.782522  0.951156  1.624582  0.380902
two    0.288132  0.407981  0.994355  0.121560  0.401432  1.282487       NaN
three  0.303235  0.931048  0.724357  0.643065  0.396567  1.027592  0.241445
four   0.321368  0.245475  0.438714  0.059332  0.624612  0.760082       NaN


### Multidimensional Data (MultiIndex Functionality)

Suppose we want to add a third dimension to our data.

In [10]:
#Generate container to hold component DFs
df_list=[]

#Generate names for third dimension positions
third_names=['front','middle','back']

#For three positions in the third dimension...
for lab in third_names:
    #...generate the corresponding section of raw data...
    d=DataFrame(np.random.uniform(size=20).reshape(4,5),columns='a b c d e'.split(' '))
    #...name the columns dimension...
    d.columns.name='dim1'
    #...generate second and third dims (to go in index)...
    d['dim2']=['one','two','three','four']
    d['dim3']=lab
    #...set index...
    d.set_index(['dim3','dim2'],inplace=True)
    #...and throw the DF in the container
    df_list.append(d)
    
#Concatenate component DFs together
d3=pd.concat(df_list)

print d3

dim1                 a         b         c         d         e
dim3   dim2                                                   
front  one    0.659332  0.719239  0.543840  0.467657  0.093750
       two    0.769227  0.438601  0.992004  0.122398  0.716938
       three  0.147056  0.646780  0.553824  0.552882  0.796306
       four   0.560300  0.136490  0.981936  0.515182  0.284379
middle one    0.168459  0.128259  0.462295  0.478126  0.590974
       two    0.269342  0.990826  0.622359  0.133947  0.067463
       three  0.151369  0.591225  0.389912  0.811437  0.995876
       four   0.664913  0.300043  0.865354  0.774986  0.946232
back   one    0.198127  0.806268  0.237037  0.835868  0.346327
       two    0.280616  0.160595  0.317330  0.075784  0.098964
       three  0.337967  0.430571  0.985119  0.532588  0.395301
       four   0.613907  0.360727  0.129215  0.150569  0.973085


Everyone wants long form for some thing or another...

In [30]:
#Convert to long form
d3_long=d3.stack().sortlevel(0)

#Correct ordering of dim2 and dim3 labels
d3_long=d3_long.reindex(['one','two','three','four'],level='dim2')
d3_long=d3_long.reindex(['front','middle','back'],level='dim3')

d3_long

dim3    dim2   dim1
front   one    a       0.659332
               b       0.719239
               c       0.543840
               d       0.467657
               e       0.093750
        two    a       0.769227
               b       0.438601
               c       0.992004
               d       0.122398
               e       0.716938
        three  a       0.147056
               b       0.646780
               c       0.553824
               d       0.552882
               e       0.796306
        four   a       0.560300
               b       0.136490
               c       0.981936
               d       0.515182
               e       0.284379
middle  one    a       0.168459
               b       0.128259
               c       0.462295
               d       0.478126
               e       0.590974
        two    a       0.269342
               b       0.990826
               c       0.622359
               d       0.133947
               e       0.067463
        three  a    

...and sometimes we want only a cross-section....

In [31]:
d3_long.xs('b',level='dim1')

dim3    dim2 
front   one      0.719239
        two      0.438601
        three    0.646780
        four     0.136490
middle  one      0.128259
        two      0.990826
        three    0.591225
        four     0.300043
back    one      0.806268
        two      0.160595
        three    0.430571
        four     0.360727
dtype: float64

...or a more complicated slice of the data.

In [32]:
d3_long.loc[slice('front','middle'),slice('two','four'),['b','d']]

dim3    dim2   dim1
front   two    b       0.438601
               d       0.122398
        three  b       0.646780
               d       0.552882
        four   b       0.136490
               d       0.515182
middle  two    b       0.990826
               d       0.133947
        three  b       0.591225
               d       0.811437
        four   b       0.300043
               d       0.774986
dtype: float64

We can kick any dimension out to columns.

In [33]:
d3_long.unstack(level='dim3')

Unnamed: 0_level_0,dim3,front,middle,back
dim2,dim1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
one,a,0.659332,0.168459,0.198127
one,b,0.719239,0.128259,0.806268
one,c,0.54384,0.462295,0.237037
one,d,0.467657,0.478126,0.835868
one,e,0.09375,0.590974,0.346327
two,a,0.769227,0.269342,0.280616
two,b,0.438601,0.990826,0.160595
two,c,0.992004,0.622359,0.31733
two,d,0.122398,0.133947,0.075784
two,e,0.716938,0.067463,0.098964


## Input/Output

For the most part, I deal with `.csv` files. Consequently, my "go to" input method is `read_csv()` and my output method is `to_csv()`.  However, pandas can handle many data formats.

In [37]:
HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/io.html width=1000 height=500></iframe>')

Maybe we are just really into Medicaid data housed in Stata sets...

In [45]:
#Identify data location
data_dir='O:/Analyst/Nadwa/MEDICAID/Output/'

#Read in Medicaid data
medicaid=pd.read_stata(data_dir+'Medicaid_Risk_Class(1979-2011).dta')

#Set index
medicaid.set_index(['year','medicaid_risk_class'],inplace=True)

#Sort index
medicaid.sortlevel(0,inplace=True)

medicaid

Unnamed: 0_level_0,Unnamed: 1_level_0,medicaidR,medicaidPR,medicaidTS,medicaid_risk_adjusted,medicaid_risk_adjusted_ts,caidbenefits,caidbenefits_ts,MSIS
year,medicaid_risk_class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1979,Disabled,1766549.602577,1792424.562538,1822183.321922,2654113.142525,2770251.271622,5.889142e+09,6.146838e+09,2771284.25
1979,Children(non-disabled),8599666.791954,8687674.821480,8690440.151665,8878543.122070,8851349.501953,2.639321e+09,2.631237e+09,8851419.00
1979,Adults(non-disabled),4841204.803772,4879044.803635,4865456.443390,4917335.583862,4930890.444214,2.756057e+09,2.763655e+09,4931041.50
1979,Elderly(non-disabled),3349458.301521,3485592.302284,3443745.251274,3349458.301521,3349458.301521,5.978108e+09,5.978108e+09,2269145.50
1980,Disabled,1778188.348282,1778188.348282,1778188.348282,1778188.348282,1778188.348282,4.560940e+09,4.560940e+09,2771284.25
1980,Children(non-disabled),8943688.522209,8943688.522209,8943688.522209,8943688.522209,8943688.522209,3.073345e+09,3.073345e+09,8851419.00
1980,Adults(non-disabled),5255202.770969,5255202.770969,5255202.770969,5255202.770969,5255202.770969,3.404802e+09,3.404802e+09,4931041.50
1980,Elderly(non-disabled),3034660.692047,3034660.692047,3034660.692047,3034660.692047,3034660.692047,6.260995e+09,6.260995e+09,2269145.50
1982,Disabled,1708216.598488,1751541.578285,1809810.918831,2550158.309547,2770356.110580,8.254112e+09,8.966827e+09,2771284.25
1982,Children(non-disabled),9103620.038713,9103620.038713,9103620.038713,9103620.038713,9103620.038713,3.947619e+09,3.947619e+09,8851419.00


Maybe we are dealing with a very large data set, and we have reason to read only chunks of it at a time.

In [53]:
#Read in Medicaid data
medicaid=pd.read_stata(data_dir+'Medicaid_Risk_Class(1979-2011).dta',chunksize=10)

#For each chunk, print the sum of Medicaid recipients (MSIS)
for i,chunk in enumerate(medicaid):
    print 'Chunk #',i,'|',chunk['MSIS'].sum()
    
medicaid

Chunk # 0 | 140465266.0
Chunk # 1 | 153757461.0
Chunk # 2 | 160122532.0
Chunk # 3 | 134729251.0
Chunk # 4 | 99609788.0
Chunk # 5 | 99522526.0
Chunk # 6 | 91359890.0
Chunk # 7 | 48325373.75
Chunk # 8 | 56027947.25
Chunk # 9 | 44845967.5
Chunk # 10 | 78189331.0
Chunk # 11 | 54905561.0
Chunk # 12 | 47546399.0


<pandas.io.stata.StataReader at 0x164b0d30>

In [58]:
HTML('<iframe src=http://bl.ocks.org/mbostock/e1e1e7e2c360bec054ba width=1100 height=500></iframe>')