# pandas: Python Data Analysis Library

*Data Science Training at Urban*

*Python, class 3, 3/29/2017*

*by Jeff Levy (jlevy@urban.org)*

In [None]:
from IPython.display import Image
Image(filename='pandas.jpg')

-------------------

**What is Pandas?**

In short, Pandas creates a new data type in Python, called a "dataframe", for loading and working with datasets.  It provides similar functionality to R, SQL, Stata and SAS.

**What is Pandas good at?**

It is used in the Python environment, which means it can work with a lot of the powerful, simple elements of the language.  

It allows you to easily integrate data work into a complete project, e.g. web scraping or natural language processing.  

Pandas also works seamlessly with the excellent Python machine learning library, scikit-learn.

**What is Pandas less good at?**

Pandas, and Python in general, lag behind R, SAS and Stata in performing hypothesis testing, causal inference, or other econometric models.

**What is NumPy?**

Numpy is the low-level (i.e. fast) math package for Python.  It allows all sorts of advanced things to happen, such as matrix algebra.  Pandas is built over the Numpy platform, and so very often functions from Numpy can be used seamlessly with Pandas, and vice versa.

**What is MatPlotLib?**

MatPlotLib is the most commonly used platform for building graphs in Python.  Pandas leverages it "under the hood" to render graphs of data, but they can also be used seamlessly together.

-----------------

In [None]:
import numpy as np
import pandas as pd

data = {'name'  :['Sue', 'Joe', 'Tom', 'Jen', 'Bob' ],
        'age'   :[14.0,  15.0,  13.0,  13.0,  np.NaN],
        'sex'   :['f',   'm',   'm',   'f',   'm'   ],
        'height':['55',  '60',  '61',  '51',  ''    ],
        'nuts?' :[True,  True,  False, False, np.NaN],
        'stuff' :[1.11,  None,  True,  '?',   [1,2] ]}

df = pd.DataFrame(data)
df

**Questions:**

 - Why are the columns out of order?

 - What are the unlabeled numbers down the left hand side?

 - What is np.NaN?  Is it the same as None, or maybe ''?

-------

**The Basics of Exploring a Dataframe**

In [None]:
df.columns

In [None]:
df.index

In [None]:
df['name']

In [None]:
df[['name', 'age']]

In [None]:
df[:2]

In [None]:
df.ix[2]

In [None]:
df.ix[[0,4], ['name', 'age']]

In [None]:
df['age'] <= 13

In [None]:
df[ df['age'] <= 13 ]

In [None]:
df['name'].str.endswith('e')

------

**A _Quick_ Glance Behind the Curtain: Functions vs Methods**

In [None]:
print( df['age'] )

In [None]:
from numpy import mean

print( mean(df['age'])  )
print( df['age'].mean() )

In [None]:
print( type(mean)    )
print( type(df.mean) )

------

**NaN: Not a Number**

In [None]:
print(1 == 1)
print(1 == 1.0)
print('a' == 'a')
print([1,2] == [1,2])
print({'a':1, 'b':2} == {'a':1, 'b':2})
print(None == None)
print(False == False)

def somefunc(x):
    return x**2

print(somefunc == somefunc)

In [None]:
print(np.NaN == np.NaN)

In [None]:
print(np.mean(   [5, 10]        ))
print(np.mean(   [5, 10, np.NaN]))
print(np.nanmean([5, 10, np.NaN]))

In [None]:
print( df['age'].mean()             )
print( df['age'].mean(skipna=False) )

-------

**Discuss Booleans == True**

In [None]:
bool_test = [1==1, 1==1.0, 'a'=='a', [1,2]==[1,2], {'a':1, 'b':2}=={'a':1, 'b':2}, None==None, False==False, somefunc==somefunc]
print(bool_test)

In [None]:
all(bool_test)

In [None]:
'a' not in ['b', 'c', 'd']

In [None]:
3 % 4 == 0

In [None]:
True and False

In [None]:
True and not False

In [None]:
(True and False) or (True and not False)

In [None]:
df[ (df['age'] > 13) | (df['height'] > 60) ]

----

**Oh No, What's That?!  A Crash Intro to Data Types (and Tracebacks!)**

In [None]:
df

In [None]:
df.dtypes

In [None]:
pd.to_numeric(df['height'])

In [None]:
df['height'] = pd.to_numeric(df['height'])

In [None]:
df.dtypes

-----

**Now Back to Our Boolean Discussion == True**

In [None]:
df[ (df['age'] > 13) | (df['height'] > 60) ]

In [None]:
df[ (df['height'] == df['height'].max()) | (df['height'] == df['height'].min()) ]

In [None]:
df[ (df['sex'] == 'm') & (df['nuts?'] != True) ]

In [None]:
df.isnull()

In [None]:
df.notnull() == ~df.isnull()

In [None]:
df[ df.isnull().any(axis=1) ]

In [None]:
df[ df.columns[df.isnull().any(axis=0)] ]

-----

**Practice: Assigning Dataframe Values**

  1. Remove the column named *stuff* (you may need to look this up, and there are at least three ways to accomplish it)
  2. Add a column named *grade* with the integer value *8* for everyone
  3. Add a column named *color* with the string values *blue*, *red*, *blue*, *red*, *blue*, starting from the top (index 0)
  4. Replace all the NaN values for Bob (index 4) with appropriate dtypes
  
------

**Loading Dataframes from File**

In [None]:
df = pd.read_csv('some_cities.csv')
df

------

**The Dataframe Index**

In [None]:
df.index

In [None]:
df.set_index('state', inplace=True)

In [None]:
df[:5]

In [None]:
df.ix['MI']

In [None]:
df.reset_index(inplace=True)

In [None]:
df[:5]

**Multi-Index**

In [None]:
df.set_index(['state', 'city'], inplace=True)

In [None]:
df[:5]

In [None]:
michigan = df.ix['MI']
michigan

In [None]:
michigan.ix['Detroit']

**Datetime Index**

In [None]:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])

In [None]:
df.set_index('date', inplace=True)

In [None]:
df[:5]

In [None]:
df.index