# Advanced Pandas

Pandas: "Python Data Analysis Library"

Pandas is _THE_ data science library for Python. It's built on top of a variety of very good and very fast numerical computing Python libraries. If you're familiar with R, Pandas was essentially an R clone, although it has developed and grown over the past decade...not to mention that it also bears the power of Python behind it. It's integration with jupyter notebooks makes Pandas excel, and allows for the use of documented, repeatable data analysis.

Pandas is entirely Python code, but the syntax doesn't feel quite python-y.

The data from this work can be downloaded from: https://www.kaggle.com/donorschoose/io/data


## Agenda

* SQLAlchemy to read SQL Server
* "SQL Queries" in Pandas
    * select
    * filtering
    * grouping
    * ordering
    * merging
* Cleaning Data
* Intro to building graphs
* Extras:
    * Additional graphs


In [3]:
import pandas as pd
import numpy as np

# Read SQL Server Table with SQLAlchemy

In [None]:
import sqlalchemy
eng = None
pd.read_sql('select x, y from table where', eng)

# SQL Queries in Pandas

In [2]:
proj_df = pd.read_csv('io/Projects.csv', engine='python')  # main data file
school_df = pd.read_csv('io/Schools.csv')  # secondary, to demonstrate merging
teach_df = pd.read_csv('io/Teachers.csv')  # secondary, to demonstrate merging

In [4]:
proj_df.head()

Unnamed: 0,Project ID,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Short Description,Project Need Statement,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Expiration Date,Project Current Status,Project Fully Funded Date
0,7685f0265a19d7b52a470ee4bac883ba,e180c7424cb9c68cb49f141b092a988f,4ee5200e89d9e2998ec8baad8a3c5968,25,Teacher-Led,Stand Up to Bullying: Together We Can!,Did you know that 1-7 students in grades K-12 ...,Did you know that 1-7 students in grades K-12 ...,"My students need 25 copies of ""Bullying in Sch...",Applied Learning,"Character Education, Early Development",Grades PreK-2,Technology,361.8,2013-01-01,2013-05-30,Fully Funded,2013-01-11
1,f9f4af7099061fb4bf44642a03e5c331,08b20f1e2125103ed7aa17e8d76c71d4,cca2d1d277fb4adb50147b49cdc3b156,3,Teacher-Led,Learning in Color!,"Help us have a fun, interactive listening cent...","Help us have a fun, interactive listening cent...","My students need a listening center, read alon...","Applied Learning, Literacy & Language","Early Development, Literacy",Grades PreK-2,Technology,512.85,2013-01-01,2013-05-31,Expired,
2,afd99a01739ad5557b51b1ba0174e832,1287f5128b1f36bf8434e5705a7cc04d,6c5bd0d4f20547a001628aefd71de89e,1,Teacher-Led,Help Second Grade ESL Students Develop Languag...,Visiting or moving to a new place can be very ...,Visiting or moving to a new place can be very ...,My students need beginning vocabulary audio ca...,Literacy & Language,ESL,Grades PreK-2,Supplies,435.92,2013-01-01,2013-05-30,Fully Funded,2013-05-22
3,c614a38bb1a5e68e2ae6ad9d94bb2492,900fec9cd7a3188acbc90586a09584ef,8ed6f8181d092a8f4c008b18d18e54ad,40,Teacher-Led,Help Bilingual Students Strengthen Reading Com...,Students at our school are still working hard ...,Students at our school are still working hard ...,My students need one copy of each book in The ...,Literacy & Language,"ESL, Literacy",Grades 3-5,Books,161.26,2013-01-01,2013-05-31,Fully Funded,2013-02-06
4,ec82a697fab916c0db0cdad746338df9,3b200e7fe3e6dde3c169c02e5fb5ae86,893173d62775f8be7c30bf4220ad0c33,2,Teacher-Led,Help Us Make Each Minute Count!,"""Idle hands"" were something that Issac Watts s...","""Idle hands"" were something that Issac Watts s...","My students need items such as Velcro, two pou...",Special Needs,Special Needs,Grades 3-5,Supplies,264.19,2013-01-01,2013-05-30,Fully Funded,2013-01-01


In [7]:
# That's a lot, let's only look at a subset
proj_df[['Project Type', 'Project Title', 'Project Essay']].head()

Unnamed: 0,Project Type,Project Title,Project Essay
0,Teacher-Led,Stand Up to Bullying: Together We Can!,Did you know that 1-7 students in grades K-12 ...
1,Teacher-Led,Learning in Color!,"Help us have a fun, interactive listening cent..."
2,Teacher-Led,Help Second Grade ESL Students Develop Languag...,Visiting or moving to a new place can be very ...
3,Teacher-Led,Help Bilingual Students Strengthen Reading Com...,Students at our school are still working hard ...
4,Teacher-Led,Help Us Make Each Minute Count!,"""Idle hands"" were something that Issac Watts s..."


In [9]:
proj_df[proj_df['Project Type'] != 'Teacher-Led'][['Project Type', 'Project Title', 'Project Essay']].head()

Unnamed: 0,Project Type,Project Title,Project Essay
93059,Professional Development,Table for Teacher Led Centers,Teaching today is all about differentiating in...
292539,Professional Development,Breaking in the Bard,Student involvement is the focus of my classro...
298111,Professional Development,Project Our Development!,"""We are building deeper school community by ha..."
298126,Professional Development,Emerging Teacher Through Emerging Technology L...,Living in a geographically isolated area of th...
298136,Professional Development,Because My Students Deserve the Best,Great teachers help create great students. In ...


In [11]:
teacher_led = proj_df['Project Type'] == 'Teacher-Led'
teacher_led.value_counts()

True     1092163
False      17854
Name: Project Type, dtype: int64

What is `teacher_led`? Series? DataFrame?

In [12]:
proj_df[teacher_led].head()

Unnamed: 0,Project ID,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Short Description,Project Need Statement,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Expiration Date,Project Current Status,Project Fully Funded Date
0,7685f0265a19d7b52a470ee4bac883ba,e180c7424cb9c68cb49f141b092a988f,4ee5200e89d9e2998ec8baad8a3c5968,25,Teacher-Led,Stand Up to Bullying: Together We Can!,Did you know that 1-7 students in grades K-12 ...,Did you know that 1-7 students in grades K-12 ...,"My students need 25 copies of ""Bullying in Sch...",Applied Learning,"Character Education, Early Development",Grades PreK-2,Technology,361.8,2013-01-01,2013-05-30,Fully Funded,2013-01-11
1,f9f4af7099061fb4bf44642a03e5c331,08b20f1e2125103ed7aa17e8d76c71d4,cca2d1d277fb4adb50147b49cdc3b156,3,Teacher-Led,Learning in Color!,"Help us have a fun, interactive listening cent...","Help us have a fun, interactive listening cent...","My students need a listening center, read alon...","Applied Learning, Literacy & Language","Early Development, Literacy",Grades PreK-2,Technology,512.85,2013-01-01,2013-05-31,Expired,
2,afd99a01739ad5557b51b1ba0174e832,1287f5128b1f36bf8434e5705a7cc04d,6c5bd0d4f20547a001628aefd71de89e,1,Teacher-Led,Help Second Grade ESL Students Develop Languag...,Visiting or moving to a new place can be very ...,Visiting or moving to a new place can be very ...,My students need beginning vocabulary audio ca...,Literacy & Language,ESL,Grades PreK-2,Supplies,435.92,2013-01-01,2013-05-30,Fully Funded,2013-05-22
3,c614a38bb1a5e68e2ae6ad9d94bb2492,900fec9cd7a3188acbc90586a09584ef,8ed6f8181d092a8f4c008b18d18e54ad,40,Teacher-Led,Help Bilingual Students Strengthen Reading Com...,Students at our school are still working hard ...,Students at our school are still working hard ...,My students need one copy of each book in The ...,Literacy & Language,"ESL, Literacy",Grades 3-5,Books,161.26,2013-01-01,2013-05-31,Fully Funded,2013-02-06
4,ec82a697fab916c0db0cdad746338df9,3b200e7fe3e6dde3c169c02e5fb5ae86,893173d62775f8be7c30bf4220ad0c33,2,Teacher-Led,Help Us Make Each Minute Count!,"""Idle hands"" were something that Issac Watts s...","""Idle hands"" were something that Issac Watts s...","My students need items such as Velcro, two pou...",Special Needs,Special Needs,Grades 3-5,Supplies,264.19,2013-01-01,2013-05-30,Fully Funded,2013-01-01


In [14]:
# Multiple filters
proj_df[(proj_df['Project Resource Category'] == 'Books')].head(1)

Unnamed: 0,Project ID,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Short Description,Project Need Statement,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Expiration Date,Project Current Status,Project Fully Funded Date
3,c614a38bb1a5e68e2ae6ad9d94bb2492,900fec9cd7a3188acbc90586a09584ef,8ed6f8181d092a8f4c008b18d18e54ad,40,Teacher-Led,Help Bilingual Students Strengthen Reading Com...,Students at our school are still working hard ...,Students at our school are still working hard ...,My students need one copy of each book in The ...,Literacy & Language,"ESL, Literacy",Grades 3-5,Books,161.26,2013-01-01,2013-05-31,Fully Funded,2013-02-06


In [16]:
proj_df[((proj_df['Project Resource Category'] == 'Books') & (proj_df['Project Grade Level Category'] == 'Grades 3-5'))].head(5)

Unnamed: 0,Project ID,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Short Description,Project Need Statement,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Expiration Date,Project Current Status,Project Fully Funded Date
3,c614a38bb1a5e68e2ae6ad9d94bb2492,900fec9cd7a3188acbc90586a09584ef,8ed6f8181d092a8f4c008b18d18e54ad,40,Teacher-Led,Help Bilingual Students Strengthen Reading Com...,Students at our school are still working hard ...,Students at our school are still working hard ...,My students need one copy of each book in The ...,Literacy & Language,"ESL, Literacy",Grades 3-5,Books,161.26,2013-01-01,2013-05-31,Fully Funded,2013-02-06
39,e2465804079e523b9f10aaa883ac5952,c630533eead7bf9518fde62d11acb7e2,6dd71a63f4a6e986452030c8dc885bdf,81,Teacher-Led,The Kid In This Book Is Just Like Me!,Most of the children in our school are first g...,Most of the children in our school are first g...,My students need 14 books about children from ...,"Literacy & Language, History & Civics","Literacy, Social Sciences",Grades 3-5,Books,259.55,2013-01-01,2013-05-31,Fully Funded,2013-02-26
104,d9ec1c9e3fedd1761c891947e68e657a,c757bfcebc9cc149d170c1fe59846ddf,1ad3e90f3e1dbdc61c2dd5d73490c912,1,Teacher-Led,A Blast To The Past...,It's time to take a blast to the past... My st...,It's time to take a blast to the past... My st...,My students need more resources such as biogra...,History & Civics,"Civics & Government, History & Geography",Grades 3-5,Books,575.27,2013-01-01,2013-05-30,Fully Funded,2013-02-26
106,af067944df14500b72bd997d0ddf2252,a5a28ec13de034e2285bb6e1f098d048,d7332fe1edf597bfbea52a5ef92eb759,4,Teacher-Led,We Got The Center. We Need The Books!,We recently got a listening center. Now we nee...,We recently got a listening center. Now we nee...,My students need five fabulous sets of books o...,"Literacy & Language, History & Civics","Literacy, Social Sciences",Grades 3-5,Books,311.92,2013-01-01,2013-05-30,Fully Funded,2013-02-01
131,f3c47a70996b9f79e6a3c0395b100bdb,12f89b5b2de0726ca38952d85dacc36b,6f52cbe3a6466dcd56834d82997778e9,22,Teacher-Led,A Skillful Study of Both Fiction and History,"""If history were taught in the form of stories...","""If history were taught in the form of stories...",My students need 27 copies of Blood on the Riv...,"History & Civics, Literacy & Language","History & Geography, Literature & Writing",Grades 3-5,Books,569.47,2013-01-01,2013-05-31,Fully Funded,2013-01-06


In [None]:
proj_df[
    ((proj_df['Project Resource Category'] == 'Books') & (proj_df['Project Grade Level Category'] == 'Grades 3-5'))
]['Project Cost'].head(20).sum()

In [None]:
proj_df['Project Type'].isnull().value_counts()  # also, .notnull()

## Groupby

In [None]:
# This counts for all the columns
proj_df.groupby('Project Type').count()

In [None]:
# To just get the size of the group by, use `.size()`
proj_df.groupby('Project Type').size()

In [None]:
# ...or specify only one column for count
proj_df.groupby('Project Type')['Project Cost'].count()

### Multilevel Groupby (Multilevel Index)

In [None]:
proj_df.groupby(['Project Grade Level Category', 'Project Resource Category'])['Project Cost'].agg({'cost': [np.size, np.mean]})

## Merging (Joining)

In [None]:
proj_df[['Project ID', 'School ID', 'Teacher ID']].head()

In [None]:
proj_teach_df = pd.merge(proj_df, teach_df, on='Teacher ID')
proj_teach_df.head()

In [None]:
all_df = pd.merge(proj_teach_df, school_df, left_on='School ID', right_on='School ID', how='left')
all_df.head()

## Union

In [None]:
df1 = proj_df[proj_df['Project Cost'] >= 100] 
df2 = proj_df[proj_df['Project Cost'] <= 100] 
df1.shape, df2.shape

In [None]:
pd.concat([df1, df2]).shape

In [None]:
pd.concat([df1, df2]).drop_duplicates().shape, all_df.shape

## Order by

In [None]:
all_df.sort_values('Project Fully Funded Date', ascending=False).head(20)

In [None]:
all_df.sort_values(
    ['Project Fully Funded Date', 'Project Cost'], 
    ascending=[False, True]
    # inplace=True  # if you don't want to create a new dataframe
).head(20)

# A Brief Note About DataFrames

1. INPLACE=True: DataFrames will usually default to creating (and returning) an entirely new DataFrame. This is not always desirable if, e.g., your data consumes a lot of memory. You can often supply an option `inplace=True` which will transform the current dataframe.
2. Copy vs View: Some operations create a copy of the existing DataFrame, while others only create a View of the DataFrame so that the same data is used. Usually, when filtering, you get a View, while at other times you can get a Copy. With the View, any changes to data will affect the original dataframe, whereas this is not the case in a Copy. IF YOU WANT TO MAKE SURE YOU HAVE A COPY, call `.copy()` explicitly.
3. Convert SAS date to Python datetime: see https://stackoverflow.com/questions/36500348/convert-a-sas-datetime-in-pandas (NB: you may need to set `unit='d'`


# Cleaning/Modifying Data

* df.loc
* df.iloc
* paradigm of setting new column

# Plotting

In [None]:
%pylab inline
proj_df[['Project Cost', 'Project Type']].plot()