# Data Science with Python: Pandas

Pandas: "Python Data Analysis Library"

Pandas is _THE_ data science library for Python. It's built on top of a variety of very good and very fast numerical computing Python libraries. If you're familiar with R, Pandas was essentially an R clone, although it has developed and grown over the past decade...not to mention that it also bears the power of Python behind it. It's integration with `jupyter notebooks` makes Pandas excel, and allows for the use of documented, repeatable data analysis.

Pandas is entirely Python code, but the syntax doesn't feel quite python-y.


## Agenda

* Series
* DataFrame
* Exploring and visualizing real data
* Extras:
    * `apply` function
    * String Manipulation
    * Categorical values

## How does Pandas compare to...

* SQL: http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
* SAS: http://pandas.pydata.org/pandas-docs/stable/comparison_with_sas.html
* R: http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html

In [1]:
import pandas as pd  # only import we'll need!

## Series

Series aren't used directly that frequently, although they're important to understand for use within a DataFrame.

A Series is a single column of data, all of the same type.

In [2]:
# series
s = pd.Series([1, 2, 3, 4, 5, 6])
s

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

In [3]:
type(s), s.dtype, len(s), s.shape

(pandas.core.series.Series, dtype('int64'), 6, (6,))

In [4]:
s1 = pd.Series([True, False, False, True])
s1

0     True
1    False
2    False
3     True
dtype: bool

In [5]:
type(s1), s1.dtype, len(s1), s1.shape

(pandas.core.series.Series, dtype('bool'), 4, (4,))

In [6]:
pd.Series([1, True, 's'])

0       1
1    True
2       s
dtype: object

## Viewing Data

One of the most important things is to be able to visualize the data you're working with. Here are a few common functions.

In [7]:
s.head()  # default to length 5

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [8]:
s.tail(1)

5    6
dtype: int64

In [9]:
s.describe()

count    6.000000
mean     3.500000
std      1.870829
min      1.000000
25%      2.250000
50%      3.500000
75%      4.750000
max      6.000000
dtype: float64

## DataFrames

You can combine multiple series into a single dataframe.

In [10]:
df = pd.DataFrame({
    'A': s,
    'B': ['a', 'b', 'c', 'd', 'e', 'f']
})
df

Unnamed: 0,A,B
0,1,a
1,2,b
2,3,c
3,4,d
4,5,e
5,6,f


In [11]:
# note the index added as a primary key
# this is required to make certain calculations -- it will be added automatically
df.index

RangeIndex(start=0, stop=6, step=1)

In [12]:
df.columns

Index(['A', 'B'], dtype='object')

In [13]:
df.head()

Unnamed: 0,A,B
0,1,a
1,2,b
2,3,c
3,4,d
4,5,e


In [14]:
df.tail(2)

Unnamed: 0,A,B
4,5,e
5,6,f


In [15]:
df.describe()

Unnamed: 0,A
count,6.0
mean,3.5
std,1.870829
min,1.0
25%,2.25
50%,3.5
75%,4.75
max,6.0


In [16]:
df.describe(include='all')

Unnamed: 0,A,B
count,6.0,6
unique,,6
top,,f
freq,,1
mean,3.5,
std,1.870829,
min,1.0,
25%,2.25,
50%,3.5,
75%,4.75,


In [17]:
type(df), len(df), df.shape

(pandas.core.frame.DataFrame, 6, (6, 2))

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
A    6 non-null int64
B    6 non-null object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes


In [19]:
df.T

Unnamed: 0,0,1,2,3,4,5
A,1,2,3,4,5,6
B,a,b,c,d,e,f


In [20]:
df.sort_values(by='B', ascending=False)

Unnamed: 0,A,B
5,6,f
4,5,e
3,4,d
2,3,c
1,2,b
0,1,a


In [21]:
# select a column
df.A

0    1
1    2
2    3
3    4
4    5
5    6
Name: A, dtype: int64

In [22]:
df['B']

0    a
1    b
2    c
3    d
4    e
5    f
Name: B, dtype: object

## Use Real Data

DataFrames can be read from just about any data source you want: http://pandas.pydata.org/pandas-docs/stable/io.html

They can be written to the same list...except the proprietary SAS7BDAT format.

In [23]:
proj_df = pd.read_csv('io/Projects.csv', engine='python')  # dataset available on kaggle/donorschoose

In [24]:
proj_df.shape

(1110017, 18)

In [25]:
proj_df.head()

Unnamed: 0,Project ID,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Short Description,Project Need Statement,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Expiration Date,Project Current Status,Project Fully Funded Date
0,7685f0265a19d7b52a470ee4bac883ba,e180c7424cb9c68cb49f141b092a988f,4ee5200e89d9e2998ec8baad8a3c5968,25,Teacher-Led,Stand Up to Bullying: Together We Can!,Did you know that 1-7 students in grades K-12 ...,Did you know that 1-7 students in grades K-12 ...,"My students need 25 copies of ""Bullying in Sch...",Applied Learning,"Character Education, Early Development",Grades PreK-2,Technology,361.8,2013-01-01,2013-05-30,Fully Funded,2013-01-11
1,f9f4af7099061fb4bf44642a03e5c331,08b20f1e2125103ed7aa17e8d76c71d4,cca2d1d277fb4adb50147b49cdc3b156,3,Teacher-Led,Learning in Color!,"Help us have a fun, interactive listening cent...","Help us have a fun, interactive listening cent...","My students need a listening center, read alon...","Applied Learning, Literacy & Language","Early Development, Literacy",Grades PreK-2,Technology,512.85,2013-01-01,2013-05-31,Expired,
2,afd99a01739ad5557b51b1ba0174e832,1287f5128b1f36bf8434e5705a7cc04d,6c5bd0d4f20547a001628aefd71de89e,1,Teacher-Led,Help Second Grade ESL Students Develop Languag...,Visiting or moving to a new place can be very ...,Visiting or moving to a new place can be very ...,My students need beginning vocabulary audio ca...,Literacy & Language,ESL,Grades PreK-2,Supplies,435.92,2013-01-01,2013-05-30,Fully Funded,2013-05-22
3,c614a38bb1a5e68e2ae6ad9d94bb2492,900fec9cd7a3188acbc90586a09584ef,8ed6f8181d092a8f4c008b18d18e54ad,40,Teacher-Led,Help Bilingual Students Strengthen Reading Com...,Students at our school are still working hard ...,Students at our school are still working hard ...,My students need one copy of each book in The ...,Literacy & Language,"ESL, Literacy",Grades 3-5,Books,161.26,2013-01-01,2013-05-31,Fully Funded,2013-02-06
4,ec82a697fab916c0db0cdad746338df9,3b200e7fe3e6dde3c169c02e5fb5ae86,893173d62775f8be7c30bf4220ad0c33,2,Teacher-Led,Help Us Make Each Minute Count!,"""Idle hands"" were something that Issac Watts s...","""Idle hands"" were something that Issac Watts s...","My students need items such as Velcro, two pou...",Special Needs,Special Needs,Grades 3-5,Supplies,264.19,2013-01-01,2013-05-30,Fully Funded,2013-01-01


In [26]:
proj_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1110017 entries, 0 to 1110016
Data columns (total 18 columns):
Project ID                          1110017 non-null object
School ID                           1110017 non-null object
Teacher ID                          1110017 non-null object
Teacher Project Posted Sequence     1110017 non-null int64
Project Type                        1110017 non-null object
Project Title                       1110011 non-null object
Project Essay                       1110016 non-null object
Project Short Description           1110014 non-null object
Project Need Statement              1110014 non-null object
Project Subject Category Tree       1109988 non-null object
Project Subject Subcategory Tree    1109988 non-null object
Project Grade Level Category        1110017 non-null object
Project Resource Category           1109981 non-null object
Project Cost                        1110017 non-null float64
Project Posted Date                 1110017 non

In [27]:
# project cost > $1000
proj_df['Project Cost']

0           361.80
1           512.85
2           435.92
3           161.26
4           264.19
5           175.15
6          3020.59
7           566.19
8           339.20
9           566.73
10          400.19
11          314.33
12          580.02
13          294.60
14          494.13
15          897.28
16          605.87
17          861.18
18          452.05
19          696.60
20          231.60
21          662.84
22          193.19
23          980.35
24          322.35
25          490.38
26          559.99
27          327.44
28          300.88
29          175.06
            ...   
1109987     601.00
1109988    1277.08
1109989     197.67
1109990     194.06
1109991     340.02
1109992     326.33
1109993    4811.76
1109994     188.12
1109995     339.20
1109996     211.54
1109997     347.40
1109998     294.58
1109999     654.15
1110000     522.95
1110001     894.15
1110002     935.88
1110003     167.01
1110004     400.67
1110005    4790.60
1110006     445.20
1110007    1184.53
1110008     

In [28]:
proj_df['Project Cost'] > 1000

0          False
1          False
2          False
3          False
4          False
5          False
6           True
7          False
8          False
9          False
10         False
11         False
12         False
13         False
14         False
15         False
16         False
17         False
18         False
19         False
20         False
21         False
22         False
23         False
24         False
25         False
26         False
27         False
28         False
29         False
           ...  
1109987    False
1109988     True
1109989    False
1109990    False
1109991    False
1109992    False
1109993     True
1109994    False
1109995    False
1109996    False
1109997    False
1109998    False
1109999    False
1110000    False
1110001    False
1110002    False
1110003    False
1110004    False
1110005     True
1110006    False
1110007     True
1110008    False
1110009    False
1110010    False
1110011    False
1110012    False
1110013    False
1110014    Fal

In [29]:
# we can use this as a filter
proj_df[proj_df['Project Cost'] > 1000]

Unnamed: 0,Project ID,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Short Description,Project Need Statement,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Expiration Date,Project Current Status,Project Fully Funded Date
6,717c7a01215d532d68f6fe9e666c88c3,14a4351014125e8a4376c2aac594526d,dd9c029f01a66f862b4d1c209eaa4451,3,Teacher-Led,Experiencing India,Travel and hands on experiences outside the cl...,Travel and hands on experiences outside the cl...,My students need additional funding to finance...,Applied Learning,"College & Career Prep, Community Service",Grades 9-12,Trips,3020.59,2013-01-01,2013-05-27,Fully Funded,2013-01-05
44,b6be9d285565584e34b1d3d4e2c573c0,d17c80f26e260dddb5854304370df76c,c89df7db975ff3099178fcc5a5193f1c,2,Teacher-Led,Chromebooks Needed to Enhance New FIRST Roboti...,Do you remember your first computer? Those hug...,Do you remember your first computer? Those hug...,My students need 12 Chromebooks for new Roboti...,"Math & Science, Applied Learning","Applied Sciences, College & Career Prep",Grades 6-8,Technology,1552.44,2013-01-01,2013-05-30,Expired,
72,6efd90896f8d7e19a9c57fdc058ffd3b,b7c32bc8478e591d8e2e515da2c8708f,a8c0a573351d6b4ed100d405980dc769,3,Teacher-Led,Document Student Work With Epson,Mathematics in the middle grades has a tendenc...,Mathematics in the middle grades has a tendenc...,"My students need an Epson document scanner, no...",Math & Science,Mathematics,Grades 6-8,Technology,1028.92,2013-01-01,2013-01-30,Fully Funded,2013-01-04
76,15356ab77b93b15e45e05f5764449f9c,a1c8f567724ec219668406d5800bef16,71b15fcb955ebfbc15d614a116c78422,3,Teacher-Led,Experiencing Ionic and Covalent Bonds,How atoms bond with each other and how the res...,How atoms bond with each other and how the res...,My students need models which will help them f...,"Math & Science, Literacy & Language","Applied Sciences, ESL",Grades 9-12,Other,1274.94,2013-01-01,2013-05-31,Expired,
84,4a4e2c669dfabf17de471ccc64a28df7,e46d9f4a069d97684c818d1fd7f56a1b,cda43dabf58e4b2d1a0cea3a3e835e2d,3,Teacher-Led,Hands-On Math,"As a six year old, do you remember learning ab...","As a six year old, do you remember learning ab...",My students need hands-on math manipulatives a...,Math & Science,Mathematics,Grades PreK-2,Supplies,1021.20,2013-01-01,2013-05-30,Fully Funded,2013-01-08
85,e1bb9b68a9fa8d3a51ae9a129f9365e7,799448cf0fbca7d59dfdb9aec6c1f86d,8ce72df217952bb64693da12d848db15,3,Teacher-Led,Let's Get Moving!,When I received the notification from our scho...,When I received the notification from our scho...,My students need a variety of indoor P.E. cent...,Health & Sports,"Gym & Fitness, Health & Wellness",Grades PreK-2,Supplies,1009.44,2013-01-01,2013-05-31,Expired,
87,4bb25472fe73026e610b635cd0642037,81a5ab1f688b70c3f0571a8b7f03588d,453b40cc9c22b7e11a41745e516941b8,1,Teacher-Led,"Getting Our ""Hands-On"" Math And Reading!","Each day, my 24 kindergarteners work diligentl...","Each day, my 24 kindergarteners work diligentl...",My students need games and manipulatives that ...,"Literacy & Language, Math & Science","Literacy, Mathematics",Grades PreK-2,Supplies,1033.53,2013-01-01,2013-05-30,Fully Funded,2013-01-01
94,c76de264e1e2a95be0e93db08a6cfefa,ca49c06a6c3335c8be99f4a289c7102e,a71e8076dbba65ed1176c7f872491d7a,16,Teacher-Led,Serious PLAY,"When I give my 3rd graders a Lego challenge, t...","When I give my 3rd graders a Lego challenge, t...",My students need Legos to express their learni...,"Math & Science, Literacy & Language","Environmental Science, Literacy",Grades 3-5,Other,1023.67,2013-01-01,2013-05-31,Expired,
103,dd0348a5b0980ff6d7db5a39911b183d,fd0ad601764815a21218a811c8a148bb,691c8675017ddfeb3367e5ba901290be,11,Teacher-Led,Special Techies,Think back to when you were in school and the ...,Think back to when you were in school and the ...,My students need 2 Apple IPADs to assist them ...,"Literacy & Language, Special Needs","Literature & Writing, Special Needs",Grades 3-5,Technology,1087.04,2013-01-01,2013-05-31,Fully Funded,2013-05-07
119,109030ad43ccf26a52dd4a110b7b2535,31d9e9d8e769137fb9ea5f96d99ca2c5,2bad8965ce82e963e5079644d7423c1c,10,Teacher-Led,English Language Listeners!,There are so many different ways to learn how ...,There are so many different ways to learn how ...,My students need audio books to read at their ...,Literacy & Language,"ESL, Literacy",Grades PreK-2,Books,1017.40,2013-01-01,2013-05-30,Fully Funded,2013-03-11


In [30]:
proj_df['Project Subject Category Tree'].unique()

array(['Applied Learning', 'Applied Learning, Literacy & Language',
       'Literacy & Language', 'Special Needs',
       'Literacy & Language, History & Civics', 'Math & Science',
       'History & Civics, Math & Science',
       'Literacy & Language, Special Needs',
       'Applied Learning, Special Needs', 'Health & Sports, Special Needs',
       'Math & Science, Literacy & Language',
       'Literacy & Language, Math & Science',
       'Literacy & Language, Music & The Arts',
       'Math & Science, Special Needs', 'Health & Sports',
       'Music & The Arts', 'Math & Science, Applied Learning',
       'Literacy & Language, Applied Learning',
       'Applied Learning, Music & The Arts',
       'History & Civics, Literacy & Language',
       'Applied Learning, Math & Science',
       'Health & Sports, Math & Science',
       'Applied Learning, Health & Sports', 'History & Civics',
       'History & Civics, Music & The Arts',
       'Math & Science, History & Civics',
       'Math & 

In [31]:
proj_df['Project Grade Level Category'].unique()

array(['Grades PreK-2', 'Grades 3-5', 'Grades 9-12', 'Grades 6-8',
       'unknown'], dtype=object)

In [32]:
proj_df['Project Grade Level Category'].value_counts()

Grades PreK-2    432002
Grades 3-5       364257
Grades 6-8       181236
Grades 9-12      132469
unknown              53
Name: Project Grade Level Category, dtype: int64

In [33]:
proj_df['Project Subject Category Tree'].value_counts()

Literacy & Language                           250504
Math & Science                                172517
Literacy & Language, Math & Science           153967
Music & The Arts                               56828
Health & Sports                                48690
Applied Learning                               44992
Literacy & Language, Special Needs             41308
Special Needs                                  35555
Applied Learning, Literacy & Language          31144
Math & Science, Literacy & Language            27956
Literacy & Language, Music & The Arts          21175
History & Civics                               20546
Applied Learning, Special Needs                20478
Math & Science, Special Needs                  18712
History & Civics, Literacy & Language          18639
Math & Science, Music & The Arts               16890
Math & Science, Applied Learning               15407
Applied Learning, Math & Science               12962
Literacy & Language, History & Civics         

# Extra Time
* `apply`
* string manipulation

## `apply` function

Allows arbitrary functions to create data.

In [34]:
# how many are `Literacy & Language`?
proj_df['Project Subject Category Tree'].value_counts()

Literacy & Language                           250504
Math & Science                                172517
Literacy & Language, Math & Science           153967
Music & The Arts                               56828
Health & Sports                                48690
Applied Learning                               44992
Literacy & Language, Special Needs             41308
Special Needs                                  35555
Applied Learning, Literacy & Language          31144
Math & Science, Literacy & Language            27956
Literacy & Language, Music & The Arts          21175
History & Civics                               20546
Applied Learning, Special Needs                20478
Math & Science, Special Needs                  18712
History & Civics, Literacy & Language          18639
Math & Science, Music & The Arts               16890
Math & Science, Applied Learning               15407
Applied Learning, Math & Science               12962
Literacy & Language, History & Civics         

In [35]:
def has_literacy_and_language(x):
    if 'Literacy & Language' in x:
        return 1
    else:
        return 0
proj_df['Project Subject Category Tree'].apply(has_literacy_and_language)

TypeError: argument of type 'float' is not iterable

In [36]:
proj_df['Project Subject Category Tree'].value_counts(dropna=False)

Literacy & Language                           250504
Math & Science                                172517
Literacy & Language, Math & Science           153967
Music & The Arts                               56828
Health & Sports                                48690
Applied Learning                               44992
Literacy & Language, Special Needs             41308
Special Needs                                  35555
Applied Learning, Literacy & Language          31144
Math & Science, Literacy & Language            27956
Literacy & Language, Music & The Arts          21175
History & Civics                               20546
Applied Learning, Special Needs                20478
Math & Science, Special Needs                  18712
History & Civics, Literacy & Language          18639
Math & Science, Music & The Arts               16890
Math & Science, Applied Learning               15407
Applied Learning, Math & Science               12962
Literacy & Language, History & Civics         

In [37]:
proj_df['Project Subject Category Tree'].fillna('').apply(has_literacy_and_language)

0          0
1          1
2          1
3          1
4          0
5          1
6          0
7          1
8          1
9          1
10         0
11         0
12         1
13         0
14         1
15         1
16         0
17         1
18         1
19         0
20         1
21         0
22         1
23         1
24         1
25         1
26         1
27         1
28         1
29         1
          ..
1109987    1
1109988    0
1109989    1
1109990    0
1109991    1
1109992    1
1109993    0
1109994    0
1109995    1
1109996    0
1109997    0
1109998    1
1109999    1
1110000    0
1110001    0
1110002    1
1110003    0
1110004    0
1110005    0
1110006    0
1110007    1
1110008    1
1110009    0
1110010    1
1110011    1
1110012    0
1110013    1
1110014    1
1110015    1
1110016    1
Name: Project Subject Category Tree, Length: 1110017, dtype: int64

In [38]:
proj_df['lit_and_lang'] = proj_df['Project Subject Category Tree'].fillna('').apply(has_literacy_and_language)
proj_df.lit_and_lang.sum()

570150

What is going on here? `apply` just passes an argument to the function. Here, we're passing in each item in the column/`series`.

What if we wanted a second argument?

In [39]:
def has_literacy_and_language(x, name='Literacy & Language'):
    if name in x:
        return 1
    else:
        return 0
proj_df['math_science'] =proj_df['Project Subject Category Tree'].fillna('').apply(lambda x: has_literacy_and_language(x, 'Math & Science'))
proj_df.math_science.sum()

436623

In [40]:
proj_df['math_science'] =proj_df['Project Subject Category Tree'].fillna('').apply(
    lambda x: 1 if 'Math & Science' in x else 0
)
proj_df.math_science.sum()

436623

### Entire Row (or column)

* axis=1 : row
* axis=0 : column

In [41]:
proj_df.apply(lambda r: 1 if r.math_science and r.lit_and_lang else 0, axis=1).sum()

181923

## String Manipulation

Manipulate strings within a Series: https://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

In [42]:
proj_df['Project Need Statement']

0          My students need 25 copies of "Bullying in Sch...
1          My students need a listening center, read alon...
2          My students need beginning vocabulary audio ca...
3          My students need one copy of each book in The ...
4          My students need items such as Velcro, two pou...
5          My students need 24 subscriptions to Time for ...
6          My students need additional funding to finance...
7          My students need a class set of high interest,...
8          My students need four grammar match-up games a...
9          My students need tag readers to use during ind...
10         My students need resources to combat bullying ...
11         My students need hands-on math materials to us...
12         My students need a new rug for the library are...
13         My students need a pencil sharpener, 12 packs ...
14         My students need 2 Kindle readers and protecti...
15         My students need a class set of iPod Shuffles ...
16         My students n

`My students` vs `I need`?

In [43]:
proj_df['Project Need Statement'].str.startswith('My students').value_counts()

True     1090596
False      19418
Name: Project Need Statement, dtype: int64

In [44]:
proj_df['Project Need Statement'].str.startswith('I need').value_counts()

False    1103213
True        6801
Name: Project Need Statement, dtype: int64

## Categorical

In [45]:
proj_df['proj_type'] = proj_df['Project Type'].astype('category')
pd.concat((proj_df['proj_type'], proj_df['proj_type'].cat.codes), axis=1)

Unnamed: 0,proj_type,0
0,Teacher-Led,2
1,Teacher-Led,2
2,Teacher-Led,2
3,Teacher-Led,2
4,Teacher-Led,2
5,Teacher-Led,2
6,Teacher-Led,2
7,Teacher-Led,2
8,Teacher-Led,2
9,Teacher-Led,2
