# DATA 202 @ Calvin University, FA19

## Lab 2: More Pandas!

**Author**: Kenneth C. Arnold

**Adapted with permission from Berkeley DATA 100 Summer 2019**

## Some useful resources


Introductory:

* [Getting started with Python for research](https://github.com/TiesdeKok/LearnPythonforResearch), a gentle introduction to Python in data-intensive research.

* [A Whirlwind Tour of Python](https://jakevdp.github.io/WhirlwindTourOfPython/index.html), by Jake VanderPlas, another quick Python intro (with notebooks).

Core Pandas/Data Science books:

* [The Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/), by Jake VanderPlas.

* [Python for Data Analysis, 2nd Edition](http://proquest.safaribooksonline.com/book/programming/python/9781491957653), by  Wes McKinney, creator of Pandas. [Companion Notebooks](https://github.com/wesm/pydata-book)

* [Effective Pandas](https://github.com/TomAugspurger/effective-pandas), a book by Tom Augspurger, core Pandas developer.


Complementary resources:

* [An introduction to "Data Science"](https://github.com/stefanv/ds_intro), a collection of Notebooks by BIDS' [Stéfan Van der Walt](https://bids.berkeley.edu/people/st%C3%A9fan-van-der-walt).

* [Effective Computation in Physics](http://proquest.safaribooksonline.com/book/physics/9781491901564), by Kathryn D. Huff; Anthony Scopatz. [Notebooks to accompany the book](https://github.com/physics-codes/seminar). Don't be fooled by the title, it's a great book on modern computational practices with very little that's physics-specific.


OK, let's load and configure some of our core libraries (as an aside, you can find a nice visual gallery of available matplotlib styles [here](https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html)).

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

sns.set()
sns.set_context('notebook')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 10
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

## Reading in DataFrames from Files

Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today. 

In [2]:
elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
...,...,...,...,...,...
18,McCain,Republican,45.7,2008,loss
19,Obama,Democratic,51.1,2012,win
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss


## Basic Inspection
You'll probably want to do examine just about every DataFrame you work with using methods like this:

In [3]:
elections.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 5 columns):
Candidate    23 non-null object
Party        23 non-null object
%            23 non-null float64
Year         23 non-null int64
Result       23 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 1000.0+ bytes


In [4]:
elections.head(n=7)  # What is the default value of n?

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss


There is also a tail command.

In [5]:
elections.tail(7)

Unnamed: 0,Candidate,Party,%,Year,Result
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
18,McCain,Republican,45.7,2008,loss
19,Obama,Democratic,51.1,2012,win
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss
22,Trump,Republican,46.1,2016,win


Note that `describe` by default only looks at the numeric columns.

In [6]:
elections.describe()

Unnamed: 0,%,Year
count,23.0,23.0
mean,42.51,1996.87
std,13.48,11.63
min,6.6,1980.0
25%,40.85,1988.0
50%,47.2,1996.0
75%,49.95,2006.0
max,58.8,2016.0


How many different unique years were there?

In [7]:
# Your code here:
len(elections["Year"].value_counts())

10

In [8]:
# Your code here
len(elections["Year"].unique())

10

## Exercise: inspect a different DataFrame

In [9]:
mottos = pd.read_csv("mottos.csv", index_col = "State")
mottos.head(5)

Unnamed: 0_level_0,Motto,Translation,Language,Date Adopted
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
Alaska,North to the future,-,English,1967
Arizona,Ditat Deus,God enriches,Latin,1863
Arkansas,Regnat populus,The people rule,Latin,1907
California,Eureka (Εὕρηκα),I have found it,Greek,1849


In [10]:
# Your code here
mottos.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, Alabama to Wyoming
Data columns (total 4 columns):
Motto           50 non-null object
Translation     49 non-null object
Language        50 non-null object
Date Adopted    50 non-null object
dtypes: object(4)
memory usage: 2.0+ KB


In [11]:
# Your code here
mottos["Language"].value_counts()

Latin             23
English           21
Hawaiian           1
Spanish            1
Greek              1
Italian            1
French             1
Chinook Jargon     1
Name: Language, dtype: int64

## The [] Operator

**Ground rule**: Before running one of the following cells, tell your neighbor what you think the result will be.

The DataFrame class has an indexing operator `[]` that lets you do a variety of different things.

Getting a column by name:

In [12]:
elections["Candidate"].head(6)

0      Reagan
1      Carter
2    Anderson
3      Reagan
4     Mondale
5        Bush
Name: Candidate, dtype: object

If you actually wanted a one-column DataFrame instead of a Series:

In [13]:
elections["Candidate"].to_frame()

Unnamed: 0,Candidate
0,Reagan
1,Carter
2,Anderson
3,Reagan
4,Mondale
...,...
18,McCain
19,Obama
20,Romney
21,Clinton


The [] operator also accepts a list of strings:

In [14]:
elections[["Candidate", "Party"]].head(6)

Unnamed: 0,Candidate,Party
0,Reagan,Republican
1,Carter,Democratic
2,Anderson,Independent
3,Reagan,Republican
4,Mondale,Democratic
5,Bush,Republican


**Exercise**: How many times did each Party run?

In [15]:
# your code here
elections["Party"].value_counts()

Democratic     10
Republican     10
Independent     3
Name: Party, dtype: int64

## Boolean Array Selection

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [16]:
elections[[False, False, False, False, False, 
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, False, True]]

Unnamed: 0,Candidate,Party,%,Year,Result
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
22,Trump,Republican,46.1,2016,win


One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

In [17]:
elections.head(5)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


In [18]:
iswin = elections['Result'] == 'win'
iswin.head(5)

0     True
1    False
2    False
3     True
4    False
Name: Result, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean.

In [19]:
iswin.dtype

dtype('bool')

In [20]:
elections[iswin]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

In [21]:
elections[elections['Result'] == 'win']

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


**Exercise**: Get the data for all Independent candidates

In [22]:
# Your code here
elections[elections['Party'] == 'Independent']

Unnamed: 0,Candidate,Party,%,Year,Result
2,Anderson,Independent,6.6,1980,loss
9,Perot,Independent,18.9,1992,loss
12,Perot,Independent,8.4,1996,loss


We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

In [23]:
elections[(elections['Result'] == 'win') & (elections['%'] < 50)]

Unnamed: 0,Candidate,Party,%,Year,Result
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
22,Trump,Republican,46.1,2016,win


**Exercise**: What state has a Spanish motto?

In [24]:
# Your code here
mottos[mottos["Language"] == "Spanish"]

Unnamed: 0_level_0,Motto,Translation,Language,Date Adopted
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Montana,Oro y plata,Gold and silver,Spanish,"February 9, 1865"


**Exercise**: How many states have non-English mottos?

In [25]:
# Your code here
len(mottos[mottos["Language"] != "English"])

29

## `loc` and `iloc`

To go beyond what we can do with `[]` (getting named columns and boolean subsets of rows), we'll need to use more general accessor methods. The most common are `loc` and `iloc`:

* `loc` uses the **names** (the row and column indices)
* `iloc` ignores the indices and looks up by **place**.

In [26]:
elections["Party"].head()

0     Republican
1     Democratic
2    Independent
3     Republican
4     Democratic
Name: Party, dtype: object

In [27]:
elections.loc[:, "Party"].head()

0     Republican
1     Democratic
2    Independent
3     Republican
4     Democratic
Name: Party, dtype: object

In [28]:
mottos.loc["Michigan"]

Motto                  Si quaeris peninsulam amoenam circumspice
Translation     If you seek a pleasant peninsula, look about you
Language                                                   Latin
Date Adopted                                        June 2, 1835
Name: Michigan, dtype: object

In [29]:
mottos.loc["Michigan", "Translation"]

'If you seek a pleasant peninsula, look about you'

In [30]:
mottos.iloc[10]

Motto                           Ua mau ke ea o ka ʻāina i ka pono
Translation     The life of the land is perpetuated in righteo...
Language                                                 Hawaiian
Date Adopted                                        July 31, 1843
Name: Hawaii, dtype: object

In [31]:
mottos.iloc[10, 1]

'The life of the land is perpetuated in righteousness.'

In [32]:
elections.iloc[0]

Candidate        Reagan
Party        Republican
%                    51
Year               1980
Result              win
Name: 0, dtype: object

# Utility functions

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using `sort_values`.

In [33]:
elections.sort_values('%')

Unnamed: 0,Candidate,Party,%,Year,Result
2,Anderson,Independent,6.6,1980,loss
12,Perot,Independent,8.4,1996,loss
9,Perot,Independent,18.9,1992,loss
8,Bush,Republican,37.4,1992,loss
4,Mondale,Democratic,37.6,1984,loss
...,...,...,...,...,...
0,Reagan,Republican,50.7,1980,win
19,Obama,Democratic,51.1,2012,win
17,Obama,Democratic,52.9,2008,win
5,Bush,Republican,53.4,1988,win


In [34]:
elections.sort_values('%').iloc[-4:]

Unnamed: 0,Candidate,Party,%,Year,Result
19,Obama,Democratic,51.1,2012,win
17,Obama,Democratic,52.9,2008,win
5,Bush,Republican,53.4,1988,win
3,Reagan,Republican,58.8,1984,win


As mentioned before, all Data Frame methods return a copy and do **not** modify the original data structure, unless you set inplace to True.

In [35]:
elections.head(5)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


If we want to sort in reverse order, we can set `ascending=False`.

In [36]:
elections.sort_values('%', ascending=False)

Unnamed: 0,Candidate,Party,%,Year,Result
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
0,Reagan,Republican,50.7,1980,win
...,...,...,...,...,...
4,Mondale,Democratic,37.6,1984,loss
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss
12,Perot,Independent,8.4,1996,loss


For Series, the `value_counts` method is often quite handy.

## Case Study: What was the most popular name in Michigan last year?

Start by downloading the baby name data from the Social Security Administration.

* [Direct link to ZIP file](https://www.ssa.gov/oact/babynames/state/namesbystate.zip)
* https://www.ssa.gov/OACT/babynames/index.html
* https://www.ssa.gov/data

Extract it to the current directory. Here's a block of code that does all of that in Python; you don't need to understand how it works.

In [37]:
# Download the ZIP file
import requests
from pathlib import Path

namesbystate_path = Path('namesbystate.zip')
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'

if not namesbystate_path.exists():
    print('Downloading...', end=' ')
    resp = requests.get(data_url)
    with namesbystate_path.open('wb') as f:
        f.write(resp.content)
    print('Done!')
else:
    print("File already downloaded.")

File already downloaded.


In [38]:
# Extract the data.
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')
zf.extractall(path="names_by_state")
zf.close()

Let's have a look at the Michigan data, it should give us an idea about the structure of the whole thing:

In [39]:
mi_filename = "names_by_state/MI.TXT"
with open(mi_filename) as f:
    for i in range(10):
        print(f.readline().rstrip())

MI,F,1910,Helen,368
MI,F,1910,Mary,349
MI,F,1910,Margaret,272
MI,F,1910,Dorothy,265
MI,F,1910,Ruth,212
MI,F,1910,Florence,164
MI,F,1910,Mildred,159
MI,F,1910,Frances,155
MI,F,1910,Anna,143
MI,F,1910,Marie,143


This is equivalent (on macOS or Linux) to extracting the full `MI.TXT` file to disk and then using the `head` command (if you're on Windows, don't try to run the cell below):

In [40]:
!head {mi_filename}

MI,F,1910,Helen,368
MI,F,1910,Mary,349
MI,F,1910,Margaret,272
MI,F,1910,Dorothy,265
MI,F,1910,Ruth,212
MI,F,1910,Florence,164
MI,F,1910,Mildred,159
MI,F,1910,Frances,155
MI,F,1910,Anna,143
MI,F,1910,Marie,143


A couple of practical comments:

* The above is using special tricks in Jupyter's "IPython" kernels that let you call operating system commands via `!cmd`, and that expand Python variables in such commands with the `{var}` syntax. You can find more about IPython's special tricks [in this tutorial](https://github.com/ipython/ipython-in-depth/blob/master/examples/IPython%20Kernel/Beyond%20Plain%20Python.ipynb).

* `head` doesn't work on Windows, though there are equivalent Windows commands. But by using Python code, even if it's a little bit more verbose, we have a 100% portable solution.

## Load the data

In [41]:
def load_names(state):
    field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
    return pd.read_csv(f'names_by_state/{state}.TXT', header=None, names=field_names)
mi_names = load_names(state='MI')

In [42]:
mi_names.head()

Unnamed: 0,State,Sex,Year,Name,Count
0,MI,F,1910,Helen,368
1,MI,F,1910,Mary,349
2,MI,F,1910,Margaret,272
3,MI,F,1910,Dorothy,265
4,MI,F,1910,Ruth,212


## Exploration review

Answer the following questions.

How many rows are in the dataset? How many columns?

In [43]:
# your code here
mi_names.shape

(181882, 5)

*Your answer here*


What is the range of years contained in this dataset?

In [44]:
# your code here
years = mi_names["Year"]
years.describe()

count    181882.00
mean       1975.64
std          29.71
min        1910.00
25%        1954.00
50%        1980.00
75%        2001.00
max        2018.00
Name: Year, dtype: float64

How many different names are there?

In [45]:
# your code here
len(mi_names['Name'].unique())

8886

### Indexing Review

Let's play around a bit with our indexing techniques from earlier today.

**Try the following**:

Extract the first few entries in the Name column, in 3 different ways:

1. Using `mi_names[]`

In [50]:
# Your code here
mi_names['Name'].head()

0       Helen
1        Mary
2    Margaret
3     Dorothy
4        Ruth
Name: Name, dtype: object

2. Using `mi_names.loc[]`

In [51]:
# Your code here
mi_names.loc[:, "Name"].head()

0       Helen
1        Mary
2    Margaret
3     Dorothy
4        Ruth
Name: Name, dtype: object

3. Using `mi_names.iloc[]`

In [53]:
# Your code here
mi_names.iloc[:5, -2]

0       Helen
1        Mary
2    Margaret
3     Dorothy
4        Ruth
Name: Name, dtype: object

### Sorting

What we've done so far is NOT exploratory data analysis. We were just playing around a bit with the capabilities of the pandas library. Now that we're done, let's turn to the problem at hand: Identifying the most common name in Michigan last year.

**Step 1**: Find the most recent year in the DataFrame and extract only the rows for that year.

In [54]:
# your code here
most_recent_year = mi_names["Year"].max()
print("most recent year is", most_recent_year)
mi_names_for_most_recent_year = mi_names[mi_names["Year"] == most_recent_year]

most recent year is 2018


**Step 2**: Get the top-10 names by birth count.

In [55]:
# your code here
mi_names_for_most_recent_year.sort_values('Count', ascending=False).head(10)

Unnamed: 0,State,Sex,Year,Name,Count
101985,MI,F,2018,Olivia,502
180621,MI,M,2018,Noah,500
101986,MI,F,2018,Ava,493
180622,MI,M,2018,Oliver,486
101987,MI,F,2018,Emma,482
180623,MI,M,2018,Liam,468
101988,MI,F,2018,Charlotte,446
180624,MI,M,2018,Benjamin,442
180625,MI,M,2018,William,419
180626,MI,M,2018,Lucas,417


**Bonus**: break that down by sex. (you may find the `display` built-in function helpful.)

In [56]:
# your code here
for sex in ["M", "F"]:
    by_gender = mi_names_for_most_recent_year[mi_names_for_most_recent_year['Sex'] == sex]
    print("Sex =", sex)
    display(by_gender.sort_values("Count", ascending=False).head(10))

Sex = M


Unnamed: 0,State,Sex,Year,Name,Count
180621,MI,M,2018,Noah,500
180622,MI,M,2018,Oliver,486
180623,MI,M,2018,Liam,468
180624,MI,M,2018,Benjamin,442
180625,MI,M,2018,William,419
180626,MI,M,2018,Lucas,417
180627,MI,M,2018,Henry,401
180628,MI,M,2018,Elijah,393
180629,MI,M,2018,Logan,383
180630,MI,M,2018,Jackson,376


Sex = F


Unnamed: 0,State,Sex,Year,Name,Count
101985,MI,F,2018,Olivia,502
101986,MI,F,2018,Ava,493
101987,MI,F,2018,Emma,482
101988,MI,F,2018,Charlotte,446
101989,MI,F,2018,Amelia,403
101990,MI,F,2018,Harper,396
101991,MI,F,2018,Sophia,376
101992,MI,F,2018,Evelyn,355
101993,MI,F,2018,Isabella,344
101994,MI,F,2018,Ella,256


**Bonus**: What is the meaning of this?

In [57]:
(mi_names_for_most_recent_year.Sex == "M").mean()

0.4722846441947566

*your answer here*

In the most recent year, about 47\%  -- less than half -- of the *unique* names were given to male children. i.e., parents used a wider range of names for female children.