# PANDAS

A high-level overview of the [Pandas](https://pandas.pydata.org) library.


## Why `pandas`?

 `pandas` is a Python library used for data manipulation and analysis. `pandas` is an industrial strength package that is used in most data analysis projects in the real world.  Learning how to use pandas would also make your projects easier to understand for other data scientists and extend the scope of influence your projects may have.



In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
#import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
#sns.set_context("notebook")

## Reading in DataFrames from Files

Pandas has a number of very useful file reading tools. This link describes manu https://realpython.com/pandas-read-write-files/ Today we'll be using read_csv today.   Another very useful one is the ability to read excel files. 


A "csv" file is a "comma separated value" file.  It's a nice and simple text format that separates things in the files by commas.  For example:
Participant,ResponseTime
1,0.50
2,.0386

This is a fairly common file format that can be read by almost every program (e.g. excel, SPSS, python, R)


Pandas stores things in something known as a "dataframe". 


In [2]:
elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


We can use shape to geth information about the shape of this dataset

In [3]:
elections.shape

(23, 5)

We can use the head command to return only a few rows of a dataframe.

In [4]:
elections.head(10)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


There is also a tail command.

In [5]:
elections.tail(7)

Unnamed: 0,Candidate,Party,%,Year,Result
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
18,McCain,Republican,45.7,2008,loss
19,Obama,Democratic,51.1,2012,win
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss
22,Trump,Republican,46.1,2016,win


When reading data column names are ideally unique. But if we try to read in a file for which column names are not unique, Pandas will automatically rename any duplicates.  Just good to know, many datasets in the wild have duplicate names. 

In [6]:
dups = pd.read_csv("duplicate_columns.csv")
dups

Unnamed: 0,name,name.1,flavor
0,john,smith,vanilla
1,zhang,shan,chocolate
2,fulan,alfulani,
3,hong,gildong,banana


## Indexing, Slicing, Dicing

After reading in data, the most common operaton is selecting data.   With pandas dataframes there are a bunch of powerful ways to access data.  I'll step through a few now. 

The DataFrame class has an indexing operator [] that lets you do a variety of different things. If your provide a String to the [] operator, you get back a Series corresponding to the requested label.

This is start of where syntax will get a bit confusing.  

### Selection Using Label/Index, with `loc`

**Column Selection** 

To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html). General usage of `.loc` looks like `df.loc[rowname, colname]`. Remember that the colon `:` means "everything." For example, if we want the `color` column of the `ex` DataFrame, we would use: `ex.loc[:, 'color']`

- You can also slice across columns. For example, `baby_names.loc[:, 'Name':]` would select the column `Name` and all columns after `Name`.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `df['colname']`.

**Row Selection**

Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (ie. primary key) of the DataFrame.

### We wil go through a bunch of examples now. 


In [7]:
#Show the first 6 values. 
...

In [8]:
# Show just the Candidate names for the first 6 values. 


The [] operator also accepts a list of strings. In this case, you get back a DataFrame corresponding to the requested strings.

In [9]:
elections[["Candidate", "Party"]].head()

Unnamed: 0,Candidate,Party
0,Reagan,Republican
1,Carter,Democratic
2,Anderson,Independent
3,Reagan,Republican
4,Mondale,Democratic


The [] operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

Which is really, really confusing. 

In [10]:
elections[0:3]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss


The way to think of this is that the table is fundamentally a table with rows and columns.  Columns have names, rows have numbers by default. What we did above was shorthand.  When we didn't ask for a specific column or row we got all of them back.  

WHen you start selecting both there are a lot of [] to keep track of.   

In [11]:
elections[["Candidate","Party"]][0:3]

Unnamed: 0,Candidate,Party
0,Reagan,Republican
1,Carter,Democratic
2,Anderson,Independent


If you provide a single argument to the [] operator, it tries to use it as a name. This is true even if the argument passed to [] is an integer.  The next cell has an intentional error.   You will see these "KeyError" messages often when working with pandas.   It just means it can't find what you're looking for.  Usually because of a typo. 

In [12]:
#elections[0] #this does not work,  see it fail in action, woo

The following cells allow you to test your understanding.

In [13]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird

Unnamed: 0,1,1.1
0,topdog,topcat
1,botdog,botcat


In [14]:
weird[1] #try to predict the output

0    topdog
1    botdog
Name: 1, dtype: object

In [15]:
weird["1"] #try to predict the output

0    topcat
1    botcat
Name: 1, dtype: object

In [16]:
weird[1:] #try to predict the output

Unnamed: 0,1,1.1
1,botdog,botcat


## Boolean Array Selection

Now let's start doing some more interesting things. 

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [17]:
elections

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


In [18]:
elections[[False, False, False, False, False, 
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, False, True]]

Unnamed: 0,Candidate,Party,%,Year,Result
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
22,Trump,Republican,46.1,2016,win


One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

In [19]:
elections

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


In [20]:
iswin = elections['Result'] == 'win'
iswin.head(5)

0     True
1    False
2    False
3     True
4    False
Name: Result, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [21]:
elections[iswin]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. 

This syntax is a little tricky to read at first, but you'll get used to it quickly.

In [22]:
elections[elections['Result'] == 'win']

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

In [23]:
win50plus = (elections['Result'] == 'win') & (elections['%'] < 50)

In [24]:
win50plus.head(5)

0    False
1    False
2    False
3    False
4    False
dtype: bool

In [25]:
elections[(elections['Result'] == 'win') & (elections['%'] < 50)]


Unnamed: 0,Candidate,Party,%,Year,Result
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
22,Trump,Republican,46.1,2016,win


The | operator is the symbol for or.

In [26]:
elections[(elections['Party'] == 'Republican')
          | (elections['Party'] == "Democratic")]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
10,Clinton,Democratic,49.2,1996,win
11,Dole,Republican,40.7,1996,loss


If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

In [27]:
elections['Party'].isin(["Republican", "Democratic"])

0      True
1      True
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9     False
10     True
11     True
12    False
13     True
14     True
15     True
16     True
17     True
18     True
19     True
20     True
21     True
22     True
Name: Party, dtype: bool

In [28]:
elections[elections['Party'].isin(["Republican", "Democratic"])]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
10,Clinton,Democratic,49.2,1996,win
11,Dole,Republican,40.7,1996,loss


An alternate simpler way to get back a specific set of rows is to use the `query` command.

In [29]:
elections.query?

[0;31mSignature:[0m [0melections[0m[0;34m.[0m[0mquery[0m[0;34m([0m[0mexpr[0m[0;34m:[0m [0;34m'str'[0m[0;34m,[0m [0minplace[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Query the columns of a DataFrame with a boolean expression.

Parameters
----------
expr : str
    The query string to evaluate.

    You can refer to variables
    in the environment by prefixing them with an '@' character like
    ``@a + b``.

    You can refer to column names that are not valid Python variable names
    by surrounding them in backticks. Thus, column names containing spaces
    or punctuations (besides underscores) or starting with digits must be
    surrounded by backticks. (For example, a column named "Area (cm^2)" would
    be referenced as ```Area (cm^2)```). Column names which are Python keywords
    (like "list", "for", "import", etc) cannot be used.

    For ex

In [30]:
elections.query("Result == 'win' and Year < 2000")

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win


## Label-based access with `loc`

In [31]:
elections.head(5)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


In [32]:
elections.loc[[0, 1, 2, 3, 4], ['Candidate','Party', 'Year']]

Unnamed: 0,Candidate,Party,Year
0,Reagan,Republican,1980
1,Carter,Democratic,1980
2,Anderson,Independent,1980
3,Reagan,Republican,1984
4,Mondale,Democratic,1984


## Warning here.  


We didn't do it above.  But it's possible to use names for rows as well as columns.  

Note: The `loc` command won't work with numeric arguments if we're using a dataframe that has labeled rows instead.


Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

In [33]:
elections.loc[0:4, 'Candidate':'Year']

Unnamed: 0,Candidate,Party,%,Year
0,Reagan,Republican,50.7,1980
1,Carter,Democratic,41.0,1980
2,Anderson,Independent,6.6,1980
3,Reagan,Republican,58.8,1984
4,Mondale,Democratic,37.6,1984


If we omit the column argument altogether, the default behavior is to retrieve all columns. 

In [34]:
elections.loc[[2, 4, 5]]

Unnamed: 0,Candidate,Party,%,Year,Result
2,Anderson,Independent,6.6,1980,loss
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win


Loc also supports boolean array inputs instead of labels. The Boolean arrays _must_ be of the same length as the row/column shape of the dataframe, respectively (in versions prior to 0.25, Pandas used to allow size mismatches and would assume the missing values were all False, [this was changed in 2019](https://github.com/pandas-dev/pandas/pull/26911)).

In [35]:
elections.loc[[True, False, False, True, False, False, True, True, True, False, False, True, 
               True, True, False, True, True, False, False, False, True, False, False], # row mask
              [True, False, False, True, True] # column mask
             ]

Unnamed: 0,Candidate,Year,Result
0,Reagan,1980,win
3,Reagan,1984,win
6,Dukakis,1988,loss
7,Clinton,1992,win
8,Bush,1992,loss
11,Dole,1996,loss
12,Perot,1996,loss
13,Gore,2000,loss
15,Kerry,2004,loss
16,Bush,2004,win


In [36]:
elections.loc[[0, 3], ['Candidate', 'Year']]

Unnamed: 0,Candidate,Year
0,Reagan,1980
3,Reagan,1984


We can use boolean array arguments for one axis of the data, and labels for the other.

In [37]:
elections.loc[[True, False, False, True, False, False, True, True, True, False, False, True, 
               True, True, False, True, True, False, False, False, True, False, False], # row mask
              
              'Candidate':'%' # column label slice
             ]

Unnamed: 0,Candidate,Party,%
0,Reagan,Republican,50.7
3,Reagan,Republican,58.8
6,Dukakis,Democratic,45.6
7,Clinton,Democratic,43.0
8,Bush,Republican,37.4
11,Dole,Republican,40.7
12,Perot,Independent,8.4
13,Gore,Democratic,48.4
15,Kerry,Democratic,48.3
16,Bush,Republican,50.7


What do you think happens if you give a single value  arguments for the requested rows AND columns?

In [38]:
elections.loc[15, '%']

48.3

## Positional access with `iloc`

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is **exclusive**, just like standard Python slicing of numerical values.

In [39]:
elections.head(5)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


In [40]:
elections.iloc[:3, 2:]

Unnamed: 0,%,Year,Result
0,50.7,1980,win
1,41.0,1980,loss
2,6.6,1980,loss


We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example: 

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g. what column #31 represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Quick Challenge

Which of the following expressions return DataFrame of the first 3 Candidate and Year for candidates that won with more than 50% of the vote.

In [41]:
elections.head(10)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


In [42]:
elections.iloc[[0, 3, 5], [0, 3]]

Unnamed: 0,Candidate,Year
0,Reagan,1980
3,Reagan,1984
5,Bush,1988


In [43]:
elections.loc[[0, 3, 5], "Candidate":"Year"]

Unnamed: 0,Candidate,Party,%,Year
0,Reagan,Republican,50.7,1980
3,Reagan,Republican,58.8,1984
5,Bush,Republican,53.4,1988


In [44]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)

Unnamed: 0,Candidate,Year
0,Reagan,1980
3,Reagan,1984
5,Bush,1988


In [45]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]

Unnamed: 0,Candidate,Year
0,Reagan,1980
3,Reagan,1984


## Sampling

Pandas dataframes also make it easy to get a sample. We simply use the `sample` method and provide the number of samples that we'd like as the arugment. Sampling is done without replacement by default. Set `replace=True` if you want replacement.

In [46]:
elections.sample(10)

Unnamed: 0,Candidate,Party,%,Year,Result
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
22,Trump,Republican,46.1,2016,win
2,Anderson,Independent,6.6,1980,loss
10,Clinton,Democratic,49.2,1996,win
5,Bush,Republican,53.4,1988,win
19,Obama,Democratic,51.1,2012,win
4,Mondale,Democratic,37.6,1984,loss
18,McCain,Republican,45.7,2008,loss


In [47]:
elections.query("Year < 1992").sample(50, replace=True)


Unnamed: 0,Candidate,Party,%,Year,Result
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
3,Reagan,Republican,58.8,1984,win
1,Carter,Democratic,41.0,1980,loss
3,Reagan,Republican,58.8,1984,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


## Handy Properties and Utility Functions for Series and DataFrames

#### Python Operations on Numerical DataFrames and Series

Consider a series of only the vote percentages of election winners.

In [48]:
winners = elections.query("Result == 'win'")["%"]
winners

0     50.7
3     58.8
5     53.4
7     43.0
10    49.2
14    47.9
16    50.7
17    52.9
19    51.1
22    46.1
Name: %, dtype: float64

We can perform various Python operations (including numpy operations) to DataFrames and Series.

In [49]:
max(winners)

58.8

In [50]:
np.mean(winners)

50.38000000000001

#### Handy Utility Methods

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. Remember when I said above we can use names for labeling rows?  This is a good dataset to demonstrate that. 

In [51]:
mottos = pd.read_csv("mottos.csv", index_col="State")

In [52]:
mottos.head(20)

Unnamed: 0_level_0,Motto,Translation,Language,Date Adopted
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
Alaska,North to the future,—,English,1967
Arizona,Ditat Deus,God enriches,Latin,1863
Arkansas,Regnat populus,The people rule,Latin,1907
California,Eureka (Εὕρηκα),I have found it,Greek,1849
Colorado,Nil sine numine,Nothing without providence.,Latin,"November 6, 1861"
Connecticut,Qui transtulit sustinet,He who transplanted sustains,Latin,"October 9, 1662"
Delaware,Liberty and Independence,—,English,1847
Florida,In God We Trust,—,English,1868
Georgia,"Wisdom, Justice, Moderation",—,English,1798


In [53]:
mottos.size

200

The fact that the size is 200 means our data file is relatively small, with only 200 total entries.

In [54]:
mottos.shape

(50, 4)

Since we're looking at data for states, and we see the number 50, it looks like we've mostly likely got a complete dataset that omits Washington D.C. and U.S. territories like Guam and Puerto Rico.

In [55]:
mottos.describe()

Unnamed: 0,Motto,Translation,Language,Date Adopted
count,50,49,50,50
unique,50,30,8,47
top,Audemus jura nostra defendere,—,Latin,1893
freq,1,20,23,2


Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

In [56]:
mottos.index

Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype='object', name='State')

We can also access individual properties of the index, for example, `mottos.index.name`.

In [57]:
mottos.index.name

'State'

This reflects the fact that in our data frame, the index IS the state name!

In [58]:
mottos.head(2)

Unnamed: 0_level_0,Motto,Translation,Language,Date Adopted
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
Alaska,North to the future,—,English,1967


It turns out the columns also have an Index. We can access this index by using `.columns`.

In [59]:
mottos.head(2)

Unnamed: 0_level_0,Motto,Translation,Language,Date Adopted
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
Alaska,North to the future,—,English,1967


There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using `sort_values`.

In [60]:
elections.sort_values('%', ascending=False)

Unnamed: 0,Candidate,Party,%,Year,Result
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
0,Reagan,Republican,50.7,1980,win
16,Bush,Republican,50.7,2004,win
10,Clinton,Democratic,49.2,1996,win
13,Gore,Democratic,48.4,2000,loss
15,Kerry,Democratic,48.3,2004,loss
21,Clinton,Democratic,48.2,2016,loss


As mentioned before, all Data Frame methods return a copy and do **not** modify the original data structure, unless you set inplace to True.

In [61]:
elections.head(5)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


If we want to sort in reverse order, we can set `ascending=False`.

In [62]:
elections.sort_values('%', ascending=False)

Unnamed: 0,Candidate,Party,%,Year,Result
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
0,Reagan,Republican,50.7,1980,win
16,Bush,Republican,50.7,2004,win
10,Clinton,Democratic,49.2,1996,win
13,Gore,Democratic,48.4,2000,loss
15,Kerry,Democratic,48.3,2004,loss
21,Clinton,Democratic,48.2,2016,loss


We can also use `sort_values` on Series objects.

In [63]:
mottos['Language'].sort_values(ascending=False).head(10)

State
Montana           Spanish
Maine               Latin
West Virginia       Latin
Virginia            Latin
Vermont             Latin
South Carolina      Latin
Oregon              Latin
Oklahoma            Latin
North Dakota        Latin
North Carolina      Latin
Name: Language, dtype: object

For Series, the `value_counts` method is often quite handy.

In [64]:
elections['Party'].value_counts()

Republican     10
Democratic     10
Independent     3
Name: Party, dtype: int64

In [65]:
mottos['Language'].value_counts()

Latin             23
English           21
Greek              1
Hawaiian           1
Italian            1
French             1
Spanish            1
Chinook Jargon     1
Name: Language, dtype: int64

Also commonly used is the `unique` method, which returns all unique values as a numpy array.

In [66]:
mottos['Language'].unique()

array(['Latin', 'English', 'Greek', 'Hawaiian', 'Italian', 'French',
       'Spanish', 'Chinook Jargon'], dtype=object)

## Baby Names Data

Now let's play around a bit with a large baby names dataset that is publicly available. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough to avoid crashing datahub, we're going to look at only California rather than looking at the national dataset.

In [67]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)

Unnamed: 0,State,Sex,Year,Name,Count
20807,CA,F,1944,Lucille,84
357434,CA,M,2008,Sidney,25
386226,CA,M,2018,Sterling,59
197488,CA,F,2011,Ysabelle,5
309431,CA,M,1989,Donavon,6


In [68]:
#Note that the babynames dataset includes both numeric and non-numeric information.  describe() defaults to just showing
# numbers for mixed datasets, include='all' will show summaries for all columns.  But look at this.  Just because it can
# doesn't mean an output is useful. 
babynames.describe(include='all')

Unnamed: 0,State,Sex,Year,Name,Count
count,394179,394179,394179.0,394179,394179.0
unique,1,2,,20029,
top,CA,F,,Jean,
freq,394179,232102,,219,
mean,,,1984.534473,,80.361912
std,,,26.638252,,297.136533
min,,,1910.0,,5.0
25%,,,1968.0,,7.0
50%,,,1991.0,,13.0
75%,,,2006.0,,39.0


# Excercises

Here are a list of questions to pull from the babynames dataset. 

What was the most popular name given in any year?

What was the most popular male name?

What was the most popular female name?

What was the most popular female and male name in 2018? 

What were the top-10 names for the year you were born?


## Harder goals:

How many unique names exist in the dataset? 

How many different names were given to Males compared with Females? In 1960? in 2020?


What other questions can we ask?  Give me the questions.  


In [69]:
#what was the most popular name given in any year?

babynames.sort_values('Count').tail(1)

Unnamed: 0,State,Sex,Year,Name,Count
259583,CA,M,1956,Michael,8259


In [70]:
#What was the most popular male name?
babynames[babynames['Sex']=='M'].sort_values('Count').tail(1)

Unnamed: 0,State,Sex,Year,Name,Count
259583,CA,M,1956,Michael,8259


In [71]:
#What was the most popular female name?
babynames[babynames['Sex']=='F'].sort_values('Count').tail(1)

Unnamed: 0,State,Sex,Year,Name,Count
116363,CA,F,1991,Jessica,6951


In [72]:
#What was the most popular female and male name in 2018? 
mostPopularFemale2018 = babynames[ (babynames['Sex']=='F') & (babynames['Year']==2018)].sort_values('Count').tail(1)
mostPopularMale2018 = babynames[ (babynames['Sex']=='M') & (babynames['Year']==2018)].sort_values('Count').tail(1)

print(mostPopularFemale2018)
print(mostPopularMale2018)

       State Sex  Year  Name  Count
221160    CA   F  2018  Emma   2743
       State Sex  Year  Name  Count
385701    CA   M  2018  Noah   2569


In [73]:
#What were the top-10 names for the year you were born?

babynames[ (babynames['Year']==2000)].sort_values('Count').tail(10)


Unnamed: 0,State,Sex,Year,Name,Count
335012,CA,M,2000,Matthew,3254
335011,CA,M,2000,David,3280
335010,CA,M,2000,Christopher,3336
335009,CA,M,2000,Joshua,3356
335008,CA,M,2000,Jacob,3520
335007,CA,M,2000,Michael,3572
335006,CA,M,2000,Andrew,3600
335005,CA,M,2000,Jose,3804
335004,CA,M,2000,Anthony,3839
335003,CA,M,2000,Daniel,4342


In [74]:
#How many unique names exist in the dataset? 
babynames['Name'].unique().shape


(20029,)

In [77]:
#How many different names were given to Males compared with Females? In 1960? in 2020?
FemaleNamesIn1960= babynames[ (babynames['Year']==1960) & (babynames['Sex']=='F') ]
MaleNamesIn1960= babynames[ (babynames['Year']==1960) & (babynames['Sex']=='M') ]

FemaleNamesIn2020= babynames[ (babynames['Year']==2020) & (babynames['Sex']=='F') ]
MaleNamesIn2020= babynames[ (babynames['Year']==2020) & (babynames['Sex']=='M') ]

#Here I am showing how to print values with some extra information.
#You just separate the string in "" from the variable you want to show with a  comma: ,
print("Unique Female Names in 1960: ", FemaleNamesIn1960['Name'].unique().shape[0])
print("Unique Male Names in 1960: ", MaleNamesIn1960['Name'].unique().shape[0])
print("Unique Female Names in 2020: ", FemaleNamesIn2020['Name'].unique().shape[0])
print("Unique Male Names in 2020: ", MaleNamesIn2020['Name'].unique().shape[0])




Unique Female Names in 1960:  1777
Unique Male Names in 1960:  1125
Unique Female Names in 2020:  3593
Unique Male Names in 2020:  2770
