# PANDAS

A high-level overview of the [Pandas](https://pandas.pydata.org) library.


## Why `pandas`?

 `pandas` is a Python library used for data manipulation and analysis. `pandas` is an industrial strength package that is used in most data analysis projects in the real world.  Learning how to use pandas would also make your projects easier to understand for other data scientists and extend the scope of influence your projects may have.



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
#sns.set_context("notebook")

![alt text](pandas_illustration.jpeg)

# Series

A "series" is the building block of pandas data.   It can acts kinda like a "dictionary"  (a dictionary is also special  type in python).   They are very useful.  For Pandas the "index" is used to identify things. 


In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
data['b']

In [None]:
'a' in data

In [None]:
data.keys()

Extending/adding to a series:

In [None]:
data['e'] = 1.25
data

We made a series above.  Let's make a simple dataframe from 2 series. 

In [None]:

#There are many ways to make data frames.  Here I use the python "dictionary" type 
# indicated by the curly braces {}.  Each column is named with the "key" field, and there is a list []
# 
myDictionary = {
    "Key": [1,2],
    "Author": ["Dr Seuss","Stephen King"], 
    "Book Title": ["cat in the hat","It"]
}

df = pd.DataFrame( myDictionary )

df

## Reading in DataFrames from Files

Pandas has a number of very useful file reading tools. This link describes manu https://realpython.com/pandas-read-write-files/ Today we'll be using read_csv today.   Another very useful one is the ability to read excel files. 


A "csv" file is a "comma separated value" file.  It's a nice and simple text format that separates things in the files by commas.  For example:
Participant,ResponseTime
1,0.50
2,.0386

This is a fairly common file format that can be read by almost every program (e.g. excel, SPSS, python, R)


Pandas stores things in something known as a "dataframe". 


In [None]:
elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print

We can use shape to geth information about the shape of this dataset

In [None]:
elections.shape

We can use the head command to return only a few rows of a dataframe.

In [None]:
elections.head(10)

There is also a tail command.

In [None]:
elections.tail(7)

When reading data column names are ideally unique. But if we try to read in a file for which column names are not unique, Pandas will automatically rename any duplicates.  Just good to know, many datasets in the wild have duplicate names. 

In [None]:
dups = pd.read_csv("duplicate_columns.csv")
dups

## Indexing, Slicing, Dicing

After reading in data, the most common operaton is selecting data.   With pandas dataframes there are a bunch of powerful ways to access data.  I'll step through a few now. 

The DataFrame class has an indexing operator [] that lets you do a variety of different things. If your provide a String to the [] operator, you get back a Series corresponding to the requested label.

This is start of where syntax will get a bit confusing.  

### Selection Using Label/Index, with `loc`

**Column Selection** 

To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html). General usage of `.loc` looks like `df.loc[rowname, colname]`. Remember that the colon `:` means "everything." For example, if we want the `color` column of the `ex` DataFrame, we would use: `ex.loc[:, 'color']`

- You can also slice across columns. For example, `baby_names.loc[:, 'Name':]` would select the column `Name` and all columns after `Name`.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `df['colname']`.

**Row Selection**

Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (ie. primary key) of the DataFrame.

### We wil go through a bunch of examples now. 


In [None]:
#Show the first 6 values. 
...

In [None]:
# Show just the Candidate names for the first 6 values. 


The [] operator also accepts a list of strings. In this case, you get back a DataFrame corresponding to the requested strings.

In [None]:
elections[["Candidate", "Party"]].head()

The [] operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

## Which can get really, really confusing!!

In [None]:
elections[0:3]

The way to think of this is that the table is fundamentally a table with rows and columns.  Columns have names, rows have numbers by default. What we did above was shorthand.  When we didn't ask for a specific column or row we got all of them back.  

WHen you start selecting both there are a lot of [] to keep track of.   

In [None]:
elections[["Candidate","Party"]][0:3]

If you provide a single argument to the [] operator, it tries to use it as a name. This is true even if the argument passed to [] is an integer.  The next cell has an intentional error.   You will see these "KeyError" messages often when working with pandas.   It just means it can't find what you're looking for.  Usually because of a typo. 

In [None]:
elections[][0] #this does not work,  see it fail in action, woo

Another common confusion is that the number 1 is treated as as **different** from the string "1".

Yes.  This can be annoying.  

In [None]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird

In [None]:
weird[1] #try to predict the output

In [None]:
weird["1"] #try to predict the output

In [None]:
weird[1:] #try to predict the output

## Boolean Array Selection

Now let's start doing some more interesting things. 

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [None]:
elections

In [None]:
elections[[False, False, False, False, False, 
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, False, True]]

One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

In [None]:
elections

In [None]:
iswin = elections['Result'] == 'win'
iswin.head(5)

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [None]:
elections[iswin]

Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. 

This syntax is a little tricky to read at first, but you'll get used to it quickly.

In [None]:
elections[elections['Result'] == 'win']

We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

In [None]:
win50plus = (elections['Result'] == 'win') & (elections['%'] < 50)

In [None]:
win50plus.head(5)

In [None]:
elections[(elections['Result'] == 'win') & (elections['%'] < 50)]


The | operator is the symbol for or.

In [None]:
elections[(elections['Party'] == 'Republican')
          | (elections['Party'] == "Democratic")]

If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

In [None]:
elections['Party'].isin(["Republican", "Democratic"])

In [None]:
elections[elections['Party'].isin(["Republican", "Democratic"])]

An alternate simpler way to get back a specific set of rows is to use the `query` command.

In [None]:
elections.query?

In [None]:
elections.query("Result == 'win' and Year < 2000")

## Label-based access with `loc`

In [None]:
elections.head(5)

In [None]:
elections.loc[[0, 1, 2, 3, 4], ['Candidate','Party', 'Year']]

## Warning here.  


We didn't do it above.  But it's possible to use names for rows as well as columns.  

Note: The `loc` command won't work with numeric arguments if we're using a dataframe that has labeled rows instead.


Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

In [None]:
elections.loc[0:4, 'Candidate':'Year']

If we omit the column argument altogether, the default behavior is to retrieve all columns. 

In [None]:
elections.loc[[2, 4, 5]]

Loc also supports boolean array inputs instead of labels. The Boolean arrays _must_ be of the same length as the row/column shape of the dataframe, respectively (in versions prior to 0.25, Pandas used to allow size mismatches and would assume the missing values were all False, [this was changed in 2019](https://github.com/pandas-dev/pandas/pull/26911)).

In [None]:
elections.loc[[True, False, False, True, False, False, True, True, True, False, False, True, 
               True, True, False, True, True, False, False, False, True, False, False], # row mask
              [True, False, False, True, True] # column mask
             ]

In [None]:
elections.loc[[0, 3], ['Candidate', 'Year']]

We can use boolean array arguments for one axis of the data, and labels for the other.

In [None]:
elections.loc[[True, False, False, True, False, False, True, True, True, False, False, True, 
               True, True, False, True, True, False, False, False, True, False, False], # row mask
              
              'Candidate':'%' # column label slice
             ]

What do you think happens if you give a single value  arguments for the requested rows AND columns?

In [None]:
elections.loc[15, '%']

## Positional access with `iloc`

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is **exclusive**, just like standard Python slicing of numerical values.

In [None]:
elections.head(5)

In [None]:
elections.iloc[:3, 2:]

We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example: 

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g. what column #31 represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Quick Challenge

Which of the following expressions return DataFrame of the first 3 Candidate and Year for candidates that won with more than 50% of the vote.

In [None]:
elections.head(10)

In [None]:
elections.iloc[[0, 3, 5], [0, 3]]

In [None]:
elections.loc[[0, 3, 5], "Candidate":"Year"]

In [None]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)

In [None]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]

## Sampling

Pandas dataframes also make it easy to get a sample. We simply use the `sample` method and provide the number of samples that we'd like as the arugment. Sampling is done without replacement by default. Set `replace=True` if you want replacement.

This is very useful for big datasets and you want to get an idea for what things are in it without drowning in a huge output or only seeing the top. 

In [None]:
elections.sample(10)

In [None]:
elections.query("Year < 1992").sample(50, replace=True)


## Handy Properties and Utility Functions for Series and DataFrames

#### Python Operations on Numerical DataFrames and Series

Consider a series of only the vote percentages of election winners.

In [None]:
winners = elections.query("Result == 'win'")["%"]
winners

We can perform various Python operations (including numpy operations) to DataFrames and Series.

In [None]:
max(winners)

In [None]:
np.mean(winners)

#### Handy Utility Methods

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. Remember when I said above we can use names for labeling rows?  This is a good dataset to demonstrate that. 

In [None]:
mottos = pd.read_csv("mottos.csv", index_col="State")

In [None]:
mottos.head(20)

In [None]:
mottos.size

The fact that the size is 200 means our data file is relatively small, with only 200 total entries.

In [None]:
mottos.shape

Since we're looking at data for states, and we see the number 50, it looks like we've mostly likely got a complete dataset that omits Washington D.C. and U.S. territories like Guam and Puerto Rico.

In [None]:
mottos.describe()

Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

In [None]:
mottos.index

We can also access individual properties of the index, for example, `mottos.index.name`.

In [None]:
mottos.index.name

This reflects the fact that in our data frame, the index IS the state name!

In [None]:
mottos.head(2)

It turns out the columns also have an Index. We can access this index by using `.columns`.

In [None]:
mottos.head(2)

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using `sort_values`.

In [None]:
elections.sort_values('%', ascending=False)

As mentioned before, all Data Frame methods return a copy and do **not** modify the original data structure, unless you set inplace to True.

In [None]:
elections.head(5)

If we want to sort in reverse order, we can set `ascending=False`.

In [None]:
elections.sort_values('%', ascending=False)

We can also use `sort_values` on Series objects.

In [None]:
mottos['Language'].sort_values(ascending=False).head(10)

For Series, the `value_counts` method is often quite handy.

In [None]:
elections['Party'].value_counts()

In [None]:
mottos['Language'].value_counts()

Also commonly used is the `unique` method, which returns all unique values as a numpy array.

In [None]:
mottos['Language'].unique()

## Baby Names Data

Now let's play around a bit with a large baby names dataset that is publicly available. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough we're going to look at only California rather than looking at the national dataset.

In [None]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)

In [None]:
#Note that the babynames dataset includes both numeric and non-numeric information.  describe() defaults to just showing
# numbers for mixed datasets, include='all' will show summaries for all columns.  But look at this.  Just because it can
# doesn't mean an output is useful. 
babynames.describe(include='all')

# Excercises

Here are a list of questions to pull from the babynames dataset. 

How many unique names exist in the dataset? 

What was the most popular name in any given year? (i.e. what year had the most people with the same name?)

What was the most popular male name?

What was the most popular female name?

What was the most popular female and male name in 2018? 

What were the top-10 names for the year you were born?


## Harder goals:

How many different names were given to Males compared with Females? In 1960? in 2020?


What other questions can we ask?  Give me some questions!  
