In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab05.ipynb")

# Lab 5: Introduction to Pandas
Welcome to Lab 4 of DATA 271! In this lab we will get practice with the Pandas library. This lab contains small tasks ("appetizers") for you to make sure you understand the examples. The culminating task ("main course") at the end of the document is more complex, and uses most of the topics you have will have worked through.

## Overview
Pandas is a fast, powerful, and flexible open source data analysis and maniputation tool built on top of the Python programming language.  Pandas is built for tabular data.  Moreover, it displays data in an intuitive and pleasing way (compared to NumPy, for example).

More documentation can be found here: https://pandas.pydata.org.

### In today's lab, we will...
- Use Series and DataFrames to work with data
- Create a DataFrame from a list or dictionaries or by reading a CSV file
- Inspect DataFrames
- Access rows, columns and elements of DataFrames
- Subset or filter data
- Make basic plots (bar and line) with Pandas

In [None]:
# if you ever want to ignore warnings, use the following
import warnings
warnings.filterwarnings("ignore")

## Introduction and Motivation
Python data structures like dictionaries and NumPy arrays are very powerful tools in data science. However Pandas is sometimes better suited for certain tasks. 

Let's take a look at one example to see why. The cell below imports a sample dataset related to earthquakes. 

In [None]:
# read short CSV file using numpy
import numpy as np

data = np.genfromtxt(
    'example_data.csv', delimiter=';', 
    names=True, dtype=None, encoding='UTF'
)
data

In [None]:
data.shape

Each of the entries in the array is a row from the CSV file, and in each row, we see several strings, a float and an integer.  
Suppose we want to find the maximum magnitude of the floats in the dataset. With the data stored in a 1d array like this, we would have to use we can use list comprehension to select the third index of each row and then take the maximum.  

In [None]:
max([row[3] for row in data])

The step above is a bit awkward and it would be better to have the data separated in a more understandable representation. We can instead create a dictionary where the keys are the column names (time, place, magType, etc.) and the values are NumPy arrays of the data.  Maybe this is an improvement, but it is still pretty difficult to parse visually.

In [None]:
array_dict = {
    col: np.array([row[i] for row in data])
    for i, col in enumerate(data.dtype.names)
}
array_dict

In [None]:
# select the mag key fromt he dictionary and use the max() method
array_dict['mag'].max()

If we want to select all of the information associated with the earthquake with the maximum magnitude, we would need to find the index of the maximum and then for each of the keys in the dictionary, take that item.

In [None]:
[array_dict[key][array_dict['mag'].argmax()] for key in array_dict.keys()]

Although it was possible to perform this task, with NumPy and base Python, it was a little clunky. Let's explore other options.

### Series

The point of the above examples is that NumPy arrays and dictionaries, though perfectly valid, might not be the best way to store and access data in this form.

In Pandas, the Series class is a data structure for arrays of a single type-- think of it as a single column in a spreadsheet.  Its shape will always be `(n, )` where `n` is the number of rows.  When created, it includes an **index** (see numbers 0 through 4 in left column below) which enable us to select rows.  Pandas creates a default index, but you can also define another index. 

We can create a series storing the location of each earthquake by using the dictionary we created.

In [None]:
import pandas as pd

# create a Pandas series called place
place = pd.Series(array_dict['place'], name='place')
place

In [None]:
# shape-- will return (rows,)
place.shape

### DataFrame

The Pandas DataFrame class builds on the Series class, and can have many columns, each with its own data type.  You can think of this as an entire spreadsheet.


In [None]:
# create a dataframe from the file we read
df = pd.DataFrame(data)
df

Notice the index on the left and the pleasing display of the information compared to the Numpy array or the dictionary above.

In [None]:
# check type of data
df.dtypes

In [None]:
# find column names
df.columns

In [None]:
# see index
df.index

In [None]:
# see dimensions -- (rows, columns)
df.shape

The above attributes give us a great deal of information about our (small) data set.  We see there are 5 rows and 6 columns, and the columns have titles `'time'`, `'place'`, etc.

### Creating Series and DataFrames 

As we saw in lecture, we can create a DataFrame from a Python Data Structure.  See the three examples below to refresh your memory.

In [None]:
# data frame from dictionary
pd.DataFrame(
    {
    'Name' : ['Aaron', 'Luke', 'Kai', 'Casey'],
    'Age' : [23, 21, 22, 21],
    'University' : ['Cal Poly Humboldt', 'Sonoma State', 'UCLA', 'UCD'],
    }, 
   
)

In [None]:
# data frame from list of tuples
list_of_tuples = [(n, n/2, n**4) for n in range(5)]
list_of_tuples
pd.DataFrame(
    list_of_tuples, 
    columns=['n', 'n/2', 'n^4']
)

In [None]:
# data frame from NumPy array
pd.DataFrame(
    np.array([
        [0, 0, 0],
        [1, .5, 1],
        [2, 1, 16],
        [3, 1.5, 81],
        [4, 2, 256]
    ]), columns=['n', 'n/2', 'n^4']
)

## Appetizers

**Question 1.1:** Create a Numpy array containing
\begin{bmatrix}
5 & 6 \\
7 & 8 
\end{bmatrix}
Then create a Pandas dataframe from that array. Each column in the dataframe should be a column from the array, and the columns should be named `a` and `b`.

In [None]:
array1 = ...
df_from_array1 = ...
df_from_array1

In [None]:
grader.check("q1_1")

**Question 1.2:** Create a list of two tuples. The first tuple should contain `5` and `6`. The second should contain `7` and `8`. 
Then create a Pandas dataframe from the list of tuples. The columns should be named `a` and `b`.

In [None]:
lst = ...
df_from_lst = ...
df_from_lst

In [None]:
grader.check("q1_2")

**Question 1.3:** Create a Python dictionary to generate another dataframe equivalent to the ones from the previous two parts.

In [None]:
dct = ...
df_from_dct = ...
df_from_dct

In [None]:
grader.check("q1_3")

### Working with dataframes 
We can create a data frame with a nested list with the populations of a few European capital cities.  For example, London, the capital of the UK, has a population of 8.615 million in 2015.

In [None]:
df = pd.DataFrame([[909976, 'Sweden'], [8615246, 'UK'], [2872086, 'Italy'], [2273305, 'France']])
df

We can reindex the data frame with the names of the respective capitals, since this might be more intuitive than a numbered index.

In [None]:
df.index = ['Stockholm', 'London', 'Rome', 'Paris']
df

We can also rename the columns so the data frame is more understandable.

In [None]:
df.columns = ['Population', 'Country']
df

### Accessing Columns
We can access a column using attribute notation or by indexing with the column name.  These are equivalent.  Note that a column of a data frame is just a Series object.

In [None]:
# option 1, attribute notation
df.Population

In [None]:
# option 2 bracket notation
df['Population']

### Accessing Rows

We can access rows with the `loc` index attribute.  This will also result in a Series object.

In [None]:
# access the row for Stockholm
df.loc['Stockholm']

We can also access rows by position with the `iloc` index. 

In [None]:
# access the row for Stockholm
df.iloc[0]

### Subsetting with `loc`

We can also pass a list of row labels to loc to result in a new data frame that is a subset of the rows of the original.  We can also subset on both rows and columns.

In [None]:
# create new data frame with the rows for Paris and Rome and all columns
df.loc[['Paris', 'Rome']]

In [None]:
# create new data frame with rows for Paris and Rome and only the column for Population
df.loc[['Paris', 'Rome'], 'Population']

### Descriptive Statistics


We can also get summary statistics of all of our numeric data.

In [None]:
df.describe()

In [None]:
# if you don't like the scientific notation
pd.options.display.float_format = '{:10,.2f}'.format
df.describe()

## Basic plotting with Pandas

`Series` and `DataFrame` objects have a `plot()` method that allows us to create several plots. This makes plotting our data much more convenient, as the bulk of the work to create presentable plots is achieved with a single method call. Under the hood, `pandas` is making several calls to `matplotlib` to produce plots. We will learn more about `matplotlib` next week. 

In [None]:
# bar plot
df.plot.bar(y = 'Population'); # uses index as x-axis by default

Note that the bar plot uses the indices as the x-axis by default. You can also use a column as your x-axis by specifying an `x=...` input inside your `.plot` all.

In [None]:
# line plot from a Series
pd.Series(np.random.rand(20)).plot.line();

In [None]:
# line plot from a DataFrame 
pd.DataFrame([(i,i**2) for i in np.arange(-10,11)],columns =['n','n^2']).plot.line(x = 'n', y = 'n^2');

**Question 2.1:** Look up census data [here](https://www.census.gov/quickfacts/fact/table) to determine the population of the following cities in Humboldt County: Arcata, Eureka, McKinelyville, Fortuna. Use the population estimates according to the April 1, 2020 Census. 
Put this information into a DataFrame with 4 rows and 2 columns. The left column should contain the name of the city, and the right column contains the population size. 

*NOTE:* For this part, don't give your rows or columns descriptive labels yet. Just get the data into a DataFrame.

In [None]:
humboldt_df = ...
humboldt_df

In [None]:
grader.check("q2_1")

**Question 2.2:** Rename the columns to `City` and `Population` appropriately. 

In [None]:
...
humboldt_df

In [None]:
grader.check("q2_2")

**Question 2.3:** Make a barplot of the population by city. Make sure that your x-axis is labeled with each city name. 

In [None]:
ax = ...

In [None]:
grader.check("q2_3")

**Question 2.4:** Subset the data frame so you have just the cities associated with Northern Humboldt Union High School District (Arcata and McKinleyville). 

In [None]:
humboldt_subset = ...
humboldt_subset

In [None]:
grader.check("q2_4")

## Main Course

The Eviction Lab at Princeton University gathers and provides data on evictions in the United States.  They have constructed a nationwide database of eviction filings, demonstrating that, on average, 2.7 million households are threatened with eviction annually.  The database relied on almost 100 million court records and is available for researchers who want to examine causes and consequences of eviction lawsuits in the United States.  

Having a national perspective is important in understanding how state and local level housing policies relate to eviction risk.  For example, states that require landlords to provide notice to tenants prior to filing an eviction case for nonpayment of rent seem to have lower risk of eviction.  Between 2000 and 2018, almost 7% of renting households faced an eviction lawsuit (Gromis, 2022).

If you are interested in this topic, the book *Evicted: Poverty and Profit in the American City* by Matthew Desmond goes into detail.  This book was a winner of the Pulitzer Prize and was identified as one of the top ten books of 2016 by the New York Times.

To begin, navigate to the Eviction Lab's [data downloads site](https://data-downloads.evictionlab.org/#estimating-eviction-prevalance-across-us/) and download the csv file *state_eviction_estimates_2000_2018.csv*.  Put the file in your working directory.

**Question 3.1:** Read your downloaded file into a Pandas DataFrame.

In [None]:
evictions_df = ...
evictions_df.head()

In [None]:
grader.check("q3_1")

**Question 3.2:** How many rows and columns are in the evictions dataset?

In [None]:
num_rows = ...
num_cols = ...

num_rows,num_cols

In [None]:
grader.check("q3_2")

**Question 3.3:** Create a list containing all the column names in the dataset. 

In [None]:
eviction_columns = ...
eviction_columns

In [None]:
grader.check("q3_3")

**Question 3.4:** Which data types are present in the dataset? Assign`datatypes` to either 1,2,3, or 4. 

1. all columns are type `object`
2. all columns are type `float64`
3. `int64` and `float64`
4. `int64` and `float64` and `object`

In [None]:
datatypes = ...

In [None]:
grader.check("q3_4")

**Question 3.5:** Some of the columns will not be needed for this exercise, so subset the data frame and retain the columns `state`, `FIPS_state`, `year`, `renting_hh`, `filings_estimate`, `hh_threat_estimate` (in that order).

In [None]:
evictions_subset = ...
evictions_subset

In [None]:
grader.check("q3_5")

<!-- BEGIN QUESTION -->

**Question 3.6:** Look at the codebook provided on the same site you got the data from, and make sure you understand what each column name you have retained means.  Write a summary describing each column.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 3.7:** Are there any null objects in your `evictions_subset` data? If there are, specify how many.

Assign `null_vals` to the bool `True` if there are null objects, and `False` if there aren't.  
Assign `num_null` to the number of total null values. (Enter `0` if `null_vals` is `False`)

In [None]:
null_vals = ...
num_null = ...

In [None]:
grader.check("q3_7")

**Question 3.8:** Get the summary statistics for the entire `evictions_subset` DataFrame. If you had to remove the summary statistics for one of these columns, which one would you choose?

In [None]:
summary_stats = ...
summary_stats

In [None]:
grader.check("q3_8")

**Question 3.9:** Create a data frame that contains just the rows related to California. Keep the columns `state`, `year`, `renting_hh`, `filings_estimate`, `hh_threat_estimate` only.

In [None]:
cali_df = ...
                              ...
cali_df

In [None]:
grader.check("q3_9")

**Question 3.10:** What year had the minimum number of eviction filings in California between 2000 and 2018 (inclusive)? Assign this to `min_cali_filings_year`.  
How many filings were there that year? Assign this to `min_cali_filings_num`. 

Similarly, find what year had the maximum number of eviction filings in California between 2000 and 2018, and how many filings there were in that year. 

In [None]:
min_cali_filings_year = ...
min_cali_filings_num = ...
max_cali_filings_year = ...
max_cali_filings_num = ...

# Print results
print("Year with minimum number of eviction filings: "+str(min_cali_filings_year))
print("Minimum number of eviction filings: "+str(min_cali_filings_num))
print("Year with maximum number of eviction filings: "+str(max_cali_filings_year))
print("Maximum number of eviction filings: "+str(max_cali_filings_num))

In [None]:
grader.check("q3_10")

<!-- BEGIN QUESTION -->

**Question 3.11:** Make a line plot for the eviction filings in California (y-axis) for each year (x-axis).  
*HINT:* Instead of `plot.bar` in the previous example, use `plot.line`. 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.12:** What trends do you notice in the plot above?  Write a few sentences describing the number of eviction filings in California over this time period and your interpretation of the trend. For example, what economic factors might have contributed to the rising eviction filings leading up to 2010?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 3.13:** California passed a Just Cause Eviction Act in October 2019.  Prior to a state-wide adoption of this policy, many cities in California had just cause eviction ordinances.  For example, East Palo Alto has had an ordinance since 2010 and San Diego has had one since 2004.  New Jersey has had a state-wide Just Cause policy since 1974 (nearly 50 years!).

Create a data frame from `evictions_subset` that retains just the rows related to New Jersey.  Keep the columns `state`, `year`, `renting_hh`, `filings_estimate`, `hh_threat_estimate` only.

In [None]:
nj_df = ...
nj_df

In [None]:
grader.check("q3_13")

**Question 3.14:** What year had the minimum number of eviction filings in New Jersey between 2000 and 2018 (inclusive)? Assign this to `min_nj_filings_year`.  
How many filings were there that year? Assign this to `min_nj_filings_num`. 

Similarly, find what year had the maximum number of eviction filings in New Jersey between 2000 and 2018, and how many filings there were in that year. 

In [None]:
min_nj_filings_year = ...
min_nj_filings_num = ...
max_nj_filings_year = ...
max_nj_filings_num = ...

# Print results
print("Year with minimum number of eviction filings: "+str(min_nj_filings_year))
print("Minimum number of eviction filings: "+str(min_nj_filings_num))
print("Year with maximum number of eviction filings: "+str(max_nj_filings_year))
print("Maximum number of eviction filings: "+str(max_nj_filings_num))

In [None]:
grader.check("q3_14")

<!-- BEGIN QUESTION -->

**Question 3.15:** Make a line plot for the eviction filings in New Jersey (y-axis) for each year (x-axis).  

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.16:** Write a few sentences describing the number of eviction filings in New Jersey over this time period.  Does the graph for New Jersey have a similar rise and peak as the graph for California?  (Note: comparing the numbers of eviction filings between the two states doesn't account for their different population sizes, but looking at the trends in the graph helps us understand what was similar and what was different over this time period.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.17:** There has been some criticism about how the Eviction Lab obtained some of its data.  For example, they purchased California eviction data from American Information Research Services, which offers tenant screening as a service.  Some critics (e.g., Anti-Eviction Mapping Project, Tenants Together) suggest this purchased data vastly undercounts the number of evictions that have been filed. For example, informal evictions (when landlords induce renters to leave through monetary incentives or illegal lockouts) are not accounted for in this data set. Discuss the implications of this undercounting in terms of analysis and policy.  (A few sentences is sufficient.)

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Task (Optional Dessert)

Go to [Kaggle](https://www.kaggle.com/search) and find a dataset that you're interested in. Use Pandas to do an initial exploration of the data. Practice plotting with Pandas and play around with way to manipulate Pandas plots. e.g. explore

- `subplots`
- `figsize`
- `legend`
- `color`
- `colormap`
- etc.

# You're done!

Congrats on finishing Lab 5! Gus is jumping for joy! Run the cell below to download the zip and submit to Canvas. 

<img src="gus_gets_yelled_at.JPG" alt="drawing" width="300"/>

### References
If you want to read more about the topics in this lab, check out these references.
- Hands on Data Analysis with Pandas by Stefanie Molin
- Gromis, Ashley, Ian Fellows, James R. Hendrickson, Lavar Edmonds, Lillian Leung, Adam Porton, and Matthew Desmond. Estimating Eviction Prevalence across the United States. Princeton University Eviction Lab. https://data-downloads.evictionlab.org/#estimating-eviction-prevalance-across-us/. Deposited May 13, 2022.
- Evicted: Poverty and Profit in the American City by Matthew Desmond.
- Cuellar, Julieta. "Effect of “just cause” eviction ordinances on eviction in four California cities." Journal of Public & International Affairs 30 (2019).  https://jpia.princeton.edu/news/effect-just-cause-eviction-ordinances-eviction-four-california-cities
- California Tenant Protection Act of 2019 (AB1482): https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201920200AB1482

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)