# Homework 5

### Overview
This homework assignment is divided into two parts: 1) Prediction of HOLC labels and 2) geodemographic clustering 

### Deliverables: 
1. Pandas notebook with outputs

In [1]:
# We are going to start importing the libraries we need
# all in one cell. 

# It is a good practice to keep all the imports in one cell so that
# we can easily see what libraries we are using in the notebook.
import pandas as pd
import geopandas as gpd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

## There is no need to import libraries more than once!

  from pandas.core.computation.check import NUMEXPR_INSTALLED


# 1. Predicting HOLC grades
In this exercise, you are going to explore some historical Census data and understand its relationships with the Home Owners' Loan Corporation (HOLC) risk maps. These maps were made by the HOLC in the late 1930s-early 1940s for the HOLC to make assessments of neighborhood risk after the HOLC bailed out underwater borrowers who were unable to pay their mortgage loans as a consequence of the Great Depression. Mortgage risk was assessed with all residential neighborhoods being given an A, B, C, or D grade: 

- A = "best"
- B = "still desirable"
- C = "definitely declining"
- D = "hazardous"

While these maps have become known as the "redlining maps" note that the agency didn't use these maps to make decisions on who/where should get loans. They actually made these maps after loan activities were already over! However, these maps do provide an interesting window into how the real estate industry viewed different neighborhoods, during a period when redlining (by savings and loan banks, the Federal Housing Administration, amongst others) occurred. In 1968, Congress signed the Fair Housing Act of 1968, also known as Title VIII of the Civil Rights Act of 1968, that formally made discrimination by race, sex, color, religion, disability, family status, and national origin illegal. This is considered by many scholars to be the legal end of discriminatory redlining practices (although they still occur, in various ways, to today). 

There are three time periods to consider when we're thinking about the "impacts" of the redlining: 
- The pre-redlining period: for now, we'll say that is 1930 and any 1930-1940 trends. 
- The redlining period: 1940 - 1970
- The post-redlining period: 1980 - present

The overall aim of this study is to understand whether we think there is a relationship between HOLC grades and present-day outcomes in racial outcomes and neighborhood outcomes such as education or median income. The underlying mechanism here might be that HOLC grades led to disinvestment in neighborhoods, which leads to poor conditions, the concentration of poverty, and low opportunities for people who live in that neighborhood. This leads us to test two scenarios: 

1. Can we use neighborhood conditions to predict historical HOLC grades during the period of redlining? 
2. Similarly, if we think that grades were also determined by neighborhood socioeconomic and demographic conditions, can we use these to predict HOLC grades? 

## 1.1 Load in the data
The folder `holc_data` can be downloaded [here](https://www.dropbox.com/scl/fo/efovuq4dg9mmrwhtnvak6/AHew7g8n0jqkuxPGpU3lZBQ?rlkey=g7371rzvg8m2duq86hg3jc6nk&dl=0).

In [None]:
### This is a version of the data with all the years as separate columns
holc_data = gpd.read_file('https://www.dropbox.com/scl/fi/i017f3g4juwtf6rmpijcq/holc_data_1930_2016.geojson?rlkey=scxpkogx5hcj1r1rvdwg297g8&dl=1',driver='GeoJSON')

### This is a version of the data with the years concatenated
holc_data_v2 = []
for y in ['1930','1940','1950','1960','1970','1980','1990','2010', '2016']: 
    df = gpd.read_file(f'holc_data/holc_overlay_{y}')
    df['year'] = y
    holc_data_v2.append(df)

holc_data_v2= pd.concat(holc_data_v2)[['city','holc_grade','population','white_perc','colored_pe','hispanic_p','other_perc','college_pe','median_inc','unemployed','geometry','year']]

holc_data_v2 = holc_data_v2.rename(columns={'colored_pe':'black_perc','hispanic_p':'hispanic_perc','college_pe':'college_perc','unemployed':'unemployed_perc'})

In [None]:
holc_data.head()

In [None]:
holc_data_v2.head()

## 1.2 Data cleaning and exploration (10 pts)
In addition to your .describe() descriptive statistics, in the following cells, create charts and/or maps that will tell us the following: 
- Define your characteristics of interest 
- What are the historical trends for each characteristic? 

Make sure to describe (Also, do all Census characteristics exist for all years??)

First, remove rows where the HOLC grade is 'E' in both datasets.

In [None]:
## insert your code here


Next, create charts where 'year' is on the x-axis and the socioeconomic/demographic information is on the y axis. 

In [None]:
### insert your code here


In [None]:
### insert your code here


## 1.3 Predicting HOLC grade using socioeconomic and demographic data during redlining (10 pts)
We believe that since HOLC grades are determinative of demographic and socioeconomic characteristics, we can back out grades by using some of these characteristics. Let's use some of these to predict HOLC grades. Here you will use the `holc_data` dataset. 

First, create `X` and `y` arrays contains our features and targets. 

In [None]:
## insert your code here.


Do we have any `NaN`s? Our ML models will not accept missing data. 

In [None]:
## insert your code here. 


In order to assure we do not have any `NaNs` in our data, which , we'll replace all of our `NaN`s with the median of the column

In [None]:
## insert your code here. 


We'll also want standardize our data as well.

In [None]:
## insert your code here. 


Now let's split our data into a train and test set. 

In [None]:
## insert your code here. 


Using a `RandomForestClassifier` model, let's train our model to predict `y_train` on the input data `X_train` 

In [None]:
## insert your code here. 


How well did our model do? Show the accuracy, F1, AUC ROC, log loss

In [None]:
## insert your code here


Now try a different model and let's see if our results were better. 

In [None]:
## insert your code here.


Show the classification scores again for this new model. 

In [None]:
## insert your code here.


Since the scores weren't so different, let's go back to the RF Classifier model and tune its hyperparamters, so we can look at feature importances later. Please explain what you chose to tune. 

In [None]:
## insert your code here.


Now, how well did you prediction improve? 

In [None]:
### insert your code here


## 1.4 Feature importances (3pts)
Create a plot of the feature importances for each feature. You can use the sample code here: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

- What seems to be most important features in determining outcomes? 
- Is this surprising? What did you expect to be more important? 

In [None]:
## insert your code here.   


## 1.5 Predicting with pre-redlining data (10 pts)
We're going to use the 1930s data and the change between 1930 and 1940, since we don't really have a lot of data for 1930.

- How did the scores change? 
- What were the most important features and is this surprising? 

First, create the `population_1940_1930_diff`, `black_perc_1940_1930_diff`, `white_perc_1940_1930_diff` variables that show the change in these three between 1930 and 1940. 

In [None]:
## insert your code here.


Now run a classification model again. 

In [None]:
## insert your code here.


And show how well the model did and whether the scores changed. 

In [None]:
## insert your code here.


And what are the feature importances? 

In [None]:
## insert your code here.


## Bonus: Feature creation (5 pts)
Let's say we think that the distance to the center of the city matters in terms of what the grade might be. 
- Create a new column called `dist_center` that is the distance from the centroid of each neighborhood (row) to the centroid of all the rows for each `city`. 
- Include this new column in your model. 
- Did it improve your scores? 



In [None]:
## insert your code here.


# 2. Predicting NYC taxi pick-up and drop-offs (40 pts)
In this exercise, we will be trying to predict where a taxi is being picked up based on where it's dropped off. 

That is: we want to predict the `PULocationID` using the rest of the data. 

This data is from: 
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

- Please familiarize yourself with the data dictionary and the taxi zones. 
- Would it make sense to add more features by bringing in more data through, for ex, the census? 

In [None]:
### You will have to download the data from the link above
### A parquet file is a file format that is very efficient for
### storing dataframes. 

taxi_data = pd.read_parquet('yellow_tripdata_2024-01.parquet')

In [None]:
## insert code here