# 14A: Redlining: Historic Effects of Racism

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# Pull in data
holc <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTxPfb9hD9EKXtnNuJp9k6SIn5KIbX1J3lGeoys49gS23M_hg79QDSRMHHQywirGHBO_MRAy1dUNLJ5/pub?gid=440145799&single=true&output=csv", header = TRUE)
holc$red_lined <- factor(holc$red_lined, levels = c(0,1), labels = c("not red-lined", "red-lined"))

### History of Redlining

Redlining was a process started in the 1930s in which the Home Owners' Loan Corporation (HOLC), a US federal agency, gave neighborhoods ratings to guide investment. Certain neighborhoods were colored red on HOLC maps and labeled as "hazardous" (see below). These "red-lined" neighborhoods were deemed risky investments (that is, banks and other institutions should not give loans to people trying to buy property in these areas).

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_p2TH8brH-image.png" alt="map of San Francisco area showing redlined areas" width = 80%>

These red-lined neighborhoods were predominantly home to communities of color, and this is no accident; the “hazardous” rating was in large part based on racial demographics (shown in historic documents below). In other words, redlining was an explicitly discriminatory policy. 

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_SyBmp1Qb-image.png" alt="paperwork from HOLC showing percentages of foreign-born families or black families" width=80%>

This history is not behind us in many ways. Today we will explore how the historic practice of redlining (legally halted with the Fair Housing Act of 1968) is running up against modern issues like global temperature rise.

Many of these historically “hazardous” neighborhoods are currently more impoverished *and* experience more extreme temperatures. Neighborhoods that are poorer and have more residents of color can be 5 to 20 degrees Fahrenheit hotter in summer than wealthier, whiter parts of the same city.


## The Data

The dataframe `holc` allows us to take a peek into the continued impact of redlining in several regions of California. There are 812 neighborhoods and 5 variables:

- `neighborhood_id` a code for each neighborhood
- `red_lined` 1 if the neighborhood was red-lined and 0 if not
- `holc_grade` A stands for "1st grade/best", B stands for "2nd grade/still desirable", C stands for "3rd grade/definitely declining", and D stands for "4th grade/hazardous".
- `area` Name of the modern neighborhood
- `cal_region` whether the neighborhood is in Northern, Central, or Southern California
- `count_squares` How big the neighborhood is in square units (a standard unit that can be used to measure neighborhoods across both historic redlining maps and modern google maps)
- `tree_cov` How much tree coverage in that neighborhood based on how many pixels on a google maps satellite image is "green" (assumed to indicate tree/plant areas); note that these pixels are not the same size as square units used in `count_squares`

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_921jbfZj-image.png" width=500 alt="legend from HOLC map">

**Acknowledgements:** HOLC data are from a lesson developed as part of UC Berkeley’s python-based http://data8.org/ curriculum. Information about redlining come from Chapple and Thomas’ (2020) Urban Displacement Project (https://www.urbandisplacement.org/). Information about the link between redlining and tree cover/climate change are from the New York Times (https://www.nytimes.com/interactive/2020/08/24/climate/racism-redlining-cities-global-warming.html).

In [None]:
# Take a look at the data
str(holc)

# What are the cases in this data frame?


## 1.0: Do red-lined neighborhoods have more/less tree coverage?

**1.1 - Explore variation:** Create a word equation and make a visualization to help you explore this question.

**1.2 - Model variation:** Find the best fitting one-predictor model, express it in GLM notation, and interpret the parameter estimates. Add the predictions of the model to the visualization.

***GLM Notation:***

$Y_i = b_0 + b_1X_i + e_i$

**Note about variables with underscores:** 

The forward slash (\\) can be used when you want to ignore the function of a symbol in markdown and just use it, as is. For example, the underscores in the math equations serve to generate subscript text, but sometimes the variable names have underscores in them that you want to keep as they are (i.e., you don't want them to appear as subscript text). Just insert a forward slash before the underscore symbol, as seen below (double-click into edit mode to see the forward slashes used in `tree_cov` and `red_lined`).

> $tree\_cov_i = b_0 + b_1red\_lined_i + e_i$

**1.3 - Evaluate models:** Is the model of the DGP that includes red-lining better than the empty model of the DGP? Explain with statistics.

## 2.0: The size of the neighborhood

Notice that only 1% of the variation in tree coverage is explained by red-lining. The variation in tree coverage seems to have a lot of "other stuff" going on. Perhaps some of this variation comes from the fact that these neighborhoods are different sizes. Does neighborhood size predict tree coverage on Google maps?

**2.1 - Explore variation:** Create a word equation and make a visualization to help you explore this question.

**2.2 - Model variation:** Find the best fitting one-predictor model, express it in GLM notation, and interpret the parameter estimates. Add the predictions of the model to the visualization.

***GLM Notation:***

$Y_i = b_0 + b_1X_i + e_i$

**2.3 - Evaluate models:** Is the model of the DGP that includes size of the neighborhood better than the empty model of the DGP? Explain with statistics.

## 3.0: Residuals after controlling for size of the neighborhood

Here we have saved the residuals from the `size_model` as `size_resid`.

In [None]:
size_model <- lm(tree_cov ~ count_squares, data = holc)
holc$size_resid <- resid(size_model)

You can think of these residuals as the **other stuff** from this word equation:

**tree coverage = count squares + other stuff**

**3.1:** Can some of that **other stuff** be related to red-lining? Here we have ordered the neighborhoods in `holc` according to the size of the residuals. Use `head()` and `tail()` to examine the neighborhoods that have the largest and smallest residuals. Do you notice any patterns?

In [None]:
holc_ordered <- arrange(holc, size_resid)
head(holc_ordered)
tail(holc_ordered)

## 4.0: Multivariate model

Perhaps once we control for the size of the neighborhood, we can more clearly see the effect of red-lining on tree coverage.

**4.1 - Explore variation:** Create a word equation and make a visualization to help you explore this question.

**4.2 - Model variation:** Find the best fitting multivariate model, express it in GLM notation, and interpret the parameter estimates. Add the predictions of the model to the visualization.

***GLM Notation:***

$Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + e_i$

**4.3 - Evaluate models:** Is the multivariate model of the DGP that includes size and redlining better than the empty model of the DGP? Explain with statistics.

**4.4:** Take a good look at your visualization. If you had to guess, which neighborhoods are generally larger: not red-lined or red-lined? Why do you think that matters?

**4.5:** Why is the PRE for red lining now 4% (compared to 1% in the one-predictor model)?

**4.6:** What does the p-value in the `red_lined` row of the ANOVA table tell us? What model can we reject?

## 5.0 - Advanced: All four grades for neighborhoods

Although we have been just looking at red-lined neighborhoods versus all other neighborhoods, would including all four grades (A,B,C,D) show us different patterns than just including `red_lined` in our multivariate model?

**5.1 - Explore variation:** Create a word equation and make a visualization to help you explore this question.

**5.2 - Model variation:** Find the best fitting multivariate model, express it in GLM notation, and interpret the parameter estimates. Add the predictions of the model to the visualization.

***GLM Notation:***

$Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + e_i$

*Note: You may need to add additional $b$ estimates.*

**5.3 - Evaluate models:** Is the model of the DGP that includes size of the multivariate better than the empty model of the DGP?

## 6.0 - Alternate Advanced: How about a ratio?

**6.1:** Of course the size of the neighborhood probably has a lot to do with the tree coverage. Is there a way to create an outcome variable that already incorporates this information? 

**6.2:** Would a model like that result in a PRE for red-lining that is more like the multivariate model (around 4%) or more like the single-predictor model (around 1%)? Make a prediction -- then find out!

**6.3:** Why don't we get exactly the same PRE as prior models?