<div class="alert alert-block alert-danger">

# 14A: Redlining: Historic Effects of Racism (COMPLETE)

**Use with textbook version 6.0+**


**Lesson assumes students have read up through page: 14.3**


</div>

<div class="alert alert-block alert-warning">

#### Summary of Notebook:

In this notebook, we can learn a little more about what "controlling for" means and how it's a little different than "proportional". Students will explore a data frame that includes information on historically red-lined neighborhoods and the tree coverage of those neighborhoods today. The lesson demonstrates how single-predictor models that use redlining and neighborhood size to predict tree coverage may not be as predictive as multivariate models that take into account both of those factors.
 
#### Includes:

- Fitting and interpreting simple and multiple regression models
- Exploring the dangers of failing to include an important variable in a model
- Using visualizations to explore which variables should be included in a model

</div>

<div class="alert alert-block alert-success">

## Approximate time to complete Notebook: 75-105 Mins

</div>

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# Pull in data
holc <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTxPfb9hD9EKXtnNuJp9k6SIn5KIbX1J3lGeoys49gS23M_hg79QDSRMHHQywirGHBO_MRAy1dUNLJ5/pub?gid=440145799&single=true&output=csv", header = TRUE)
holc$red_lined <- factor(holc$red_lined, levels = c(0,1), labels = c("not red-lined", "red-lined"))

### History of Redlining

Redlining was a process started in the 1930s in which the Home Owners' Loan Corporation (HOLC), a US federal agency, gave neighborhoods ratings to guide investment. Certain neighborhoods were colored red on HOLC maps and labeled as "hazardous" (see below). These "red-lined" neighborhoods were deemed risky investments (that is, banks and other institutions should not give loans to people trying to buy property in these areas).

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_p2TH8brH-image.png" alt="map of San Francisco area showing redlined areas" width = 80%>

These red-lined neighborhoods were predominantly home to communities of color, and this is no accident; the “hazardous” rating was in large part based on racial demographics (shown in historic documents below). In other words, redlining was an explicitly discriminatory policy. 

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_SyBmp1Qb-image.png" alt="paperwork from HOLC showing percentages of foreign-born families or black families" width=80%>

This history is not behind us in many ways. Today we will explore how the historic practice of redlining (legally halted with the Fair Housing Act of 1968) is running up against modern issues like global temperature rise.

Many of these historically “hazardous” neighborhoods are currently more impoverished *and* experience more extreme temperatures. Neighborhoods that are poorer and have more residents of color can be 5 to 20 degrees Fahrenheit hotter in summer than wealthier, whiter parts of the same city.


<div class="alert alert-block alert-success">

### 1.0 - Approximate Time:  10-15 mins

</div>

## The Data

The dataframe `holc` allows us to take a peek into the continued impact of redlining in several regions of California. There are 812 neighborhoods and 5 variables:

- `neighborhood_id` a code for each neighborhood
- `red_lined` 1 if the neighborhood was red-lined and 0 if not
- `holc_grade` A stands for "1st grade/best", B stands for "2nd grade/still desirable", C stands for "3rd grade/definitely declining", and D stands for "4th grade/hazardous".
- `area` Name of the modern neighborhood
- `cal_region` whether the neighborhood is in Northern, Central, or Southern California
- `count_squares` How big the neighborhood is in square units (a standard unit that can be used to measure neighborhoods across both historic redlining maps and modern google maps)
- `tree_cov` How much tree coverage in that neighborhood based on how many pixels on a google maps satellite image is "green" (assumed to indicate tree/plant areas); note that these pixels are not the same size as square units used in `count_squares`

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_921jbfZj-image.png" width=500 alt="legend from HOLC map">

**Acknowledgements:** HOLC data are from a lesson developed as part of UC Berkeley’s python-based http://data8.org/ curriculum. Information about redlining come from Chapple and Thomas’ (2020) Urban Displacement Project (https://www.urbandisplacement.org/). Information about the link between redlining and tree cover/climate change are from the New York Times (https://www.nytimes.com/interactive/2020/08/24/climate/racism-redlining-cities-global-warming.html).

In [None]:
# Take a look at the data
str(holc)

# What are the cases in this data frame?


## 1.0: Do red-lined neighborhoods have more/less tree coverage?

**1.1 - Explore variation:** Create a word equation and make a visualization to help you explore this question.

<div class="alert alert-block alert-warning">

<b>Sample Responses:</b> 

- tree_cov = red_lined + error

- tree coverage = redlining + other stuff

</div>

In [None]:
# Sample Responses

gf_jitter(tree_cov ~ red_lined, data = holc, alpha = .1) %>%
    gf_boxplot()


gf_histogram(~tree_cov, data = holc) %>%
    gf_facet_grid(red_lined ~ .)

**1.2 - Model variation:** Find the best fitting one-predictor model, express it in GLM notation, and interpret the parameter estimates. Add the predictions of the model to the visualization.

***GLM Notation:***

$Y_i = b_0 + b_1X_i + e_i$

**Note about variables with underscores:** 

The forward slash (\\) can be used when you want to ignore the function of a symbol in markdown and just use it, as is. For example, the underscores in the math equations serve to generate subscript text, but sometimes the variable names have underscores in them that you want to keep as they are (i.e., you don't want them to appear as subscript text). Just insert a forward slash before the underscore symbol, as seen below (double-click into edit mode to see the forward slashes used in `tree_cov` and `red_lined`).

> $tree\_cov_i = b_0 + b_1red\_lined_i + e_i$

In [None]:
# Sample Response

# Redlining Model
red_model <- lm(tree_cov ~ red_lined, data = holc)
red_model

# Add Model Predictions to Visualization
gf_jitter(tree_cov ~ red_lined, data = holc, alpha = .1) %>%
    gf_boxplot() %>%
    gf_model(red_model, color = "red")

gf_histogram(~tree_cov, data = holc) %>%
    gf_facet_grid(red_lined ~ .) %>%
    gf_model(red_model, color = "red")

<div class="alert alert-block alert-warning">

<b>Sample Responses:</b> 

***GLM Notation:***

- $tree\_cov_i = 7751 + -4584red\_linedred-lined_i + e_i$
- $Y_i = 7751 + -4584X_i + e_i$

***Interpret Parameter Estimates:***

- $b_0$ = 7751, this is the amount of average tree coverage for a not red-lined neigborhood (when X = 0).

- $b_1$ = -4584, this is the increment (or mean difference), or the amount of tree coverage we subtract from $b_0$ to get the average tree coverage for a red-lined neigborhood (when X = 1).


</div>

**1.3 - Evaluate models:** Is the model of the DGP that includes red-lining better than the empty model of the DGP? Explain with statistics.

In [None]:
# Sample Response

supernova(red_model)

# Optional: Add argument to condense length of output
supernova(red_model, verbose = FALSE)

<div class="alert alert-block alert-warning">

<b>Sample Responses:</b> 

The model does seem at least somewhat better than the empty model. It has reduced error by about 1.5% (PRE = 0.0148).

</div>

<div class="alert alert-block alert-success">

### 2.0 - Approximate Time:  10-15 mins

</div>

## 2.0: The size of the neighborhood

Notice that only 1% of the variation in tree coverage is explained by red-lining. The variation in tree coverage seems to have a lot of "other stuff" going on. Perhaps some of this variation comes from the fact that these neighborhoods are different sizes. Does neighborhood size predict tree coverage on Google maps?

**2.1 - Explore variation:** Create a word equation and make a visualization to help you explore this question.

<div class="alert alert-block alert-warning">

<b>Sample Responses:</b> 

- tree_cov = count_squares + error

- tree coverage = size of neighborhood + other stuff

</div>

In [None]:
# Sample Response

gf_jitter(tree_cov ~ count_squares, data = holc, alpha = .1)

**2.2 - Model variation:** Find the best fitting one-predictor model, express it in GLM notation, and interpret the parameter estimates. Add the predictions of the model to the visualization.

***GLM Notation:***

$Y_i = b_0 + b_1X_i + e_i$

In [None]:
# Sample Response

size_model <- lm(tree_cov ~ count_squares, data = holc)
size_model

gf_jitter(tree_cov ~ count_squares, data = holc, alpha = .1) %>%
    gf_model(size_model, color = "green4")

<div class="alert alert-block alert-warning">

<b>Sample Responses:</b> 

***GLM Notation:***

- $tree\_cov_i = 2380.83 + 2.71count\_squares_i + e_i$
- $Y_i = 2380.83 + 2.71X_i + e_i$


***Interpret Parameter Estimates:***

- $b_0$ = 2380.83, this is the amount of predicted tree coverage when count_squares is zero (when X = 0).

- $b_1$ = 2.71, this is the slope, or the amount of tree coverage we add to $b_0$ for every 1 unit increase in count_squares.


</div>

**2.3 - Evaluate models:** Is the model of the DGP that includes size of the neighborhood better than the empty model of the DGP? Explain with statistics.

In [None]:
# Sample Responses

#supernova(size_model)
supernova(size_model, verbose = FALSE)

<div class="alert alert-block alert-warning">

***Sample Response:***

The model does seem better than the empty model. It has reduced error by about 18% (PRE = 0.18), and the relatively large F value (178.36) tells us that this multivariate model is a pretty good model compared to our empty model. We got a lot of reduction in error for the degrees of freedom spent.

</div>

<div class="alert alert-block alert-success">

### 3.0 - Approximate Time:  5-10 mins

</div>

## 3.0: Residuals after controlling for size of the neighborhood

Here we have saved the residuals from the `size_model` as `size_resid`.

In [None]:
size_model <- lm(tree_cov ~ count_squares, data = holc)
holc$size_resid <- resid(size_model)

You can think of these residuals as the **other stuff** from this word equation:

**tree coverage = count squares + other stuff**

**3.1:** Can some of that **other stuff** be related to red-lining? Here we have ordered the neighborhoods in `holc` according to the size of the residuals. Use `head()` and `tail()` to examine the neighborhoods that have the largest and smallest residuals. Do you notice any patterns?

In [None]:
holc_ordered <- arrange(holc, size_resid)
head(holc_ordered)
tail(holc_ordered)

<div class="alert alert-block alert-warning">

***Sample Response:***

*Students may notice a variety of patterns, such as:*

- Those with the smallest residuals also tend to:
  - be red-lined
  - have lower tree coverage
  - be mostly C/D grade
  - have larger count_squares

*And, generally, the opposite is true about those with the largest residuals.*

</div>

<div class="alert alert-block alert-success">

### 4.0 - Approximate Time:  20-25 mins

</div>

## 4.0: Multivariate model

Perhaps once we control for the size of the neighborhood, we can more clearly see the effect of red-lining on tree coverage.

**4.1 - Explore variation:** Create a word equation and make a visualization to help you explore this question.

<div class="alert alert-block alert-warning">

<b>Sample Responses:</b> 

- tree_cov = count_squares + red_lined + error

- tree coverage = size of neighborhood + redlining + other stuff

</div>

In [None]:
# Sample Response

gf_jitter(tree_cov ~ count_squares, color = ~red_lined, data = holc, alpha = .1) 

**4.2 - Model variation:** Find the best fitting multivariate model, express it in GLM notation, and interpret the parameter estimates. Add the predictions of the model to the visualization.

***GLM Notation:***

$Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + e_i$

In [None]:
# Sample Response

multi_model <- lm(tree_cov ~ count_squares + red_lined, data = holc)
multi_model

gf_jitter(tree_cov ~ count_squares, color = ~red_lined, data = holc, alpha = .1) %>%
    gf_model(multi_model)

<div class="alert alert-block alert-warning">

**Sample Responses:**

***GLM Notation:***

- $Y_i = 3376.74 + 2.87X_{1i} + -6865.19X_{2i} + e_i$

- $tree\_cov_i = 3376.74 + 2.87count\_squares_{i} + -6865.19red\_linedred-lined_{i} + e_i$

***Interpret Parameter Estimates:***

- $b_0$ = 3376.74, this is the amount of predicted tree coverage for a not red-lined neighborhood (when X2 = 0) when count_squares is zero (when X1 = 0).

- $b_1$ = 2.87, this is the slope, or the amount of tree coverage we add to the model for every 1 unit increase in count_squares.

- $b_2$ = -6865.19, this is the increment (or group difference), or the amount we subtract from the model to get the predicted tree coverage for a red-lined neighborhood.


</div>

**4.3 - Evaluate models:** Is the multivariate model of the DGP that includes size and redlining better than the empty model of the DGP? Explain with statistics.

In [None]:
# Sample Response

#supernova(multi_model)
supernova(multi_model, verbose = FALSE)

<div class="alert alert-block alert-warning">

***Sample Response:***

The model does seem better than the empty model. It has reduced error by about 21% (PRE = 0.21), and the relatively large F value (109.55) tells us that this multivariate model is a pretty good model compared to our empty model. We got a lot of reduction in error for the degrees of freedom spent.

</div>

**4.4:** Take a good look at your visualization. If you had to guess, which neighborhoods are generally larger: not red-lined or red-lined? Why do you think that matters?

<div class="alert alert-block alert-warning">

**Sample Response:**

The red-lined neighborhoods are generally larger. 

*Students may come up with a variety of reasons, such as:*

- This could matter because if they are more condensed with people and buildings, they may have less room for trees.


</div>

In [None]:
# complete version
favstats(count_squares ~ red_lined, data = holc)

**4.5:** Why is the PRE for red lining now 4% (compared to 1% in the one-predictor model)?

<div class="alert alert-block alert-warning">

**Sample Response:**

If you just look at the x-axis (`count_squares`) you can see that some of the red lined neighborhoods are quite large. Neighborhoods that are small with a lot of tree coverage could look like a neighborhood that is large with very little tree coverage. Before we included `count_squares` and `red_lined` in the model, we couldn't distinguish between those two kinds of neighborhoods. Now that we can, we can see that red-lined neighborhoods have less tree coverage than would be expected for a neighborhood of their size. 


</div>

**4.6:** What does the p-value in the `red_lined` row of the ANOVA table tell us? What model can we reject?

<div class="alert alert-block alert-warning">

**Sample Response:**

We can reject the model that included everything except for `red_lined` (e.g., the size one-predictor model).


</div>

<div class="alert alert-block alert-success">

### 5.0 - Approximate Time:  20-25 mins

</div>

## 5.0 - Advanced: All four grades for neighborhoods

Although we have been just looking at red-lined neighborhoods versus all other neighborhoods, would including all four grades (A,B,C,D) show us different patterns than just including `red_lined` in our multivariate model?

**5.1 - Explore variation:** Create a word equation and make a visualization to help you explore this question.

<div class="alert alert-block alert-warning">

**Sample Responses:** 

- tree_cov = count_squares + holc_grade + error

- tree coverage = size of neighborhood + HOLC grade + other stuff

</div>

In [None]:
# Sample Response

gf_jitter(tree_cov ~ count_squares, color = ~holc_grade, data = holc, alpha = .1)

**5.2 - Model variation:** Find the best fitting multivariate model, express it in GLM notation, and interpret the parameter estimates. Add the predictions of the model to the visualization.

***GLM Notation:***

$Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + e_i$

*Note: You may need to add additional $b$ estimates.*

In [None]:
# Sample Response

multi_model4 <- lm(tree_cov ~ count_squares + holc_grade, data = holc)
multi_model4

gf_jitter(tree_cov ~ count_squares, color = ~holc_grade, data = holc, alpha = .1) %>%
    gf_model(multi_model4)

<div class="alert alert-block alert-warning">

**Sample Responses:**

***GLM Notation:***

- $Y_i = 13927.29 + 2.99X_{1i} + -10697.48X_{2i} + -14531.11X_{3i} + -17685.12X_{4i} + e_i$

- $tree\_cov_i = 13927.29 + 2.99count\_squares_i + -10697.48holc\_gradeB_i + -14531.11holc\_gradeC_i + -17685.12holc\_gradeD_i + e_i$

***Interpret Parameter Estimates:***

- $b_0$ = 13927.29, this is the amount of predicted tree coverage for a grade A neighborhood (when X2, X3, and X4 = 0) when count_squares is zero (when X1 = 0).

- $b_1$ = 2.99, this is the slope, or the amount of tree coverage we add to the model for every 1 unit increase in count_squares.

- $b_2$ = -10697.48, this is the increment (or group difference), or the amount we subtract from the model to get the predicted tree coverage for a grade B neighborhood.

- $b_3$ = -14531.11, this is the increment (or group difference), or the amount we subtract from the model to get the predicted tree coverage for a grade C neighborhood.

- $b_4$ = -17685.12, this is the increment (or group difference), or the amount we subtract from the model to get the predicted tree coverage for a grade D neighborhood.


</div>

**5.3 - Evaluate models:** Is the model of the DGP that includes size of the multivariate better than the empty model of the DGP?

In [None]:
# Sample Responses

#supernova(multi_model4)
supernova(multi_model4, verbose = FALSE)

<div class="alert alert-block alert-warning">

**Sample Response:**

The model does seem better than the empty model. It has reduced error by about 31% (PRE = 0.31), and the relatively large F value (90.10) tells us that this multivariate model is a pretty good model compared to our empty model. We got a lot of reduction in error for the degrees of freedom spent.

</div>

<div class="alert alert-block alert-success">

### 6.0 - Approximate Time:  10-15 mins

</div>

## 6.0 - Alternate Advanced: How about a ratio?

**6.1:** Of course the size of the neighborhood probably has a lot to do with the tree coverage. Is there a way to create an outcome variable that already incorporates this information? 

In [None]:
# Sample Response
# Create a tree coverage ratio

holc$tree_ratio <- holc$tree_cov / holc$count_squares

# Optional: teach students the `mutate()` function
#holc <- holc %>%
#    mutate(tree_ratio = tree_cov / count_squares)

**6.2:** Would a model like that result in a PRE for red-lining that is more like the multivariate model (around 4%) or more like the single-predictor model (around 1%)? Make a prediction -- then find out!

In [None]:
# Sample Response
    
ratio_red_model <- lm(tree_ratio ~ red_lined, data = holc)
ratio_red_model

gf_jitter(tree_ratio ~ red_lined, data = holc, color = ~red_lined) %>%
    gf_model(ratio_red_model, color = "red")

supernova(ratio_red_model)

**6.3:** Why don't we get exactly the same PRE as prior models?

<div class="alert alert-block alert-warning">

**Sample Response:**

Dividing tree coverage by count squares links the tree coverage to the specific area of the neighborhood. In contrast, putting count squares in the multivariate model takes the best fitting relationship between tree coverage and count squares for the whole data set and uses that best-fitting relationship for every neighborhood.

**Note to Instructors:**

There are reasons to use either the ratio approach or the multivariate approach. The multivariate approach can give you information such as the general relationship between tree coverage and the size of the neighborhood, how much variation the size of the neighborhood explains. The ratio approach might be good if you want to actually predict ratio as the outcome.

</div>