# CP201A Lecture: Testing for Statistical Significance
Fall 2025

Today, we're going to answer the question:  Which county saw the greatest increases in median rents between 2019 and 2023?

In [1]:
# Import the libraries and modules we need
import pandas as pd
from census import Census
import numpy as np

In [2]:
# Initialize the Census Data API connection with your API key
api_key = 'ab895c1f94c45324d4cfa4f724f1aec7f1a274a4'
c = Census(key=api_key)

# Define the dict of variables to pull and rename
variables_of_interest = {
    'NAME': 'NAME',
    'GEO_ID': 'GEO_ID',
    'B25064_001E': 'med_rent',
    'B25064_001M': 'med_rent_moe',
}

# Pull 2019
df_2019 = pd.DataFrame(
    c.acs1.get(
        list(variables_of_interest.keys()),
        {'for': 'county:*', 'in':'state:06'},
        year=2019
    )
).rename(columns=variables_of_interest)

# Pull 2023
df_2023 = pd.DataFrame(
    c.acs1.get(
        list(variables_of_interest.keys()),
        {'for': 'county:*', 'in':'state:06'},
        year=2023
    )
).rename(columns=variables_of_interest)

In [3]:
df_2019 = df_2019.rename(
    columns={
        "med_rent": "med_rent_2019",
        "med_rent_moe": "med_rent_2019_moe"
    }
)

df_2023 = df_2023.rename(
    columns={
        "med_rent": "med_rent_2023",
        "med_rent_moe": "med_rent_2023_moe"
    }
)

# Merge and keep NAME from 2023
df_merged = pd.merge(
    df_2019[["state", "county", "med_rent_2019", "med_rent_2019_moe"]],
    df_2023[["state", "county", "NAME", "med_rent_2023", "med_rent_2023_moe"]],
    on=["state", "county"],
    how="inner"
)

df = df_merged[[
    "NAME", 
    "med_rent_2019", 
    "med_rent_2019_moe", 
    "med_rent_2023", 
    "med_rent_2023_moe"
]].copy()
df

Unnamed: 0,NAME,med_rent_2019,med_rent_2019_moe,med_rent_2023,med_rent_2023_moe
0,"Lake County, California",992.0,202.0,1544.0,173.0
1,"Yuba County, California",1012.0,80.0,1304.0,91.0
2,"Sonoma County, California",1757.0,57.0,2061.0,75.0
3,"Imperial County, California",810.0,75.0,1022.0,69.0
4,"Alameda County, California",1982.0,41.0,2303.0,37.0
5,"Napa County, California",1835.0,124.0,2239.0,183.0
6,"Yolo County, California",1489.0,74.0,1900.0,83.0
7,"Nevada County, California",1407.0,130.0,1782.0,129.0
8,"Mendocino County, California",1120.0,103.0,1439.0,110.0
9,"Los Angeles County, California",1577.0,9.0,1896.0,11.0


In [4]:
# filter rows where med_rent_2019_adj > med_rent_2023
mask = df['med_rent_2019'] > df['med_rent_2023']

# print the NAME column for those rows
print(df.loc[mask, 'NAME'])

Series([], Name: NAME, dtype: object)


In [5]:
# Inflation factor (2019 -> 2023 dollars)
inflation_factor = 305.109 / 255.657  # ≈ 1.193

# Adjust 2019 median rent
df["med_rent_2019_adj"] = df["med_rent_2019"] * inflation_factor
df["med_rent_2019_adj_moe"] = df["med_rent_2019_moe"] * inflation_factor
df[["NAME", "med_rent_2019", "med_rent_2019_adj", "med_rent_2023"]]

Unnamed: 0,NAME,med_rent_2019,med_rent_2019_adj,med_rent_2023
0,"Lake County, California",992.0,1183.883594,1544.0
1,"Yuba County, California",1012.0,1207.752215,1304.0
2,"Sonoma County, California",1757.0,2096.858341,2061.0
3,"Imperial County, California",810.0,966.679144,1022.0
4,"Alameda County, California",1982.0,2365.380326,2303.0
5,"Napa County, California",1835.0,2189.945963,2239.0
6,"Yolo County, California",1489.0,1777.018822,1900.0
7,"Nevada County, California",1407.0,1679.157477,1782.0
8,"Mendocino County, California",1120.0,1336.642767,1439.0
9,"Los Angeles County, California",1577.0,1882.040754,1896.0


In [6]:
# filter rows where med_rent_2019_adj > med_rent_2023
mask = df['med_rent_2019_adj'] > df['med_rent_2023']

# print the NAME column for those rows
print(df.loc[mask, 'NAME'])

2              Sonoma County, California
4             Alameda County, California
10        Santa Clara County, California
11             Merced County, California
12       Contra Costa County, California
15             Tehama County, California
24         Santa Cruz County, California
33    San Luis Obispo County, California
39          San Mateo County, California
Name: NAME, dtype: object


------------------------
# 2. Testing for statistically significant differences

## 2.1 Calculating standard errors

First we need to convert the 90% confidence level margins of error that come with the ACS data into standard errors. The formula to do so is $SE = \frac{MOE_{ACS}}{1.645},$ where $MOE_{ACS}$ is the 90% margin of error provided for the ACS estimate.

Let's calculate the standard error for the estimates of the percent of renters who are cost burdened. Try implementing this formula for this estimate's standard error now.

In [8]:
# Create a new 'pct_rent_burdened_se' column based on 'pct_rent_burdened_moe'
df['med_rent_2019_adj_se'] = df['med_rent_2019_adj_moe'] / 1.645
df['med_rent_2023_se'] = df['med_rent_2023_moe'] / 1.645

## 2.2 Implementing the two-sample t-test of means

Let's review the formula for testing whether two sample estimates are statistically significantly different from each other:

$$\left|\frac{\hat{X}_1 - \hat{X}_2}{\sqrt{SE_1^2 + SE_2^2}}\right| > Z_{CL},$$
where:
* $\hat{X}_1$ and $\hat{X}_2$ are the estimates we're comparing (the hat over the $X$ just means that the value is an estimate)
* $SE_1$ and $SE_2$ are the corresponding *standard error* values, and
* $Z_{CL}$ is the z-score associated with a given *confidence level* (1.645 for 90%, 1.96 for 95%, 2.576 for 99%).

We have all our “ingredients” – we have the percent of renters who are cost burdened for each geography, as well as the associated standard error. Now we just need to implement this formula. It looks complicated, but we already know addition `+`, subtraction `-`, division `-`, and exponentiation `**` in Python. All we really need to complete the picture is how to take the *absolute value* of a number.

The absolute value of a real number $x$ is the non-negative value of $x$, without regard to its sign. In math formulas, $|x|$ denotes an absolute value. In Python, the function `abs(x)` returns the value of `x` if `x` is non-negative, or `-x` if `x` is negative. So `abs(4)` is 4, and `abs(-10)` is 10.


In [9]:
# Now try to recreate the above formula:
df['Z_score']=abs((df['med_rent_2023'] - df['med_rent_2019_adj']) / ((df['med_rent_2023_se']**2 + df['med_rent_2019_adj_se']**2)**0.5))
df['pct_increase']=(df['med_rent_2023'] - df['med_rent_2019_adj'])/df['med_rent_2019_adj']
df

Unnamed: 0,NAME,med_rent_2019,med_rent_2019_moe,med_rent_2023,med_rent_2023_moe,med_rent_2019_adj,med_rent_2019_adj_moe,med_rent_2019_adj_se,med_rent_2023_se,Z_score,pct_increase
0,"Lake County, California",992.0,202.0,1544.0,173.0,1183.883594,241.073071,146.548979,105.167173,1.996438,0.304182
1,"Yuba County, California",1012.0,80.0,1304.0,91.0,1207.752215,95.474483,58.0392,55.319149,1.200403,0.079692
2,"Sonoma County, California",1757.0,57.0,2061.0,75.0,2096.858341,68.025569,41.35293,45.592705,0.582561,-0.017101
3,"Imperial County, California",810.0,75.0,1022.0,69.0,966.679144,89.507328,54.41175,41.945289,0.805222,0.057228
4,"Alameda County, California",1982.0,41.0,2303.0,37.0,2365.380326,48.930673,29.74509,22.492401,1.672762,-0.026372
5,"Napa County, California",1835.0,124.0,2239.0,183.0,2189.945963,147.985449,89.960759,111.246201,0.34287,0.0224
6,"Yolo County, California",1489.0,74.0,1900.0,83.0,1777.018822,88.313897,53.68626,50.455927,1.669237,0.069206
7,"Nevada County, California",1407.0,130.0,1782.0,129.0,1679.157477,155.146036,94.313699,78.419453,0.838458,0.061247
8,"Mendocino County, California",1120.0,103.0,1439.0,110.0,1336.642767,122.923397,74.72547,66.869301,1.020749,0.076578
9,"Los Angeles County, California",1577.0,9.0,1896.0,11.0,1882.040754,10.740879,6.52941,6.68693,1.493602,0.007417


In [10]:
df_sorted = df.sort_values(by="pct_increase", ascending=False)
df_sorted[["NAME", "pct_increase", "Z_score"]]

Unnamed: 0,NAME,pct_increase,Z_score
0,"Lake County, California",0.304182,1.996438
23,"Madera County, California",0.203899,2.578913
26,"San Bernardino County, California",0.143243,7.891744
34,"Fresno County, California",0.142618,5.895835
35,"Kern County, California",0.135475,5.305379
17,"El Dorado County, California",0.128117,1.47675
31,"Butte County, California",0.105407,1.900667
40,"San Joaquin County, California",0.100602,3.381804
20,"San Diego County, California",0.094349,12.095881
13,"Stanislaus County, California",0.085689,2.904781


In [None]:
# filter rows where med_rent_2019_adj > med_rent_2023
mask = df['Z_score'] < 1.645

# print the NAME column for those rows
print(df.loc[mask, 'NAME'])

**What does this z-value mean for our analysis?** Can we say the estimates are *statistically significantly different*? If so, at what confidence level?
