# CP201A Lab: Testing for Statistical Significance
Fall 2025

# Learning Objectives
* Learn how to use .loc to select subsets of a dataframe
* Learn how to test ACS-provided estimates for statistical significance
* Optional: Learn how to make graphs with MOE bars

---------------------------------
# 0. Before we begin

In [None]:
# Import the libraries and modules we need
import pandas as pd
from census import Census
import numpy as np

## 0.1 Using `.loc` to select data

`.loc` (aka location) is a Pandas feature that allows us to select very specific subsets of dataframe. Remember how we used indexes to slice up lists in our first Python lab? `.loc` operates very similarly! It may seem a bit abstract right now, but we'll be using this feature later on in this notebook.

Here is the basic format: `dataframe.loc[row_index, column_name]` But depending on what we want to select, we do not always have to specify the row index or column name. 

Let's dive right in and explore its functionality!

In [None]:
# Here's some sample data we'll use for this tutorial
flowers = pd.DataFrame({'Type': ['Orchid', 'Rose', 'Carnation', 'Daffodil'],
                        'Count': [1, 15, 4, 21], 
                        'Color': ['Red', 'White', 'Orange', 'Yellow']})
flowers

#### **A. `.loc` can select very particular slices of a dataframe**

<img src="excel_example_a.png" width="300">


If we think of our dataframe as a table in Excel, then we are just selecting cells A1 through B5. 

The following code is telling Python to extract every row (`:` means every row) for the columns `'Type'` and `'Color'`. 

In [None]:
# Using .loc, we can extract all the rows (:) in a particular column(s) ('Type', 'Color')
flowers.loc[:, ['Type', 'Color']]


In [None]:
# Of course, we can do this in an even simpler way
flowers[['Type', 'Color']]

If we instead want only the first two rows, or cells A1 through B3 in the Excel table, then we change `:` to `0:1` which indicates that we want rows with the index 0 and 1. 

Notice that unlike list slicing, the **second index number is inclusive.** Also note that **column headers are not given a row index** because Pandas knows to treat that "row" of data as a header. 

In [None]:
# If we only wanted to select the first two rows, we would use 0:1 (note that the index is inclusive)
flowers.loc[0:1, ['Type', 'Color']]

#### **B. `.loc` can select rows based on their index** 

<img src="excel_example_b.png" width="300">

We don't have to specify a column name either. If we give `.loc` only an index number, then it will give us all columns for the corresponding row. 

In [None]:
# .loc also allows us to select specific row(s) based on their index
flowers.loc[0]

But unless we know the exact index of the row that we want to extract, then it's not very useful to use index numbers. To get around this, we can **change the index of our dataframe to something meaningful.** From there, we can use the values of the new index to subset our dataframe.

In [None]:
flowers = flowers.set_index('Type')
flowers

# See that the index of the table (the bold column on the far left) is now the type of flower

In [None]:
# And now we can use .loc to select by type.
flowers.loc['Orchid']

Previously, we could have used `flowers.loc[0, 'Count']` to extract this specific value. But being able to use `Orchid` as our index is much more intuitive. This feature can be incredibly useful because I may want to pull out a very specific value and then store it in a variable to use later on. 

In [None]:
# Here we extract the count for the orchid row.
flowers.loc['Orchid', 'Count']

#### **C. `.loc` can select rows based on one or more conditions** 

<img src="excel_example_c.png" width="400">

Finally, we can use `.loc` to filter data using one or more conditions just like we can use the filter function on Excel to select specific column values.

In [None]:
# This line of code is asking Python to give us only the rows where the corresponding value in the 'Count' column are greater than 10
flowers.loc[flowers['Count']>10]

We can also specify multiple conditions. Note that a **double equal sign (`==`) checks for equivalence**. In other words, we are checking that the value in the `'Color'` column is equal to `'Yellow'`. Also note that you must put parentheses around each condition.

In [None]:
# & means AND, meaning both conditions must be satisfied
flowers.loc[(flowers['Count']>10) & (flowers['Color']=='Yellow')]

In [None]:
# | means OR,  meaning either condition can be satisfied
flowers.loc[(flowers['Count']>10) | (flowers['Color']=='Yellow')]

In [None]:
# Exercise: Use .loc to extract the color corresponding to carnation



In [None]:
# Exercise: Use .loc to find the rows where the count of flowers is less than 5 or more than 20



-------------------------
# 1. Data

**There are 2 options for the data we'll use in today's lab:** 
* **Option A**: If you already have a cleaned table with aggregated, neighborhood-level estimates, then go ahead and use that data. Follow the instructions in the Option A section. 
* **Option B**: If you do not yet have a cleaned table with aggregated, neighorhood-level estimates, you can work with the dataset we've prepared for you.

-------------------------
### **Option A**

If you haven't already, go to the .ipynb file where you cleaned your data and export each dataframe as a csv file. Here is some sample code of how to do this: 

`df_out.to_csv('file_name.csv', index=False)`

Then import that csv into this file using the following code block: 

`new_df_name = pd.read_csv('file_name.csv')`

**Then skip down to section 2.**

-------------------------
### **Option B**
To switch it up from last week's labs, we'll use the following data for comparisons today:

* Table: [B25070](https://api.census.gov/data/2022/acs/acs1/groups/B25070.html)
* Table Name: Gross Rent as a Percentage of Household Income in the Past 12 Months
* Geography: The neighborhood of Fruitvale and the city of Oakland
* Universe: Renter-occupied housing units
* Dataset: [2023 ACS 1-year estimates](https://api.census.gov/data/2023/acs/acs1/examples.html)

We downloaded the following variables, and renamed them as shown below:

|     Old Name                                           |     New Name         |
|--------------------------------------------------------|----------------------|
|     Estimate!!Total:                                   |     total            |
|     Margin of Error!!Total:                            |     total_moe        |
|     Estimate!!Total:!!30.0 to 34.9   percent           |     rb_30to34        |
|     Margin of Error!!Total:!!30.0 to   34.9 percent    |     rb_30to34_moe    |
|     Estimate!!Total:!!35.0 to 39.9   percent           |     rb_35to39        |
|     Margin of Error!!Total:!!35.0 to   39.9 percent    |     rb_35to39_moe    |
|     Estimate!!Total:!!40.0 to 49.9   percent           |     rb_40to49        |
|     Margin of Error!!Total:!!40.0 to   49.9 percent    |     rb_40to49_moe    |
|     Estimate!!Total:!!50.0 percent or   more           |     rb_mt50          |
|     Margin of Error!!Total:!!50.0   percent or more    |     rb_mt50_moe      |

We have done all the steps from last week's lab:
- We created a new variable rent_burdened by combining the four estimates of over 30 percent spent on rent; then a second severely_rent_burdened which is severely rent burdened, meaning that they spent more than 50 percent on rent (note that these are **NOT** mutually exclusive)
- We aggregated the MOEs associated with those estimates, using the square root sum of squares formula
- We calculated the percent of rent burdened households
- We calculated the derived proportion MOE
- And we put both the neighborhood and city results into one .csv

In [None]:
#I'm going to limit the display to just one decimal point - it will make things easier to read
pd.options.display.float_format = "{:.3f}".format

In [None]:
#Let's bring in the .csv and look at it
df = pd.read_csv('rentburden_2023.csv')
df

------------------------
# 2. Testing for statistically significant differences

Depending on which data option you selected, follow the built-in exercises in this section based on the following: 
* **Option A**: Use your own data to test whether there are statistically significant differences between two categories in your data.
* **Option B**: Use the data we loaded in and cleaned above to test whether there are statistically significant differences between renter cost burdens in Fruitvale and Oakland.
    * For the exercises, pick something different in our cleaned dataset to compare. For example, you could try testing whether the percentage of severely rent burdened households is different.

## 2.1 Calculating standard errors

First we need to convert the 90% confidence level margins of error that come with the ACS data into standard errors. The formula to do so is $SE = \frac{MOE_{ACS}}{1.645},$ where $MOE_{ACS}$ is the 90% margin of error provided for the ACS estimate.

Let's calculate the standard error for the estimates of the percent of renters who are cost burdened. Try implementing this formula for this estimate's standard error now.

In [None]:
# Exercise: Create a new 'pct_rent_burdened_se' column based on 'pct_rent_burdened_moe'



We have quite a few margins of error in our DataFrame, and it would be nice to handle them all at once, rather than have to repeat ourselves several times. Because we've been diligent in our variable naming conventions, all our MOEs' column names end in `_moe`. Can we exploit this fact to efficiently convert them all to standard errors?

...

Yes. Of course we can.

In [None]:
# Iterate through all column names
for col in df.columns:
    # Check whether each column name ends with '_moe', using a built-in string method
    # `if '_moe' in col:` is another possibility, but what if we had a column named `pct_moebius` or something?
    if col.endswith('_moe'):
        # Replace '_moe' with '_se' but only at the end of the name, again using string subsetting
        # col[:-4] selects all but the last four characters in col
        df[col[:-4] + '_se'] = df[col] / 1.645

df

In [None]:
# Exercise for Option A: Try it out with your own data!



## 2.2 Implementing the two-sample t-test of means

Let's review the formula for testing whether two sample estimates are statistically significantly different from each other:

$$\left|\frac{\hat{X}_1 - \hat{X}_2}{\sqrt{SE_1^2 + SE_2^2}}\right| > Z_{CL},$$
where:
* $\hat{X}_1$ and $\hat{X}_2$ are the estimates we're comparing (the hat over the $X$ just means that the value is an estimate)
* $SE_1$ and $SE_2$ are the corresponding *standard error* values, and
* $Z_{CL}$ is the z-score associated with a given *confidence level* (1.645 for 90%, 1.96 for 95%, 2.576 for 99%).

We have all our “ingredients” – we have the percent of renters who are cost burdened for each geography, as well as the associated standard error. Now we just need to implement this formula. It looks complicated, but we already know addition `+`, subtraction `-`, division `-`, and exponentiation `**` in Python. All we really need to complete the picture is how to take the *absolute value* of a number.

The absolute value of a real number $x$ is the non-negative value of $x$, without regard to its sign. In math formulas, $|x|$ denotes an absolute value. In Python, the function `abs(x)` returns the value of `x` if `x` is non-negative, or `-x` if `x` is negative. So `abs(4)` is 4, and `abs(-10)` is 10.

Try plugging in the numbers from the City of Oakland and Alameda County into the formula for testing significant differences:

In [None]:
# First, we need to set a meaningful index because we want to use the index to select specific rows in our dataframe. 
# Recall how we set the index to flower 'type' in the .loc example above? We're doing the same thing here.
df = df.set_index('NAME')
df

In [None]:
# For simplicity, I'll start by assigning the relevant cells from our DataFrame to variable names matching the formula above

# To do this, we'll use .loc! And since we've set the index as the geography name, we can pass 'Fruitvale' to .loc.
x1 = df.loc['Fruitvale', 'pct_rent_burdened']
x2 = df.loc['Oakland city, California', 'pct_rent_burdened']
se1 = df.loc['Fruitvale', 'pct_rent_burdened_se']
se2 = df.loc['Oakland city, California', 'pct_rent_burdened_se']

# Print x1 and x2, nicely formatted as percentages, to see how different they *appear* to be
print(f'{x1:.2%}, {x2:.2%}')

In [None]:
# Exercise for Option A and B: Calculate the z-score using the variables we just created above (i.e. x1, x2, etc.)

zscore = 'put your equation here'
print('Our z-score is:', zscore)

**What does this z-value mean for our analysis?** Can we say the estimates are *statistically significantly different*? If so, at what confidence level?

Remember if the absolute value of our z-score is greater than our critical value or, in other words, is located in the blue tail region of the below curve, we can **reject our null hypothesis** meaning there **is** a statistically significant difference between our two samples. Otherwise we **fail to reject our null hypothesis** meaning there **is not** a statistically significant difference.
<img src="z_test_example.png" width="500">

In [None]:
# Exercise for Option A: Try it out with your own data!
# Exercise for Option B: Test whether the difference between the percent of households that are severely rent burdened is statistically significant


# Make sure to interpret your z-value. What does it mean for your analysis?

It would be nice if we could quickly check a given variable and pair of geographies for statistically significant differences. Try writing a function that takes in a DataFrame, a variable to check, and two jurisdictions to compare:

In [None]:
def z_statistic(df, col, place_1, place_2):
    '''Write a brief docstring - future you will thank past you.

    Inputs:
    - df (pd.DataFrame): the table of summary statistics and standard errors.
      Columns must contain col and col + '_se'. Index must contain place_1
      and place_2
    - col (string): the column name to be compared across jurisdictions.
    - place_1, place_2 (string): the jurisdictions whose values to compare.

    Output:
    The two-sample z-value (float) of the difference between the values of col
    for place_1 and place_2.
    '''
    # Assign the relevant cells from df to variable names matching the formula
    x1 = df.loc[place_1, col]
    x2 = df.loc[place_2, col]
    se1 = df.loc[place_1, col + '_se']
    se2 = df.loc[place_2, col + '_se']

    # Return the z-value
    return abs((x1 - x2) / (se1**2 + se2**2)**0.5)

# Try out your function on severely cost burdened renters in Berkeley and Oakland
z_statistic(df, 'pct_rent_burdened', 'Fruitvale', 'Oakland city, California')

In [None]:
# Exercise for Option A and B: Try this function out with the same test that you ran above. See if you get the same z-value. 



# 3. OPTIONAL: Plotting proportions with margins of error

We are now going to plot the results of the previous exercise. We want to visualize the proportion of rental units that are occupied by cost-burdened households in these two geographies *AND visualize the margins of error.* 

As is often the case in Python, there are quite a few ways to plot data from a pandas DataFrame:
* Pandas' built-in plotting (`df.plot`) is a quick-and-dirty tool that can often produce surprisingly good graphs with a little coaxing
* `matplotlib.pyplot` is more flexible but also more verbose than `df.plot`
* `seaborn` is great at producing complex statistical plots, including from raw data (rather than summary statistics)

Today we are going to focus on `df.plot`, because it's the simplest and it gets the job done! Pandas' plotting uses `matplotlib` under the hood, so if you get into `matplotlib` later, you'll see some similarities to what we're doing today.  

Note: you may need to install matplotlib in your conda environment.  Just open a new conda prompt, open your environment, and type "conda install matplotlib".

Then, you can create a bar chart using a remarkably simple command:

In [None]:
df.plot.bar(y='pct_rent_burdened')

This is not half bad, for one short line of code! But because we're committed to high standards for data visualization, let's address a few issues:

1. We need to show margins of error to visually indicate how exact our estimates are.
2. The y-axis should be formatted as percentage points, not plain decimals.
3. We don't need the legend, but we do need a chart title.
4. We need a good *caption* for our chart.

Let's tackle these, point by point.

In [None]:
# 1. Show margins of error: df.plot.bar() accepts a `yerr` argument
# that lets us specify which column to use for the width of margins of error
df.plot.bar(y='pct_rent_burdened', yerr='pct_rent_burdened_moe')

In [None]:
# 2. Format the y-axis as percentage points: we need to store the result of
# df.plot.bar() as an Axes object, then adjust some attributes of the object.
# This is the hardest one, but matplotlib has a PercentFormatter that really helps
# - see here for more details https://stackoverflow.com/a/36319915
import matplotlib.ticker as mtick
ax = df.plot.bar(y='pct_rent_burdened', yerr='pct_rent_burdened_moe')
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0, 0))

In [None]:
# 3. Remove legend, add a title: we can set `legend=False` in df.plot.bar();
# adding a legend means again modifying the Axes object as we did for #3
ax = df.plot.bar(
    y='pct_rent_burdened',
    yerr='pct_rent_burdened_moe',
    legend=False
)
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0, 0))
ax.set_title('Percent of Rental Units Occupied by Cost-Burdened Households')
# You can use similar syntax to set x and y axis labels, if needed
# I don't think we need them for this plot, though
# ax.set_xlabel('Jurisdiction')
# ax.set_ylabel('Percent of Rental Units')

In [None]:
# Bonus visualization option: sometimes a horizontal bar chart (df.plot.barh()) is nice.
# We need to change yerr to xerr and ax.yaxis.set_major_formatter to ax.xaxis.set_major_formatter
# (but we don't change 'y' to 'x', confusing!)
# Also note I have specified additional keyword arguments `figsize`, `width`, and `color`;
# other arguments also exist: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.barh.html
ax = df.plot.barh(
    y='pct_rent_burdened',
    xerr='pct_rent_burdened_moe',
    legend=False,
    figsize=(6, 2),  # inches at 100 dpi
    width=0.8,
    color='firebrick',  # https://matplotlib.org/stable/gallery/color/named_colors.html
)
ax.xaxis.set_major_formatter(mtick.PercentFormatter(1.0, 0))
ax.set_title('Percent of Rental Units Occupied by Cost-Burdened Households')

As for issue #5, good captions for your graphs, this isn't really a Python question, because captions usually sit alongside the graph in whatever presentation/report preparation software (PowerPoint, Word, Canva, Google Slides, etc.) you use downstream of preparing the chart. But make sure your caption clearly indicates the source of your data, its vintage and universe, and definitions of any concepts or choices you made during your analysis. A good caption for this chart might look something like this:

> Source: American Community Survey 2022 1-year estimates, Table B25070. Universe: Renter-occupied housing units. Notes: Rent burden is defined as those spending between 30-49.9% of their income on rent.

In [None]:
# Optional Exercise: Try plotting a different set of data (use your own data for Option A, try plotting severely rent burdened for Option B)



# 4. Save your data

You have this nice plot, and this lovely data table, but they're both "stuck" in this Jupyter notebook. Let's set them free!

The easiest way to export a plot from a Jupyter notebook is to right-click the image, click "Copy Output to Clipboard", and paste the figure into your report/presentation or into an image editing program like Paint (Windows) or Preview (macOS). 

As for your data table, just save it as a CSV:

In [None]:
df.to_csv('lab_4_finished.csv')
# Note no `index=False` this time because we set a meaningful index!