# Lab 7: Statistical Analysis

November 19, 2025

## Learning Objectives
* Develop hypotheses to research questions from survey data
* Select the appropriate statistical test
* Learn how to do t-tests of means and Z-tests of proportions with Python

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest
import researchpy as rp

## 0. Make a .csv file of your cleaned dataframe

The first step in our data analysis process was to clean our data and turn the variables we're interested in into dummy variables where needed. 

Open your data cleaning notebook from Lab 6 (or from the P/NP that you submitted for this week) and write one more line of code at the bottom to write out your updated dataframe to a new .csv file

```python
cleaned_df.to_csv("survey_data_cleaned.csv")

## 1. Bringing our cleaned .csv file into Python


Read in the .csv file with the data we cleaned last week

In [None]:
survey_cleaned=pd.read_csv('survey_data_cleaned.csv')
survey_cleaned.head()

In [None]:
#Let's remind ourselves of the columns we're working with
survey_cleaned.columns

To make the next steps of our analysis a little easier, let's make a smaller dataframe with just the columns we want to analyze. \
If you end up wanting to add other columns later on, you can always just come back and add to this list.

In [None]:
#Change this list to the name of the columns you want for your analysis
analysis_columns=['survey_num', 'housing_supply',
                  'tenure',
                  'housing_supply_grouped', 
                  'days_week_clean',
                  'tenure_dv_rent',
                  'housing_supply_dv_agree']

#Keep only the columns we want to work with
survey_df=survey_cleaned[analysis_columns]

survey_df.head()

##  2.  Testing Bivariate Relationships

As we discussed in class, we are interested in answering the following two questions:

-	Question 1:  Are people who spend more time in the neighborhood more likely to support new housing supply?
-	Question 2:  Do renters support new housing supply more than owners?

**Checkpoint**: Identify the independent and dependent variable for each of these questions

*When you conduct your own analysis, try to structure your questions in a similar way to identify your dependent and independent variables.*


## 2.1 Question 1

For Question 1, we're going to explore whether there are any observable differences in the support for new housing supply (Y) by the average number of days someone spends in the neighborhood (X).

Our hypothesis is that people who spend **more** time in the neighborhood are **more** likely to agree that the neighborhood needs more housing supply.

### 2.1.1. Quick descriptive overview

Like always, we'll first start by getting a sense of the data we're working with 

Let's look at the distribution of responses to the housing supply question...

In [None]:
survey_df["housing_supply"].value_counts()

... then look at the average number of days spent in the neighborhood for each of those response groups.


`.groupby()` is really handy for this. It lets you group your data by categories in one column, then apply calculations to the value in other columns within each group.

In this case, we want to:
1. Group the data by response to a housing supply question (like "Agree," "Strongly Disagree")
2. Calculate the average number of days people spent in the neighborhood for each response (i.e., within each group)

<img src="groupby_example_1.jpeg" width="600">

In [None]:
survey_df.groupby("housing_supply")['days_week_clean'].mean()

This is helpful, but we can't say much about this from a statistics standpoint. To do that, we'll want to use dummy variables.

### 2.1.2. Get our dummy variables ready

You likely did this already in the previous notebook, but if you want to add new columns with new dummy variables, you can still do that here.

As an example, here we're going to: \
a) drop the Don't Know/NAs, and\
b) create a dummy variable for Agree and Strongly Agree = 1, others = 0

In [None]:
#.map() will let us translate responses into dummy variables based on the categories in the 'housing_supply' column
survey_df['housing_supply_dv']=survey_df['housing_supply'].map(
    {"Strongly Disagree":0, 
     "Disagree":0, 
     "Neutral":0 ,
     "Agree":1, 
     "Strongly Agree":1, 
     "Don't Know/NA":np.nan})

In [None]:
#Let's use .groupby() again on our dummy variable column
survey_df.groupby("housing_supply_dv")['days_week_clean'].mean()

### 2.1.3. Run our statistical test: t-test of means

Here's where our statistical testing starts!

Python has a ton of packages that let's you quickly test the relation between your dependent and independent variable. Let's use `researchpy`.\
The basic structure of this t-test is: 

`rp.ttest(sample1.independent_variable),(sample2.independent_variable))`

This will compare the means of the two groups and tell you whether the difference between them is statistically significant.

You can read more about it here:
https://researchpy.readthedocs.io/en/latest/ttest_documentation.html 

In [None]:
#Step 1: Create slices of my dataframe for each group

#Sample 1 – People who did not agree that the neighborhood should increase its housing supply
sample1=survey_df[survey_df['housing_supply_dv']==0]

#Sample 2 – People who did agree that the neighborhood should increase its housing supply
sample2=survey_df[survey_df['housing_supply_dv']==1]

#Step 2: Run my t-test where I compare the averages of either sample
rp.ttest((sample1.days_week_clean), #compare the mean of our first sample
         (sample2.days_week_clean)) #to the mean of our second sample

Let's take a second to figure out what we're looking at...

<img src="t-test_results.jpeg" width="600">

## 2.2 Question 2

For Question 2, I want to see whether renters are more likely to support new housing supply. I'm going to use a **Z test of proportions**.  

### 2.2.1. Quick descriptive overview

In [None]:
survey_df[['housing_supply', 'housing_supply_dv']]

In [None]:
pd.crosstab(survey_df['tenure'], survey_df['housing_supply_dv'], normalize='index')

### 2.2.2 Get our dummy variables ready

Housing supply is still a dummy variable (with Agree and Strongly Agree=1), but I also need to change tenure into a dummy

In [None]:
survey_df['rent_dv']=survey_df['tenure'].map(
    {"Own":0, "Other":0, "Rent":1})

#Let's check and make sure that worked
survey_df[['tenure','rent_dv']]

In [None]:
pd.crosstab(survey_df['rent_dv'], survey_df['housing_supply_dv'], normalize='index')

### 2.2.3 Run our statistical test: Z-test of proportions
https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html

Let's take a closer look at the Z-test equation: 

<img src="z_test_equation.jpg" width="400">

- *p1*: proportion of Y in sample group 1
- *p2*: proportion of Y in sample group 2
- *p*: pooled proportion of Y in overall group (sample group 1+sample group 2) 
- *n1*: sample size of sample group 1
- *n2*: sample size of sample group 2

Thanks to some handy packages, we thankfully don't have to do this manually. Let's use with statsmodels' `proportions_ztest`
https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html 

We'll be using `.groupby()` to do two things:
1. Using `.count()`, calculate the **sample size** of either group (renter vs. non-renter)\
&nbsp;&nbsp;&nbsp;&nbsp; *`.count()` works here because it counts all values, giving us a total number of occurrences in each group*

2. Using `.sum()`, calculate the number of respondents in each sample that meet our Y-variable condition (i.e. support or don't support more housing).\
&nbsp;&nbsp;&nbsp;&nbsp;*`.sum()` works here because it adds up 1s and 0s, essentially giving us a count of just the 1s*


<img src="groupby_example_2.jpeg" width="600">

In [None]:
#Calculates the number of "successes", where housing_supply_dv=1 (our y that we're explaining)
counts = survey_df.groupby('rent_dv')['housing_supply_dv'].sum() 

# Calculate the total sample, by the explanatory variable (rent_dv)
totals = survey_df.groupby('rent_dv')['housing_supply_dv'].count() #.count() will count the total number of values

# Perform Z-test of proportions
stat, pval = proportions_ztest(counts, totals, alternative='two-sided')

# Output results
print(f"Z-statistic: {stat}")
print(f"P-value: {pval}")

Here's the manual way to do this.

In [None]:
#P1: Share of renters who support increased housing supply
#First, calculate the number of renters who support increased housing supply
renters_supply_count = survey_df[(survey_df['rent_dv'] == 1) & (survey_df['housing_supply_dv'] == 1)].shape[0]
#Then, calculate the total number of people who are renters
rent_total = survey_df[survey_df['rent_dv'] == 1].shape[0] #This is our n1!
# Calculate share of renters that supports increased housing supply 
p1 = renters_supply_count / rent_total  

#P2: Share of non-renters who support increased housing supply
#First, calculate the number of non-renters who support increased housing supply
non_supply_count = survey_df[(survey_df['rent_dv'] == 0) & (survey_df['housing_supply_dv'] == 1)].shape[0]
#Then, calculate the total number of people who are non-renters
non_rent_total = survey_df[survey_df['rent_dv'] == 0].shape[0] #This is our n2!
# Calculate share of non-renters that supports increased housing supply 
p2 = non_supply_count / non_rent_total  

#P: Calculate the pooled proportion: share of total people who support more housing supply
p_pool = (renters_supply_count + non_supply_count) / (rent_total + non_rent_total)

#Denominator: Calculate the standard error
se = np.sqrt(p_pool * (1 - p_pool) * ((1 / rent_total) + (1 / non_rent_total)))

# Calculate the Z-statistic
z_stat = (p1 - p2) / se

# Calculate the p-value (two-tailed test)
from scipy.stats import norm
p_value = 2 * (1 - norm.cdf(abs(z_stat)))  # Two-tailed p-value

# Output results
print(f"Proportion of Renters who support: {p1}")
print(f"Proportion of Non-renters who support: {p2}")
print(f"Z-statistic: {z_stat}")
print(f"P-value: {p_value}")

Cool, we did it!!! Now how do we interpret these numbers?

#### A note on calculations

You might notice that the two results (from the manual approach vs. from proportions_ztest) are slightly different.

The formula for the z-test itself is the same in either case, but **each approach treats missing `housing_supply_dv` values differently.**

In our manual code, we include people who did not answer the housing supply question in our sample size, while the stats model function excludes those. 

This will lead to slightly different outputs - but they would be larger if you had more NaN values.

1. **Manual code** does this:

Numerators (`renters_supply_count`, `non_supply_count`):\
Count rows where housing_supply_dv == 1 and rent_dv is 1 or 0.\
&nbsp;&nbsp;&nbsp;&nbsp;→ Rows with housing_supply_dv = NaN are not counted (they don’t equal 1).

Denominators (`rent_total`, `non_rent_total`):\
Count all renters / non-renters, including those missing `housing_supply_dv`.

Result: *the share of all renters (including those who didn’t answer the supply question) who support increased housing supply.*

2. **`statsmodel` function** does this:

Numerators (`counts`):\
Sum of housing_supply_dv within each rent_dv group.\
&nbsp;&nbsp;&nbsp;&nbsp;→ If housing_supply_dv is coded 0/1, this gives us the number of 1’s, skipping NaNs.

Denominators (`totals`):\
`.count()` of housing_supply_dv within each rent_dv group.\
&nbsp;&nbsp;&nbsp;&nbsp;.count() ignores NaNs, so this denominator is:\
&nbsp;&nbsp;&nbsp;&nbsp;“Number of renters or non-renters who actually answered the supply question.”

Result: *the share of renters **who answered the supply question** who support increased housing supply*