# Lecture - November 19, Statistical Analysis
November 19, 2025


In [None]:
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest
import researchpy as rp

## 1. Bringing our cleaned .csv file into Python


Read in the .csv file with the data we cleaned last week

In [None]:
survey_cleaned=pd.read_csv('survey_data_cleaned.csv')
survey_cleaned.head()

In [None]:
#Let's remind ourselves of the columns we're working with
survey_cleaned.columns

To make the next steps of our analysis a little easier, let's make a smaller dataframe with just the columns we want to analyze. \
If you end up wanting to add other columns later on, you can always just come back and add to this list.

In [None]:
#Change this list to the name of the columns you want for your analysis
analysis_columns=['survey_num', 'housing_supply',
                  'tenure',
                  'housing_supply_grouped', 
                  'tenure_dv_rent',
                  'housing_supply_dv_agree', 'housing_supply_numeric']

#Keep only the columns we want to work with
survey_df=survey_cleaned[analysis_columns]

survey_df.head()

##  2.  Testing Bivariate Relationships

We are interested in answering the following question, but we're going to test it two ways:

-	Question 1:  Do renters support new housing supply more than owners, coding housing supply support as numeric
-	Question 2:  Do renters support new housing supply more than owners, coding housing supply as a dummy for "agree and strongly agree"

## 2.1 Question 1

For Question 1, we're going to explore whether there are any observable differences in the support for new housing supply (Y - numeric) by the tenure (X - dummy).

### 2.1.1. Quick descriptive overview

Like always, we'll first start by getting a sense of the data we're working with 

Let's look at the distribution of responses to the housing supply question...

In [None]:
survey_df['housing_supply_numeric'].describe()

In [None]:
survey_df['housing_supply_numeric'].value_counts()

In [None]:
survey_df.groupby("tenure_dv_rent")['housing_supply_numeric'].mean()

### 2.1.3. Run our statistical test: t-test of means

Here's where our statistical testing starts!

Python has a ton of packages that let's you quickly test the relation between your dependent and independent variable. Let's use `researchpy`.\
The basic structure of this t-test is: 

`rp.ttest(sample1.independent_variable),(sample2.independent_variable))`

This will compare the means of the two groups and tell you whether the difference between them is statistically significant.

You can read more about it here:
https://researchpy.readthedocs.io/en/latest/ttest_documentation.html 

In [None]:
#Step 1: Create slices of my dataframe for each group

#Sample 1 – People who did not agree that the neighborhood should increase its housing supply
sample1=survey_df[survey_df['tenure_dv_rent']==0]

#Sample 2 – People who did agree that the neighborhood should increase its housing supply
sample2=survey_df[survey_df['tenure_dv_rent']==1]

#Step 2: Run my t-test where I compare the averages of either sample
rp.ttest((sample1.housing_supply_numeric), #compare the mean of our first sample
         (sample2.housing_supply_numeric)) #to the mean of our second sample

In [None]:
import warnings
warnings.filterwarnings("ignore")

## 2.2 Question 2

For Question 2, I want to see whether renters are more likely to support new housing supply, but this time I'm going to use my dummy dependent variable. I'm going to use a Z test of proportions.  

### 2.2.1. Quick descriptive overview

In [None]:
survey_df[['housing_supply', 'housing_supply_dv_agree']]

In [None]:
pd.crosstab(survey_df['tenure_dv_rent'], survey_df['housing_supply_dv_agree'], normalize='index')

### 2.2.3 Run our statistical test: Z-test of proportions
https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html

In [None]:
#Calculates the number of "successes", where housing_supply_dv=1 (our y that we're explaining)
counts = survey_df.groupby('tenure_dv_rent')['housing_supply_dv_agree'].sum() #.

# Calculate the total sample, by the explanatory variable (rent_dv)
totals = survey_df.groupby('tenure_dv_rent')['housing_supply_dv_agree'].count() #.count() will count the total number of values

# Perform Z-test of proportions
stat, pval = proportions_ztest(counts, totals, alternative='two-sided')

# Output results
print(f"Z-statistic: {stat}")
print(f"P-value: {pval}")

In [None]:
# Perform Z-test of proportions
stat, pval = proportions_ztest(counts, totals, alternative='smaller')

# Output results
print(f"Z-statistic: {stat}")
print(f"P-value: {pval}")