## Lecture (November 12, 2025): Cleaning and Recategorizing Survey Data


In [None]:
!pip install seaborn

In [None]:
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np
import re
pd.set_option('display.float_format', '{:.2f}'.format)

## 1.0 Bringing a .csv file into Python

Python can read in multiple forms of data, but the most common is a .csv file ("comma separated values").  We can easily import .csv data into Python.  The "pd." tells Python to call up the panda function (this is like vocabulary - something to learn), to read the file as a csv, the name of the file, and that the delimiter is a comma. (A delimiter is what separates each of the columns, or array values, from one another.)

(We are able to put the name of the CSV file alone as the parameter rather than the full file path since our file is in the same folder as our Python file.)

In [None]:
survey_df = pd.read_csv('CP201Asurveydata.csv', delimiter = ',')

In [None]:
#let's take a quick look at our data
survey_df.head()

In [None]:
#we can get information about our dataset by calling the "info()" function
survey_df.info()

#### We're going to rename two of our columns to be more coding friendly. And then limit our dataframe to those three columns.

In [None]:
survey_df.rename(columns={'Average days spent in neighborhood per week': 'days_week',
                        'Support increasing the supply of housing': 'housing_supply',
                        'Rent or own': 'tenure'}, inplace=True)

In [None]:
lecture_df = survey_df[["days_week", "housing_supply", "tenure"]].copy()

In [None]:
lecture_df

## 2. Cleaning Variables

### 2.1 Categorical "nominal" variables

Let's start by looking at one of our nominal categorical variables: tenure.

In [None]:
#A simple way to look the distribution of a nominal variable is to request the value_counts().  
#Note that I include the (dropna=False) option in order to be able to see if I have any missing values
lecture_df[['tenure']].value_counts(dropna=False)

In [None]:
pd.crosstab(index=lecture_df['tenure'], columns="Total", dropna=False) 

In [None]:
#let's get the percents.  Here, we want to normalize (create percents) by the "total" value in the column
pd.crosstab(index=lecture_df['tenure'], columns="Total", normalize='columns', dropna=False) 

### 2.2 Categorical "ordinal" variables

We explore categorical ordinal variables--like age or our likert scale questions--in the same way.

In [None]:
lecture_df[['housing_supply']].value_counts(dropna=False)

In [None]:
#one cool thing that makes looking at ordinal data easier is to assign a category order
from pandas.api.types import CategoricalDtype

# Define a category type with the ordered flag set to True
category_order = CategoricalDtype(["Strongly Disagree", "Disagree", 
    "Neutral", "Agree", "Strongly Agree", "Don't Know/NA"], ordered=True)

lecture_df['housing_supply'] = lecture_df['housing_supply'].astype(category_order)

In [None]:
pd.crosstab(index=lecture_df['housing_supply'], columns="Total", dropna=False) 

In [None]:
pd.crosstab(index=lecture_df['housing_supply'], columns="Total", normalize='columns', dropna=False) 

In [None]:
#It also makes it easier to look at the distribution visually
lecture_df['housing_supply'].value_counts().sort_index().plot(kind='bar')
plt.xlabel('Need to Increase Housing Supply')
plt.ylabel('Number of Responses')
plt.title('Respondent Support for Increasing Housing Supply')
plt.show()

### 2.3 Recoding categorical variables

This is where we truly don our magician hat!  We need to decide how we want to recode these for analysis.

In [None]:
pd.crosstab(lecture_df['tenure'], lecture_df['housing_supply'])

In [None]:
pd.crosstab(lecture_df['tenure'], lecture_df['housing_supply'], normalize="index")

In [None]:
# Combine 'Agree' and 'Disagree' groupings
group_mapping = {
    'Agree': 'Agree',
    'Strongly Agree': 'Agree',
    'Disagree': 'Disagree',
    'Strongly Disagree': 'Disagree',
    "Don't Know/NA" : 'Unknown',
    "Neutral":'Neutral'
}

# Create a new column 'housing_group' for the grouped categories
lecture_df['supply_group'] = lecture_df['housing_supply'].map(group_mapping)
pd.crosstab(index=lecture_df['supply_group'], columns="Total", normalize=True)

In [None]:
pd.crosstab(lecture_df['tenure'], lecture_df['supply_group'], normalize="index")

## 3. Creating Dummy Variables


In [None]:
# Let's turn our tenure variable into a dummy variable
#Python can do it automatically, but I highly recommend intentionally coding your dummies.  I am going to create a dummy that 
#is equal to 1 for renters (renter_dv = 1) and 0 if it's an owner. What should I do with "other"?  

#Recoding or aggregating variables can be done lots of ways.  When I started, I liked making very clear,
#line by line codes.  For example, 
lecture_df['renter_dv']=lecture_df['tenure'].map({"Rent":1, "Own":0, "Other":0})

In [None]:
#we always need to check our coding - it's easy to make a mistake!
lecture_df[['renter_dv', 'tenure']]

In [None]:
pd.crosstab(lecture_df['renter_dv'], lecture_df['supply_group'], normalize="index")

## 4. Converting Likert Questions into a Numeric Variable

In [None]:
lecture_df["supply_numeric"] = survey_df["housing_supply"].map({
        "Strongly Agree": 5, 
        "Agree": 4, 
        "Neutral": 3,
        "Disagree": 2, 
        "Strongly Disagree": 1, 
        "Don't Know/NA": np.nan})

In [None]:
lecture_df[["housing_supply", "supply_numeric"]]

In [None]:
lecture_df.groupby("renter_dv")["supply_numeric"].mean()

## 5. Are these different?
Can I say that renters (dv=1) are more likely to support new supply (4) than owners (3.55)? 

Just like with the ACS data, we have to address sampling error and the potential that these two values are not statistically significant from one another.  Because I'm now working with raw data, I can calculate the standard deviation, standard error, and MOE myself!

In [None]:
# 1. Group data and calculate mean, count, std
group_stats = lecture_df.groupby("renter_dv")["supply_numeric"].agg(["mean", "count", "std"])
print(group_stats)

In [None]:
# 2. Standard error = std / sqrt(n)
group_stats["se"] = group_stats["std"] / np.sqrt(group_stats["count"])
print(group_stats)

In [None]:
# 3. Create confidence intervals (90%)
group_stats["ci_lower"] = group_stats["mean"] - 1.645 * group_stats["se"]
group_stats["ci_upper"] = group_stats["mean"] + 1.645 * group_stats["se"]
print(group_stats)

In [None]:
z = (3.55 - 4.0) / np.sqrt(0.16**2 + 0.09**2)
z