# **CP201A Lab 6: Cleaning and Recategorizing Survey Data**

#### November 12, 2025

## Learning Objectives
* Clean a dataset
* Recategorize and analyze nominal, numeric, and Likert scale survey data
* Use dummy variables (0 or 1) for regression analysis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math

pd.set_option('display.float_format', '{:.2f}'.format)

## **1. Preparing for Data Analysis**

### **1.1 Bringing a .csv file into Python**

Python can read in multiple forms of data, but the most common is a .csv file ("comma separated values"). 
We can easily import .csv data into Python with `pd.read_csv("file_name.csv")`

The `pd.` tells Python to call up pandas and read a given file as a csv, which pandas will turn into a dataframe.

*Note: We put the name of the CSV file alone as the parameter rather than the full file path since our file is in the same folder as our Python file.*

In [None]:
# Let's read our data in as a pandas dataframe
survey_df = pd.read_csv('CP201Asurveydata.csv')

In [None]:
# Let's take a quick look at our data
survey_df.head()

In [None]:
# We can get information about our dataset by calling the "info()" function
survey_df.info()

### **1.2 Codebook**

Your codebook is a record of all the information about your dataset. For instance, this is where you'd keep track of each variable in your dataset and the specific survey question or parameter to which it refers.

Decide how and where you want to develop your codebook. You can do it directly in your notebook as a markdown cell, or do it separately in Word or Excel.  But don't forget to keep a record of the changes you make!!!


Let's rename our columns to something more coding-friendly.

In [None]:
survey_df.columns

In [None]:
survey_df.rename(columns={'Survey Number': 'survey_num', 
                        'Day of the Week': 'day_of_week', 
                        'Time of Day': 'time',
                        'Interviewer or self-administered': 'collection_mode',
                        'Connection to neighborhood': 'connection_nbhd',
                        'Average days spent in neighborhood per week': 'days_week',
                        'Neighborhood recovered from pandemic': 'pandemic_recover',
                        'Safe walking alone at night': 'safe_walking',
                        'New housing has increased gentrification': 'gentrification',
                        'Bikes lanes have increased traffic safety': 'bike_lane',
                        'I feel connected to the people and community in this city.': 'connected',
                        'Support increasing the supply of housing': 'housing_supply',
                        'Support speed enforcement cameras': 'speed_camera',
                        'Support sales tax to fund climate disaster preparedness': 'climate_tax',
                        'Support investment in policing': 'policing', 
                        'Gender identity': "gender_identity",
                        'Rent or own': 'tenure',
                        'Biggest challenge facing city': 'biggest_challenge',
                        'Notes (include data here on "Other" responses, observational notes.)': 'notes'}, inplace=True)

In [None]:
survey_df.columns

## **2. Exploring and Cleaning Variables**

Okay!  Now let's start exploring each of the variables in the dataset.

### **2.0 `NaN` values**

Even though we didn't have an N/A or Don't Know option on our survey, some respondents chose not to answer certain questions. Let's take a look at these "blank" values.

In [None]:
# Note that you must include the dropna=False option in order to be able to see if I have any missing values
survey_df['tenure'].value_counts(dropna=False)

It turns out that there are special data types that deal with missing, undefined, and invalid values. `NaN` refers to "Not a number" and is a special floating-point value. 

When we read in our data from the csv, python automatically assigned all the blank cells with this special data type. Note that in this case, we are using `math.nan`, not `np.nan` which is another common type of `NaN` value. 

### **2.1 Nominal (categorical) variables**

Let's start by looking at one of our simplest questions: Tenure.

An easy way to look at the distribution of a nominal variable is to use `.value_counts()`.  We should always do this first to assess the distribution of variables.


In [None]:
survey_df[['tenure']].value_counts(dropna=False)

In [None]:
survey_df[['gentrification']].value_counts(dropna=False)

Now it's `pd.crosstab()`'s time to shine! 

`pd.crosstab()` is a pandas function that makes a quick table showing how two (or more) variables relate. It basically counts how many times each combination (pair of values) occurs. 

For example, if I want to answer the following question: "Which group believes new housing creates gentrification: people who rent or people who own?" 
`.crosstab()` can help me do that by giving me the following combinations:
* Rent and agree with new housing increased gentrification
* Rent and don't agree with new housing increased gentrification
* Own and gree with new housing increased gentrification
* Own and don't agree with new housing increased gentrification

I can also use `.crosstab()` on a single variable, and it will count how many times each value appears, like `value_counts()`. `normalize=True` turns those counts into percentages.

Just like with ```value_counts()```, we need to specify that ```dropna = False``` in order to include our non-responses. This is important because when reporting on these results, we will need to indicate what share of folks didn't answer the question.

In [None]:
# Let's get the percents for each tenure category
pd.crosstab(index=survey_df['tenure'], columns="Total", dropna=False) 

In [None]:
# Add normalize=True to convert the table into percentages
pd.crosstab(index=survey_df['tenure'], columns="Total", normalize=True, dropna=False) 

In [None]:
# We can also use crosstab to determine the frequency of responses across multiple variables
pd.crosstab(survey_df['tenure'], survey_df['gentrification'])

In [None]:
# Let's "normalize" the results, aka convert them to percentages
pd.crosstab(survey_df['tenure'], survey_df['gentrification'], normalize=True)

# Hmmm what do you notice about these percentages?

When we "normalize" data (aka create a relative measure by converting to percentages), it's important that we understand what the percentages are **out of**. In other words, what is our denominator?

* If our percentages are taken within each column (i.e. the denominator is the sum of column values), then we are treating each column as a group.
    * Therefore, when discussing our results we are **comparing values within each column**. 
* If our percentages are taken within each row (i.e. the denominator is the sum of row values), then we are treating each row as a group.
    * Therefore, when discussing our results are are **comparing values within each row**.

##### **QUESTION: Which comparison is more useful for the crosstabulation below?**

<img src="Crosstab_normalize_example.png" width="600">

In [None]:
# Let's normalize across the "index" (aka the rows)
pd.crosstab(survey_df['tenure'], survey_df['gentrification'], normalize='index')

In [None]:
# Let's normalize across columns
pd.crosstab(survey_df['tenure'], survey_df['gentrification'], normalize='columns')


`pd.crosstab()` is one of the most powerful tools we have for exploring and understanding our data!

But wait, our work doesn't stop at crosstabulations!

Because our Likert scale includes different degrees of response (i.e., Agree *and* Strongly Agree), it's hard to quickly communicate interesting data just with the table above. Let's look at different ways to work with Likert data...

### **2.2 Likert Variables**

First, take a look at the data. It's always important to start by examining the data as is.

Once we examine the data, we have several different options for how we can recode Likert scale data.
1. **Group together Agree/Disagree Categories (to work with a binary)**\
&nbsp;&nbsp;&nbsp;&nbsp; Combine "Agree" with "Strongly Agree" and "Disagree" with "Strongly disagree"

2. **Turn it into a numeric scale**\
&nbsp;&nbsp;&nbsp;&nbsp;Where a higher number means stronger agreement

#### **2.2.1. Looking at the Data**

In [None]:
survey_df[['gentrification']].value_counts(dropna=False)

One cool thing that will make graphing easier is to assign a **category order** for ordinal data

In [None]:
from pandas.api.types import CategoricalDtype

# Define a category type with the ordered flag set to True
category_order = pd.CategoricalDtype(["Strongly Disagree", "Disagree", 
    "Neutral", "Agree", "Strongly Agree", "Don't Know/NA"], ordered=True)

survey_df['gentrification_ordered'] = survey_df['gentrification'].astype(category_order)

In [None]:
survey_df['gentrification_ordered'].value_counts(sort=False, dropna=False) 
# Note, sorting is turned off because value_counts automatically sorts based on value, not category

This is great for viewing the data, but not so great for putting it into a regression... We'll get to that with our dummy variables:)

#### **2.2.2. Combine Agree and Disagree Categories**

Let's use our function-building skills to make a function that will turn:

* "Strongly Agree" into "Agree"
* "Strongly Disagree" into "Disagree"
* And keep everything else the same

In [None]:
def likert_grouped(column_name):
    '''
    Simplifies agreement responses in a specified column of our survey results dataframe
    '''
    survey_df[f"{column_name}_grouped"] = survey_df[column_name].map({ # Creates a new column that copies the original specified column
        # Within the new column, some values are replaced using the following rules
        "Strongly Agree": "Agree", # Replace all "Strongly Agree" responses with "Agree"
        "Strongly Disagree": "Disagree", # Replace all "Strongly Disagree" responses with "Disagree"

        # We want to keep the rest of our values the same, so we have to specify their "map" value
        "Agree": "Agree", 
        "Disagree": "Disagree",
        "Neutral": "Neutral",
        "Don't Know/NA": "Don't Know/NA"
    })

Now try it out! See if it works for the gentrification question...

In [None]:
# Let's use the function
likert_grouped("gentrification")

# Now let's check to see if it worked
survey_df[["gentrification","gentrification_grouped"]].value_counts(dropna=False)

#### **2.2.3 Turn the Likert scale into a numeric scale**

We can do the same thing to translate Likert responses to a 1-5 scale, from Strongly Disagree as a 1 to Strongly Agree as a 5.\
Let's make another function for that: then we can just use our function for whichever approach we prefer on whichever column we want! Cool!

In [None]:
def likert_to_numeric(column_name):
    '''
    Converts Likert responses to numeric values in a specified column of our survey results dataframe
    '''
    # This function works the same as the last example, but the "rules" are different
    survey_df[f"{column_name}_numeric"] = survey_df[column_name].map({
        "Strongly Agree": 5, 
        "Agree": 4, 
        "Neutral": 3,
        "Disagree": 2, 
        "Strongly Disagree": 1, 
        "Don't Know/NA": np.nan})

Sanity check! Let's make sure that worked, too!

In [None]:
# Let's use the function
likert_to_numeric("gentrification")

# Now let's check to see if it worked
survey_df[["gentrification","gentrification_grouped","gentrification_numeric"]].value_counts(dropna=False)

##### **EXERCISE [5 mins]: Now pick a likert scale variable that you're interested in and clean it.**

Consider how you want to deal with the Neutral and Don't Know/NA responses in particular. Would it be more useful to convert it into a numeric scale or just combine categories?

In [None]:
# Use this cell for the exercise

### **2.3 Numeric variables**

Numeric variables refer to any variable that includes numbers, either integers (1, 4, 300) or floats (1.6, 4.56, 300.1543). When we work with raw numeric data, we want to explore their "distribution" - what is the mean and standard deviation?  What is the smallest value?  What is the largest value?

In [None]:
# Just as with the describe function above, we can ask to describe a single variable
survey_df['days_week'].describe()

In [None]:
# But our values are being read as an object rather than an integer or float. Why is that?
survey_df['days_week'].dtypes

In [None]:
# It looks like there are some values that don't conform to the integer or float format (i.e. 0-7, 5-6, etc.)
survey_df[['days_week']].value_counts(dropna=False)

Hmm... the plot thickens...

Even though the "days of the week" question is numeric, the ranges provided (like "4-5") are read by pandas as a string, because of the dash. We can't calculate the average or the standard deviation right away...

Here's what we'll do. In our numeric value column, there are four possibilities:
1. The value is null (i.e., non-response)
2. The value is a float or integer
3. The value is a string because it's written as a range (with a – )
4. The value is a number but is being read as a string anyways

Let's write a function that runs through all of those possibilities and handles each value accordingly. Depending on what each value of our column is, pandas will do something different to clean it.

In [None]:
def clean_range_to_top(x):
    #1. The value is null: keep it that way
    if pd.isna(x):
        return np.nan 
    #2. The value is a float or integer: round it and keep it
    if isinstance(x, (int, float)):
        return round(x) 
    #3 and 4. The value is a string, for whatever reason
    if isinstance(x, str):
    #3. The value is a range: keep the upper value
        if "-" in x: 
            parts = x.split('-')
            return float(parts[-1])
    #4. The value is a float being read as a string: convert it to a float and round up
        return math.ceil(float(x))

In [None]:
# Let's apply this function to every row in the column using .apply()
survey_df["days_week_clean"] = survey_df["days_week"].apply(clean_range_to_top)

In [None]:
# Now let's check to see if it worked
survey_df["days_week_clean"].value_counts(dropna=False)

Ah, that's better!

##### **QUESTION: What is a limitation to this function?**

## **3. Creating Dummy Variables**

When we’re working with categorical data (like Rent vs. Own) or survey scales, we can’t jump straight into statistical tests. That’s where dummy variables come in — they turn our qualitative data into numbers (usually 0s and 1s) so we can use them in quantitative analysis.

They also let us focus on one category at a time. For example, if we’re looking at TENURE, we can make a dummy variable just for “rent” — marking it as 1 and everything else as 0 — to see how “rent” compares to the rest of the responses.

Here's a visual explanation of how we use dummy variables to turn categorical data into a binary matrix. We'll cover this more in the next lab, but converting these values to 1's and 0's is what allows us to do statistical testing and regression modelling with categorical data!

<img src="dummy_variable.png" width="400">

In [None]:
# Let's create a dummy that is equal to 1 for renters and 0 for owners. 

# One way of dealing with "Other" is to convert it to np.nan    
survey_df['rent_dv']=survey_df['tenure'].map({
    "Rent":1, 
    "Own":0, 
    "Other":np.nan})

survey_df['rent_dv'].value_counts(dropna=False)

# This is fine for now, but you'll see later on that it makes it hard to run statistical tests if we have stray NaN values.

In [None]:
# We could also set global mapping lists, so we can apply it to lots of variables that have the same coding.

# Note that I could have skipped the Strongly Agree/Somewhat Agree (and Disagree) step above and just coded them into dummies here

mapping_agree = {
    'Strongly Agree':1,
    'Agree': 1,
    'Neutral': 0,
    'Disagree': 0,
    'Strongly Disagree':0,
    "Don't Know/NA": 0, # Since this is an "agree" dummy variable we can convert everything that isn't "agree" to a 0
    np.nan: 0 # We know there is a single nan value in the dataset, so let's code this as well
}

# Apply mapping to the 'housing' column
survey_df['housing_supply_dv_agree'] = survey_df['housing_supply'].map(mapping_agree)
survey_df['housing_supply_dv_agree'].value_counts(dropna=False)

In [None]:
# Now I can do the same thing for another variable. 
survey_df['gentrification_dv_agree'] = survey_df['gentrification'].map(mapping_agree)
survey_df['gentrification_dv_agree'].value_counts(dropna=False)

# And we can create a "disagree" map that we can then repeatedly use on variables as well. 

##### **EXERCISE [5 mins]: Now convert a variable of your own choosing into a dummy variable.**

Consider which categories you want to compare, and how the Neutral and Don't Know/NA responses fit into that comparison. 

In [None]:
# Use this cell for the exercise

Eventually, the goal is to repeat this process, perhaps using a function and/or for loop, to achieve something like this table for each variable that you're interested in examining. 

Note that you don't have to convert every single variable in the entire dataset into a dummy variable, just focus on what you're most interested in!

<img src="housing_dummy_variable_example.png" width="800">

**Woo hoo! Now you're a data cleaning expert!**
