## Lecture November 17: Cleaning and Recategorizing Survey Data Part II


In [None]:
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np
pd.set_option('display.float_format', '{:.2f}'.format)

## 1.0 Bringing a .csv file into Python

We're going to start by bringing in a .csv file of ACS PUMS data (American Community Survey Public Use Microdata). https://www.census.gov/programs-surveys/acs/microdata/access.html

In [None]:
pums_df=pd.read_csv('pums.csv', delimiter = ',')

In [None]:
#we can also specify how many rows we want to look at
pums_df

## 2. Cleaning Numeric Data

Numeric variables refer to any variable that includes numbers, either integers (1, 4, 300) or floats (1.6, 4.56, 300.1543). When we work with raw numeric data, we want to explore their "distribution" - what is the mean and standard deviation?  What is the smallest value?  What is the largest value?

In [None]:
#We are going to ask python to "describe" a variable for us
pums_df['HINCP'].describe()

In [None]:
pums_df["HINCP"].hist(bins=30)
plt.xlabel("Household Income")
plt.ylabel("Frequency")
plt.title("Distribution of Household Income in San Francisco")
plt.show()

In [None]:
pums_df["HINCP"].quantile([0.01, 0.05, 0.1, 0.5, 0.9, 0.95, 0.99])

In [None]:
x = pums_df["HINCP"].dropna()
lower = x.quantile(0.01)
upper = x.quantile(0.99)

clean_df = pums_df[(pums_df["HINCP"] >= lower) & (pums_df["HINCP"] <= upper)]

In [None]:
clean_df['HINCP'].describe()

In [None]:
clean_0_df = pums_df[pums_df["HINCP"] > 0]

In [None]:
clean_0_df['HINCP'].describe()

In [None]:
#ok - now your turn - work in a group to figure out the mean income by renters versus owners.
#You're going to have to use the codebook, recategorize Tenure, and then calculate means by group (hint: groupby)

In [None]:
#how about income by the age of the building where the householder lives?

## 3.0 Cleaning Text Data

In [None]:
df=pd.read_csv('survey_text.csv', delimiter = ',')

In [None]:
df

In [None]:
#Let’s start by taking a look at the contents of the "biggest challenge" column
# Display unique values 
print("\nUnique values:")
print(df['biggest_challenge'].unique())

What patterns do you see?  Are people using different words for the same concept?

### Step 1: Coding
In your group, decide on a few categories that you see in the data.  I'm going to work through an example using "homelessness" as a concept, distinct from housing.  I want to include a sentence in my report saying what share of people said homelessness is the biggest challenge by tenure.

### Step 2: Standardize Text

In [None]:
# Standardize to lowercase
df["lowercase_text"] = df["biggest_challenge"].str.lower()
df[["lowercase_text", "biggest_challenge"]]

In [None]:
df["homeless_dv"] = df["lowercase_text"].str.contains(
    "homeless|houseless|homelessness|homless", na=False
).astype(int)

In [None]:
df.loc[df["homeless_dv"] == 1, ["homeless_dv", "lowercase_text"]]

### 3.2 Advanced Text Cleaning - Regex
Another, more advanced approach is regex, which allows you to have more control 

Plain matching, what we did above, looks for exact substrings.  But with regex, we can provide more complex decision rules.  For example:

- “find words starting with ‘hom’” → r"\bhom"
- “find rent or rental or renting” → r"rent.*"
- “find respondents who mention crime or violence or police” → r"crime|violence|police"

In [None]:
# Regex version
homeless_pattern = r"afford"

df["afford_dv_regex"] = df["lowercase_text"].str.lower().str.contains(
    homeless_pattern, 
    na=False
).astype(int)

df.loc[df["afford_dv_regex"] == 1, ["lowercase_text", "afford_dv_regex"]].head(15)

In [None]:
#Can you do a crosstab of your homeless_dv with tenure?


In [None]:
# Code a couple more with your group! 