# Notebook 3: Reclassifying Variables

In Notebook 2, we learned how to import a .csv file, rename our variables, "slice" our dataframe using [[]], and examine the distribution of numeric variables using the .describe() and seaborn plotting functions.  We also learned how to use pd.crosstab to explore categorical variables.

In today's lecture, we're going to focus on

> Turning our categorical variables into dummies

> Cleaning up numeric variables

As with everything in Python, there are lots of different ways to do the same thing, so we're providing some basic code so you have what you need for Assignment 4.  But you may find that when you work with your own data, you'll need to explore the web for other code.

## 1.0 Reading in our libraries, our dataset, and renaming our variables

The next few cells get us back to where we left off last week.  Remember, you need to run all the cells in order - libraries, read data, and rename data, otherwise Python will give you an error message!

In [1]:
# First, We're going to call in our libraries
# from datascience import *

import numpy as np
import pandas as pd
import math
from scipy import stats
import seaborn as sns
import matplotlib as plt


pd.options.display.float_format = '{:.2f}'.format

In [2]:
#Show our plots in the Jupyter notebook
%matplotlib inline

In [14]:
#When we start working with nan (missing) values, we can get RuntimeWarning errors - we're going to ignore them here
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning) 
pd.options.mode.chained_assignment = None  # default='warn'

In [4]:
#Now we're going to read in our data and create a new dataframe

chis_df = pd.read_csv('CHISextract2022.csv')
chis_df

Unnamed: 0,AB1,AJ105,AK23,AK25,AM19,AM20,AM21,AK28,AM45,AM48,...,OCCMAIN2,AHEDC_P1,AK22_P1,AK3_P1V2,HHSIZE_P1,OMBSRR_P1,RACECN_P1,SRAGE_P1,TIMEAD_P1,TIMENEV2_P1
0,2,-1,1,1,3,4,3,3,4,2,...,-1,4,2,-1,3,1,1,55,17,-1
1,2,-1,3,2,3,3,3,2,2,2,...,99,8,8,7,1,1,1,30,13,13
2,2,-1,1,2,2,3,2,1,5,2,...,7,7,7,2,1,2,5,65,13,13
3,3,1,1,1,3,3,1,2,1,2,...,-1,2,6,-1,6,1,1,55,18,-1
4,3,-1,1,1,1,4,1,2,5,2,...,10,6,9,6,1,2,5,55,13,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24448,2,-1,3,2,3,1,2,1,3,2,...,5,3,1,6,1,3,4,30,15,-1
24449,4,1,3,2,2,2,2,2,4,2,...,5,1,3,6,2,1,5,40,17,-1
24450,4,-1,3,2,2,2,3,3,3,1,...,-1,9,12,-1,2,3,4,60,2,2
24451,4,1,3,2,2,2,2,3,5,2,...,5,3,1,2,2,1,5,60,13,13


In [6]:
chis_df.rename(columns={"AE_VEGI":"ate_veg",
                        "SRSEX": "sex",
                        "OMBSRR_P1": "race_ethnicity",
                        "POVLL" : "pov_cat",
                       "AK22_P1" : "hh_inc",
                       "AM184": "housing_worry"}, inplace=True)

In [7]:
#and create our extract, so things are easier to see/work with
chis_df_small=(chis_df[['ate_veg','sex', 'race_ethnicity', 'pov_cat', 'hh_inc', 'housing_worry',]])

In [8]:
chis_df_small

Unnamed: 0,ate_veg,sex,race_ethnicity,pov_cat,hh_inc,housing_worry
0,1,2,1,1,2,4
1,2,1,1,4,8,3
2,14,2,2,4,7,4
3,3,2,1,4,6,4
4,1,1,2,4,9,3
...,...,...,...,...,...,...
24448,0,1,3,1,1,3
24449,14,2,1,1,3,3
24450,7,1,3,4,12,1
24451,7,2,1,1,1,4


### Codebook


> AE_VEGI: Number of times respondent eats vegetables per week

> SRSEX: Self-reported Sex (1= Male, 2=Female)

> OMBSRR_P1: Race/ethnicity
(1=Hispanic, 2= White NH, 3=Black NH, 4=AmIndian/Alaska Native NH, 5=Asian NH, 6=Other or two or more)

> POVLL: poverty level
(1 = 0-99% FPL, 2=100-199% FPL, 3=200-299% FPL, 4=300% FPL and above)

> AK22_P1: Household Income

> AM184: How Often Worry about Paying Rent/Mortgage
(1=Very often, 2=Somewhat Often, 3=From Time to Time, 4=Almost Never)

##  2.0  Recoding and Cleaning Variables

### 2.1 Recoding Nominal Binary Variables into Dummies

We  often have to work with "nominal" variables (those with a "Name").  Note that in the CHIS data, the variables that are nominal (e.g. sex, race/ethnicity) are actually assigned number values rather than strings.   

Let's start with the "sex" variable.  It has two possible values, "Male" and "Female".  This is known as a binary or dichotomous variable.  But, even though they are represented by the numbers 1 and 2, we can't treat them as numbers - e.g., adding 2 males together doesn't give us a female.

    SRSEX: Self-reported Sex (1= Male, 2=Female)
    
We need to change this into a ***dummy*** variable. A dummy variable always only takes two values - a 0 and a 1 - and in general, we give a "1" to the variable when the condition we're interested in exploring is met. 

In [9]:
#The fastest way to make dummies is to use the panda "get_dummies" function.  Let's try and it and see what happens
#notice that the "sex" column is replaced by two new variables
chis_df_1=pd.get_dummies(chis_df_small, columns=['sex'])
chis_df_1

Unnamed: 0,ate_veg,race_ethnicity,pov_cat,hh_inc,housing_worry,sex_1,sex_2
0,1,1,1,2,4,0,1
1,2,1,4,8,3,1,0
2,14,2,4,7,4,0,1
3,3,1,4,6,4,0,1
4,1,2,4,9,3,1,0
...,...,...,...,...,...,...,...
24448,0,3,1,1,3,1,0
24449,14,1,1,3,3,0,1
24450,7,3,4,12,1,1,0
24451,7,1,1,1,4,0,1


In [10]:
#Whenever you classify or create a new variable, it's always a good idea to check your work to see if it did what you expected
pd.crosstab(chis_df_1['sex_1'], columns='Total')

col_0,Total
sex_1,Unnamed: 1_level_1
0,13718
1,10735


In [12]:
#Let's check it against the original
pd.crosstab(chis_df_small['sex'], columns='Total')

col_0,Total
sex,Unnamed: 1_level_1
1,10735
2,13718


In [16]:
#That works!  But I much prefer 'controlling' my operations, so I can be sure my code is running correctly.
#Plus, I like naming my dummy variable so I don't get confused as to what is coded a 1 and what is coded a 0
chis_df_small['female_dv']=chis_df_small['sex'].map({1:0, 2:1})
pd.crosstab(chis_df_small['female_dv'], columns='count')

col_0,count
female_dv,Unnamed: 1_level_1
0,10735
1,13718


### 2.3  Recoding Categorical Variables

A more complicated type of "nominal" variable is one where we have more than 2 categories - we find these all the time in planning surveys!  (And most are ordinal, meaning that the numbers assigned move either up or down in some logical way.)

    AM184: How Often Worry about Paying Rent/Mortgage
    (1=Very often, 2=Somewhat Often, 3=From Time to Time, 4=Almost Never)
    
We renamed the variable to **housing_worry**

In [17]:
#Get dummies works here too!  Remember, get dummies removes the original variable, so create a new dataframe
chis_df_1=pd.get_dummies(chis_df_small, columns=['housing_worry'])
chis_df_1

Unnamed: 0,ate_veg,sex,race_ethnicity,pov_cat,hh_inc,male_dv,female_dv,housing_worry_1,housing_worry_2,housing_worry_3,housing_worry_4
0,1,2,1,1,2,0,1,0,0,0,1
1,2,1,1,4,8,1,0,0,0,1,0
2,14,2,2,4,7,0,1,0,0,0,1
3,3,2,1,4,6,0,1,0,0,0,1
4,1,1,2,4,9,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
24448,0,1,3,1,1,1,0,0,0,1,0
24449,14,2,1,1,3,0,1,0,0,1,0
24450,7,1,3,4,12,1,0,1,0,0,0
24451,7,2,1,1,1,0,1,0,0,0,1


In [18]:
#let's look at our data and be intentional
pd.crosstab(chis_df_small['housing_worry'], columns='count')

col_0,count
housing_worry,Unnamed: 1_level_1
1,1189
2,1971
3,5027
4,16266


In [19]:
# I decide I want to group together anyone that expresses concern, 
#so I'm going to assign a 1 to 1,2,3, and a 0 to anyone who never worries
chis_df_small['housing_worry_dv']=chis_df_small['housing_worry'].map({1:1, 2:1, 3:1, 4:0})
pd.crosstab(chis_df_small['housing_worry_dv'], columns='count')

col_0,count
housing_worry_dv,Unnamed: 1_level_1
0,16266
1,8187


## 3.0 Reclassifying Numeric Variables

While we don't always need to recode numeric variables, sometimes we need to address outliers and/or we want to make our numeric data more meaningful.  For example, with our "ate_veg" variable, we know we need to address the outlier, and we could "smooth" out the fact that folks like to respond using weekly numbers

### 3.1 Dropping extreme values or outliers

Sometimes, you just need to drop some rows with specific values, or, remove all outliers in the dataset.  Let's take a look at it with the age_veg variable from last week.

In [21]:
chis_df_small['ate_veg'].describe()

count   24453.00
mean        9.77
std        10.19
min         0.00
25%         5.00
50%         7.00
75%        14.00
max       139.00
Name: ate_veg, dtype: float64

In [22]:
#Dropping outliers; this is easiest way - I like renaming my dataframe just in case the code 
#does something I don't want it to
#I am going to make the decision that anyone who says they ate more than 10 vegetables per day per week (70) is an outlier
chis_df_1 = chis_df_small[chis_df_small['ate_veg'] < 71] 
chis_df_1['ate_veg'].describe()

count   24366.00
mean        9.35
std         7.28
min         0.00
25%         5.00
50%         7.00
75%        14.00
max        69.00
Name: ate_veg, dtype: float64

In [23]:
#How is the resulting dataframe different from the original?
chis_df_1.describe()

Unnamed: 0,ate_veg,sex,race_ethnicity,pov_cat,hh_inc,housing_worry,male_dv,female_dv,housing_worry_dv
count,24366.0,24366.0,24366.0,24366.0,24366.0,24366.0,24366.0,24366.0,24366.0
mean,9.35,1.56,2.45,3.33,9.62,3.49,0.44,0.56,0.33
std,7.28,0.5,1.48,1.03,5.95,0.84,0.5,0.5,0.47
min,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,5.0,1.0,2.0,3.0,4.0,3.0,0.0,0.0,0.0
50%,7.0,2.0,2.0,4.0,8.0,4.0,0.0,1.0,0.0
75%,14.0,2.0,2.0,4.0,15.0,4.0,1.0,1.0,1.0
max,69.0,2.0,6.0,4.0,19.0,4.0,1.0,1.0,1.0


### 3.2  Reclassifying a numeric variable

What if I want to actually turn this into a series of dummies, based on number of times per week?

In [27]:
# create DV for different levels of veggie consumption
chis_df_small['no_veg_dv']=np.where((chis_df_small['ate_veg']==0),1,0)
chis_df_small['one_to_three_veg_dv']=np.where(((chis_df_small['ate_veg']>=0) & 
                                    (chis_df_small['ate_veg']<4)),1,0)
chis_df_small['morethan_four_veg_dv']=np.where((chis_df_small['ate_veg']>=4),1,0)

In [28]:
chis_df_small

Unnamed: 0,ate_veg,sex,race_ethnicity,pov_cat,hh_inc,housing_worry,male_dv,female_dv,housing_worry_dv,no_veg_dv,one_to_three_veg_dv,morethan_four_veg_dv
0,1,2,1,1,2,4,0,1,0,0,1,0
1,2,1,1,4,8,3,1,0,1,0,1,0
2,14,2,2,4,7,4,0,1,0,0,0,1
3,3,2,1,4,6,4,0,1,0,0,1,0
4,1,1,2,4,9,3,1,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
24448,0,1,3,1,1,3,1,0,1,1,1,0
24449,14,2,1,1,3,3,0,1,1,0,0,1
24450,7,1,3,4,12,1,1,0,1,0,0,1
24451,7,2,1,1,1,4,0,1,0,0,0,1


## 4.0  That's it for now!

By lab next week Thursday, you should have selected your dataset, and 5-6 variables you plan to explore, including 1 outcome variable. Explore, clean and reclassify each variable, and come with questions!