# Notebook 3: Reclassifying Variables

In Notebook 2, we learned how to import a .csv file, rename our variables, "slice" our dataframe using [[]], and examine the distribution of numeric variables using the .describe() and seaborn plotting functions.

In today's lab, we're going to focus on

> Exploring categorical variables

> Turning our categorical variables into dummies

> Cleaning up numeric variables

As with everything in Python, there are lots of different ways to do the same thing, so we're providing some basic code so you have what you need for Assignment 4.  But you may find that when you work with your own data, you'll need to explore the web for other code.

## 1.0 Reading in our libraries, our dataset, and renaming our variables

The next few cells get us back to where we left off last week.  Remember, you need to run all the cells in order - libraries, read data, and rename data, otherwise Python will give you an error message!

In [None]:
# First, We're going to call in our libraries

import numpy as np
import pandas as pd
import math
from scipy import stats
import seaborn as sns
import matplotlib as plt
from datascience import *

pd.options.display.float_format = '{:.2f}'.format

In [None]:
#Show our plots in the Jupyter notebook
%matplotlib inline

In [None]:
#When we start working with nan (missing) values, we can get RuntimeWarning errors - we're going to ignore them here
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning) 

In [None]:
#Now we're going to read in our data

# Here is my code for reading in the complete CHIS data 2

# col_list = ['AC47', 'AC42', 'SRSEX', 'AC46', 'POVLL', 'AE_VEGI', 'OMBSRR_P1','POVGWD_P1']
#chis_df=pd.read_csv("CHIS_2018_Adult_All.csv", usecols=col_list)
#chis_df

#today we're going to work with the extract as we did last week

chis_df = pd.read_csv('CHISextract2018.csv')
chis_df

In [None]:
chis_df.rename(columns={'AC47':'drank_water', 
                        'AC42':'nhood_fv', 
                        'AE_VEGI':'ate_fv',
                        'SRSEX': 'sex',
                        'AC46': 'drank_sweet',
                        'OMBSRR_P1': 'race_ethnicity',
                        'POVGWD_P1' : 'pov_ratio',
                       'POVLL' : 'pov_cat'}, inplace=True)
chis_df

### Codebook

> AC46: Number of times respondent drank sweet fruit drinks in past month

> AC47: Number of times respondent drank water yesterday

> AE_VEGI: Number of times respondent eats vegetables per week

> AC42: Number of times respondent was able to find fresh fruits/vegetables in neighborhood
(1=Never, 2=Sometimes, 3 = Usually, 4 = Always, 5=Doesn't eat f/v, 6: Doesn't shop for f/v, 7 Doesn't shop in neighborhood)

> SRSEX: Self-reported Sex (1= Male, 2=Female)

> OMBSRR_P1: Race/ethnicity
(1=Hispanic, 2= White NH, 3=Black NH, 4=AmIndian/Alaska Native NH, 5=Asian NH, 6=Other or two or more)

> POVLL: poverty level
(1 = 0-99% FPL, 2=100-199% FPL, 3=200-299% FPL, 4=300% FPL and above)

> POVGWD_P1: Family Poverty Threshold Level

##  2.0  Exploring Categorical (or Nominal) Variables

### 2.1 Exploring Nominal Binary Variables

In addition to numeric variables, we also often have to work with "nominal" variables (those with a "Name").  Note that in the CHIS data, the variables that are nominal (e.g. sex, race/ethnicity) are actually assigned number values rather than strings.   

Let's start with looking at the "sex" variable.  It has two possible values, "Male" and "Female".  This is known as a binary or dichotomous variable.  But, even though they are represented by the numbers 1 and 2, we can't treat them as numbers - e.g., adding 2 males together doesn't give us a female.

    SRSEX: Self-reported Sex (1= Male, 2=Female)

In [None]:
chis_df[['sex']].head(5)

In [None]:
#A simple way to look the distribution of a binary variable is to request the value_counts()
chis_df[['sex']].value_counts()

In [None]:
#another approach is to use the "crosstab" function available in pandas.  We'll be using crosstabs a lot when we do ttests,
#so let's look at a simple example for now
pd.crosstab(index=chis_df['sex'], columns='Total')

In [None]:
#we can also use the plot function above to look at the distribution visually
sns.countplot(chis_df['sex'])

### 2.2  Creating a dummy out of a binary variable

Okay - we know from Tuesday's lecture that we need to change this into a ***dummy*** variable. A dummy variable always only takes two values - a 0 and a 1 - and in general, we give a "1" to the variable when the condition we're interested in exploring is met.  

In [None]:
#The fastest way to make dummies is to use the panda "get_dummies" function.  Let's try and it and see what happens
#notice that the "sex" column is replaced by two new variables
chis_df_1=pd.get_dummies(chis_df, columns=['sex'])
chis_df_1

In [None]:
#Whenever you classify or create a new variable, it's always a good idea to check your work to see if it did what you expected
pd.crosstab(chis_df_1['sex_1'], columns='Total')

In [None]:
#That works!  But I much prefer 'controlling' my operations, so I can be sure my code is running correctly.
#Plus, I like naming my dummy variable so I don't get confused as to what is coded a 1 and what is coded a 0
chis_df['male_dv']=chis_df['sex'].map({1:1, 2:0})
pd.crosstab(chis_df['male_dv'], columns='count')

### 2.3  Exploring Categorical Variables

A more complicated type of "nominal" variable is one where we have more than 2 categories - we find these all the time in planning surveys!  (And most are ordinal, meaning that the numbers assigned move either up or down in some logical way.)

    nhood_fv: Number of times respondent was able to find fresh fruits/vegetables in neighborhood
    (1=Never, 2=Sometimes, 3 = Usually, 4 = Always, 5=Doesn't eat f/v, 6: Doesn't shop for f/v, 7 Doesn't shop in neighborhood)

In [None]:
#We can explore a categorical variable in just the same way as we did above- try exploring nhood_fv on your own


In [None]:
#Create a plot for it

In [None]:
#Get dummies works here too!  Remember, get dummies removes the original variable, so create a new dataframe


In [None]:
#But, I actually want to reclassify my variable, so again, I'm going to take a more intentional approach
# the np.nan says to assign a 'missing' value for any observation where the person answered 5, 6, or 7
chis_df['nhood_fv_dv']=chis_df['nhood_fv'].map({1:1, 2:1, 3:1, 4:0, 5:np.nan, 6:np.nan, 7:np.nan})
pd.crosstab(chis_df['nhood_fv_dv'], columns='count')

In [None]:
#Here's another approach that works too, this time for the race/ethnicity data, creating a dummy for Hispanic
chis_df['hispanic_dv']=np.where((chis_df['race_ethnicity'] == 1), 1,0)
chis_df

In [None]:
#Create dummies for the rest of your race/ethnicity values

In [None]:
#Check your work - how can you be sure you did it right?

## 3.0 Reclassifying Numeric Variables

While we don't always need to recode numeric variables, sometimes we need to address outliers and/or we want to make our numeric data more meaningful.  For example, with our "drank_water" variable, we know we need to address both the "99" value and perhaps "smooth" out the fact that folks like to respond using even numbers.

### 3.1 Dropping extreme values or outliers

Sometimes, you just need to drop some rows with specific values, or, remove all outliers in the dataset.  Let's take a look at it with the drank_water variable from last week.

In [None]:
chis_df['drank_water'].describe()

In [None]:
sns.countplot(chis_df['drank_water']);

In [None]:
#Dropping outliers; this is easiest way - I like renaming my dataframe just in case code 
#does something I don't want it to
chis_df_3 = chis_df[chis_df['drank_water'] < 99] 
chis_df_3['drank_water'].describe()

In [None]:
#You can also use a global cutoff, like a threshold for Z-scores, to omit certain observations from the dataframe
#Note that including "nan" values can lead to error messages, so we're going to "omit" missing values in the calculation

chis_df_z=chis_df[(np.abs(stats.zscore(chis_df, nan_policy='omit'))<3).all(axis=1)]
chis_df_z.describe()

In [None]:
#How is the resulting dataframe different from the original?
chis_df.describe()

### 3.2  Reclassifying a numeric variable

Let's create dummy variables to replace our drank water numeric variable.  Again, lots of ways to do this, but here is some helpful code when you want to specify precisely the "bins" you want to put your numeric data into.

In [None]:
# create DV for rent burdened and severely rent burdened
chis_df['fourorless_wtr_dv']=np.where((chis_df['drank_water']<5),1,0)
chis_df['fivetoten_wtr_dv']=np.where(((chis_df['drank_water']>=5) & 
                                    (chis_df['drank_water']<10)),1,0)
chis_df['tenormore_wtr_dv']=np.where((chis_df['drank_water']>=10),1,0)

In [None]:
chis_df

## 4.0  That's it for now!

By lab next week Thursday, you should have selected your dataset, and 5-6 variables you plan to explore, including 1 outcome variable. Explore, clean and reclassify each variable, and come with questions!