# Notebook 3: Reclassifying Variables

In Notebook 2, we learned how to import a .csv file, rename our variables, "slice" our dataframe using [[]], and examine the distribution of numeric variables using the .describe() and seaborn plotting functions.  We also learned how to use pd.crosstab to explore categorical variables.

In today's lecture, we're going to focus on

> Turning our categorical variables into dummies

> Cleaning up numeric variables

As with everything in Python, there are lots of different ways to do the same thing, so we're providing some basic code so you have what you need for Assignment 4.  But you may find that when you work with your own data, you'll need to explore the web for other code.

## 1.0 Reading in our libraries, our dataset, and renaming our variables

The next few cells get us back to where we left off last week.  Remember, you need to run all the cells in order - libraries, read data, and rename data, otherwise Python will give you an error message!

In [1]:
# First, We're going to call in our libraries
# from datascience import *

import numpy as np
import pandas as pd
import math
from scipy import stats
import seaborn as sns
import matplotlib as plt


pd.options.display.float_format = '{:.2f}'.format

In [2]:
#Show our plots in the Jupyter notebook
%matplotlib inline

In [3]:
#When we start working with nan (missing) values, we can get RuntimeWarning errors - we're going to ignore them here
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning) 
pd.options.mode.chained_assignment = None  # default='warn'

In [4]:
#Now we're going to read in our data and create a new dataframe

chis_df = pd.read_csv('chis_teen_extract.csv')
chis_df

Unnamed: 0,TF3,TL25,TL28,TQ15,SRSEX,ACE_T,DOCT_YR,SOCHESS3,OMBSRTN_P1,RACEDFT_P1,TB1_P1
0,2,2,2,1,1,2,1,11,5,4,2
1,1,2,2,2,1,2,1,9,2,6,3
2,2,2,2,1,1,2,1,9,2,6,1
3,2,3,3,2,1,2,1,9,7,8,4
4,2,1,3,3,1,2,1,5,2,6,3
...,...,...,...,...,...,...,...,...,...,...,...
1164,2,2,3,1,1,1,1,9,1,1,4
1165,2,2,2,2,2,1,1,8,1,1,2
1166,2,3,2,2,1,2,1,9,1,1,2
1167,1,1,3,4,2,2,2,8,1,1,2


### From here on out, you will need to change the data to reflect your variables

In [None]:
chis_df.rename(columns={"AE_VEGI":"ate_veg",
                        "SRSEX": "sex",
                        "OMBSRR_P1": "race_ethnicity",
                        "POVLL" : "pov_cat",
                       "AK22_P1" : "hh_inc",
                       "AM184": "housing_worry"}, inplace=True)

### Codebook


PUT IN YOUR OWN CODEBOOK HERE

##  2.0  Recoding and Cleaning Variables

### 2.1 Recoding Nominal Binary Variables into Dummies

We  often have to work with "nominal" variables (those with a "Name").  Note that in the CHIS data, the variables that are nominal (e.g. sex, race/ethnicity) are actually assigned number values rather than strings.   

Let's start with the "sex" variable.  It has two possible values, "Male" and "Female".  This is known as a binary or dichotomous variable.  But, even though they are represented by the numbers 1 and 2, we can't treat them as numbers - e.g., adding 2 males together doesn't give us a female.

    SRSEX: Self-reported Sex (1= Male, 2=Female)
    
We need to change this into a ***dummy*** variable. A dummy variable always only takes two values - a 0 and a 1 - and in general, we give a "1" to the variable when the condition we're interested in exploring is met. 

In [None]:
#Let's look at the data
pd.crosstab(chis_df['sex'], columns='Total')

In [None]:
#Now I'm going to create a dummy; I like naming my dummy variable so I don't get confused as to what is coded a 1 and what is coded a 0
chis_df['female_dv']=chis_df_small['sex'].map({1:0, 2:1})
pd.crosstab(chis_df['female_dv'], columns='count')

### 2.3  Recoding Categorical Variables

A more complicated type of "nominal" variable is one where we have more than 2 categories - we find these all the time in planning surveys!  (And most are ordinal, meaning that the numbers assigned move either up or down in some logical way.)

    AM184: How Often Worry about Paying Rent/Mortgage
    (1=Very often, 2=Somewhat Often, 3=From Time to Time, 4=Almost Never)
    
We renamed the variable to **housing_worry**

In [None]:
#let's look at our data and be intentional
pd.crosstab(chis_df['housing_worry'], columns='count')

In [None]:
# I decide I want to group together anyone that expresses concern, 
#so I'm going to assign a 1 to 1,2,3, and a 0 to anyone who never worries
chis_df['housing_worry_dv']=chis_df_small['housing_worry'].map({1:1, 2:1, 3:1, 4:0})
pd.crosstab(chis_df['housing_worry_dv'], columns='count')

## 3.0 Reclassifying Numeric Variables

While we don't always need to recode numeric variables, sometimes we need to address outliers and/or we want to make our numeric data more meaningful.  For example, with our "ate_veg" variable, we know we need to address the outlier, and we could "smooth" out the fact that folks like to respond using weekly numbers

### 3.1 Dropping extreme values or outliers

Sometimes, you just need to drop some rows with specific values, or, remove all outliers in the dataset.  Let's take a look at it with the age_veg variable from last week.

In [None]:
chis_df['ate_veg'].describe()

In [None]:
#Dropping outliers; this is easiest way - I like renaming my dataframe just in case the code 
#does something I don't want it to
#I am going to make the decision that anyone who says they ate more than 10 vegetables per day per week (70) is an outlier
chis_df = chis_df[chis_df['ate_veg'] < 71] 
chis_df['ate_veg'].describe()

In [None]:
#How is the resulting dataframe different from the original?
chis_df.describe()

### 3.2  Reclassifying a numeric variable

What if I want to actually turn this into a series of dummies, based on number of times per week?

In [None]:
# create DV for different levels of veggie consumption
chis_df['no_veg_dv']=np.where((chis_df['ate_veg']==0),1,0)
chis_df['one_to_three_veg_dv']=np.where(((chis_df['ate_veg']>=0) & 
                                    (chis_df['ate_veg']<4)),1,0)
chis_df['morethan_four_veg_dv']=np.where((chis_df['ate_veg']>=4),1,0)

In [None]:
chis_df

## 4.0  That's it for now!

By lab next week Thursday, you should have selected your dataset, and 5-6 variables you plan to explore, including 1 outcome variable. Explore, clean and reclassify each variable, and come with questions!