## 1 Working with Disaggregate Data

Today, we're going to play with characterizing and cleaning disaggregate data, and getting used to working with different types of variables.  

In [None]:
#Call our libraries

import numpy as np
import pandas as pd
import math
from datascience import *

pd.options.display.float_format = '{:.2f}'.format

In [None]:
# Here is my code for reading in the ADULT CHIS data - I have provided everyone with an extract of the data 
# so you don't need to run this cell today, but it might be helpful if you decide to work with the CHIS data
# for Assignment 2

#chis_df=pd.read_stata("ADULT.dta", columns = ["ac11","povll", "ab1", "racedf_p1", "ak28", "ak25", "ak10_p", "ak22_p1"])

# This code exports my extract to a csv
#chis_df.to_csv("chis_extract.csv", index=False)

# And this should look familiar - I'm reading the csv file I've created above.
chis_df=pd.read_csv("chis_extract.csv")

In [None]:
# type in the code to look at the data
chis_df

### 1.1 Renaming the variables

When you go to work with a new dataset, you need to spend some time with the codebook, figuring out what variables you want to analyse, and what each "code" means.  Here, I've renamed the variables for you so we can get to the fun stuff!

In [None]:
chis_df.rename(columns={"ac11":"number_sodas","povll":"poverty_line",
"ab1":"health",
"racedf_p1":"race_eth",
"ak28":"feel_safe",
"ak25":"tenure",
"ak10_p":"earnings",
"ak22_p1":"hh_income"}, inplace=True)

### 1.2 Assessing and Changing Data Types

In [None]:
#what data types are our variables?  
chis_df.dtypes

In [None]:
# Let's change the number of sodas from a "category" to a float.

chis_df["number_sodas"]=chis_df["number_sodas"].astype(float)
chis_df.dtypes

In [None]:
#Now try it for earnings

In [None]:
#What happened?  The "inapplicable" entry is a string, which Python refused to convert to a number.  
#We can force it to by telling it to "coerce" the new data type when it finds an error.

chis_df["earnings"]=pd.to_numeric(chis_df["earnings"], errors="coerce")

In [None]:
# check your data types and look at your data again.  What has Python entered for the "Inapplicable" missing value? 


In [None]:
# We often want to know how many variables have missing data, and how many observation are missing.
chis_df.isnull().sum()

### 1.3  Understanding Numeric Variables

There are lots of different ways to explore numeric variables.  The two codes below are ways I always check my variables before I start any analysis.

In [None]:
chis_df["number_sodas"].describe()

In [None]:
chis_df["number_sodas"].quantile([.1, .25, .50, .75, .99])

### 1.4 Exploring Categorical Data

To explore categorical, or binary, data, we use what's called a frequency table instead.  Again, there's lots of different ways to get to the same answer, but here are some useful codes for understanding categorical data.

In [None]:
chis_df["tenure"].value_counts()

In [None]:
chis_df["tenure"].value_counts().sort_index()

In [None]:
chis_df["tenure"].value_counts(normalize=True)

The Panda "crosstab" function is also helpful, especially when we want to create a frequency table with two variables.

In [None]:
pd.crosstab(chis_df["tenure"], columns="count")

### 1.5 Cleaning up our Categorical Data

In looking at the tenure variable, it's pretty clear we have a lot of "missing" values coded as other things.  What I really want to understand is the difference between renters and owners - the other categories are not as meaningful.  I don't really want to drop these rows, since the people who "REFUSED" or selected "OTHER ARRANGEMENT" may have valuable answers to other questions.  What I'd like to do is create a new dummy variable, which is equal to 1 for owners, and 0 for renters, and missing for every other category.  I'm going to use a "dictionary" file to assign these new values.

In [None]:
chis_df["own_dv"]=chis_df["tenure"].map({"OWN":1, "RENT":0, "REFUSED":np.nan, "NOT ASCERTAINED":np.nan, "DON'T KNOW": np.nan, "OTHER ARRANGEMENT": np.nan})
pd.crosstab(chis_df["own_dv"], columns="count")

Pandas also has a "get dummies" function.  It is a faster way of creating dummies, especially if you have a variable with lots of categories.  But how does it differ from the code above?  

In [None]:
newdata=pd.get_dummies(chis_df, columns=["tenure"])
newdata.tenure_OWN.value_counts()

Be very careful in how you use functions like this - if you have "messy" real life data, you could be creating your dummies in such a way that will profoundly affect your results.  I highly recommend using the more thoughtful process outlined above, at least until you get more familiar with inferential statistics.

### 1.6  Grouping Data by Categories

Often, we want to know differences in outcomes across groups - let's see whether owners or renters drink more sodas.

In [None]:
chis_df["number_sodas"].groupby(chis_df["own_dv"]).mean()

In [None]:
chis_df["number_sodas"].groupby(chis_df["own_dv"]).agg(['min','max','mean', 'median'])

In [None]:
chis_df.groupby(['own_dv']).agg({'number_sodas': 'mean',
                                  'earnings' : 'mean'})

### 1.7 Two Variable Frequency Tables

If we want to explore two categorical variables, we need to rely on the Panda crosstab function.  We specify which variable we want along the rows (our "index" variable) and which variable we want along our columns.

In [None]:
health_poverty=pd.crosstab(index=chis_df["poverty_line"], columns=chis_df["health"])
health_poverty

In [None]:
#  If we want Python to give us row and column totals, we specify that using
# the "Margins=true" option within the crosstab function.
health_poverty=pd.crosstab(index=chis_df["poverty_line"], columns=chis_df["health"], margins=True)
health_poverty

In [None]:
# We can also get the percents by asking Python to normalize the data
health_poverty=pd.crosstab(index=chis_df["poverty_line"], columns=chis_df["health"], margins=True, normalize="index")
health_poverty

In [None]:
#Try it here, this time "normalizing" (getting the percents for) the columns.

### 1.8 Next Steps

Work on answering the questions in the lab sheet below with a group of your classmates.  The goal for Thursday's lab is to apply these same Python skills to your own dataset, so it's worth practicing and getting comfortable with these commands.  Again, you don't have to memorize them, just know what each of them do and or feel empowered to copy, paste and run and see what happens!!