# Notebook 4: Exploring Bivariate Relationships and Working with Weights

In Notebook 3, we learned how to explore our variables, recode categorical variables into dummies, and clean and/or classify our numeric variables.

In today's lab, we're going to focus on

> Exploring relationships between two variables

We will also be introducing the concept of weights.  **USING WEIGHTS THIS SEMESTER IS OPTIONAL.**  But it is helpful to know what they are and how they work if you ever encounter them in a professional setting.

## 1.0 Reading in our libraries, our dataset, and renaming our variables

Just the intro material!  Remember, you need to run all the cells in order - libraries, read data, and rename data, otherwise Python will give you an error message!  Note that there is a new datafile that includes the RAKEDW0 variable - this is our weight variable.

In [None]:
# First, We're going to call in our libraries

import numpy as np
import pandas as pd
import math
from scipy import stats
import seaborn as sns
import matplotlib as plt
import matplotlib.pyplot as plt
import scipy 
from datascience import *

pd.options.display.float_format = '{:.2f}'.format

In [None]:
#Show our plots in the Jupyter notebook
%matplotlib inline

In [None]:
#When we start working with nan (missing) values, we can get warnings - we're going to ignore them here
import warnings
warnings.filterwarnings("ignore") 

In [None]:
#Now we're going to read in our data

# Here is my code for reading in the complete CHIS data 2

# col_list = ['AC47', 'AC42', 'SRSEX', 'AC46', 'POVLL', 'AE_VEGI', 'OMBSRR_P1','POVGWD_P1','RAKEDW0']
#chis_df=pd.read_csv("CHIS_2018_Adult_All.csv", usecols=col_list)
#chis_df

#today we're going to work with the extract as we did last week

chis_df = pd.read_csv('chis_extract_2018_weights.csv')
chis_df

In [None]:
chis_df.rename(columns={'AC47':'drank_water', 
                        'AC42':'nhood_fv', 
                        'AE_VEGI':'ate_fv',
                        'SRSEX': 'sex',
                        'AC46': 'drank_sweet',
                        'OMBSRR_P1': 'race_ethnicity',
                        'POVGWD_P1' : 'pov_ratio',
                       'POVLL' : 'pov_cat',
                       'RAKEDW0': 'weight'}, inplace=True)
chis_df

### Codebook

> AC46: Number of times respondent drank sweet fruit drinks in past month

> AC47: Number of times respondent drank water yesterday

> AE_VEGI: Number of times respondent eats vegetables per week

> AC42: Number of times respondent was able to find fresh fruits/vegetables in neighborhood
(1=Never, 2=Sometimes, 3 = Usually, 4 = Always, 5=Doesn't eat f/v, 6: Doesn't shop for f/v, 7 Doesn't shop in neighborhood)

> SRSEX: Self-reported Sex (1= Male, 2=Female)

> OMBSRR_P1: Race/ethnicity
(1=Hispanic, 2= White NH, 3=Black NH, 4=AmIndian/Alaska Native NH, 5=Asian NH, 6=Other or two or more)

> POVLL: poverty level
(1 = 0-99% FPL, 2=100-199% FPL, 3=200-299% FPL, 4=300% FPL and above)

> POVGWD_P1: Family Poverty Threshold Level

> RAKEDW0: Individual weight

##  2.0  Exploring Bivariate Relationships

### 2.1 Hypothesis

Let's start by reminding ourselves why we're doing all this data cleaning!  I am a city planner interested in the issue of soda taxes.   I am concerned that people in poverty will disporportionately bear the burden of a soda tax.  

**My hypothesis is that people who are poor are more likely to drink sweet fruit drinks/sodas, so this is a regressive tax.**

>  Y Variable: Number of Sodas/Sweet Drinks (AC46 - renamed drank_sweet)

>  X Variable: Ratio of income to poverty line  (POVGWD_P1  - renamed pov_ratio)

>  Alternate X Variable: Categorical poverty level (POVLL - renamed pov_cat)

Below, I've included the code I used to clean each of the variables, including notes about what I did and why!

**I'm also going to create a "text" race/ethnicity variable to explore if there might be differences by race/ethnicity.  This is mostly to show you another option for working with your data in Python!**


### 2.2  Cleaning my variables

#### 2.2a Clean my Y variable (numeric)

In [None]:
#Describe the distribution of my data
chis_df['drank_sweet'].describe()

In [None]:
#Look at it visually with a histogram - clear there's lots of 0's, and 300 is a clear outlier
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [8, 5]})
sns.distplot(
    chis_df['drank_sweet'], norm_hist=False, kde=False, bins=20, hist_kws={"alpha": 1}
).set(xlabel='drank_sweet', ylabel='Count');

In [None]:
#drop observations 3 StDev from the mean for the "drank_sweet" variable
chis_df=chis_df[(np.abs(stats.zscore(chis_df['drank_sweet'], nan_policy='omit'))<3)]
chis_df.describe()

In [None]:
#Check distribution again - looks better! 
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [8, 5]})
sns.distplot(
    chis_df['drank_sweet'], norm_hist=False, kde=False, bins=20, hist_kws={"alpha": 1}
).set(xlabel='drank_sweet', ylabel='Count');

#### 2.2b  Clean my X variable (poverty as a ratio)

In [None]:
chis_df['pov_ratio'].describe()

In [None]:
#Look at it visually with a histogram - this looks pretty good so I'm going to leave it as is
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [8, 5]})
sns.distplot(
    chis_df['pov_ratio'], norm_hist=False, kde=False, bins=20, hist_kws={"alpha": 1}
).set(xlabel='pov_ratio', ylabel='Count');

#### 2.2c Clean my X variable (poverty as category)

In [None]:
# Look at the distribution of values
pd.crosstab(chis_df['pov_cat'], columns='Total')

In [None]:
#Because I am most concerned about households living under the poverty line, 
#I'm going to create a dummy where 1 = under the poverty line, and 0 is above

chis_df['inpoverty_dv']=chis_df['pov_cat'].map({1:1, 2:0, 3:0, 4:0})

In [None]:
# Remember that it's good to double check when creating new variables
pd.crosstab(chis_df['inpoverty_dv'], columns='Total')

#### Sometimes, I want a categorical variable with text so I can quickly look at my data - here's one way to create a new categorical variable with the values replaced by text strings.

In [None]:
#This code creates a new column with a categorical race variable based on the dummies
#OMBSRR_P1: Race/ethnicity (1=Hispanic, 2= White NH, 3=Black NH, 4=AmIndian/Alaska Native NH, 5=Asian NH, 6=Other or two or more)
chis_df.loc[(chis_df['race_ethnicity'] == 2), 'race_eth_text'] = 'NHWhite'  
chis_df.loc[(chis_df['race_ethnicity']==5), 'race_eth_text'] = "Asian"
chis_df.loc[(chis_df['race_ethnicity']==3), 'race_eth_text'] = "Black"
chis_df.loc[(chis_df['race_ethnicity']==1), 'race_eth_text'] = "Hispanic"
chis_df.loc[(chis_df['race_ethnicity']==4), 'race_eth_text'] = "Other/Two Races"
chis_df.loc[(chis_df['race_ethnicity']==6), 'race_eth_text'] = "Other/Two Races"
chis_df['race_eth_text'].value_counts()

### 2.3  Exploring relationships

Now that I've cleaned my data, I can start to explore whether or not there are relationships between my Y and X variables.  I'm going to explore whether there are any observable differences in the average number of sweet drinks a person consumes by my poverty variables.  

#### 2.3.1  Exploring a numeric variable grouped by a dummy or categorical variable

In [None]:
#Let's see if the average number of sweet drinks varies by poverty category
chis_df["drank_sweet"].groupby(chis_df["pov_cat"]).mean()

In [None]:
#If I want to look at different metrics, like median, min or max, I can do that too
#What do each of these columns mean?
chis_df["drank_sweet"].groupby(chis_df["pov_cat"]).agg(['count', 'sum','mean', 'median', 'min', 'max', 'std', 'sem'])

In [None]:
#Try it yourself - calculate mean scores for a numeric variable, grouping by a dummy in your dataset


In [None]:
#I can also explore by race/ethnicity - here's where the "text" version benefits me - 
#I don't have to remember what the number values stand for
chis_df["drank_sweet"].groupby(chis_df["race_eth_text"]).agg(['count', 'sum','mean', 'median', 'min', 'max', 'std', 'sem'])

In [None]:
#A useful tip for Assignment 4 - you can assign the analysis above to an object, and then export it as a .csv
results_race_eth=chis_df["drank_sweet"].groupby(chis_df["race_eth_text"]).agg(['count', 'sum','mean', 'median', 'min', 'max', 'std', 'sem'])
results_race_eth.to_csv("results_race_eth.csv", index= True)

#### 2.3.2  Exploring the relationship between two dummy or categorical variables

In [None]:
pd.crosstab(index=chis_df["race_eth_text"], columns=chis_df["inpoverty_dv"])

In [None]:
#If I want to explore the relationship between two dummy or categorical variables, I'm going to use the pandas crosstab function.  
#Take a minute to explore what the margins and normalize code did.  
pd.crosstab(index=chis_df["race_eth_text"], columns=chis_df["inpoverty_dv"], margins=True, normalize='columns')

In [None]:
#Normalize by index

#### Let's write out two sentences that describe these data that we can refer back to.

Replace this with a sentence describing the data normalized by columns.

Replace this with a sentence describing the data normalized by rows (index).

#### 2.3.2  Exploring the relationship between two numeric variables variables

In [None]:
# Exploring the relationship between two numeric variables is generally done with a scatterplot
# Because both are numeric, there are too many values to use "groupby"
plt.scatter(chis_df["pov_ratio"], chis_df["drank_sweet"], s=5)
plt.xlabel("Poverty Ratio")
plt.ylabel("Drank Sweet/Soda Drinks")

#it's still pretty hard to assess, right? It's because we have so many observations
#and because there are a lot of people who drink soda!
#That's why we'll calculate correlations later on!

In [None]:
#another fun way to visualize!!
sns.relplot(x="pov_ratio", y="drank_sweet", hue="race_eth_text", data=chis_df);

## 3 Survey Weights 

When official agencies run a survey (like CHIS, PUMS, AHS, and NHTS), they often include weights that allow the user to calculate total population estimates from survey responses to reduce survey bias. Simply, weighting assists in making our sample of survey respondents (more) representative of the population.  The weighting process usually involves three steps: (i) obtain the design weights, which account for sample selection; (ii) adjust these weights to compensate for nonresponse; (iii) adjust the weights so that the estimates coincide to some known totals of the population, which is called calibration.

The literature on weighting is vast, and for those of you who are going to move on to more advanced statistical techniques, you will learn a lot more about weighting than I can do justice here.  But, it is useful to explore how using weights in descriptive statistics can change your results.

Statistical packages tend to have sophisticated functions to apply weights - I wasn't able to find similarly elegant solutions for Python.  But below are two approaches if you'd like to try using weights in your descriptive results.

**3.1. Weighting: First Approach for Means and Total Counts**

The cell below is a set of code where I define two separate "helper" functions.  The first is to calculate a weighted mean, the second is to create a weighted sum.  You run this cell, and then you use the **w_mean** and **w_sum** functions in subsequent cells to specify the output you want.  

In [None]:
# Sample weights helper function for weighted mean.
def w_mean(frame, mean_var, weight):  #this line of code defines the function w_mean, as having to specify a data frame, a variable, and a weight
    d = frame[mean_var]
    w = frame[weight]
    try: 
        return (d * w).sum() / w.sum() #this calculates the weighted mean
    except ZeroDivisionError:
        return np.nan

# Sample weights helper function for weighted sum.
def w_sum(frame, sum_var, weight):
    d = frame[sum_var]
    w = frame[weight]
    try: 
        return (d * w).sum()
    except ZeroDivisionError:
        return np.nan

In [None]:
#Let's first take a look at the distribution of sweet drinks by race/ethnicity without weights
chis_df["drank_sweet"].groupby(chis_df["race_eth_text"]).mean()

In [None]:
# Now, with weights - what happens to the mean number of sweet drinks by race/ethnicity?
chis_df.groupby('race_eth_text').apply(w_mean, 'drank_sweet', 'weight')

In [None]:
#Again, let's first look at the number of respondents who are under the poverty line
chis_df['inpoverty_dv'].value_counts()

In [None]:
#now, with weights
w_sum(chis_df, 'inpoverty_dv', 'weight')

In [None]:
#The number of respondents by race/ethnicity who are under the poverty line without weights
chis_df["inpoverty_dv"].groupby(chis_df["race_eth_text"]).agg(['sum'])

In [None]:
# Weighted counts of the number of people under the poverty line by race/ethnicity
chis_df.groupby('race_eth_text').apply(w_sum, 'inpoverty_dv', 'weight')

**3.2. Weighting: Approach 2 - use weightedcalcs library**

I found this cool library that does weighted calculations for you on github.  The example python notebook can be found here:
https://github.com/jsvine/weightedcalcs/tree/master/examples/notebooks .  This also provides an example of how you have to sometimes install a new library, even in datahub.  The command is pip install --user and then the name of the library

In [None]:
pip install --user weightedcalcs

In [None]:
import weightedcalcs as wc

In [None]:
#this line of code assigns which variable will be the "weight" variable in the calculator
calc = wc.Calculator("weight")

In [None]:
chis_df['drank_sweet'].describe()

In [None]:
#here I caclulate the weighted mean number of drinks
calc.mean(chis_df, "drank_sweet").round()

In [None]:
#without weights
pd.crosstab(chis_df['inpoverty_dv'], columns='Total', normalize=True)

In [None]:
#with weights
calc.distribution(chis_df, "inpoverty_dv").round(3).sort_values(ascending=False)

In [None]:
#if you want to calculate statistics across a groupby variable, you need to create a new object
grp_race_eth= chis_df.groupby(["race_eth_text"])

In [None]:
calc.mean(grp_race_eth, "drank_sweet")

## 4  Conclusion

If you don't feel like you want to apply weights, just be clear that the N in your table and any descriptive statistics are based on the sample data, and not the population.  As with everything, as long as you document your choices, you'll be using data ethically!