# Notebook 4: Exploring Bivariate Relationships 

In previous notebooks, we learned how to explore our variables, recode categorical variables into dummies, and clean and/or classify our numeric variables.

In today's lab, we're going to focus on

> Exploring relationships between two variables

## 1.0 Reading in our libraries, our dataset, and renaming our variables

Just the intro material!  Remember, you need to run all the cells in order - libraries, read data, and rename data, otherwise Python will give you an error message!  

In [None]:
# First, We're going to call in our libraries

import numpy as np
import pandas as pd
import math
from scipy import stats
import seaborn as sns
import matplotlib as plt
import matplotlib.pyplot as plt
import scipy 
from datascience import *

pd.options.display.float_format = '{:.2f}'.format

In [None]:
#Show our plots in the Jupyter notebook
%matplotlib inline

In [None]:
#When we start working with nan (missing) values, we can get warnings - we're going to ignore them here
import warnings
warnings.filterwarnings("ignore") 

In [None]:
#Now we're going to read in our data

# Here is my code for reading in the complete CHIS data 2

# col_list = ['AC47', 'AC42', 'SRSEX', 'AC46', 'POVLL', 'AE_VEGI', 'OMBSRR_P1','POVGWD_P1','RAKEDW0']
#chis_df=pd.read_csv("CHIS_2018_Adult_All.csv", usecols=col_list)
#chis_df

#today we're going to work with the extract as we did last week

chis_df = pd.read_csv('chis_extract_2022_weights.csv')
chis_df

In [None]:
chis_df.rename(columns={"AE_VEGI":"ate_veg",
                        "SRSEX": "sex",
                        "OMBSRR_P1": "race_ethnicity",
                        "POVLL" : "pov_cat",
                       "AK22_P1" : "hh_inc",
                       "AM184": "housing_worry",
                       "CV7_1":"covid_lostjob"}, inplace=True)

In [None]:
chis_df=(chis_df[['ate_veg','sex', 'race_ethnicity', 'pov_cat', 'hh_inc', 'housing_worry', 'covid_lostjob']])

In [None]:
chis_df

### Codebook

### Codebook


> AE_VEGI: Number of times respondent eats vegetables per week

> SRSEX: Self-reported Sex (1= Male, 2=Female)

> OMBSRR_P1: Race/ethnicity
(1=Hispanic, 2= White NH, 3=Black NH, 4=AmIndian/Alaska Native NH, 5=Asian NH, 6=Other or two or more)

> POVLL: poverty level
(1 = 0-99% FPL, 2=100-199% FPL, 3=200-299% FPL, 4=300% FPL and above)

> AK22_P1: Household Income

> AM184: How Often Worry about Paying Rent/Mortgage
(1=Very often, 2=Somewhat Often, 3=From Time to Time, 4=Almost Never)

> CV7_1: Lost Job due to COVID (1=Yes, 2=No)

##  2.0  Exploring Bivariate Relationships

### 2.1 Hypothesis

Let's start by reminding ourselves why we're doing all this data cleaning!  I am interested in understanding whether someone who lost their job due to COVID is more likely to be concerned about paying their rent, which I can use to argue for an extension of eviction moratoria or greater rent relief.

**My hypothesis is that people who lost their job due to COVID are more likely to be concerned about paying their rent.**

>  Y Variable: How Often Worry about Paying Rent/Mortgage  (AM184 - renamed housing_worry)
    (1=Very often, 2=Somewhat Often, 3=From Time to Time, 4=Almost Never)

>  X Variable: lost job due to COVID  (CV7_1  - renamed covid_lostjob)

>  Alternate X Variable: Categorical poverty level (POVLL - renamed pov_cat)

>  I'm also going to look at whether folks in poverty eat fewer vegetables, mostly to demonstrate code and concepts!

        > Y Variable: Eat Vegetables (ae_veg - renamed ate_veg)
    
        > X Variable: Categorical poverty level (POVLL - renamed pov_cat)

Below, I've included the code I used to clean each of the variables, including notes about what I did and why!

**I'm also going to create a "text" race/ethnicity variable to explore if there might be differences by race/ethnicity.  This is mostly to show you another option for working with your data in Python!**


### 2.2  Cleaning my variables

#### 2.2a Clean my Y variable (numeric)

In [None]:
#Describe the distribution of my data
chis_df['ate_veg'].describe()

In [None]:
#drop people who said they ate more than 10 veggies a day
chis_df = chis_df[chis_df['ate_veg'] < 71] 

#### 2.2b Clean my Y variable (categorical)

In [None]:
#let's first look at the data
pd.crosstab(chis_df['housing_worry'], columns='count')

In [None]:
# I decided I want to group together anyone that expresses concern, 
#so I'm going to assign a 1 to 1,2,3, and a 0 to anyone who never worries
chis_df['housing_worry_dv']=chis_df['housing_worry'].map({1:1, 2:1, 3:1, 4:0})
pd.crosstab(chis_df['housing_worry_dv'], columns='count')

#### 2.2c Clean my X variable (poverty as category)

In [None]:
# Look at the distribution of values
pd.crosstab(chis_df['pov_cat'], columns='Total')

In [None]:
#Because I am most concerned about households living under the poverty line, 
#I'm going to create a dummy where 1 = under the poverty line, and 0 is above

chis_df['inpoverty_dv']=chis_df['pov_cat'].map({1:1, 2:0, 3:0, 4:0})
pd.crosstab(chis_df['inpoverty_dv'], columns='Total')

#### 2.2c Clean my X variable (lost job due to COVID)

In [None]:
pd.crosstab(chis_df['covid_lostjob'], columns='count')

In [None]:
chis_df['lostjob_dv']=chis_df['covid_lostjob'].map({1:1, 2:0})
pd.crosstab(chis_df['lostjob_dv'], columns='count')

#### Sometimes, I want a categorical variable with text so I can quickly look at my data - here's one way to create a new categorical variable with the values replaced by text strings.

In [None]:
#This code creates a new column with a categorical race variable based on the dummies
#OMBSRR_P1: Race/ethnicity (1=Hispanic, 2= White NH, 3=Black NH, 4=AmIndian/Alaska Native NH, 5=Asian NH, 6=Other or two or more)
chis_df.loc[(chis_df['race_ethnicity'] == 2), 'race_eth_text'] = 'NHWhite'  
chis_df.loc[(chis_df['race_ethnicity']==5), 'race_eth_text'] = "Asian"
chis_df.loc[(chis_df['race_ethnicity']==3), 'race_eth_text'] = "Black"
chis_df.loc[(chis_df['race_ethnicity']==1), 'race_eth_text'] = "Hispanic"
chis_df.loc[(chis_df['race_ethnicity']==4), 'race_eth_text'] = "Other/Two Races"
chis_df.loc[(chis_df['race_ethnicity']==6), 'race_eth_text'] = "Other/Two Races"
chis_df['race_eth_text'].value_counts()

### 2.3  Exploring relationships

Now that I've cleaned my data, I can start to explore whether or not there are relationships between my Y and X variables.  I'm going to explore whether there are any observable differences in the average number of veggies a person consumes by my poverty variables.  

In [None]:
#First, I'm going to look at my original pov_cat variable
chis_df["ate_veg"].groupby(chis_df["pov_cat"]).mean()

In [None]:
#If I want to look at different metrics, like median, min or max, I can do that too
chis_df["ate_veg"].groupby(chis_df["pov_cat"]).agg(['mean', 'median', 'min', 'max'])

In [None]:
#How about when I look at my newly created dummy variable:
chis_df["ate_veg"].groupby(chis_df["inpoverty_dv"]).agg(['mean', 'median'])

In [None]:
#I can also explore by race/ethnicity - here's where the "text" version benefits me - 
#I don't have to remember what the number values stand for
chis_df["ate_veg"].groupby(chis_df["race_eth_text"]).mean()

In [None]:
#Here's another approach to exploring the relationship between two categorical variables.  
#Take a minute to explore what the normalize code did.  What does it mean when you change it to index?  all?
pd.crosstab(index=chis_df["race_eth_text"], columns=chis_df["inpoverty_dv"], margins=True, normalize='columns')


In [None]:
#let's look at whether there is any relationship between housing worry and COVID job loss
pd.crosstab(index=chis_df["housing_worry_dv"], columns=chis_df["lostjob_dv"], margins=True, normalize='index')

In [None]:
pd.crosstab(index=chis_df["housing_worry_dv"], columns=chis_df["lostjob_dv"], margins=True, normalize='columns')

## 4  Conclusion

In lab today, use this notebook to start exploring your variables!