# Notebook 6: Regression Example Code


This notebook provides the code for how to execute either a linear or logistic regression.  The examples continue to use the CHIS data to explore relationships between "drank sweet drinks" and a number of explanatory or independent variables.

### Codebook

> AC46: Number of times respondent drank sweet fruit drinks in past month

> AC47: Number of times respondent drank water yesterday

> AE_VEGI: Number of times respondent eats vegetables per week

> AC42: Number of times respondent was able to find fresh fruits/vegetables in neighborhood
(1=Never, 2=Sometimes, 3 = Usually, 4 = Always, 5=Doesn't eat f/v, 6: Doesn't shop for f/v, 7 Doesn't shop in neighborhood)

> SRSEX: Self-reported Sex (1= Male, 2=Female)

> OMBSRR_P1: Race/ethnicity
(1=Hispanic, 2= White NH, 3=Black NH, 4=AmIndian/Alaska Native NH, 5=Asian NH, 6=Other or two or more)

> POVLL: poverty level
(1 = 0-99% FPL, 2=100-199% FPL, 3=200-299% FPL, 4=300% FPL and above)

> POVGWD_P1: Family Poverty Threshold Level

> RAKEDW0: Individual weight

## 1 Libraries

We're going to bring in our libraries - you'll notice some new libraries and fundtions - "scipy" is a library includes statistical analysis functions.  We're going to bring in the t and ttest_ind commands.  I'm also going to allow 4 decimal points in my number displays.

In [32]:
#Call our libraries; note, we are adding some libraries to our notebook
from statsmodels.stats.proportion import proportions_ztest
import numpy as np
import pandas as pd
import math
from scipy import stats
from scipy.stats import t
from scipy.stats import ttest_ind
from scipy.stats import pearsonr
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor ### VIF package
from statsmodels.discrete.discrete_model import Logit
from datascience import *

pd.options.display.float_format = '{:.4f}'.format


In [2]:
import matplotlib
import matplotlib.pyplot as plt
import scipy 

In [3]:
%matplotlib inline

In [4]:
pip install researchpy

Collecting researchpy
  Using cached researchpy-0.2.3-py3-none-any.whl (10 kB)
Installing collected packages: researchpy
Successfully installed researchpy-0.2.3
Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install pingouin

Processing /home/jovyan/.cache/pip/wheels/6d/60/60/30767bba9ffecc666e5feeed793b0e98ec6c8a37b714458a5e/pingouin-0.3.8-py3-none-any.whl
Processing /home/jovyan/.cache/pip/wheels/2d/4f/c9/062da6e68841f60d0c3434980775671daaa07a574110567de6/outdated-0.2.0-py3-none-any.whl
Collecting tabulate
  Using cached tabulate-0.8.7-py3-none-any.whl (24 kB)
Collecting pandas-flavor>=0.1.2
  Using cached pandas_flavor-0.2.0-py2.py3-none-any.whl (6.6 kB)
Processing /home/jovyan/.cache/pip/wheels/6a/33/c4/0ef84d7f5568c2823e3d63a6e08988852fb9e4bc822034870a/littleutils-0.2.2-py3-none-any.whl
Installing collected packages: littleutils, outdated, tabulate, pandas-flavor, pingouin
Successfully installed littleutils-0.2.2 outdated-0.2.0 pandas-flavor-0.2.0 pingouin-0.3.8 tabulate-0.8.7
Note: you may need to restart the kernel to use updated packages.


In [6]:
import researchpy as rp
import pingouin as pg

In [7]:
#When we start working with nan (missing) values, we can get warnings - we're going to ignore them here
import warnings
warnings.filterwarnings("ignore") 

## 2 Bring in data and code variables

In [8]:
chis_df = pd.read_csv('chis_extract_2018_weights.csv')
chis_df

Unnamed: 0,AC47,AC42,SRSEX,AC46,POVLL,AE_VEGI,OMBSRR_P1,POVGWD_P1,RAKEDW0
0,2,4,2,0,4,7,2,5.0000,85.8754
1,0,4,1,0,4,7,2,5.0000,1911.8158
2,3,4,2,0,4,3,2,5.0000,197.0370
3,3,4,2,120,4,7,2,4.1200,1335.0551
4,6,4,1,0,1,14,2,0.1500,938.3114
...,...,...,...,...,...,...,...,...,...
21172,2,4,2,0,1,4,2,0.0000,1601.0253
21173,6,4,1,5,3,4,1,2.3500,8935.2972
21174,2,4,1,60,1,21,1,1.1000,1454.7643
21175,0,4,1,22,4,7,5,5.0000,3184.6443


In [9]:
chis_df.rename(columns={'AC47':'drank_water', 
                        'AC42':'nhood_fv', 
                        'AE_VEGI':'ate_fv',
                        'SRSEX': 'sex',
                        'AC46': 'drank_sweet',
                        'OMBSRR_P1': 'race_ethnicity',
                        'POVGWD_P1' : 'pov_ratio',
                       'POVLL' : 'pov_cat',
                       'RAKEDW0': 'weight'}, inplace=True)
chis_df

Unnamed: 0,drank_water,nhood_fv,sex,drank_sweet,pov_cat,ate_fv,race_ethnicity,pov_ratio,weight
0,2,4,2,0,4,7,2,5.0000,85.8754
1,0,4,1,0,4,7,2,5.0000,1911.8158
2,3,4,2,0,4,3,2,5.0000,197.0370
3,3,4,2,120,4,7,2,4.1200,1335.0551
4,6,4,1,0,1,14,2,0.1500,938.3114
...,...,...,...,...,...,...,...,...,...
21172,2,4,2,0,1,4,2,0.0000,1601.0253
21173,6,4,1,5,3,4,1,2.3500,8935.2972
21174,2,4,1,60,1,21,1,1.1000,1454.7643
21175,0,4,1,22,4,7,5,5.0000,3184.6443


In [10]:
#drop observations 3 StDev from the mean for the "drank_sweet" variable
chis_df=chis_df[(np.abs(stats.zscore(chis_df['drank_sweet'], nan_policy='omit'))<3)]
chis_df['drank_water'].describe()

count   20582.0000
mean        6.5979
std         9.9055
min         0.0000
25%         3.0000
50%         5.0000
75%         8.0000
max        99.0000
Name: drank_water, dtype: float64

In [11]:
# assign missing observations for drank water to be missing instead of '99'
chis_df.loc[(chis_df.drank_water == 99),'drank_water']=np.nan
chis_df['drank_water'].describe()

count   20376.0000
mean        5.6638
std         3.4515
min         0.0000
25%         3.0000
50%         5.0000
75%         8.0000
max        20.0000
Name: drank_water, dtype: float64

In [12]:
#Dummy for whether the respondent is in poverty
chis_df['inpoverty_dv']=chis_df['pov_cat'].map({1:1, 2:0, 3:0, 4:0})

In [45]:
#Dummy for whether the person can never or only sometimes find fresh fruits and vegetables in their neighborhood
chis_df['nofv_dv']=chis_df['nhood_fv'].map({1:1, 2:1, 3:0, 4:0})
chis_df[chis_df['nhood_fv'] <5]

Unnamed: 0,drank_water,nhood_fv,sex,drank_sweet,pov_cat,ate_fv,race_ethnicity,pov_ratio,weight,inpoverty_dv,nofv_dv,race_eth_text,hispanic_dv,white_dv,black_dv,asian_dv,other_dv
0,2.0000,4,2,0,4,7,2,5.0000,85.8754,0,0.0000,NHWhite,0,1,0,0,0
1,0.0000,4,1,0,4,7,2,5.0000,1911.8158,0,0.0000,NHWhite,0,1,0,0,0
2,3.0000,4,2,0,4,3,2,5.0000,197.0370,0,0.0000,NHWhite,0,1,0,0,0
4,6.0000,4,1,0,1,14,2,0.1500,938.3114,1,0.0000,NHWhite,0,1,0,0,0
5,4.0000,4,1,0,4,2,2,3.0400,460.4011,0,0.0000,NHWhite,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21171,4.0000,4,2,4,4,7,2,5.0000,1022.3784,0,0.0000,NHWhite,0,1,0,0,0
21172,2.0000,4,2,0,1,4,2,0.0000,1601.0253,1,0.0000,NHWhite,0,1,0,0,0
21173,6.0000,4,1,5,3,4,1,2.3500,8935.2972,0,0.0000,Hispanic,1,0,0,0,0
21175,0.0000,4,1,22,4,7,5,5.0000,3184.6443,0,0.0000,Asian,0,0,0,1,0


In [14]:
#Text race/ethnicity variable
chis_df.loc[(chis_df['race_ethnicity'] == 2), 'race_eth_text'] = 'NHWhite'  
chis_df.loc[(chis_df['race_ethnicity']==5), 'race_eth_text'] = "Asian"
chis_df.loc[(chis_df['race_ethnicity']==3), 'race_eth_text'] = "Black"
chis_df.loc[(chis_df['race_ethnicity']==1), 'race_eth_text'] = "Hispanic"
chis_df.loc[(chis_df['race_ethnicity']==4), 'race_eth_text'] = "Other/Two Races"
chis_df.loc[(chis_df['race_ethnicity']==6), 'race_eth_text'] = "Other/Two Races"

In [15]:
#A series of dummies for my race ethnicity variable
chis_df['hispanic_dv']=np.where((chis_df['race_ethnicity'] == 1), 1,0)
chis_df['white_dv']=np.where((chis_df['race_ethnicity'] == 2), 1,0)
chis_df['black_dv']=np.where((chis_df['race_ethnicity'] == 3), 1,0)
chis_df['asian_dv']=np.where((chis_df['race_ethnicity'] == 5), 1,0)
chis_df['other_dv']=np.where((chis_df['race_ethnicity'] == 4) | (chis_df['race_ethnicity'] == 6), 1,0)
chis_df

Unnamed: 0,drank_water,nhood_fv,sex,drank_sweet,pov_cat,ate_fv,race_ethnicity,pov_ratio,weight,inpoverty_dv,nofv_dv,race_eth_text,hispanic_dv,white_dv,black_dv,asian_dv,other_dv
0,2.0000,4,2,0,4,7,2,5.0000,85.8754,0,0.0000,NHWhite,0,1,0,0,0
1,0.0000,4,1,0,4,7,2,5.0000,1911.8158,0,0.0000,NHWhite,0,1,0,0,0
2,3.0000,4,2,0,4,3,2,5.0000,197.0370,0,0.0000,NHWhite,0,1,0,0,0
4,6.0000,4,1,0,1,14,2,0.1500,938.3114,1,0.0000,NHWhite,0,1,0,0,0
5,4.0000,4,1,0,4,2,2,3.0400,460.4011,0,0.0000,NHWhite,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21171,4.0000,4,2,4,4,7,2,5.0000,1022.3784,0,0.0000,NHWhite,0,1,0,0,0
21172,2.0000,4,2,0,1,4,2,0.0000,1601.0253,1,0.0000,NHWhite,0,1,0,0,0
21173,6.0000,4,1,5,3,4,1,2.3500,8935.2972,0,0.0000,Hispanic,1,0,0,0,0
21175,0.0000,4,1,22,4,7,5,5.0000,3184.6443,0,0.0000,Asian,0,0,0,1,0


## Linear Regression


We use linear regression when our dependent or Y variable is numeric.  

In [36]:
#Define Independent Variables of Interest
ind_var = ['hispanic_dv','black_dv', 'asian_dv', 'other_dv', 'inpoverty_dv', 'ate_fv'] 
#Note that the year variable is categorical. We need to exclude one to prevent collinearity issues with out model
#We will exclude year 2001 - we choose to have the earlier year be our base year

x = chis_df[ind_var].assign(Intercept = 1) #Independent Variables
y = chis_df['drank_sweet'] #Dependent Variable

model = sm.OLS(y, x).fit()
### Let's save the results as "model" - this will be useful for other functions below.

model.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.032
Dependent Variable:,drank_sweet,AIC:,149120.2793
Date:,2020-11-11 11:34,BIC:,149175.8045
No. Observations:,20582,Log-Likelihood:,-74553.0
Df Model:,6,F-statistic:,113.8
Df Residuals:,20575,Prob (F-statistic):,7.86e-142
R-squared:,0.032,Scale:,82.02

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
hispanic_dv,2.7168,0.1608,16.8954,0.0000,2.4016,3.0319
black_dv,2.3768,0.2870,8.2830,0.0000,1.8144,2.9393
asian_dv,0.0028,0.2287,0.0121,0.9904,-0.4455,0.4510
other_dv,1.1710,0.2977,3.9335,0.0001,0.5875,1.7545
inpoverty_dv,0.9541,0.1880,5.0743,0.0000,0.5855,1.3226
ate_fv,-0.1347,0.0098,-13.7250,0.0000,-0.1540,-0.1155
Intercept,5.2897,0.1141,46.3438,0.0000,5.0660,5.5134

0,1,2,3
Omnibus:,6833.639,Durbin-Watson:,1.959
Prob(Omnibus):,0.0,Jarque-Bera (JB):,17392.72
Skew:,1.87,Prob(JB):,0.0
Kurtosis:,5.51,Condition No.:,49.0


## Logit Regression

We use logit regression when our dependent variable is a dummy!

In [55]:
y = chis_df['nofv_dv'] #Dependent Variable - it's a dummy!
ind_var = ['hispanic_dv','black_dv', 'asian_dv', 'other_dv', 'inpoverty_dv', 'ate_fv']
x = chis_df[ind_var].assign(Intercept = 1) #Independent Variables
y.value_counts()

0.0000    18407
1.0000     2020
Name: nofv_dv, dtype: int64

In [56]:
logit_model = sm.Logit(y, x, missing='drop').fit()
logit_model.summary2()

Optimization terminated successfully.
         Current function value: 0.313238
         Iterations 7


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.029
Dependent Variable:,nofv_dv,AIC:,12811.0122
Date:,2020-11-11 11:46,BIC:,12866.4845
No. Observations:,20427,Log-Likelihood:,-6398.5
Df Model:,6,LL-Null:,-6590.5
Df Residuals:,20420,LLR p-value:,8.103799999999999e-80
Converged:,1.0000,Scale:,1.0
No. Iterations:,7.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
hispanic_dv,0.2805,0.0585,4.7989,0.0000,0.1660,0.3951
black_dv,0.4760,0.0945,5.0376,0.0000,0.2908,0.6612
asian_dv,0.2557,0.0842,3.0376,0.0024,0.0907,0.4207
other_dv,0.7148,0.0933,7.6587,0.0000,0.5319,0.8978
inpoverty_dv,0.6146,0.0589,10.4388,0.0000,0.4992,0.7300
ate_fv,-0.0538,0.0053,-10.0993,0.0000,-0.0642,-0.0433
Intercept,-2.1414,0.0490,-43.7440,0.0000,-2.2374,-2.0455


In [57]:
# Odds Ratios
or_table = np.exp(logit_model.conf_int()) #Exponentiate Confidence Intervals
or_table['Odds Ratio'] = np.exp(logit_model.params) #Exponentiate Coefficients

or_table.columns = ['2.5%', '97.5%', 'Odds Ratio'] #Name Columns
or_table

Unnamed: 0,2.5%,97.5%,Odds Ratio
hispanic_dv,1.1805,1.4845,1.3238
black_dv,1.3375,1.9371,1.6096
asian_dv,1.0949,1.523,1.2913
other_dv,1.7021,2.4541,2.0438
inpoverty_dv,1.6475,2.0752,1.849
ate_fv,0.9378,0.9576,0.9477
Intercept,0.1067,0.1293,0.1175
