In [79]:
# HIDDEN
# This useful nonsense should just go at the top of your notebook.
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
plots.style.use('fivethirtyeight')
# datascience version number of last run of this notebook
version.__version__

'0.5.12'

<h1>Class 7: Smoking and Weight Over Time</h1>

*Special thanks to David Culler for the idea and some coding and to Shanaaz Deo for coding help.*

Are smoking and weight related? Biologically, nicotine is a stimulate that may reduce appetite and raise metabolism. Smokers sometimes remark that they gain weight when trying to quit smoking.

An obvious method of testing this relationship would be a randomized controlled trial (RCT) in which some kind of smoking cessation were available to the treatment group. But when we only have observational data on smoking and weight, what can we do?

The U.S Health and Retirement Study is a biennial *panel survey* of Americans aged 50 and over.  We can observe smoking status and weight for a set of individuals measured two years apart, in waves 8 and 9 (2006 and 2008). 


Let's compute the *average weight* (in kilograms) for several interesting groups and compare them:
<ol>
<li>a smoker in wave 8 
<li>a non-smoker in wave 8

<li>a *quitter* in wave 9
<li>a *still-smoker* in wave 9
</ol>

Here is the call to load in the data

In [80]:
#load file
smokeweight = Table.read_table('http://demog.berkeley.edu/~redwards/Courses/LS88/c07_smokeweight.csv')
smokeweight

hhidpn,ragender,r8agey_m,r8weight,r9weight,r8smoken,r9smoken
3010,1,70,71.6672,65.317,0,0
3020,2,67,65.317,68.0385,0,0
10001010,1,66,72.5744,72.5744,0,0
10003030,2,50,58.9667,72.5744,0,0
10004010,1,66,102.511,100.697,0,0
10004040,2,60,77.1103,74.8423,0,0
10013010,1,68,108.862,99.7898,0,0
10013040,2,58,64.4098,63.5026,1,1
10038010,1,70,74.8423,73.4816,0,0
10038040,2,63,64.4098,63.5026,0,0


In [81]:
#Filter the table; only include rows where r8smoken==1. These are current smokers in wave 8
smoker8 = smokeweight.where('r8smoken',1)
smoker8

hhidpn,ragender,r8agey_m,r8weight,r9weight,r8smoken,r9smoken
10013040,2,58,64.4098,63.5026,1,1
10059020,2,70,57.6059,57.1523,1,1
10083010,1,67,122.469,95.7075,1,0
10196010,2,71,58.9667,52.1628,1,0
10433010,2,71,125.644,120.201,1,0
10482010,2,65,61.2347,65.7706,1,1
10577011,1,56,105.233,105.686,1,1
11071010,2,73,81.6462,77.1103,1,1
11230010,2,70,63.5026,65.7706,1,1
11345010,1,66,86.1821,86.1821,1,1


In [82]:
#Compute the average weight of smokers in wave 8
smoker8_weight_avg = smoker8['r8weight'].mean()
smoker8_weight_avg

76.134530202520253

Remember the formula for the standard error of a mean?
$$SEM = \frac{STD}{\sqrt{N}}$$
And the 95% "margin of error" is 1.96 times that. The 95% confidence interval is defined by $\pm 1.96 \times SEM$.

In [83]:
smoker8_weight_std = smoker8['r8weight'].std()
smoker8_weight_std

17.652564168796527

In [84]:
smoker8_weight_sem = smoker8_weight_std/len(smoker8['r8weight'])**0.5
smoker8_weight_sem

0.37448615964871435

In [85]:
smoker8_weight_moe = 1.96*smoker8_weight_sem
smoker8_weight_moe

0.73399287291148008

Let's now calculate the average weight of non-smokers in wave 8

In [86]:
#Filter the table; only include rows where r8smoken==0. These are current non-smokers in wave 8
nonsmoker8 = ...

In [87]:
#Compute the average weight of non-smokers in wave 8
nonsmoker8_weight_avg = ...
nonsmoker8_weight_avg

Ellipsis

In [88]:
nonsmoker8_weight_std = ...
nonsmoker8_weight_std

Ellipsis

In [89]:
nonsmoker8_weight_sem = ...
nonsmoker8_weight_sem

Ellipsis

In [90]:
nonsmoker8_weight_moe = ...
nonsmoker8_weight_moe

Ellipsis

In [92]:
#Now, calculate the difference between the two averages
#UNCOMMENT THE FOLLOWING:
#nonsmoker8_weight_avg - smoker8_weight_avg

Do the confidence intervals overlap?

This is the lower bound on nonsmokers' higher average weight

In [93]:
#UNCOMMENT ME:
#(nonsmoker8_weight_avg - nonsmoker8_weight_moe)

And this is the upper bound on smokers' lower average weight

In [94]:
(smoker8_weight_avg + smoker8_weight_moe)

76.868523075431739

Remark on what you found comparing smokers and non-smokers in wave 8.

<h2>Change over time</h2>

Let's now look at the weight of a *quitter* in wave 9
versus the weight of a *still-smoker* in wave 9! The nifty thing is that we can use our `smoker8` matrix from above.

In [54]:
quitter9 = smoker8.where('r9smoken',0)
quitter9

hhidpn,ragender,r8agey_m,r8weight,r9weight,r8smoken,r9smoken
10083010,1,67,122.469,95.7075,1,0
10196010,2,71,58.9667,52.1628,1,0
10433010,2,71,125.644,120.201,1,0
11765010,1,67,73.9352,76.2031,1,0
11859010,2,68,72.5744,70.76,1,0
12033011,2,65,45.359,45.8126,1,0
12183010,1,70,66.6777,65.7706,1,0
13137020,1,67,81.6462,83.9142,1,0
13600010,1,72,81.6462,86.1821,1,0
13600020,2,71,58.0595,61.2347,1,0


In [78]:
stillsmoker9 = ...
stillsmoker9

Ellipsis

Calculate the average (mean) weights of these two groups, then the standard deviations and standard errors of the means, then test for significance.

In [96]:
quitter9_weight_avg = ...
quitter9_weight_avg

Ellipsis

In [97]:
quitter9_weight_std = ...
quitter9_weight_std

Ellipsis

In [98]:
quitter9_weight_sem = ...
quitter9_weight_sem

Ellipsis

In [99]:
quitter9_weight_moe = ...
quitter9_weight_moe

Ellipsis

In [100]:
stillsmoker9_weight_avg = ...
stillsmoker9_weight_avg

Ellipsis

In [101]:
stillsmoker9_weight_std = ...
stillsmoker9_weight_std

Ellipsis

In [102]:
stillsmoker9_weight_sem = ...
stillsmoker9_weight_sem

Ellipsis

In [103]:
stillsmoker9_weight_moe = ...
stillsmoker9_weight_moe

Ellipsis

In [104]:
#Now, calculate the difference between the two averages
#quitter9_weight_avg - stillsmoker9_weight_avg

Do the confidence intervals overlap?

This is the lower bound on quitters' higher average weight

In [106]:
#quitter9_weight_avg - quitter9_weight_moe

And this is the upper bound on still-smokers' lower average weight

In [108]:
#stillsmoker9_weight_avg + stillsmoker9_weight_moe

Remark on what you found comparing quitters and still-smokers, and how this comparison itself compares to the first comparison you undertook.