# Practice notebook for regression analysis with NHANES

This notebook will give you the opportunity to perform some
regression analyses with the NHANES data that are similar to
the analyses done in the week 2 case study notebook.

You can enter your code into the cells that say "enter your code here",
and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar
to code that appears in the case study notebook.  You will need
to edit code from that notebook in small ways to adapt it to the
prompts below.

To get started, we will use the same module imports and
read the data in the same way as we did in the case study:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

url = "https://raw.githubusercontent.com/kshedden/statswpy/master/NHANES/merged/nhanes_2015_2016.csv"
da = pd.read_csv(url)

# Drop unused columns, drop rows with any missing values.
vars = ["BPXSY1", "RIDAGEYR", "RIAGENDR", "RIDRETH1", "DMDEDUC2", "BMXBMI", "SMQ020"]
da = da[vars].dropna()

## Question 1:

Use linear regression to relate the expected body mass index (BMI) to a person's age.

In [4]:
# first i check for the correlation between them
# da[['BMXBMI','RIDAGEYR']].corr()
model = sm.OLS.from_formula("BMXBMI ~ RIDAGEYR", data=da)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,BMXBMI,R-squared:,0.001
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,2.72
Date:,"Thu, 15 Jun 2023",Prob (F-statistic):,0.0991
Time:,10:36:30,Log-Likelihood:,-17149.0
No. Observations:,5102,AIC:,34300.0
Df Residuals:,5100,BIC:,34320.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,29.0564,0.290,100.143,0.000,28.488,29.625
RIDAGEYR,0.0091,0.006,1.649,0.099,-0.002,0.020

0,1,2,3
Omnibus:,936.202,Durbin-Watson:,2.009
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1857.656
Skew:,1.105,Prob(JB):,0.0
Kurtosis:,4.964,Cond. No.,156.0


__Q1a.__ According to your fitted model, do older people tend to have higher or lower BMI than younger people?

my model tells me that the age is not a significant factor on BMI for a person 

__Q1b.__ Based your analysis, are you confident that there is a relationship between BMI and age in the population that NHANES represents?

no their are no relationship between them the R square = 0 which means their isn't any

__Q1c.__ By how much does the average BMI of a 40 year old differ from the average BMI of a 20 year old?

if we use our equation it will be ==>  20 * .0091 = 0.182 which isn't much

__Q1d.__ What fraction of the variation of BMI in this population is explained by age?

if we consider R square it would be .001 and if we take the adjusted one it will be Zero percent

## Question 2: 

Add gender and ethnicity as additional control variables to your linear model relating BMI to age.  You will need to recode the ethnic groups based
on the values in the codebook entry for [RIDRETH1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIDRETH1).

In [8]:
# enter your code here
"""
    1	Mexican American		
    2	Other Hispanic	
    3	Non-Hispanic White	
    4	Non-Hispanic Black	
    5	Other Race - Including Multi-Racial
"""
# da.RIDRETH1.value_counts()
da['RIDRETH1x'] = da.RIDRETH1.replace(
    {1: 'Mexican American', 2: 'Other Hispanic', 3: 'Non-Hispanic White', 4: 'Non-Hispanic Black', 5: 'Other Race'})
# da.RIDRETH1x.value_counts()
da.columns

Index(['BPXSY1', 'RIDAGEYR', 'RIAGENDR', 'RIDRETH1', 'DMDEDUC2', 'BMXBMI',
       'SMQ020', 'RIDRETH1x'],
      dtype='object')

In [9]:
model = sm.OLS.from_formula("BMXBMI ~ RIDAGEYR + RIAGENDR +RIDRETH1x", data=da)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,BMXBMI,R-squared:,0.055
Model:,OLS,Adj. R-squared:,0.054
Method:,Least Squares,F-statistic:,49.27
Date:,"Thu, 15 Jun 2023",Prob (F-statistic):,3.9800000000000004e-59
Time:,10:56:26,Log-Likelihood:,-17007.0
No. Observations:,5102,AIC:,34030.0
Df Residuals:,5095,BIC:,34070.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,29.1908,0.456,63.988,0.000,28.296,30.085
RIDRETH1x[T.Non-Hispanic Black],-0.4499,0.308,-1.460,0.144,-1.054,0.154
RIDRETH1x[T.Non-Hispanic White],-1.8555,0.282,-6.588,0.000,-2.408,-1.303
RIDRETH1x[T.Other Hispanic],-0.9379,0.345,-2.721,0.007,-1.614,-0.262
RIDRETH1x[T.Other Race],-4.7799,0.334,-14.318,0.000,-5.434,-4.125
RIDAGEYR,0.0065,0.005,1.196,0.232,-0.004,0.017
RIAGENDR,1.0226,0.190,5.370,0.000,0.649,1.396

0,1,2,3
Omnibus:,917.09,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1855.286
Skew:,1.075,Prob(JB):,0.0
Kurtosis:,5.026,Cond. No.,323.0


__Q2a.__ How did the mean relationship between BMI and age change when you added additional covariates to the model?

it decreased a little from 0.0091 to 0.0065  it become more   not significant 

__Q2b.__ How did the standard error for the regression parameter for age change when you added additional covariates to the model?

  and the standard error decreased also means that our calculation becomes more accurate  .006 to 0.005

__Q2c.__ How much additional variation in BMI is explained by age, gender, and ethnicity that is not explained by age alone?

5.4 % of our data variability  can be explained by this model 

__Q2d.__ What reference level did the software select for the ethnicity variable?

Mexican American

__Q2e.__ What is the expected difference between the BMI of a 40 year-old non-Hispanic black man and a 30 year-old non-Hispanic black man?

this means that all other coefficient are held constant so we will use the coefficient of age 
it will be calculated like this ==> 0.0065 * 10 which is the difference at age 
so, the answer will be 0.065 of BMI between them 

__Q2f.__ What is the expected difference between the BMI of a 50 year-old Mexican American woman and a 50 year-old non-Hispanic black man?

## Question 3: 

Randomly sample 25% of the NHANES data, then fit the same model you used in question 2 to this data set.

In [None]:
# enter your code here

__Q3a.__ How do the estimated regression coefficients and their standard errors compare between these two models?  Do you see any systematic relationship between the two sets of results?

## Question 4:

Generate a scatterplot of the residuals against the fitted values for the model you fit in question 2.

In [None]:
# enter your code here

__Q4a.__ What mean/variance relationship do you see?

## Question 5: 

Generate a plot showing the fitted mean BMI as a function of age for Mexican American men.  Include a 95% simultaneous confidence band on your graph.

In [None]:
# enter your code here

__Q5a.__ According to your graph, what is the longest interval starting at year 30 following which the mean BMI could be constant?  *Hint:* What is the longest horizontal line starting at age 30 that remains within the confidence band?

__Q5b.__ Add an additional line and confidence band to the same plot, showing the relationship between age and BMI for Mexican American women.  At what ages do these intervals not overlap?

## Question 6:

Use an added variable plot to assess the linearity of the relationship between BMI and age (when controlling for gender and ethnicity).

In [None]:
# enter your code here

__Q6a.__ What is your interpretation of the added variable plot?

## Question 7: 

Generate a binary variable reflecting whether a person has had at least 12 drinks in their lifetime, based on the [ALQ110](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/ALQ_I.htm#ALQ110) variable in NHANES.  Calculate the marginal probability, odds, and log odds of this variable for women and for men.  Then calculate the odds ratio for females relative to males.

In [None]:
# enter your code here

__Q7a.__ Based on the log odds alone, do more than 50% of women drink alcohol?

__Q7b.__ Does there appear to be an important difference between the alcohol use rate of women and men?

## Question 8: 

Use logistic regression to express the log odds that a person drinks (based on the binary drinking variable that you constructed above) in terms of gender.

In [None]:
# enter your code here

__Q8a.__ Is there statistical evidence that the drinking rate differs between women and men?  If so, in what direction is there a difference?

__Q8b.__ Confirm that the log odds ratio between drinking and smoking calculated using the logistic regression model matches the log odds ratio calculated directly in question 6.

## Question 9: 

Use logistic regression to relate drinking to age, gender, and education.

In [None]:
# enter your code here

__Q9a.__ Which of these predictor variables shows a statistically significant association with drinking?

__Q9b.__ What is the odds of a college educated, 50 year old woman drinking?

__Q9c.__ What is the odds ratio between the drinking status for college graduates and high school graduates (with no college), holding gender and age fixed?

__Q9d.__ Did the regression parameter for gender change to a meaningful degree when age and education were added to the model?

## Question 10:

Construct a CERES plot for the relationship between drinking and age (using the model that controls for gender and educational attainment).

In [None]:
# enter your code here

__Q10a.__ Does the plot indicate any major non-linearity in the relationship between age and the log odds for drinking?