<a href="https://colab.research.google.com/github/brittabeta/AdventureWorks-SQL-PowerBI/blob/main/nhanes_univariate_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice notebook for univariate analysis using NHANES data

This notebook will be used to perform some univariate analyses on  the NHANES data set.  

To get started, import modules and read the data:

In [39]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

path = '/content/nhanes_2015_2016.csv'
da = pd.read_csv(path)

Marital Status Variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) 

In [40]:
# inspect current state of DMDMARTL marital status column
print('Unique Values in Marital Status Column:', da.DMDMARTL.value_counts())
print('Null Values in Marital Status Column:', da.DMDMARTL.isnull().sum())

Unique Values in Marital Status Column: 1.0     2780
5.0     1004
3.0      579
6.0      527
2.0      396
4.0      186
77.0       2
Name: DMDMARTL, dtype: int64
Null Values in Marital Status Column: 261


In [42]:
# de code / rename martial status values labels per NHANEs code
da['DMDMARTL'] = da.DMDMARTL.replace({
    1:	'Married',	2:	'Widowed', 3:	'Divorced', 4: 'Separated',	
    5:	'Never married', 6:	'Living with partner', 77:	'Refused'})
# rename missing / NaN values as Missing
da['DMDMARTL'] = da.DMDMARTL.fillna('Missing')

In [43]:
# create a frequency table for all subjects marital status
x = da.DMDMARTL.value_counts()
print('Frequency Table: Marital Status (All Genders)\n',x/x.sum())

Frequency Table: Marital Status (All Genders)
 Married                0.484743
Never married          0.175065
Divorced               0.100959
Living with partner    0.091892
Widowed                0.069050
Missing                0.045510
Separated              0.032432
Refused                0.000349
Name: DMDMARTL, dtype: float64


In [44]:
# create a frequency table for marital status, females only
f = da[da['RIAGENDR'] == 2]
f = f.DMDMARTL.value_counts()
print('Frequency Table: Marital Status (Females)\n',f/f.sum())
# create a frequency table for marital status, males only
m = da[da['RIAGENDR'] == 1]
m = m.DMDMARTL.value_counts()
print('Frequency Table: Marital Status (Males)\n',m/m.sum())

Frequency Table: Marital Status (Females)
 Married                0.437836
Never married          0.174731
Divorced               0.117608
Widowed                0.099462
Living with partner    0.088038
Missing                0.042339
Separated              0.039651
Refused                0.000336
Name: DMDMARTL, dtype: float64
Frequency Table: Marital Status (Males)
 Married                0.535339
Never married          0.175426
Living with partner    0.096049
Divorced               0.083001
Missing                0.048931
Widowed                0.036245
Separated              0.024647
Refused                0.000362
Name: DMDMARTL, dtype: float64


In [45]:
# people whose age is between 30 and 40 marital status all genders
a = da[(da['RIDAGEYR'] >= 30) & (da['RIDAGEYR'] <= 40)]
aa = a.DMDMARTL.value_counts()
print('Frequency Table: Marital Status (All Genders, Age: 30 to 40)\n',aa/aa.sum())
# create a frequency table for marital status, females only, age is between 30 and 40
f = a[a['RIAGENDR'] == 2]
f = f.DMDMARTL.value_counts()
print('Frequency Table: Marital Status (Females, Age: 30 to 40)\n',f/f.sum())
# create a frequency table for marital status, males only, age is between 30 and 40
m = a[a['RIAGENDR'] == 1]
m = m.DMDMARTL.value_counts()
print('Frequency Table: Marital Status (Males, Age: 30 to 40)\n',m/m.sum())

Frequency Table: Marital Status (All Genders, Age: 30 to 40)
 Married                0.545809
Never married          0.211501
Living with partner    0.139376
Divorced               0.068226
Separated              0.029240
Widowed                0.004873
Refused                0.000975
Name: DMDMARTL, dtype: float64
Frequency Table: Marital Status (Females, Age: 30 to 40)
 Married                0.535714
Never married          0.218045
Living with partner    0.122180
Divorced               0.086466
Separated              0.033835
Widowed                0.003759
Name: DMDMARTL, dtype: float64
Frequency Table: Marital Status (Males, Age: 30 to 40)
 Married                0.556680
Never married          0.204453
Living with partner    0.157895
Divorced               0.048583
Separated              0.024291
Widowed                0.006073
Refused                0.002024
Name: DMDMARTL, dtype: float64


__Q1a.__ Briefly comment on some of the differences that you observe between the distribution of marital status between women and men, for people of all ages.

__Q1b.__ Briefly comment on the differences that you observe between the distribution of marital status states for women between the overall population, and for women between the ages of 30 and 40.

__Q1c.__ Repeat part b for the men.

## Question 2

Restricting to the female population, stratify the subjects into age bands no wider than ten years, and construct the distribution of marital status within each age band.  Within each age band, present the distribution in terms of proportions that must sum to 1.

In [None]:
# insert your code here

__Q2a.__ Comment on the trends that you see in this series of marginal distributions.

__Q2b.__ Repeat the construction for males.

In [None]:
# insert your code here

__Q2c.__ Comment on any notable differences that you see when comparing these results for females and for males.

## Question 3

Construct a histogram of the distribution of heights using the BMXHT variable in the NHANES sample.

In [None]:
# insert your code here

__Q3a.__ Use the `bins` argument to [distplot](https://seaborn.pydata.org/generated/seaborn.distplot.html) to produce histograms with different numbers of bins.  Assess whether the default value for this argument gives a meaningful result, and comment on what happens as the number of bins grows excessively large or excessively small. 

__Q3b.__ Make separate histograms for the heights of women and men, then make a side-by-side boxplot showing the heights of women and men.

In [None]:
# insert your code here

__Q3c.__ Comment on what features, if any are not represented clearly in the boxplots, and what features, if any, are easier to see in the boxplots than in the histograms.

## Question 4

Make a boxplot showing the distribution of within-subject differences between the first and second systolic blood pressure measurents ([BPXSY1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXSY1) and [BPXSY2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXSY2)).

In [None]:
# insert your code here

__Q4a.__ What proportion of the subjects have a lower SBP on the second reading compared to the first?

In [None]:
# insert your code here

__Q4b.__ Make side-by-side boxplots of the two systolic blood pressure variables.

In [None]:
# insert your code here

__Q4c.__ Comment on the variation within either the first or second systolic blood pressure measurements, and the variation in the within-subject differences between the first and second systolic blood pressure measurements.

## Question 5

Construct a frequency table of household sizes for people within each educational attainment category (the relevant variable is [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2)).  Convert the frequencies to proportions.

In [None]:
# insert your code here

__Q5a.__ Comment on any major differences among the distributions.

__Q5b.__ Restrict the sample to people between 30 and 40 years of age.  Then calculate the median household size for women and men within each level of educational attainment.

In [None]:
# insert your code here

## Question 6

The participants can be clustered into "maked variance units" (MVU) based on every combination of the variables [SDMVSTRA](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#SDMVSTRA) and [SDMVPSU](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#SDMVPSU).  Calculate the mean age ([RIDAGEYR](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIDAGEYR)), height ([BMXHT](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXHT)), and BMI ([BMXBMI](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXBMI)) for each gender ([RIAGENDR](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIAGENDR)), within each MVU, and report the ratio between the largest and smallest mean (e.g. for height) across the MVUs.

In [None]:
# insert your code here

__Q6a.__ Comment on the extent to which mean age, height, and BMI vary among the MVUs.

__Q6b.__ Calculate the inter-quartile range (IQR) for age, height, and BMI for each gender and each MVU.  Report the ratio between the largest and smalles IQR across the MVUs.

In [None]:
# insert your code here

__Q6c.__ Comment on the extent to which the IQR for age, height, and BMI vary among the MVUs.