# Analyzing Health and Nutrition Data

We'll start by importing relevant _libraries_ (i.e. prebuilt chunks of Python that have useful functions).

In [7]:
import pandas as pd    #data manipulation
import numpy as np     #mathematical operations

import matplotlib.pyplot as plt      #plotting tools
import geopandas as gpd              #geo mapping tools
import contextily as ctx             #map illustrating tool

from urllib.request import urlopen   #web url reading
import json                          #json reader
import xport                   #xport reader (us gov't data export format)

## Exploratory Data Analysis

### Demographics

We'll be importing data from the [National Health and Nutrition Examination Survey](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2017) as a pandas dataframe.  We can start by looking at the underlying demographics of the survey participants.

In [15]:
# download_url = "https://raw.githubusercontent.com/annahaensch/DataAndSocialJustice/main/Data/Health_and_Nutrition/DEMO_I.XPT"

url = "../Data/Health_and_Nutrition/DEMO_I.XPT"

with open(url, 'rb') as f:
    df = xport.to_dataframe(f)

In [16]:
df.head()

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,...,DMDHREDZ,DMDHRMAZ,DMDHSEDZ,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,93703.0,10.0,2.0,2.0,2.0,,5.0,6.0,2.0,27.0,...,3.0,1.0,3.0,9246.491865,8539.731348,2.0,145.0,15.0,15.0,5.0
1,93704.0,10.0,2.0,1.0,2.0,,3.0,3.0,1.0,33.0,...,3.0,1.0,2.0,37338.768343,42566.61475,1.0,143.0,15.0,15.0,5.0
2,93705.0,10.0,2.0,2.0,66.0,,4.0,4.0,2.0,,...,1.0,2.0,,8614.571172,8338.419786,2.0,145.0,3.0,3.0,0.82
3,93706.0,10.0,2.0,1.0,18.0,,5.0,6.0,2.0,222.0,...,3.0,1.0,2.0,8548.632619,8723.439814,2.0,134.0,,,
4,93707.0,10.0,2.0,1.0,13.0,,5.0,7.0,2.0,158.0,...,2.0,1.0,3.0,6769.344567,7064.60973,1.0,138.0,10.0,10.0,1.88


To understand the column headings here, we should consult the [NHANES documentation brochure](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm).  Who is missing from this data and why?  How should that impact our analysis?  We can look at the summary statistics of the dataframe with:

In [20]:
df.describe()

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,...,DMDHREDZ,DMDHRMAZ,DMDHSEDZ,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
count,9254.0,9254.0,9254.0,9254.0,9254.0,597.0,9254.0,9254.0,8704.0,3433.0,...,8764.0,9063.0,4751.0,9254.0,9254.0,9254.0,9254.0,8763.0,8780.0,8023.0
mean,98329.5,10.0,1.940566,1.507564,34.334234,10.437186,3.233953,3.49719,1.517348,107.475677,...,2.050776,1.472691,2.110714,34670.706829,34670.706829,1.517614,140.965853,12.500057,12.202506,2.37549
std,2671.544029,0.0,0.236448,0.49997,25.50028,7.09297,1.27765,1.700961,0.499728,70.618237,...,0.652806,0.721168,0.688517,41356.667327,43343.996803,0.499717,4.200801,17.307571,17.155294,1.600291
min,93703.0,10.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,2571.068712,0.0,1.0,134.0,1.0,1.0,0.0
25%,96016.25,10.0,2.0,1.0,11.0,4.0,3.0,3.0,1.0,43.0,...,2.0,1.0,2.0,13074.433246,12347.31189,1.0,137.0,6.0,6.0,1.04
50%,98329.5,10.0,2.0,2.0,31.0,10.0,3.0,3.0,2.0,106.0,...,2.0,1.0,2.0,21098.45426,21059.894454,2.0,141.0,8.0,8.0,1.92
75%,100642.75,10.0,2.0,2.0,58.0,17.0,4.0,4.0,2.0,166.0,...,2.0,2.0,3.0,36923.316352,37561.99802,2.0,145.0,14.0,14.0,3.69
max,102956.0,10.0,2.0,2.0,80.0,24.0,5.0,7.0,2.0,239.0,...,3.0,3.0,3.0,433085.005262,419762.836488,2.0,148.0,99.0,99.0,5.0


For any individual column we can get a closer look at the value counts with:

In [22]:
df["RIAGENDR"].value_counts()

2.0    4697
1.0    4557
Name: RIAGENDR, dtype: int64

### Dietary Data

We'll be importing dietary survey data from the [National Health and Nutrition Examination Survey](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2017) as a pandas dataframe.  To understand the data we can look in the accompanying [documentation brochure](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DR2IFF_I.htm).

In [23]:
url = "../Data/Health_and_Nutrition/DR2IFF_I.XPT"

with open(url, 'rb') as f:
    df_diet = xport.to_dataframe(f)

In [25]:
df_diet

Unnamed: 0,SEQN,WTDRD1,WTDR2D,DR2ILINE,DR2DRSTZ,DR2EXMER,DRABF,DRDINT,DR2DBIH,DR2DAY,...,DR2IM181,DR2IM201,DR2IM221,DR2IP182,DR2IP183,DR2IP184,DR2IP204,DR2IP205,DR2IP225,DR2IP226
0,83732.0,92670.699919,69945.934107,1.0,1.0,87.0,2.0,2.0,2.0,4.0,...,0.000,0.000,0.000,0.003,0.000,0.0,0.000,0.000,0.000,0.000
1,83732.0,92670.699919,69945.934107,2.0,1.0,87.0,2.0,2.0,2.0,4.0,...,0.000,0.000,0.000,0.000,0.000,0.0,0.000,0.000,0.000,0.000
2,83732.0,92670.699919,69945.934107,3.0,1.0,87.0,2.0,2.0,2.0,4.0,...,0.191,0.000,0.000,0.007,0.001,0.0,0.000,0.000,0.000,0.000
3,83732.0,92670.699919,69945.934107,4.0,1.0,87.0,2.0,2.0,2.0,4.0,...,0.000,0.000,0.000,0.000,0.000,0.0,0.000,0.000,0.000,0.000
4,83732.0,92670.699919,69945.934107,5.0,1.0,87.0,2.0,2.0,2.0,4.0,...,10.223,0.158,0.002,5.934,0.544,0.0,0.178,0.000,0.013,0.044
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100675,93702.0,67560.380806,55166.938286,13.0,1.0,87.0,2.0,2.0,4.0,3.0,...,1.224,0.019,0.001,1.027,0.144,0.0,0.014,0.011,0.004,0.082
100676,93702.0,67560.380806,55166.938286,15.0,1.0,87.0,2.0,2.0,4.0,3.0,...,0.001,0.000,0.000,0.007,0.018,0.0,0.000,0.000,0.000,0.000
100677,93702.0,67560.380806,55166.938286,16.0,1.0,87.0,2.0,2.0,4.0,3.0,...,0.037,0.000,0.000,0.098,0.004,0.0,0.000,0.000,0.000,0.000
100678,93702.0,67560.380806,55166.938286,17.0,1.0,87.0,2.0,2.0,4.0,3.0,...,9.111,0.025,0.000,1.682,0.126,0.0,0.000,0.000,0.000,0.000
