## Project 1 (Due 2/17)

The goal of the first project is to do some wrangling, EDA, and visualization, and generate sequences of values. We will focus on:

- CDC National Health and Nutritional Examination Survey (NHANES, 1999-2000): https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=1999
- CDC Linked Mortality File (LMF, 1999-2000): https://www.cdc.gov/nchs/data-linkage/mortality-public.htm

NHANES is a rich panel dataset on health and behavior, collected bi-yearly from around 1999 to now. We will focus on the 1999 wave, because that has the largest follow-up window, providing us with the richest mortality data. The mortality data is provided by the CDC Linked Mortality File. 

The purpose of the project is to use $k$-NN to predict who dies (hard or soft classification) and how long they live (regression).

### Day 1: Wrangling and EDA (40/100 pts)

First, go to the NHANES and LMF web sites and familiarize yourself with the data sources. Download codebooks. Think about what resources are available. The CDC Linked Mortality File is somewhat of a pain to work with, so I have pre-cleaned it for you. It is available at httts://github.com/ds4e/undergraduate_ml_assignments in the data folder, as `lmf_parsed.cav`. From the CDC LMF web page, get the SAS program to load the data; it is the real codebook.

Second, download the demographic data for the 1999--2000 wave from the NHANES page. You can use the following code chunk to merge the LMF and DEMO data:

``` python
import pandas as pd
mdf = pd.read_csv('lmf_parsed.csv') # Load mortality file
print( mdf.head() )
gdf = pd.read_sas("DEMO.xpt", format="xport") # Load demographics file
print( gdf.head() )
df = gdf.merge(mdf, on="SEQN", how="inner") # Merge mortality and demographics on SEQN variable
```

Third, the variables `ELIGSTAT`, `MORTSTAT`, `PERMTH_INT`, and `RIDAGEEX` are particularly important. Look them up in the documentation and clearly describe them. (5/100 pts.)

Second, the goal of the project is to use whatever demographic, behavioral, and health data you like to predict mortality (`MORTSTAT`) and life expectancy (`PERMTH_INT`). Go to the NHANES 1999--2000 web page and select your data and download it. Clearly explain your rationale for selecting these data. Use `.merge` to combine your data into one complete dataframe. Document missing values. (5/100 pts)

Third, do basic EDA and visualization of the key variables. Are any important variables skewed? Are there outliers? How correlated are pairs of variables? Do pairs of categorical variables exhibit interesting patterns in contingency tables? Provide a clear discussion and examination of the data and the variables you are interested in using. (20/100 pts)


### Day 2: $k$-NN classification/regression, write-up (50/100 pts)

Submit a notebook that clearly addresses the following, using code and markdown chunks:

1. Describe the data, particularly what an observation is and whether there are any missing data that might impact your analysis. Who collected the data and why? What known limitations are there to analysis? (10/100 pts)
2. Describe the variables you selected to predict mortality and life expectancy, and the rationale behind them. Analyze your variables using describe tables, kernel densities, scatter plots, and conditional kernel densities. Are there any patterns of interest to notice? (10/100 pts)
3. Using your variables to predict mortality using a $k$-Nearest Neighbor Classifier. Analyze its performance and explain clearly how you select $k$. (10/100 pts)
4. Using your variables to predict life expectancy using a $k$-Nearest Neighbor Regressor. Analyze its performance and explain clearly how you select $k$. (10/100 pts)
5. Describe how your model could be used for health interventions based on patient characteristics. Are there any limitations or risks to consider? (10/100 pts)

## Submission (10/100 pts)

Submit your work in a well-organized GitHub repo, where the code is appropriately commented and all members of the group have made significant contributions to the commit history. (10/100 pts)

In [None]:
import pandas as pd
mdf = pd.read_csv('data/linked_mortality_file_1999_2000.csv') # Load mortality file
print( mdf.head() )
gdf = pd.read_sas("data/DEMO.xpt", format="xport") # Load demographics file
print( gdf.head() )
df = gdf.merge(mdf, on="SEQN", how="inner") # Merge mortality and demographics on SEQN variable
gdf = pd.read_sas("data/DSII.xpt", format="xport") # Load demographics file
print( gdf.head() )

In [None]:
# get our data
import requests
from pathlib import Path

url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/1999/DataFiles/DRXIFF.xpt"
out_path = Path("data/diet.xpt")

# make sure directory exists
out_path.parent.mkdir(parents=True, exist_ok=True)

r = requests.get(url, stream=True)
r.raise_for_status()

with open(out_path, "wb") as f:
    for chunk in r.iter_content(chunk_size=8192):
        f.write(chunk)

print("Saved to", out_path)


Second, the goal of the project is to use whatever demographic, behavioral, and health data you like to predict mortality (`MORTSTAT`) and life expectancy (`PERMTH_INT`). Go to the NHANES 1999--2000 web page and select your data and download it. Clearly explain your rationale for selecting these data. Use `.merge` to combine your data into one complete dataframe. Document missing values. (5/100 pts). 
We chose the Dietary Interview and Individual Foods data file. We chose this because we want to look at how specific ingredients predict mortality and life expectancy. The ingredients we want to evaluate are sodium, caffeine, and fiber. We chose these three ingredients because sodium and caffeine have been linked to health problems, and fiber has a positive connotation for overall digestive health. We want to see just how impactful these ingredients are.

The variable 'ELIGSTAT' indicates whether an NHANES participant was eligible for linkage to death records. This essentially informs us whether the individual could be matched to the National Death Index. This is important to us in that when you analyze mortality outcomes, you only include individuals with 'ELIGSTAT' indicating that they could be linked. 

The variable 'MORTSTAT' represents whether the participant was dead or presumed alive as of the end of the mortality follow-up period. If the data shows a 0, that individual is assumed to be alive. If the data shows 1, that individual is deceased. This status is based on the linkage between NHANES survey records and the National Death Index (previous variable). The 'MORTSTAT' variable is important to us because it informs us if a participant died during the follow-up window. 

The variable 'PERMTH_INT' measures how long each participant was followed (in months) from the date of their interview until death, or the end of the follow-ip period if they were still alive. It's a standard measure of follow-up time in survival analyses in that larger values are equal to longer follow-up time (more months). Missing values often occur for people not eligible for mortality linkage, for example, children without sufficient identifying info. This variable is important as it is used to calculate rates and hazard estimates, because it accounts for time under observation and not just whether someone died. 

The variable 'RIDAGEEX' measures a person's age in months at the time of their NHANES visit. Itâ€™s often used instead of age in years as it is more precise, which is especially useful in growth and/or developmental studies. This variable is important as it is more precise than years and age can serve as a confounding variable in lots of cases.