# Instructions

This assignment is provided in the form of a Jupyter notebook. Questions and tasks are posed within this notebook file and you are expected to provide code and/or written answers when prompted. Remember that you can use Markdown cells to format written responses where necessary.

Before submitting your assignment, be sure to do a clean run of your notebook and **verify that your cell outputs (e.g., prints, figures, tables) are correctly shown**. To do a clean run, click *Kernel&#8594;Restart & Restart Kernal and Run All Cells...*.

You are required to submit this notebook to Gradescope in two forms:

1. Submit a PDF of the completed notebook. To produce a PDF, you can use *File&#8594;Save and Export Notebook As...&#8594;HTML* and then convert the HTML file to a PDF using your preferred web browser. **Verify that your code, written answers, and cell outputs are visible in the submitted PDF.**
2. Submit a zip file (including the `.ipynb` file) of this assignment to Gradescope.

# Setup and Imports

These cells will import necessary libraries and configure the notebook's visual style.

In [1]:
# Efficient math and data management
import numpy as np
import pandas as pd

# You may import useful modules and functions from the Python Standard Library.
import os
from functools import reduce  

# Visualization libraries
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [55]:
# Inline figures. Can swap comments to use interactive figures. Use inline figures for assignment submission.
%matplotlib inline
# %matplotlib notebook

In [56]:
# Set seaborn visual style
sns.set()
sns.set_context('talk')
plt.rcParams["patch.force_edgecolor"] = False  # Turn off histogram borders

# Load Data

In this assignment, we will be using the 2017&ndash;2018 NHANES data set. Every year, the CDC conducts a series of interviews and mobile health examinations. The 2017&ndash;2018 data set contains 9,254 completed interviews and 8,704 health examinations obtained from 30 survey locations. This data has been made publicly available and subset of it is included with this assignment. The raw data files can be found in the `nhanes` folder in both CSV and XPORT file formats if you would like to view them. These data files have been processed and combined for you and provided as `NHANES_combined.csv.gz`.

Load the combined data file.

In [7]:
data = pd.read_sas('P_KIQ_U.XPT')
print(data)
metal_data = pd.read_sas('P_PBCD.XPT')

          SEQN  KIQ022  KIQ025  KIQ026  KIQ029  KIQ005  KIQ010  KIQ042  \
0     109266.0     2.0     NaN     2.0     NaN     1.0     NaN     2.0   
1     109267.0     2.0     NaN     2.0     NaN     NaN     NaN     NaN   
2     109271.0     2.0     NaN     2.0     NaN     1.0     NaN     2.0   
3     109273.0     2.0     NaN     2.0     NaN     2.0     1.0     2.0   
4     109274.0     2.0     NaN     1.0     2.0     1.0     NaN     2.0   
...        ...     ...     ...     ...     ...     ...     ...     ...   
9227  124815.0     2.0     NaN     2.0     NaN     2.0     1.0     2.0   
9228  124817.0     2.0     NaN     1.0     2.0     5.0     1.0     2.0   
9229  124818.0     2.0     NaN     2.0     NaN     3.0     1.0     2.0   
9230  124821.0     2.0     NaN     2.0     NaN     3.0     1.0     2.0   
9231  124822.0     2.0     NaN     2.0     NaN     2.0     2.0     2.0   

      KIQ430  KIQ044  KIQ450  KIQ046  KIQ470  KIQ050  KIQ052        KIQ480  
0        NaN     2.0     NaN     2

In [19]:
#Count the nuber of non-nan values in the dataset and divide by length of dataset
non_null_kidney = data['KIQ022'][pd.notnull(data['KIQ022'])]
with_kidney_disease = non_null_kidney[non_null_kidney == 1]
print(len(with_kidney_disease))
# percent_tested = num_tested / len(data)
# print(f'{round(percent_tested * 100,4)}%')

383


In [20]:
non_null_lead = metal_data['LBXBPB'][pd.notnull(metal_data['LBXBPB'])]
with_high_lead = non_null_lead[non_null_lead >= 5]
print(len(with_high_lead))

124


In [21]:
non_null_mercury = metal_data['LBXTHG'][pd.notnull(metal_data['LBXTHG'])]
with_high_mercury = non_null_mercury[non_null_mercury >= 5]
print(len(with_high_mercury))

398


In [25]:
num_both = np.count_nonzero((non_null_kidney == 1) & (non_null_lead >= 5))
print(num_both)

1        False
3        False
5        False
6        False
7        False
         ...  
13766    False
13767    False
13769    False
13770    False
13771    False
Name: LBXBPB, Length: 11107, dtype: bool
3
