## COLLECTION OF NATIONAL HEALTH & NUTRITION EXAMINATION SURVEY DATA FOR PROJECT

This worksheet is used for collecting, combining, and filtering of data from the National Health & Nutrition Examination Survey.
Data is posted in XPT format in two-year increments for various topics related to the survey.  Data was collected for 1999-2018.

*Note on the data: NHANES sample weights are used by analysts to produce estimates of the health-related statistics that would have been obtained if the entire sampling frame (i.e., the noninstitutionalized civilian U.S. population) had been surveyed.*

In [1]:
import pandas as pd

#### DELETE THIS FOR FINAL - just using to build key for data collection

A = 1999 - 2000</br>
B = 2001 - 2002</br>
C = 2003 - 2004</br>
D = 2005 - 2006</br>
E = 2007 = 2008</br>
F = 2009 - 2010</br>
G = 2011 - 2012</br>
H = 2013 - 2014</br>
I = 2015 - 2016</br>
J = 2017 - 2018</br>

#### A: 1999-2000 file pull and merge:

In [112]:
# Pull in XPT files to create dataframes

A_blood_press = pd.read_sas('NHANES_files/blood_pressure/A_BloodPressure_BPX.XPT') #blood pressure (target)

A_body_meas = pd.read_sas('NHANES_files/body_measures/A_BodyMeasures_BMX.XPT') #body measures


In [140]:
# verify all lists are same size
print(len(A_blood_press))
print(len(A_body_meas))

9282
9282


In [145]:
# merge lists together on each respondents 'sequence number'
A_df_merged = pd.merge(A_blood_press, A_body_meas, on='SEQN')

In [142]:
#narrow df to only the features & target we want to explor for model

A_df_narrowed = A_df_merged[["SEQN","BPXSY1","BPXSY2","BPXSY3",
                             "BPXDI1","BPXDI2","BPXDI3",
                             "BMXBMI","BMXWAIST","BMXHT",
                             "BMXWT"]]

In [144]:
A_df_narrowed.head()

Unnamed: 0,SEQN,BPXSY1,BPXSY2,BPXSY3,BPXDI1,BPXDI2,BPXDI3,BMXBMI,BMXWAIST,BMXHT,BMXWT
0,1.0,,,,,,,14.9,45.7,91.6,12.5
1,2.0,106.0,98.0,98.0,58.0,56.0,56.0,24.9,98.0,174.0,75.4
2,3.0,110.0,104.0,112.0,60.0,64.0,62.0,17.63,64.7,136.6,32.9
3,4.0,,,,,,,,,,13.3
4,5.0,122.0,122.0,122.0,82.0,84.0,82.0,29.1,99.9,178.3,92.5


-

#### B: 2001-2002 file pull and merge:

In [53]:
# Pull in XPT files to create dataframes

B_blood_press = pd.read_sas('NHANES_files/blood_pressure/B_BloodPressure_BPX_B.XPT') #blood pressure (target)

B_body_meas = pd.read_sas('NHANES_files/body_measures/B_BodyMeasures_BMX_B.XPT') #body measures

In [133]:
B_blood_press

Unnamed: 0,SEQN,PEASCST1,PEASCTM1,PEASCCT1,BPXCHR,BPQ150A,BPQ150B,BPQ150C,BPQ150D,BPAARM,...,BPAEN2,BPXSY3,BPXDI3,BPAEN3,BPXSY4,BPXDI4,BPAEN4,BPXSAR,BPXDAR,Year
0,9966.0,1.0,908.0,5.397605e-79,,2.0,2.0,2.0,2.0,1.0,...,2.0,122.0,78.0,2.0,,,,124.0,7.900000e+01,2001-2002
1,9967.0,1.0,587.0,5.397605e-79,,2.0,2.0,2.0,2.0,1.0,...,2.0,104.0,66.0,2.0,,,,102.0,6.600000e+01,2001-2002
2,9968.0,1.0,626.0,5.397605e-79,,2.0,2.0,2.0,2.0,1.0,...,2.0,,,2.0,,,2.0,120.0,5.397605e-79,2001-2002
3,9969.0,1.0,860.0,5.397605e-79,,2.0,2.0,2.0,2.0,1.0,...,2.0,108.0,70.0,2.0,,,,113.0,7.100000e+01,2001-2002
4,9970.0,1.0,740.0,5.397605e-79,,2.0,2.0,2.0,2.0,1.0,...,2.0,124.0,94.0,2.0,,,,121.0,8.900000e+01,2001-2002
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10472,21000.0,1.0,538.0,5.397605e-79,,2.0,2.0,2.0,2.0,1.0,...,2.0,96.0,48.0,2.0,,,,94.0,4.800000e+01,2001-2002
10473,21001.0,1.0,56.0,5.397605e-79,120.0,,,,,,...,,,,,,,,,,2001-2002
10474,21002.0,1.0,536.0,5.397605e-79,,2.0,2.0,2.0,2.0,1.0,...,2.0,116.0,44.0,2.0,,,,113.0,4.800000e+01,2001-2002
10475,21003.0,1.0,682.0,5.397605e-79,,2.0,2.0,2.0,2.0,1.0,...,2.0,102.0,44.0,2.0,100.0,46.0,2.0,100.0,4.600000e+01,2001-2002


# NOTES FOR HOW TO "CLEAN" UP BLOOD PRESSURE DATA & THEN CREATE Y VALUE

In [55]:
test_result = pd.concat([A_blood_press, B_blood_press], ignore_index=False)

In [68]:
#rename the only columns we care about for blood pressure

test_result.rename(columns={"BPXSY1": "Systolic_Rd1",
                            "BPXSY2": "Systolic_Rd2",
                            "BPXSY3": "Systolic_Rd3",
                            "BPXDI1": "Diastolic_Rd1",
                            "BPXDI2": "Diastolic_Rd2",
                            "BPXDI3": "Diastolic_Rd3",
                           }, inplace=True)

In [71]:
#narrow dataframe to just be columns we want for blood pressure

test_blood_pressure = test_result[["Systolic_Rd1","Systolic_Rd2",
                                  "Systolic_Rd3","Diastolic_Rd1",
                                  "Diastolic_Rd2","Diastolic_Rd3"
                                  ]]

In [77]:
#drop all of the rows with null values

test_blood_pressure_nonull = test_blood_pressure.dropna()

In [87]:
# Create new column for all mean of diastolic readings

test_blood_pressure_nonull['Systolic_Avg'] = test_blood_pressure_nonull[
    ['Systolic_Rd1', 'Systolic_Rd2','Systolic_Rd3']].mean(axis=1)

In [89]:
# Create new column for all mean of diastolic readings

test_blood_pressure_nonull['Diastolic_Avg'] = test_blood_pressure_nonull[
    ['Diastolic_Rd1', 'Diastolic_Rd2','Diastolic_Rd3']].mean(axis=1)

In [105]:
# function to determine whether person has high blood pressure

def blood_pressure_status(row):
    if row['Systolic_Avg'] > 130 or row['Diastolic_Avg'] > 80:
        return 1
    return 0

In [109]:
# application of above function to run through entire dataframe

test_blood_pressure_nonull['High_Blood_Pressure'] = test_blood_pressure_nonull.apply(
    lambda row: blood_pressure_status(row), axis=1)

In [111]:
# See how many have high blood pressure in the data set (1 = high blood pressure)

test_blood_pressure_nonull.High_Blood_Pressure.value_counts()

0    7762
1    3032
Name: High_Blood_Pressure, dtype: int64