# <center> A Study of Factors related to Hypertension in Adults </center>


The objective of this notebook is to show some of the essential steps of a workflow for building predictive models. The notebook provides a few examples of each step and it is only a very thin slice of what a complete analysis would consist of. 

The workflow includes:
1. **Problem Definition**:  A clear definition of the problem enables us to identify the appropriate data to gather and technique(s) to use in order to solve the problem. For many problems this many require background reading, discussion with domain experts, and layered problem specification. 
2. **Data Gathering**: We have to know which data to use, where to gather them, and how to make them useful to solve our problem. In many cases, data from multiple sources can provide deeper insights. 
3. **Exploratory Data Analysis**: Exploratory data analysis (EDA) is an approach of performing initial investigations on our data. EDA normally has descriptive nature and uses graphical statistics to discover patterns, to identify anomalies, to test hypothesis, and to check assumptions regarding our data. 
4. **Data Cleaning and Wrangling**: Raw data are generally incomplete, inconsistent, and contain many errors. Thus, we need to prepare the data for further processing. Data wrangling is the process of cleaning, structuring, and enriching raw data into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes, such as analytics.
5. **Data Modelling**:  Data modelling involves selecting and optiming the machine learning models that generate the best predictive performance based on the data we have. 
6. **Prediction**: Once we have developed the best predictive model, we can deploy it to make predictions.



# 1.0. Problem Definition

Hypertension is a major public health problem and important area of research due to its high prevalence and being major risk factor for cardiovascular diseases and other complications. To assess the prevalence of hypertension and its associated factors this notebook analyzes the data from the NHANES datasets (https://www.cdc.gov/nchs/nhanes/index.htm)

We apply the tools of machine learning to predict the factors that are associated with systolic blood pressure in adults.



# 2.0. Data Gathering and Import


In [None]:
#Before moving to the next section, we need to import all packages required to do the analysis by calling the following:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 2.1. Gathering and Importing Data

We import the datasets by calling the following:

In [None]:
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/"
LOCAL_DATA_PATH = os.path.join("datasets", "nhanes") + "/"
FILE_NAME = "P_DEMO.XPT"

def fetch_nhanes_data(file_name=FILE_NAME, nhanes_url=DOWNLOAD_ROOT,  nhanes_path=LOCAL_DATA_PATH): 
    os.makedirs(nhanes_path, exist_ok=True)
    xpt_path = os.path.join(nhanes_path, file_name) 
    url = nhanes_url + file_name
    urllib.request.urlretrieve(url, xpt_path)

In [None]:
fetch_nhanes_data("P_DEMO.XPT","https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/",LOCAL_DATA_PATH)
fetch_nhanes_data("P_BPXO.XPT","https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/",LOCAL_DATA_PATH)
fetch_nhanes_data("P_BMX.XPT","https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/",LOCAL_DATA_PATH)

In [None]:
!ls $LOCAL_DATA_PATH

In [None]:
demo_df = pd.read_sas(LOCAL_DATA_PATH + "P_DEMO.XPT")
bmx_df = pd.read_sas(LOCAL_DATA_PATH + "P_BMX.XPT")
bpxo_df = pd.read_sas(LOCAL_DATA_PATH + "P_BPXO.XPT")

## 2.2. Exploring Data Structure and Features
Before performing data analysis, we often need to know the structure of our data. Therefore, we perform the following:
- Viewing a small part of our datasets
- Viewing data shape
- Describing the features contained in the datasets

In [None]:
bmx_df.info()

In [None]:
demo_df.head()

In [None]:
bpxo_df.describe()

###  More data exploring  (todo)

### Keep only the columns that will be used in the analysis 

In [None]:
keep_columns = ['SEQN','RIAGENDR','RIDAGEYR','DMDEDUC2']
demo_sub_df = demo_df[keep_columns]
demo_sub_df.info()

In [None]:
keep_columns = [col for col in bpxo_df if col.startswith('BPXOS') | col.startswith('SEQN')]
bpxo_sub_df = bpxo_df[keep_columns]
bpxo_sub_df.info()

In [None]:
keep_columns= ['SEQN','BMXWT','BMXHT','BMXBMI']
bmx_sub_df = bmx_df[keep_columns]
bmx_sub_df.info()

### Merge the datatables into a single table

In [None]:
hp_df = demo_sub_df.merge(bpxo_sub_df, how='inner', on='SEQN')
hp_df = hp_df.merge(bmx_sub_df,how="inner", on='SEQN')
hp_df.shape

In [None]:
# Note the missing values 
hp_df.info()

# 3.0 Exploratory Data Analysis 

In [None]:
hp_df.hist(bins=50, figsize=(20,15)) 
plt.show()

In [None]:
#hp_sub_df = hp_df[['BMXBMI','BPXOSY', 'RIAGENDR','BMXWT','BMXHT','RIDAGEYR']]
corr_matrix = hp_df.corr()
corr_matrix["BPXOSY1"].sort_values(ascending=False)

In [None]:
hp_df.isnull().sum()

In [None]:
feat_desc = pd.DataFrame({'Description': ['Respondent Sequence Number',
                                          'The gender of the passenger',
                                          'Age in years at screening',
                                          'The Education Level Adults 20+',
                                          'Systolic 1st Oscillometric reading',
                                          'Systolic 2nd Oscillometric reading',
                                          'Systolic 3rd Oscillometric reading',
                                          'Weight (Kg)',
                                          'Standing Height (cm)',
                                          'Body Mass Index (Kg/m**2)'], 
                          'Values': [hp_df[i].unique() for i in hp_df.columns],
                          'Number of unique values': [len(hp_df[i].unique()) for i in hp_df.columns]}, 
                          index = hp_df.columns)

feat_desc

In [None]:
plt.figure(figsize=(13,10))

hp_df['age_groups'] = pd.cut(hp_df['RIDAGEYR'], bins=range(20,90,8))

# Creating a bar chart of ticket class (Pclass) vs probability of survival (Survived)
ax1 = plt.subplot(221)
g1 = sns.barplot(x='age_groups', y='BPXOSY', data=hp_df, color='seagreen')
plt.ylabel('Systolic Pressure')
plt.xlabel('Age')
plt.title('Age and Systolic Pressure', size=13)

hp_df = hp_df.drop('age_groups', axis=1)
# Creating a bar chart of ticket class (Pclass) and gender (Sex) vs probability of survival (Survived)
ax2 = plt.subplot(222)
g2 = sns.barplot(x='RIAGENDR', y='BPXOSY', data=hp_df, palette='BuGn_r')
plt.ylabel('Systolic Pressure')
plt.xlabel('Gender')
ax2.set_xticklabels(['Male', 'Female'])
plt.title('Gender and Systolic Pressure', size=13)


plt.subplots_adjust(hspace = 0.4, wspace = 0.3)

plt.show()

The left barchart shows age and systolic blood pressure are correlations

The right barchart shows differences between male and female blood pressures.

### More graph and EDA needed (todo)

# 4.0 Data Cleaning and Wrangling

### We do some of the cleaning and attribute adding now before the split 

In [None]:
def bp_add_attributes(hp_df):
    if 'BPXOSY' not in hp_df.columns:
         hp_df['BPXOSY']= (hp_df['BPXOSY1'] + hp_df['BPXOSY2'] + hp_df['BPXOSY3'])/3 
    
    return hp_df

In [None]:
def bp_trim_rows(hp_df):
    # Remove all rows that do not have the three systolic values
    if ('BPXOSY1' in hp_df.columns) & ('BPXOSY2' in hp_df.columns) & ('BPXOSY3' in hp_df.columns):
        hp_df = hp_df[hp_df['BPXOSY1'].notna() & hp_df['BPXOSY2'].notna() & hp_df['BPXOSY3'].notna()]
    # We are only interested in adults, so let's drop all individuals with an age less than 20
    hp_df = hp_df[hp_df['RIDAGEYR'] >= 20]
    return hp_df    

In [None]:
def bp_drop_columns(hp_df):
    # We do not need the seqn now (only needed for the merge)
    if 'SEQN' in hp_df.columns:
        hp_df = hp_df.drop("SEQN",axis=1)
    # the systolic numbers have been averaged
    if 'BPXOSY1' in hp_df.columns:
        hp_df = hp_df.drop('BPXOSY1',axis=1)
    if 'BPXOSY2' in hp_df.columns:
        hp_df = hp_df.drop('BPXOSY2',axis=1)
    if 'BPXOSY3' in hp_df.columns:
        hp_df = hp_df.drop("BPXOSY3",axis=1)
    
    return hp_df


In [None]:
def bp_add_trim_drop(hp_df):
    hp_df = bp_trim_rows(hp_df)
    hp_df = bp_add_attributes(hp_df)
    hp_df = bp_drop_columns(hp_df)
    
    return hp_df

In [None]:
hp_df = bp_add_trim_drop(hp_df)
hp_df.head()

### Now split into training and test data sets

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(hp_df, test_size=0.2, random_state=42)

In [None]:
# have some background information that leads us to believe male and female heart rates are different
#  so we make sure that we have even split across train and test
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) 
for train_index, test_index in split.split(hp_df, hp_df['RIAGENDR']):
        strat_train_set = hp_df.iloc[train_index]
        strat_test_set = hp_df.iloc[test_index]

In [None]:
bp_train_X = strat_train_set.drop("BPXOSY", axis=1)
bp_train_y = strat_train_set["BPXOSY"].copy()

In [None]:
bp_train_X.head()

### set missing values of numerical data to the median

In [None]:
# Gender is a categorical field, so we need to remove to do column calculations

bp_num = bp_train_X.drop("RIAGENDR",axis=1)

In [None]:
from sklearn.impute import SimpleImputer 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder

In [None]:

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
        ])
#bp_num_tr = num_pipeline.fit_transform(bp_num)

In [None]:
num_attribs = list(bp_num)
cat_attribs = ["RIAGENDR"]
full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
        ])
bp_prepared = full_pipeline.fit_transform(bp_train_X)

In [None]:
# bp_prepared is in a numpy array. Sometimes it is useful to have the data in a dataframe, so let's build one 
# we do not use this dataframe for the rest of the notebook, but you may find it useful
column_names = num_attribs.copy()
column_names.append('Male')
column_names.append('Female')
bp_prepared_df = pd.DataFrame(bp_prepared, columns=column_names)
bp_prepared_df.head()

# 5.0 Data Modeling

### Linear regression

In [None]:
from sklearn.linear_model import LinearRegression 
lin_reg = LinearRegression()
lin_reg.fit(bp_prepared, bp_train_y)

In [None]:
#some_data = X.iloc[:5]
some_data = bp_prepared[:5]
some_labels = bp_train_y.iloc[:5]
print("Predictions:", lin_reg.predict(some_data))
print("Labels:", list(some_labels))


In [None]:
from sklearn.metrics import mean_squared_error
systolic_predictions = lin_reg.predict(bp_prepared)
lin_mse = mean_squared_error(bp_train_y, systolic_predictions) 
lin_rmse = np.sqrt(lin_mse)
lin_rmse

### Cross Validation (todo)

# 6.0 Prediction

### Evaluate on the test set : only done at the end of all modeling (once !)

In [None]:
X_test = strat_test_set.drop("BPXOSY", axis=1)
y_test = strat_test_set["BPXOSY"].copy()

In [None]:
bp_prepared = full_pipeline.fit_transform(X_test)

In [None]:
from sklearn.metrics import mean_squared_error
systolic_predictions = lin_reg.predict(bp_prepared)
lin_mse = mean_squared_error(y_test, systolic_predictions) 
lin_rmse = np.sqrt(lin_mse)
lin_rmse