# **Lab One: Visualization and Data Preprocessing**

*Contributors:* Balaji Avvaru, Joshua Eysenbach, Vijay Kaniti, Daniel Turner

Nbviewer link: https://nbviewer.jupyter.org/github/iamsrikanthkv/CardioVascularDisease_Prediction/blob/master/Lab_1_DataViz_Group3-Final.ipynb

## **Business Understanding**

This analysis uses a dataset categorizing patients with a cardiovascular disease (CVD) diagnosis.  It contains a collection of 11 attributes that were gathered with the intention of trying to identify potential characteristics of individuals that correlate with heart disease. The primary goal of this analysis is to explore the data through statistical summaries and visualization to elucidate any trends that will be useful for building a prediction model for classifying a patient as having a CVD diagnosis or not.

This dataset was procured from Kaggle<sup>(0-1) </sup> but it is unclear from where the original data originates. However, the features are described well enough to understand what each of them represents. In a practical "real-world" setting, verifying the source of this data and scrutinizing the methods of its collection would be a vital part of the analysis, but for the purposes of our academic interests in data visualization and eventual prediction modelling, this level of validity is inconsequential as long as the data has realistic characteristics and application.

Cardiovascular disease generally refers to conditions that involve narrowed or blocked blood vessels (Atherosclerosis) that can lead to a heart attack, chest pain (angina) or stroke. Diseases under the heart disease umbrella include blood vessel diseases, such as coronary artery disease, heart rhythm problems (arrhythmias), congenital heart defects, and rheumatic heart disease. Four out of five CVD deaths are due to heart attacks and strokes.<sup>(0-2) </sup>

#### Cardiovascular Disease Prediction

The goal of any prediction algorithm using this data is to determine if any of these attributes or combination of them can predict a cardiovascular disease diagnosis. These predicition models could provide valuable insight into what conditions or behaviors might be correlated with heart disease and could be used in aiding diagnosis or helping to mitigate the disease through understanding its possible causes. Many of the attributes collected are based on long standing suppositions of conditions or behaviors associated with heart disease like high cholesterol, smoking, or alcohol consumption. This analysis and subsequent prediction modeling can hopefully identify if any of these are more associated with (or even perhaps falsely attributed to) a diagnosis. 

As a classification problem, logistic regression would be a good model to start with to predict a positive or negative CVD diagnosis. The effectiveness of a prediction model for classifying CVD patients could be measured in several ways depending on the implementation. For example, the intent of the model could be to identify individuals likely to be diagnosed with cardiovascular disease so they can be given an objective reason to make behavioral changes to reduce their chances of a future diagnosis. In this case, maximizing sensitivity at the expense of accuracy and specificity might be the best option because there would be few downsides to making false positive classifications. AUC (Area under the curve; Receiver Operating Characteristic) could also be an effective metric under this principle as it measures the ability of the model to predict a higher score for positive examples. On the other hand, if something like high cholesterol was found to be a highly significant predictor and there was a decision to be made about prescribing a drug that could have side effects for the patient, accuracy and a more balanced sensitivity and specificity might be more important as false positives become more of a concern. For reasons like these, a few different prediction models might be warranted for different specific uses. 

#### Systolic Blood Pressure Prediction

Blood pressure is another attribute included in this data that could be a target variable for prediction. Whether or not blood pressure is correlated to CVD, high blood pressure has many negative health implications of its own. If we build a model to predict systolic blood pressure, we could examine how strongly diastolic pressure correlates to it, and add as many of the other variables as are relevant to predicting systolic pressure. As a continuous measurement, we would predict Systolic blood pressure using multiple linear regression. Minimizing root mean square error in combination with cross validation is a good metric for evaluating the effectiveness of a MLR prediction model for systolic blood pressure.

For validation of models for both predicting CVD and blood pressure, we will use 10-fold cross validation.

## **Data Understanding**

### Data Meaning and Types

The features included are described on the Kaggle page<sup>(0-1) </sup> for this data as being separable into three categories:
* *Objective*: Factual initial information about the patient;
* *Examination*: Information resulting from medical examination;
* *Subjective*: Information given by the patient.

Distinctions between the different attributes are paramount to interpreting and qualifying results of analysis and modeling of this data as they can represent varying degrees of validity and potential biases, so it is important that we keep this in mind as we explore the data and eventually make any recommendations. For example, we would likely assume blood pressure measurements will be reasonably accurate as they were taken by a trained health professional, but should be wary of a patient's proclivity to be honest when asked whether they are a regular smoker or drinker.

The dataset acquired from Kaggle is stored for our use on Github<sup>(1-1) </sup>. The code for importing the data is combined with the inital loading of various analysis and visualization packages below. 

In [4]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [5]:
df = pd.read_csv("https://raw.githubusercontent.com/jteysen/MSDS-7331-Machine-Learning-I/master/Data/cardio_train.csv")

The attributes included in the dataset are outlined below per their descriptions on the Kaggle page.

|Attribute Name | Category | Description |
|---------------|----------|-------------|
|age | Objective | Age of the Patient (days) |
|height | Objective | Height of the Patient (cm) |
|weight | Objective | Weight of the Patient (kg) |
|gender | Objective | Gender of the Patient (1:M, 2:F)|
|ap_hi | Examination | Systolic blood pressure (mmHg)|
|ap_lo | Examination | Diastolic blood pressure (mmHg)|
|cholesterol | Examination | Cholesterol level -  1: normal, 2: above normal, 3: well above normal |
|gluc | Examination | Glucose level - 1: normal, 2: above normal, 3: well above normal |
|smoke | Subjective | Patient does (1) or does not (0) describe themselves as a smoker |
|alco | Subjective | Patient does (1) or does not (0) regularly drink alcohol |
|active | Subjective | Patient does (1) or does not (0) regularly exercise |
|cardio | Target Variable | Diagnosis of presence (1) or absence (0) of cardiovascular disease |

Most of these need little explanation, but we researched the attributes in the *Examination* category for a more in-depth understanding of their measurement and their known implications on cardiovascular disease and general heart health.

Blood Pressure is typically measured using an inflatable cuff with a gauge that meaures mmHg (pressure exerted by 1 mm high column of mercury). Measurements are taken that represent the maximum pressure exerted on arteries when the heart beats (systolic - higher) and between heart beats (diastolic - lower). The condition where either or both of these measurements is high called Hypertension, and this condition has been linked to cardiovascular disease. The American Heart Association outlines blood pressure levels by severity.

In [6]:
# Continuous features
cont_features = ['age', 'height', 'weight', 'ap_hi', 'ap_lo']
# Categorical features
cat_features = ['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active']

Since the ID field is simply the ordered numbering of all of the observations, it is not useful to us so we will drop it from our data frame. This will also help to reduce the size of any pairwise comparisons without additional restricting.

In [7]:
df.drop(['id'], inplace=True, axis=1)

In [8]:
duplicate_observations = df[df.duplicated(keep='first')]

Having 24 individuals with the same values for this relatively small list of attributes is entirely possible in a dataset of 70,000. All of the duplicates appear very near the average or medians for the different attributes, making the likelihood of the duplicates high and their legitimacy believable. For that reason, we will leave them.

In [9]:
#convert the age to number of years instead of number of days 
df['age_years'] = (df['age'] / 365).round().astype('int')
# drop the feature 'age'
df.drop('age', axis=1, inplace=True)
# New Continuous features
cont_features = ['age_years', 'height', 'weight', 'ap_hi', 'ap_lo']

In [10]:
df = df[(df["ap_lo"] < 140) & (df["ap_lo"] > 20)]
df = df[(df["ap_hi"] < 220) & (df["ap_hi"] > 40)]

print(df[['ap_lo', 'ap_hi']].describe())

              ap_lo         ap_hi
count  68698.000000  68698.000000
mean      81.305395    126.572753
std        9.439463     16.595382
min       30.000000     60.000000
25%       80.000000    120.000000
50%       80.000000    120.000000
75%       90.000000    140.000000
max      135.000000    215.000000


In [11]:
# swap values for records where ap_lo > ap_hi
temp = df['ap_lo'] > df['ap_hi']
df.loc[temp, ['ap_lo','ap_hi']] = df.loc[temp, ['ap_hi','ap_lo']].values
# Check that swap worked
len(df.query('ap_lo > ap_hi'))

0

In [12]:
df = df[(df["height"] < 250)]
df = df[(df["height"] > 120)]
print(df[['height']].describe())

             height
count  68615.000000
mean     164.433520
std        7.856655
min      122.000000
25%      159.000000
50%      165.000000
75%      170.000000
max      207.000000


In [13]:
df = df[(df["weight"] > 40)]
print(df[['weight']].describe())

             weight
count  68525.000000
mean      74.151076
std       14.240436
min       41.000000
25%       65.000000
50%       72.000000
75%       82.000000
max      200.000000


In [14]:
df = df[(df["weight"] < 200)]
print(df[['weight']].describe())

             weight
count  68523.000000
mean      74.147402
std       14.224404
min       41.000000
25%       65.000000
50%       72.000000
75%       82.000000
max      183.000000


In [15]:
#create a variable that bins age
df2 = df.copy()
df2['age_in_groups'] = pd.cut(df.age_years,[0,40,50,60,80],4,labels=['<40','40-50','50-60','>60'])
df2['age_in_groups'].describe()

count     68523
unique        4
top       50-60
freq      34771
Name: age_in_groups, dtype: object

In [16]:
# Creating a displaying BMI as an example of a new variable. BMI = weight in kgs / height in m^2
df2['BMI'] = df['weight'] / (df['height'] / 100).round()

In [17]:
df2

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_years,age_in_groups,BMI
0,2,168,62.0,110,80,1,1,0,0,1,0,50,40-50,31.0
1,1,156,85.0,140,90,3,1,0,0,1,1,55,50-60,42.5
2,1,165,64.0,130,70,3,1,0,0,0,1,52,50-60,32.0
3,2,169,82.0,150,100,1,1,0,0,1,1,48,40-50,41.0
4,1,156,56.0,100,60,1,1,0,0,0,0,48,40-50,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69995,2,168,76.0,120,80,1,1,1,0,1,0,53,50-60,38.0
69996,1,158,126.0,140,90,2,2,0,0,1,1,62,>60,63.0
69997,2,183,105.0,180,90,3,1,0,1,0,1,52,50-60,52.5
69998,1,163,72.0,135,80,1,2,0,0,0,1,61,>60,36.0


In [20]:
df2.to_csv(r'CVD.csv', index = False)

### Logistic Regression Modeling

In [None]:
#separating data into two parts X (features) and Y (target)
features = ["gender", "height", "weight", "ap_hi", "ap_lo", "cholesterol", "gluc", "smoke", "alco", "active", "age_years", "BMI"]
# Separating out the features
X = df2.loc[:, features].values

#Separating out the target
Y = df.loc[:, ['cardio']].values.ravel()
Y.shape