### Project - Predict heart attack based on phisiological and physical parameters fo the patient

### Action plan

1. Read, explore and clean all the data sets
2. See what all datasets have in common and how to relate them. 
3. Combine or merge relevant columns from datasets 1 and 3, to create a richer dataset. 
4. Plot main relations between variables to better understand the behaviour of the data.
5. Run Machine learning algorithms to train and test. The target is 0 or 1, depending on having or not a stroke.
6. Test all ML models and measure the error
7. Get best model and run with data set stroke_predictorSet2
8. Measure error. If not high enough, retrain ML models changing parameters.
9. Return best possible ML model to predict stroke with this data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### First data set analysis and cleaning

1. age
2. sex 1=male, 0=female
3. chest pain type (4 values)
4. resting blood pressure
5. serum cholestoral in mg/dl (100-130 normal, 130-160 above normal, more than 160 well above normal)
6. fasting blood sugar > 120 mg/dl
7. resting electrocardiographic results (values 0,1,2)
8. maximum heart rate achieved
9. exercise induced angina
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

In [116]:
df_stroke = pd.read_csv('stroke_predictorSet.csv') #heart measurements taken in India

In [117]:
df_stroke.sample(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
205,52,1,0,128,255,0,1,161,1,0.0,2,1,3,0
110,64,0,0,180,325,0,1,154,1,0.0,2,0,2,1
198,62,1,0,120,267,0,1,99,1,1.8,1,2,3,0
265,66,1,0,112,212,0,0,132,1,0.1,2,1,2,0
170,56,1,2,130,256,1,0,142,1,0.6,1,1,1,0


In [118]:
df_stroke[df_stroke.duplicated() == True] #Check for duplicates

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
164,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1


In [119]:
df_stroke.drop_duplicates(inplace=True) #Remove duplicates and check size of the data frame
df_stroke.shape

(302, 14)

In [120]:
#Set appropriate names for the columns
columns = ['Age', 'Sex', 'ChestPain', 'SystolicPressure', 'Cholesterol', 'Diabetes', 'ecgStatus', 'MaxHeartBeat', 'InducedAngina',
           'STdepression','STslope', 'fluorVessels', 'Thalassemia', 'Target']
df_stroke.columns=columns
df_stroke.sample(5)

Unnamed: 0,Age,Sex,ChestPain,SystolicPressure,Cholesterol,Diabetes,ecgStatus,MaxHeartBeat,InducedAngina,STdepression,STslope,fluorVessels,Thalassemia,Target
107,45,0,0,138,236,0,0,152,1,0.2,1,0,2,1
57,45,1,0,115,260,0,0,185,0,0.0,2,0,2,1
185,44,1,0,112,290,0,0,153,0,0.0,2,1,2,0
263,63,0,0,108,269,0,1,169,1,1.8,1,2,2,0
132,42,1,1,120,295,0,1,162,0,0.0,2,0,2,1


In [172]:
df_stroke.describe()

Unnamed: 0,Age,Sex,SystolicPressure,Diabetes,Target
count,302.0,302.0,302.0,302.0,302.0
mean,54.42053,0.682119,131.602649,0.149007,0.543046
std,9.04797,0.466426,17.563394,0.356686,0.49897
min,29.0,0.0,94.0,0.0,0.0
25%,48.0,0.0,120.0,0.0,0.0
50%,55.5,1.0,130.0,0.0,1.0
75%,61.0,1.0,140.0,0.0,1.0
max,77.0,1.0,200.0,1.0,1.0


In [121]:
#Let´s select the columns common to the other datasets in order to compare them
df_stroke.drop(axis=1, columns=['ChestPain', 'ecgStatus', 'MaxHeartBeat', 'InducedAngina', 'STdepression', 'STslope', 'fluorVessels', 'Thalassemia'], inplace=True)

In [122]:
df_stroke.head()

Unnamed: 0,Age,Sex,SystolicPressure,Cholesterol,Diabetes,Target
0,63,1,145,233,1,1
1,37,1,130,250,0,1
2,41,0,130,204,0,1
3,56,1,120,236,0,1
4,57,0,120,354,0,1


In [123]:
#Clasify Cholesterol column in three categories. (100-130 normal, 130-160 above normal, more than 160 well above normal)
df_stroke['Cholesterol']=pd.cut(df_stroke['Cholesterol'], bins=[0,130,160,1000],right=False, labels=[1,2,3])

In [124]:
df_stroke.sample(5) #df_stroke clean and features ready for modeling

Unnamed: 0,Age,Sex,SystolicPressure,Cholesterol,Diabetes,Target
74,43,0,122,3,0,1
131,49,0,134,3,0,1
243,57,1,152,3,0,0
96,62,0,140,3,0,1
111,57,1,150,1,1,1


### Second data set analysis and cleaning
1. Age | Objective Feature | age | int (days)
2. Height | Objective Feature | height | int (cm) |
3. Weight | Objective Feature | weight | float (kg) |
4. Gender | Objective Feature | gender | categorical code | 1 woman, 2 man
5. Systolic blood pressure | Examination Feature | ap_hi | int |
6. Diastolic blood pressure | Examination Feature | ap_lo | int |
7. Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
8. Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
9. Smoking | Subjective Feature | smoke | binary |
10. Alcohol intake | Subjective Feature | alco | binary |
11. Physical activity | Subjective Feature | active | binary |
12. Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

In [156]:
df_stroke2 = pd.read_csv('stroke_predictorSet2.csv', sep=';')

In [157]:
df_stroke2.head() #This data set is independant from the first one and also has a target, we will use it to test our model.

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [158]:
df_stroke2.dropna(axis=0) #Remove null rows 

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69995,99993,19240,2,168,76.0,120,80,1,1,1,0,1,0
69996,99995,22601,1,158,126.0,140,90,2,2,0,0,1,1
69997,99996,19066,2,183,105.0,180,90,3,1,0,1,0,1
69998,99998,22431,1,163,72.0,135,80,1,2,0,0,0,1


In [159]:
df_stroke2[df_stroke2.duplicated() == True] #Check for duplicates. There are no duplicates in this case

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio


In [160]:
df_stroke2['age']=df_stroke2['age']/365 #Age was given in days, transform yo years

In [161]:
#Set appropiate column names for second dataset, according to the data in df_stroke
columns2 = ['Id','Age', 'Sex', 'Height', 'Weight','SystolicPressure', 'DiastolicPressure', 'Cholesterol', 'Diabetes', 'Smoker', 'Alcoholic',
           'ActiveSport','Target']
df_stroke2.columns=columns2
df_stroke2['Age']=df_stroke2['Age'].astype(int) #Round all ages to the closest integer

In [162]:
df_stroke2.head(5)

Unnamed: 0,Id,Age,Sex,Height,Weight,SystolicPressure,DiastolicPressure,Cholesterol,Diabetes,Smoker,Alcoholic,ActiveSport,Target
0,0,50,2,168,62.0,110,80,1,1,0,0,1,0
1,1,55,1,156,85.0,140,90,3,1,0,0,1,1
2,2,51,1,165,64.0,130,70,3,1,0,0,0,1
3,3,48,2,169,82.0,150,100,1,1,0,0,1,1
4,4,47,1,156,56.0,100,60,1,1,0,0,0,0


In [163]:
#Let´s select the columns common to the other datasets in order to compare them
df_stroke2.drop(axis=1, columns=['Id', 'Height','Weight', 'DiastolicPressure', 'Smoker', 'Alcoholic', 'ActiveSport'], inplace=True)

In [164]:
df_stroke2.head(5)

Unnamed: 0,Age,Sex,SystolicPressure,Cholesterol,Diabetes,Target
0,50,2,110,1,1,0
1,55,1,140,3,1,1
2,51,1,130,3,1,1
3,48,2,150,1,1,1
4,47,1,100,1,1,0


In [170]:
#Caterogize diabetes column in only two catergories, 0 means normal and 1 means above normal or patient with diabetes (>100mg/dl blood sugar)
df_stroke2['Diabetes'] = pd.cut(df_stroke2['Diabetes'], bins=[1,2,3],labels=[0,1], right=False)

In [171]:
df_stroke2.sample(5)

Unnamed: 0,Age,Sex,SystolicPressure,Cholesterol,Diabetes,Target
41378,39,2,120,1,0,0
60704,56,1,120,1,0,1
33830,57,2,130,3,1,1
42238,58,2,120,1,0,0
31927,59,1,140,1,0,1


### Third data set analysis and cleaning


1 male 0 female

In [181]:
df_stroke3 = pd.read_csv('stroke_predictoSet3.csv')

In [182]:
df_stroke3[df_stroke3.duplicated() == True] #Check for duplicates. There are no duplicates in this case

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD


In [183]:
columns3 = ['Sex', 'Age', 'Education', 'Smoker','CigarretesPerDay', 'BloodPressureMedicines', 'PreviousStroke', 'HyperSentitive', 'Diabetes', 'Cholesterol',
           'SystolicPressure','DiastolicPressure','BodyMassIndex','HeartRate','FastingBloodSugar', 'MoreThanTenYearDisease']
df_stroke3.columns=columns3

In [184]:
df_stroke3.head()

Unnamed: 0,Sex,Age,Education,Smoker,CigarretesPerDay,BloodPressureMedicines,PreviousStroke,HyperSentitive,Diabetes,Cholesterol,SystolicPressure,DiastolicPressure,BodyMassIndex,HeartRate,FastingBloodSugar,MoreThanTenYearDisease
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [185]:
df_stroke3.drop(axis=1, columns= ['Education', 'Smoker','CigarretesPerDay', 'BloodPressureMedicines', 'PreviousStroke', 'HyperSentitive','DiastolicPressure','BodyMassIndex','HeartRate', 'FastingBloodSugar', 'MoreThanTenYearDisease'], inplace=True)

In [186]:
df_stroke3.sample(5)

Unnamed: 0,Sex,Age,Diabetes,Cholesterol,SystolicPressure
3953,0,44,0,174.0,174.0
2761,0,44,0,205.0,109.0
1545,0,48,0,193.0,138.5
2374,1,45,0,218.0,133.0
755,0,45,0,222.0,121.0


In [None]:
# Choose target and features
y = df_stroke['target']
stroke_features = ['age', 'sex', 'chol', 'fbs', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)