### Project - Predict heart attack based on phisiological and physical parameters fo the patient

### Action plan

1. Read, explore and clean all the data sets
2. See what all datasets have in common and how to relate them. 
3. Combine or merge relevant columns from datasets 1 and 3, to create a richer dataset. 
4. Plot main relations between variables to better understand the behaviour of the data.
5. Run Machine learning algorithms to train and test. The target is 0 or 1, depending on having or not a stroke.
6. Test all ML models and measure the error
7. Get best model and run with data set stroke_predictorSet2
8. Measure error. If not high enough, retrain ML models changing parameters.
9. Return best possible ML model to predict stroke with this data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### First data set analysis and cleaning

1. age
2. sex
3. chest pain type (4 values)
4. resting blood pressure
5. serum cholestoral in mg/dl
6. fasting blood sugar > 120 mg/dl
7. resting electrocardiographic results (values 0,1,2)
8. maximum heart rate achieved
9. exercise induced angina
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

In [2]:
df_stroke = pd.read_csv('stroke_predictorSet.csv') #heart measurements taken in India

In [13]:
df_stroke.sample(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
251,43,1,0,132,247,1,0,143,1,0.1,1,4,3,0
282,59,1,2,126,218,1,1,134,0,2.2,1,1,1,0
163,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1
57,45,1,0,115,260,0,0,185,0,0.0,2,0,2,1
65,35,0,0,138,183,0,1,182,0,1.4,2,0,2,1


In [17]:
df_stroke[df_stroke.duplicated() == True] #Check for duplicates

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
164,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1


In [26]:
df_stroke.drop_duplicates(inplace=True) #Remove duplicates and check size of the data frame
df_stroke.shape

(302, 14)

In [33]:
df_stroke.sample(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
296,63,0,0,124,197,0,1,136,1,0.0,1,0,2,0
109,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1
261,52,1,0,112,230,0,1,160,0,0.0,2,1,2,0
151,71,0,0,112,149,0,1,125,0,1.6,1,0,2,1
23,61,1,2,150,243,1,1,137,1,1.0,1,0,2,1


### Second data set analysis and cleaning
1. Age | Objective Feature | age | int (days)
2. Height | Objective Feature | height | int (cm) |
3. Weight | Objective Feature | weight | float (kg) |
4. Gender | Objective Feature | gender | categorical code | 1 woman, 2 man
5. Systolic blood pressure | Examination Feature | ap_hi | int |
6. Diastolic blood pressure | Examination Feature | ap_lo | int |
7. Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
8. Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
9. Smoking | Subjective Feature | smoke | binary |
10. Alcohol intake | Subjective Feature | alco | binary |
11. Physical activity | Subjective Feature | active | binary |
12. Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

In [46]:
df_stroke2 = pd.read_csv('stroke_predictorSet2.csv', sep=';')

In [47]:
df_stroke2.head() #This data set is independant from the first one and also has a target, we will use it to test our model.

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [48]:
df_stroke2[df_stroke2.duplicated() == True] #Check for duplicates. There are no duplicates in this case

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio


In [49]:
df_stroke2['age']=df_stroke2['age']/365 #Age was given in days, transform yo years
df_stroke2.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,50.391781,2,168,62.0,110,80,1,1,0,0,1,0
1,1,55.419178,1,156,85.0,140,90,3,1,0,0,1,1
2,2,51.663014,1,165,64.0,130,70,3,1,0,0,0,1
3,3,48.282192,2,169,82.0,150,100,1,1,0,0,1,1
4,4,47.873973,1,156,56.0,100,60,1,1,0,0,0,0


### Third data set analysis and cleaning


In [37]:
df_stroke3 = pd.read_csv('stroke_predictoSet3.csv')

In [38]:
df_stroke3.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [40]:
df_stroke3[df_stroke3.duplicated() == True] #Check for duplicates. There are no duplicates in this case

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
