# **Brain Stroke Prediction Using Machine Learning and Data Sciecne.**



### **Importing Necessary Libraries.**

In [1]:
pip install autoviz



You should consider upgrading via the 'c:\users\ab\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.





In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('dark_background')
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report, accuracy_score, precision_recall_curve,precision_score,recall_score,f1_score, roc_auc_score, roc_curve, auc
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from imblearn.over_sampling import RandomOverSampler,SMOTE
from imblearn.combine import SMOTETomek
from collections import Counter
import warnings
warnings.filterwarnings(action='ignore')
import time
from numpy import arange
import autoviz
from autoviz.AutoViz_Class import AutoViz_Class
#for interactive console
import ipywidgets
import ipywidgets as widgets
from ipywidgets import interact
from ipywidgets import interact_manual

Imported AutoViz_Class version: 0.0.81. Call using:
    from autoviz.AutoViz_Class import AutoViz_Class
    AV = AutoViz_Class()
    AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=0,
                            lowess=False,chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30)
Note: verbose=0 or 1 generates charts and displays them in your local Jupyter notebook.
      verbose=2 saves plots in your local machine under AutoViz_Plots directory and does not display charts.


### Import the Data Set from the neccesay location.
The Data set consists of 40000+ entries of Patients Regarding Brain Stroke symptoms.
There are total of 12 columns including target_column.
1. id
2. gender
3. age
4. hypertension
5. heart_disease
6. ever_married
7. work_type
8. Residence_type
9. avg_glucose_level
10. bmi
11. smoking_status
12. stroke(target_column)

In [3]:
dodge = pd.read_csv('train_strokes.csv')

In [4]:
# head() helps us to view the first 5 entries in our dataset.

dodge.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,30669,Male,3.0,0,0,No,children,Rural,95.12,18.0,,0
1,30468,Male,58.0,1,0,Yes,Private,Urban,87.96,39.2,never smoked,0
2,16523,Female,8.0,0,0,No,Private,Urban,110.89,17.6,,0
3,56543,Female,70.0,0,0,Yes,Private,Rural,69.04,35.9,formerly smoked,0
4,46136,Male,14.0,0,0,No,Never_worked,Rural,161.28,19.1,,0


In [5]:
# info() gives us the count and dtype, also helps us to identify whether there are any null values or not.

dodge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43400 entries, 0 to 43399
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 43400 non-null  int64  
 1   gender             43400 non-null  object 
 2   age                43400 non-null  float64
 3   hypertension       43400 non-null  int64  
 4   heart_disease      43400 non-null  int64  
 5   ever_married       43400 non-null  object 
 6   work_type          43400 non-null  object 
 7   Residence_type     43400 non-null  object 
 8   avg_glucose_level  43400 non-null  float64
 9   bmi                41938 non-null  float64
 10  smoking_status     30108 non-null  object 
 11  stroke             43400 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 4.0+ MB


In [6]:
# describe() gives us a breif description about the columns(count, min, max, mean, median etc)

dodge.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,43400.0,43400.0,43400.0,43400.0,43400.0,41938.0,43400.0
mean,36326.14235,42.217894,0.093571,0.047512,104.48275,28.605038,0.018041
std,21072.134879,22.519649,0.291235,0.212733,43.111751,7.77002,0.133103
min,1.0,0.08,0.0,0.0,55.0,10.1,0.0
25%,18038.5,24.0,0.0,0.0,77.54,23.2,0.0
50%,36351.5,44.0,0.0,0.0,91.58,27.7,0.0
75%,54514.25,60.0,0.0,0.0,112.07,32.9,0.0
max,72943.0,82.0,1.0,1.0,291.05,97.6,1.0


In [7]:
# In the case of object columns we get(count, unique values, top, freq)
dodge.describe(include = 'object')

Unnamed: 0,gender,ever_married,work_type,Residence_type,smoking_status
count,43400,43400,43400,43400,30108
unique,3,2,5,2,3
top,Female,Yes,Private,Urban,never smoked
freq,25665,27938,24834,21756,16053


### **Exploring Target Variable.**

In [8]:
dodge['stroke'].value_counts()

0    42617
1      783
Name: stroke, dtype: int64

In [9]:
# There arent any null values, but
dodge['stroke'].isnull().sum()

0

In [10]:
# This plot tell's about, how the distribution of target class is spreaded.
# we can see that the target classes are highly imbalanced with 0->42617, 1->783, so we need to balance these target classes which we will see in the later part.
# countplot() helps us to visualize the count the classes.

plt.figure(figsize = (6,4), dpi = 100)
sns.countplot(dodge['stroke'])
plt.xlabel('Stroke Status')
plt.ylabel('Count')
plt.title('Distribution of Target Classes')
plt.show()

### **Exploring Independent Numerical Columns.**


1. Cleaning
2. Treating Missing values
3. Anamoly Detection and Reduction



In [11]:
numerical = ['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi']
#dodge[numerical[0]]

##### Treating missing values present in the column dodge['bmi'], no other numerical columns has missing values.





In [12]:
dodge['bmi'].isnull().sum()

1462

In [13]:
dodge['bmi'] = dodge['bmi'].fillna(dodge['bmi'].mean())

In [14]:
dodge['bmi'].isnull().sum()

0

##### Exploring each numerical column using describe()





In [15]:
for i in numerical:
  print(dodge[i].describe())

count    43400.000000
mean        42.217894
std         22.519649
min          0.080000
25%         24.000000
50%         44.000000
75%         60.000000
max         82.000000
Name: age, dtype: float64
count    43400.000000
mean         0.093571
std          0.291235
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: hypertension, dtype: float64
count    43400.000000
mean         0.047512
std          0.212733
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: heart_disease, dtype: float64
count    43400.000000
mean       104.482750
std         43.111751
min         55.000000
25%         77.540000
50%         91.580000
75%        112.070000
max        291.050000
Name: avg_glucose_level, dtype: float64
count    43400.000000
mean        28.605038
std          7.638023
min         10.100000
25%         23.400000
50%         28.100000
75%         32.600000
max  

##### Anamoly Detection and Reduction in Numericals. 

1. age

In [16]:
dodge['age'].describe()

count    43400.000000
mean        42.217894
std         22.519649
min          0.080000
25%         24.000000
50%         44.000000
75%         60.000000
max         82.000000
Name: age, dtype: float64

In [17]:
dodge['age'].value_counts()

51.00    738
52.00    721
53.00    701
78.00    698
50.00    694
        ... 
0.48      37
0.40      35
1.00      34
0.16      26
0.08      17
Name: age, Length: 104, dtype: int64

##### Function to check the Anamolies in the column using upper_limit and lower_limit.


1. If the upper_limit > max(df['col']), then we replace the upper_limit with the max value.
2. Similarly, if the lower_limit < min(df['col']), we replace the lower_limit with the min value.



In [18]:
anamolies = []
def outliers(data):
  random_state_mean = np.mean(data)
  random_state_std = np.std(data)
  anamoly = random_state_std * 3

  upper_limit = random_state_mean + anamoly
  lower_limit = random_state_mean - anamoly
  lp_lower_limit = 1.00
  up_upper_limit = max(dodge['age'])
  print(upper_limit)
  print(lower_limit)

  print(lp_lower_limit)
  print(up_upper_limit)

  for i in data:
      if i < lp_lower_limit or i > up_upper_limit:
        anamolies.append(i)

In [19]:
outliers(dodge['age'])
print(len(anamolies))

109.7760617173718
-25.340273698938617
1.0
82.0
496


In [20]:
dodge.shape

(43400, 12)

Here all the values below 1 are termed as outliers, although in rarest of cases Intrauterine stroke occur to unborn childre in the womb.

But in this project we drop those values, but in future we can even work on these values.



In [21]:
dodge[dodge['age'] < 1.00]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
116,7559,Female,0.64,0,0,No,children,Urban,83.82,24.9,,0
129,22706,Female,0.88,0,0,No,children,Rural,88.11,15.5,,0
323,61511,Female,0.32,0,0,No,children,Rural,73.71,16.2,,0
746,54747,Male,0.88,0,0,No,children,Rural,157.57,19.2,,0
761,53279,Male,0.24,0,0,No,children,Rural,118.87,16.3,,0
...,...,...,...,...,...,...,...,...,...,...,...,...
43031,2698,Female,0.32,0,0,No,children,Urban,91.86,17.6,,0
43106,51999,Male,0.32,0,0,No,children,Urban,90.38,16.1,,0
43220,36634,Female,0.08,0,0,No,children,Rural,125.11,12.1,,0
43296,52578,Male,0.56,0,0,No,children,Rural,78.07,21.9,,0


In [22]:
dodge[dodge['age'] < 1.00].index

Int64Index([  116,   129,   323,   746,   761,   861,   975,  1087,  1375,
             1389,
            ...
            42637, 42862, 42880, 42881, 42982, 43031, 43106, 43220, 43296,
            43330],
           dtype='int64', length=496)

In [23]:
chevy = dodge.drop(index = dodge[dodge['age'] < 1.00].index, axis = 0, inplace=True)

In [24]:
dodge.drop(index = dodge[(dodge.age > 1.0) & (dodge.age < 2.0)].index, axis = 0, inplace = True)

In [25]:
dodge.shape

(42309, 12)

2. avg_glucose_level(Average Glucose Level)

In [26]:
anamolies = []
def outliers(data):
  random_state_mean = np.mean(data)
  random_state_std = np.std(data)
  anamoly = random_state_std * 3

  upper_limit = random_state_mean + anamoly
  lower_limit = random_state_mean - anamoly
  ll_p = min(dodge['avg_glucose_level'])

  print(upper_limit)
  print(lower_limit)
  print(ll_p)
  for i in data:
    if i < ll_p or i > upper_limit:
      anamolies.append(i)

In [27]:
outliers(dodge['avg_glucose_level'])
print(len(anamolies))

235.13454455171652
-25.55617162869774
55.0
575


In [28]:
dodge['avg_glucose_level'].describe()

count    42309.000000
mean       104.789186
std         43.448966
min         55.000000
25%         77.570000
50%         91.650000
75%        112.260000
max        291.050000
Name: avg_glucose_level, dtype: float64

In [29]:
dodge[dodge['avg_glucose_level'] > 234.40827023316058 ]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
7,41413,Female,75.0,0,1,Yes,Self-employed,Rural,243.53,27.000000,never smoked,0
54,18518,Male,66.0,0,0,Yes,Private,Rural,242.30,35.300000,smokes,0
77,4480,Male,76.0,0,0,Yes,Private,Rural,234.58,34.300000,formerly smoked,0
78,2982,Female,57.0,1,0,Yes,Private,Rural,235.85,40.100000,never smoked,0
83,59368,Female,78.0,0,0,Yes,Private,Urban,243.50,26.100000,never smoked,0
...,...,...,...,...,...,...,...,...,...,...,...,...
43228,27207,Male,41.0,1,0,Yes,Private,Rural,271.01,25.800000,,0
43279,49997,Male,67.0,0,0,Yes,Self-employed,Rural,242.61,47.000000,,0
43283,29575,Female,30.0,0,0,No,Self-employed,Urban,258.24,28.605038,never smoked,0
43287,22198,Male,66.0,0,0,Yes,Private,Rural,238.23,33.300000,formerly smoked,0


In [30]:
dodge[dodge['avg_glucose_level'] > 234.40827023316058].index

Int64Index([    7,    54,    77,    78,    83,    96,   139,   310,   322,
              469,
            ...
            43140, 43144, 43155, 43175, 43188, 43228, 43279, 43283, 43287,
            43358],
           dtype='int64', length=617)

In [31]:
dodge.drop(index = dodge[dodge['avg_glucose_level'] > 234.40827023316058].index, axis = 0, inplace = True)

In [32]:
dodge.shape

(41692, 12)

3. bmi(Body Mass Index)

In [33]:
anamolies = []
def outliers(data):
  random_state_mean = np.mean(data)
  random_state_std = np.std(data)
  anamoly = random_state_std * 3

  upper_limit = random_state_mean + anamoly
  lower_limit = random_state_mean - anamoly
  lll_p = min(dodge['bmi'])

  print(upper_limit)
  print(lower_limit)
  print(lll_p)
  for i in data:
    if i < lll_p or i > upper_limit:
      anamolies.append(i)

In [34]:
outliers(dodge['bmi'])
print(len(anamolies))

51.34051370653113
6.268167446955189
10.1
431


In [35]:
dodge['bmi'].describe()

count    41692.000000
mean        28.804341
std          7.512148
min         10.100000
25%         23.700000
50%         28.200000
75%         32.700000
max         97.600000
Name: bmi, dtype: float64

In [36]:
dodge[dodge['bmi'] > 51.35486554902225]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
9,28674,Female,74.0,1,0,Yes,Self-employed,Urban,205.84,54.6,never smoked,0
21,72911,Female,57.0,1,0,Yes,Private,Rural,129.54,60.9,smokes,0
86,1703,Female,52.0,0,0,Yes,Private,Urban,82.24,54.7,formerly smoked,0
111,66333,Male,52.0,0,0,Yes,Self-employed,Urban,78.40,64.8,never smoked,0
184,53144,Female,52.0,0,1,Yes,Private,Urban,72.79,54.7,never smoked,0
...,...,...,...,...,...,...,...,...,...,...,...,...
43025,14846,Male,50.0,1,0,Yes,Govt_job,Rural,75.29,52.0,never smoked,0
43087,70198,Male,78.0,1,0,Yes,Private,Rural,135.73,89.0,never smoked,0
43239,36167,Male,21.0,0,0,No,Private,Urban,83.78,54.9,never smoked,0
43355,57237,Female,46.0,0,0,Yes,Private,Rural,99.81,53.2,,0


In [37]:
dodge[dodge['bmi'] > 51.35486554902225].index

Int64Index([    9,    21,    86,   111,   184,   220,   297,   302,   396,
              422,
            ...
            42560, 42589, 42604, 42831, 42977, 43025, 43087, 43239, 43355,
            43396],
           dtype='int64', length=431)

In [38]:
dodge.drop(index = dodge[dodge['bmi'] > 51.35486554902225].index, axis = 0, inplace = True)

In [39]:
dodge.shape

(41261, 12)

Checking Null Values in the Dataset.

Except smoking_status, every column is free from null values.



In [40]:
dodge.isnull().sum()

id                       0
gender                   0
age                      0
hypertension             0
heart_disease            0
ever_married             0
work_type                0
Residence_type           0
avg_glucose_level        0
bmi                      0
smoking_status       12015
stroke                   0
dtype: int64

### **Exploring Independent Categorical(Object/String) Columns.**


1. Cleaning
2. Treating Missing values



In [41]:
categorical = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

In [42]:
for i in categorical:
  print(dodge[i].describe())

count      41261
unique         3
top       Female
freq       24498
Name: gender, dtype: object
count     41261
unique        2
top         Yes
freq      27051
Name: ever_married, dtype: object
count       41261
unique          5
top       Private
freq        24195
Name: work_type, dtype: object
count     41261
unique        2
top       Urban
freq      20664
Name: Residence_type, dtype: object
count            29246
unique               3
top       never smoked
freq             15655
Name: smoking_status, dtype: object


In [43]:
dodge.describe(include = 'object')

Unnamed: 0,gender,ever_married,work_type,Residence_type,smoking_status
count,41261,41261,41261,41261,29246
unique,3,2,5,2,3
top,Female,Yes,Private,Urban,never smoked
freq,24498,27051,24195,20664,15655


In [44]:
dodge['smoking_status'].value_counts()

never smoked       15655
formerly smoked     7222
smokes              6369
Name: smoking_status, dtype: int64

In [45]:
dodge.describe(include = 'all')

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
count,41261.0,41261,41261.0,41261.0,41261.0,41261,41261,41261,41261.0,41261.0,29246,41261.0
unique,,3,,,,2,5,2,,,3,
top,,Female,,,,Yes,Private,Urban,,,never smoked,
freq,,24498,,,,27051,24195,20664,,,15655,
mean,36315.893992,,42.998134,0.092533,0.046945,,,,102.504529,28.509461,,0.018032
std,21080.388177,,21.848829,0.28978,0.211524,,,,39.968402,6.94245,,0.133067
min,1.0,,1.0,0.0,0.0,,,,55.0,10.1,,0.0
25%,18007.0,,25.0,0.0,0.0,,,,77.37,23.7,,0.0
50%,36315.0,,44.0,0.0,0.0,,,,91.17,28.1,,0.0
75%,54539.0,,60.0,0.0,0.0,,,,110.77,32.5,,0.0


Treating Missing values in Object columns using,
1. mean/median/mode
2. Based on frequency Distribution.

In [46]:
dodge['smoking_status'].mode()

0    never smoked
dtype: object

In [47]:
dodge['smoking_status'].fillna('never smoked',inplace = True)

In [48]:
dodge['smoking_status'].isnull().sum()

0

Hence all our data is now cleaned and ready for Analysis.

In [49]:
dodge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41261 entries, 0 to 43399
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 41261 non-null  int64  
 1   gender             41261 non-null  object 
 2   age                41261 non-null  float64
 3   hypertension       41261 non-null  int64  
 4   heart_disease      41261 non-null  int64  
 5   ever_married       41261 non-null  object 
 6   work_type          41261 non-null  object 
 7   Residence_type     41261 non-null  object 
 8   avg_glucose_level  41261 non-null  float64
 9   bmi                41261 non-null  float64
 10  smoking_status     41261 non-null  object 
 11  stroke             41261 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 4.1+ MB


In [50]:
dodge.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,30669,Male,3.0,0,0,No,children,Rural,95.12,18.0,never smoked,0
1,30468,Male,58.0,1,0,Yes,Private,Urban,87.96,39.2,never smoked,0
2,16523,Female,8.0,0,0,No,Private,Urban,110.89,17.6,never smoked,0
3,56543,Female,70.0,0,0,Yes,Private,Rural,69.04,35.9,formerly smoked,0
4,46136,Male,14.0,0,0,No,Never_worked,Rural,161.28,19.1,never smoked,0


### **Exploratory Data Analysis.** 
Exploratory Data Analysis helps us the understand the insights and extract the patterns from the dataset, which might be helpful to explain about the problem statement given to our clients.
This can also be done by using traditional python code, But Visualizing the data looks more eye catching than looking at some numbers and letters.
so, hence we are going to use various plots and graphs to visualize, which comes from the libraries such as,
seaborn and matplotlib.pyplot.
1. bar
2. countplot
3. piechart
4. hist
5. box
6. scatterplot

Apart from this we have also used and auto visualization tool, "autoviz"

In [51]:
dodge.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [52]:
dodge.drop(columns = 'id', inplace=True)

In [53]:
dodge.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,3.0,0,0,No,children,Rural,95.12,18.0,never smoked,0
1,Male,58.0,1,0,Yes,Private,Urban,87.96,39.2,never smoked,0
2,Female,8.0,0,0,No,Private,Urban,110.89,17.6,never smoked,0
3,Female,70.0,0,0,Yes,Private,Rural,69.04,35.9,formerly smoked,0
4,Male,14.0,0,0,No,Never_worked,Rural,161.28,19.1,never smoked,0



-> pd.crosstab() function is a very useful and most advanced fuction in the python dataframe, it helps us to compare 2 variables, due to which we can plot the distribution of thsoe variables. 

1. Bar plot for crosstab distribution between gender and stroke.

In [54]:
plt.figure(figsize = (8,6))
x = pd.crosstab(dodge['gender'], dodge['stroke'])
x.plot(kind = 'bar')
#x.div(x.sum(1).astype(float), axis = 0).plot(kind='bar', stacked = False)
plt.xlabel('Gender_distribution')
plt.ylabel('Count')
plt.title('Gender Distribution over Target Class')
plt.show()



2. Pie Chart for distribution of gender.

In [55]:
# PIE CHART for dodge['gender'] column.

plt.figure(figsize = (8,6), dpi = 90)
labels = dodge['gender'].value_counts().index
sizes = dodge['gender'].value_counts()
explode = [0,0,0.1]
colors = plt.cm.autumn(np.linspace(0,1,3))
plt.pie(sizes, colors=colors, labels=labels, explode=explode, shadow =True, startangle=90, autopct = '%.2f%%' )
plt.title('Gender',fontsize=12)
plt.legend()
plt.show()

3. Bar chart for gender-hypertentsion distribution.

In [56]:
plt.figure(figsize = (8,6), dpi = 90)
x = pd.crosstab(dodge['gender'],dodge['hypertension'])
x.plot(kind = 'bar')
plt.xlabel('Gender')
plt.ylabel('Hypertension')
plt.title("Gender_Hypertension_Distribution")
plt.show()

4. Bar Chart for age-hypertension distribution 

In [57]:
plt.rcParams['figure.figsize'] = (20,12)
x = pd.crosstab(dodge['age'], dodge['hypertension'])
x.plot(kind = 'bar')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title("Age Hypertension Distrubition")
plt.show()

5. Bar Chart for gender-heart_disease distribution

In [58]:
plt.figure(figsize=(12,10))
ab = pd.crosstab(dodge['gender'], dodge['heart_disease'])
ab.plot(kind = 'bar')
plt.show()

6. age-stroke distribution

In [59]:
plt.rcParams['figure.figsize'] = (20,12)
x = pd.crosstab(dodge['age'], dodge['stroke'])
x.plot(kind = 'bar')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title("Age_Stroke Distrubition")
plt.show()

7. age-heart_disease distribution.

In [60]:
plt.rcParams['figure.figsize'] = (20,12)
#plt.figure(figsize =(13,6))
x = pd.crosstab(dodge['age'], dodge['heart_disease'])
x.plot(kind = 'bar')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title("Age Heart_Disease Distrubition")
plt.show()

8. Distribution of people getting stroke with respect to whether they are married or not.

In [61]:
plt.rcParams['figure.figsize'] = (8,6)
h = pd.crosstab(dodge['ever_married'], dodge['stroke'])
h.plot(kind ='bar')
plt.show()

9. Scatterplot for avg_glucose level and bmi with hue as stroke, hue is an additional parameter which seperates the values using different colors.

In [62]:
plt.rcParams['figure.figsize'] = (20,12)
sns.relplot(dodge['avg_glucose_level'], dodge['bmi'], hue = dodge['stroke'], kind = 'scatter')
plt.xlabel('Avg_Glucose_Level')
plt.ylabel('BMI')
plt.show()

10. Countplot() for checking distribution of work_type.

In [63]:
plt.figure(figsize = (12,10))
sns.countplot(dodge['work_type'], color ='red')
plt.xlabel("Work Type")
plt.ylabel('Count')
plt.title("Distribution of Work_type")
plt.show()

11. Distribution of work_type with respect to stroke occurence.

In [64]:
plt.rcParams['figure.figsize'] = (8,6)
h = pd.crosstab(dodge['work_type'], dodge['stroke'])
h.plot(kind ='bar')
plt.xlabel("Work_type")
plt.ylabel("Count")
plt.title("Distribution of Work_type and stroke")
plt.show()

autovis = AV.AutoViz(filename = 'train_strokes.csv', sep=',')
autovis

### **Exploring data using traditional python code, with the help of interactive widgets.**

In [65]:
abg = dodge[['hypertension', 'heart_disease']].groupby(['hypertension']).count().style.background_gradient(cmap = 'viridis')

Sum of Heart Disease values with respect to hypertension, This can be easily eaxplained by crosstab()

In [66]:
abg

Unnamed: 0_level_0,heart_disease
hypertension,Unnamed: 1_level_1
0,37443
1,3818


In [67]:
dre = pd.crosstab(dodge['hypertension'], dodge['heart_disease'])
dre

heart_disease,0,1
hypertension,Unnamed: 1_level_1,Unnamed: 2_level_1
0,35984,1459
1,3340,478


@interact -> The interact function (ipywidgets.interact) automatically creates user interface (UI) controls for exploring code and data interactively.

The function gets called each time the slider is moved.

In [68]:
@interact
def abc(x = 50):
  y = dodge[dodge['avg_glucose_level'] > x]
  return y['stroke'].value_counts()
abc()

interactive(children=(IntSlider(value=50, description='x', max=150, min=-50), Output()), _dom_classes=('widget…

0    40517
1      744
Name: stroke, dtype: int64

In [69]:
@interact
def hyp_heart(x=0, y=0):
  g = dodge[(dodge['hypertension'] == x) & (dodge['heart_disease'] == y)]
  return g['stroke'].value_counts()
hyp_heart()

interactive(children=(IntSlider(value=0, description='x', max=1), IntSlider(value=0, description='y', max=1), …

0    35541
1      443
Name: stroke, dtype: int64

In [70]:
@interact
def hy_he_eve(x=0,y=0,z='No'):
  j = dodge[(dodge['hypertension'] == x) & (dodge['heart_disease'] == y) & (dodge['ever_married'] == z)]
  return j['stroke'].value_counts(), j['smoking_status'].value_counts()
hy_he_eve()

interactive(children=(IntSlider(value=0, description='x', max=1), IntSlider(value=0, description='y', max=1), …

(0    13690
 1       44
 Name: stroke, dtype: int64,
 never smoked       11124
 smokes              1437
 formerly smoked     1173
 Name: smoking_status, dtype: int64)

### **Feature Transformation.**
Feature Transformation is the technique of transforming the variable into other form like Strings -> Numeric, splitting the Date Column in to pieces etc.

Types of encoding.
1. Nominal Encoding.
  *   one hot encoding -> Creating Dummy variables.
  *   one hot encoding with multi categories(more than 20 categories)
  *   mean encoding


2. Ordinal Encoding.
  * Label Encoder
  * target_guided_encoding

-> For the columns with less than 5 categories we can manually perform encoding, usinf map().

-> For Columns with more than 20 Categories we can perform one hot encoding with multi categories, where we tend to select the top categories based on their value_counts().


In [71]:
dodge.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,3.0,0,0,No,children,Rural,95.12,18.0,never smoked,0
1,Male,58.0,1,0,Yes,Private,Urban,87.96,39.2,never smoked,0
2,Female,8.0,0,0,No,Private,Urban,110.89,17.6,never smoked,0
3,Female,70.0,0,0,Yes,Private,Rural,69.04,35.9,formerly smoked,0
4,Male,14.0,0,0,No,Never_worked,Rural,161.28,19.1,never smoked,0


In [72]:
dodge['smoking_status'].unique()

array(['never smoked', 'formerly smoked', 'smokes'], dtype=object)

In [73]:
mapping = {'Male':2, 'Female':1, 'Other':0}
mapping1 = {'No':0, 'Yes':1}
mapping2 = {'never smoked':0, 'formerly smoked':1, 'smokes':2}

In [74]:
dodge['gender'] = dodge['gender'].map(mapping)

In [75]:
dodge['ever_married'] = dodge['ever_married'].map(mapping1)

In [76]:
dodge['smoking_status'] = dodge['smoking_status'].map(mapping2)

In [77]:
dodge[['gender', 'smoking_status', 'ever_married']].head()

Unnamed: 0,gender,smoking_status,ever_married
0,2,0,0
1,2,0,1
2,1,0,0
3,1,1,1
4,2,0,0


In [78]:
dodge['work_type'].unique()

array(['children', 'Private', 'Never_worked', 'Govt_job', 'Self-employed'],
      dtype=object)

In [79]:
dodge['Residence_type'].unique()

array(['Rural', 'Urban'], dtype=object)

In [80]:
dodge['home_town'] = pd.get_dummies(dodge['Residence_type'], drop_first = True)

Creating a new dataframe withe respect to work_type.

In [81]:
f150 = pd.get_dummies(dodge['work_type'], drop_first = True)

Merging 2 DataFrames(dodge,f150) with the default join.

In [82]:
camero = pd.concat([dodge,f150], axis = 1)

In [83]:
camero.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,home_town,Never_worked,Private,Self-employed,children
0,2,3.0,0,0,0,children,Rural,95.12,18.0,0,0,0,0,0,0,1
1,2,58.0,1,0,1,Private,Urban,87.96,39.2,0,0,1,0,1,0,0
2,1,8.0,0,0,0,Private,Urban,110.89,17.6,0,0,1,0,1,0,0
3,1,70.0,0,0,1,Private,Rural,69.04,35.9,1,0,0,0,1,0,0
4,2,14.0,0,0,0,Never_worked,Rural,161.28,19.1,0,0,0,1,0,0,0


In [84]:
camero.rename(columns = {'Never_worked':'w_t_n_w', 'Private':'w_t_p', 'Self-employed':'w_t_s_e', 'children':'w_t_c'}, inplace = True)

Droping the columns ['work_type', 'Residence_type'], as we have already created dummy variables for them.

In [85]:
camero.drop(columns = ['work_type','Residence_type'], inplace = True)

In [86]:
camero.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,smoking_status,stroke,home_town,w_t_n_w,w_t_p,w_t_s_e,w_t_c
0,2,3.0,0,0,0,95.12,18.0,0,0,0,0,0,0,1
1,2,58.0,1,0,1,87.96,39.2,0,0,1,0,1,0,0
2,1,8.0,0,0,0,110.89,17.6,0,0,1,0,1,0,0
3,1,70.0,0,0,1,69.04,35.9,1,0,0,0,1,0,0
4,2,14.0,0,0,0,161.28,19.1,0,0,0,1,0,0,0


In [87]:
camero.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41261 entries, 0 to 43399
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             41261 non-null  int64  
 1   age                41261 non-null  float64
 2   hypertension       41261 non-null  int64  
 3   heart_disease      41261 non-null  int64  
 4   ever_married       41261 non-null  int64  
 5   avg_glucose_level  41261 non-null  float64
 6   bmi                41261 non-null  float64
 7   smoking_status     41261 non-null  int64  
 8   stroke             41261 non-null  int64  
 9   home_town          41261 non-null  uint8  
 10  w_t_n_w            41261 non-null  uint8  
 11  w_t_p              41261 non-null  uint8  
 12  w_t_s_e            41261 non-null  uint8  
 13  w_t_c              41261 non-null  uint8  
dtypes: float64(3), int64(6), uint8(5)
memory usage: 4.6 MB


### **Feature Scaling**
Feature Scaling is the technique to scale down all the values in the datset to same level, so that there will be no partiality while we train the model like bmi -> 56 getting high priority than heart_disease -> 0, so in order to remove this error, feature scaling is done.

Feature Scaling Tools.
1. Standardisation (values are centered around the mean with unit standard deviation.)
2. Normalisation/min_max scaling.(values range from 0 to 1)

StandardScaler(), which is a Standardization tool.

In [88]:
se = StandardScaler()
abh = se.fit_transform(camero.drop(columns=['stroke']))
mercury = pd.DataFrame(data = abh, columns = camero.drop(columns = ['stroke']).columns)
mercury.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,smoking_status,home_town,w_t_n_w,w_t_p,w_t_s_e,w_t_c
0,1.208899,-1.830699,-0.319325,-0.22194,-1.379732,-0.184761,-1.513815,-0.647332,-1.001625,-0.065637,-1.190685,-0.435028,2.675967
1,1.208899,0.686629,3.131608,-0.22194,0.724779,-0.363905,1.539898,-0.647332,0.998378,-0.065637,0.839853,-0.435028,-0.373697
2,-0.825374,-1.601851,-0.319325,-0.22194,-1.379732,0.209805,-1.571433,-0.647332,0.998378,-0.065637,0.839853,-0.435028,-0.373697
3,-0.825374,1.235865,-0.319325,-0.22194,0.724779,-0.837285,1.064556,0.690823,-1.001625,-0.065637,0.839853,-0.435028,-0.373697
4,1.208899,-1.327233,-0.319325,-0.22194,-1.379732,1.470566,-1.355368,-0.647332,-1.001625,15.235255,-1.190685,-0.435028,-0.373697


### **Feature Selection**
Selecting the best features which best contribute to our model.

In [89]:
plt.rcParams['figure.figsize'] = (20,12)
corr = mercury.corr()
sns.heatmap(corr, annot=True, cmap='autumn')
plt.show()

Function to select the best features with some threshold value.

In [90]:
def correlation(dataset,threshold):
    corr_list = []
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i,j] > threshold):
                column = corr_matrix.columns[[i,j]]
                corr_list.append(column)
    print(len(corr_list))
    return corr_list
correlation(mercury,0.6)

1


[Index(['ever_married', 'age'], dtype='object')]

Although, we can see ['ever_married', 'age'] are somewhat correlated, but where as if we use 'Variance Inflation Factor", we ended up with fixing the Multicollinearirty.

variance_inflation_factor -> it is used to remove multicollinearity between variables by removing as few variables as possible.

In [91]:
vif = variance_inflation_factor 
earth1 = pd.Series([vif(mercury.values, i) for i in range(mercury.shape[1])], index = mercury.columns)
earth1

gender               1.022118
age                  2.637361
hypertension         1.098485
heart_disease        1.096197
ever_married         1.950928
avg_glucose_level    1.081189
bmi                  1.287565
smoking_status       1.069008
home_town            1.000213
w_t_n_w              1.051573
w_t_p                2.336830
w_t_s_e              1.949642
w_t_c                2.712860
dtype: float64

Function to check and remove multicollinearity between independent variables.

In [92]:
def mc(data):
  earth = pd.Series([vif(data.values, i) for i in range(data.shape[1])], index = data.columns)
  if earth.max() > 6:
    print(earth[earth == earth.max()].index[0], 'Has Been Removed.')
    data = data.drop(columns = earth[earth == earth.max()].index[0])
  else:
    print("MultiCollinearity Has Been Removed.")
    return data

In [93]:
for i in range(5):
  mercury = mc(mercury)
mercury.head()

MultiCollinearity Has Been Removed.
MultiCollinearity Has Been Removed.
MultiCollinearity Has Been Removed.
MultiCollinearity Has Been Removed.
MultiCollinearity Has Been Removed.


Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,smoking_status,home_town,w_t_n_w,w_t_p,w_t_s_e,w_t_c
0,1.208899,-1.830699,-0.319325,-0.22194,-1.379732,-0.184761,-1.513815,-0.647332,-1.001625,-0.065637,-1.190685,-0.435028,2.675967
1,1.208899,0.686629,3.131608,-0.22194,0.724779,-0.363905,1.539898,-0.647332,0.998378,-0.065637,0.839853,-0.435028,-0.373697
2,-0.825374,-1.601851,-0.319325,-0.22194,-1.379732,0.209805,-1.571433,-0.647332,0.998378,-0.065637,0.839853,-0.435028,-0.373697
3,-0.825374,1.235865,-0.319325,-0.22194,0.724779,-0.837285,1.064556,0.690823,-1.001625,-0.065637,0.839853,-0.435028,-0.373697
4,1.208899,-1.327233,-0.319325,-0.22194,-1.379732,1.470566,-1.355368,-0.647332,-1.001625,15.235255,-1.190685,-0.435028,-0.373697


### **Splitting Data**
Splitting the dataset 

1. target_var
2. Independent_var

In [94]:
target_var = camero['stroke']
inde_vars = camero.drop(columns=['stroke'], axis = 1)

### **Handling Imbalanced Dataset.**

In [95]:
camero.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,smoking_status,stroke,home_town,w_t_n_w,w_t_p,w_t_s_e,w_t_c
0,2,3.0,0,0,0,95.12,18.0,0,0,0,0,0,0,1
1,2,58.0,1,0,1,87.96,39.2,0,0,1,0,1,0,0
2,1,8.0,0,0,0,110.89,17.6,0,0,1,0,1,0,0
3,1,70.0,0,0,1,69.04,35.9,1,0,0,0,1,0,0
4,2,14.0,0,0,0,161.28,19.1,0,0,0,1,0,0,0


In [96]:
## Using SMOTEK

In [97]:
so = SMOTE()
x_resample,y_resample = so.fit_sample(inde_vars, target_var.values.ravel())
brad = pd.DataFrame(data=x_resample, columns = inde_vars.columns)

In [98]:
#Before resampling
print("Before Resampling Target_Variable: ")
print(target_var.value_counts())

# After resampling
y_resample = pd.DataFrame(y_resample)
print("After Resampling Target_Variable:")
print(y_resample[0].value_counts())

Before Resampling Target_Variable: 
0    40517
1      744
Name: stroke, dtype: int64
After Resampling Target_Variable:
1    40517
0    40517
Name: 0, dtype: int64


### **Train Test Split.**

In [99]:
#using SMOTETOMEK 
x_train,x_test,y_train,y_test = train_test_split(x_resample, y_resample, test_size = 0.3, random_state = 0)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(56723, 13)
(56723, 1)
(24311, 13)
(24311, 1)


### **Feature Scaling Balanced Data.**



In [100]:
x_train_ss = se.fit_transform(x_train)
x_test_ss = se.transform(x_test)

## **Building Predictive Models.**
1. Decision Tree
2. Random Forest
3. Logistic Regression
4. Naive Bayes
5. XG Boost 

In [101]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(x_train_ss,y_train)
predictions = model.predict(x_test_ss)

print('The Training Accuracy of x_train and y_train is', model.score(x_train_ss,y_train))
print("The Testing Accuracy of x_test and y_test is", model.score(x_test_ss,y_test))

The Training Accuracy of x_train and y_train is 1.0
The Testing Accuracy of x_test and y_test is 0.9518736374480687


In [102]:
print(confusion_matrix(predictions,y_test))

[[11242   422]
 [  748 11899]]


In [103]:
print(classification_report(predictions,y_test))

              precision    recall  f1-score   support

           0       0.94      0.96      0.95     11664
           1       0.97      0.94      0.95     12647

    accuracy                           0.95     24311
   macro avg       0.95      0.95      0.95     24311
weighted avg       0.95      0.95      0.95     24311



In [104]:
print(accuracy_score(predictions, y_test))

0.9518736374480687


In [105]:
x = pd.DataFrame(data = x_train_ss, columns = inde_vars.columns)
y = y_train
from sklearn.model_selection import  StratifiedKFold
accuracy = []
skf = StratifiedKFold(n_splits = 10, random_state = None)
skf.get_n_splits(x,y)
for train_index, test_index in skf.split(x,y):
  print('Train:', train_index, 'Validation',test_index)
  x1_train,x1_test = x.iloc[train_index],x.iloc[test_index]
  y1_train,y1_test = y.iloc[train_index],y.iloc[test_index]
  model.fit(x1_train,y1_train)
  pred = model.predict(x1_test)
  score = accuracy_score(pred,y1_test)
  accuracy.append(score)
print(accuracy)

Train: [ 5646  5651  5653 ... 56720 56721 56722] Validation [   0    1    2 ... 5689 5691 5693]
Train: [    0     1     2 ... 56720 56721 56722] Validation [ 5646  5651  5653 ... 11348 11352 11353]
Train: [    0     1     2 ... 56720 56721 56722] Validation [11338 11343 11344 ... 17100 17101 17102]
Train: [    0     1     2 ... 56720 56721 56722] Validation [16944 16945 16948 ... 22755 22758 22759]
Train: [    0     1     2 ... 56720 56721 56722] Validation [22614 22616 22617 ... 28391 28392 28394]
Train: [    0     1     2 ... 56720 56721 56722] Validation [28336 28337 28341 ... 34094 34095 34097]
Train: [    0     1     2 ... 56720 56721 56722] Validation [33968 33969 33970 ... 39745 39746 39749]
Train: [    0     1     2 ... 56720 56721 56722] Validation [39661 39664 39668 ... 45394 45395 45397]
Train: [    0     1     2 ... 56720 56721 56722] Validation [45362 45364 45365 ... 51056 51058 51063]
Train: [    0     1     2 ... 51056 51058 51063] Validation [51043 51044 51045 ... 56720

In [106]:
arr = np.array(accuracy)

In [107]:
np.mean(arr)

0.9482397411388964

In [108]:
## HYper Parameter Tuning to overcome Overfitting Problem.

In [109]:
def cal_score(model, x1,y1,x2,y2):
  model.fit(x1,y1)
  p = model.predict(x1)
  f1 = f1_score(y1, p)
  p1 = model.predict(x2)
  f2 = f1_score(y2,p1)
  return f1,f2

In [110]:
def effect(train, test, x_axis, title):
  plt.figure(figsize = (12,10), dpi = 100)
  plt.plot(x_axis, train, color = 'red', label = 'train_score')
  plt.plot(x_axis, test, color = 'blue', label = 'test_score')
  plt.legend()
  plt.show()

In [111]:
max_depth = [i for i in range(1,50)]
train = []
test = []
for i in max_depth:
  model =DecisionTreeClassifier(max_depth=i, random_state=50)
  f1,f2 = cal_score(model, x_train, y_train, x_test, y_test)
  train.append(f1)
  test.append(f2)
effect(train,test, range(1,50), 'Max_Depth')

In [112]:
min_samples = [i for i in range(2,5000,25)]
train = []
test = []
for i in min_samples:
  model =DecisionTreeClassifier(max_depth=20, random_state=50, min_samples_split=i)
  f1,f2 = cal_score(model, x_train, y_train, x_test, y_test)
  train.append(f1)
  test.append(f2)
effect(train,test, range(2,5000,25), 'Min_Samples_Split')

In [113]:
max_leaf = [i for i in range(2,200,10)]
train = []
test = []
for i in max_leaf:
  model =DecisionTreeClassifier(max_depth=20,min_samples_split=4250,max_leaf_nodes=i, random_state=50)
  f1,f2 = cal_score(model, x_train, y_train, x_test, y_test)
  train.append(f1)
  test.append(f2)
effect(train,test, range(2,200,10), 'Max_Leaf_Nodes')

In [114]:
def cal_score1(model, x1,y1,x2,y2):
  model.fit(x1,y1)
  p = model.predict(x1)
  false_positive_rate, true_positive_rate, thresholds = roc_curve(y1, p)
  roc_auc_1 = auc(false_positive_rate, true_positive_rate)
  p1 = model.predict(x2)
  false_positive_rate, true_positive_rate, thresholds = roc_curve(y2, p1)
  roc_auc_2 = auc(false_positive_rate, true_positive_rate)
  return roc_auc_1,roc_auc_2

In [115]:
def effect1(train, test, x_axis, title):
  plt.figure(figsize = (12,10), dpi = 100)
  plt.plot(x_axis, train, color = 'red', label = 'train_score')
  plt.plot(x_axis, test, color = 'blue', label = 'test_score')
  plt.legend()
  plt.show()

In [116]:
max_depth = [i for i in range(1,100)]
train = []
test = []
for i in max_depth:
  model =DecisionTreeClassifier(max_depth=i, random_state=50)
  roc_auc_1,roc_auc_2 = cal_score1(model, x_train, y_train, x_test, y_test)
  train.append(roc_auc_1)
  test.append(roc_auc_2)
effect1(train,test, range(1,100), 'Max_Depth')

In [117]:
min_sample_leaff = [i for i in range(25,4000,25)]
train = []
test = []
for i in min_sample_leaff:
  model =DecisionTreeClassifier(max_depth=20, min_samples_leaf=i, random_state=50)
  roc_auc_1,roc_auc_2 = cal_score1(model, x_train, y_train, x_test, y_test)
  train.append(roc_auc_1)
  test.append(roc_auc_2)
effect1(train,test, range(25,4000,25), 'Min_Samples_Leaf')

In [118]:
max_leaf_node = [i for i in range(2,200,10)]
train = []
test = []
for i in max_leaf_node:
  model =DecisionTreeClassifier(max_depth=20,max_leaf_nodes=i, min_samples_leaf=3700, random_state=50)
  roc_auc_1,roc_auc_2 = cal_score1(model, x_train, y_train, x_test, y_test)
  train.append(roc_auc_1)
  test.append(roc_auc_2)
effect1(train,test, range(2,200,10), 'Max_Leaf_Nodes')

In [119]:
modified_model = DecisionTreeClassifier(max_depth = 18, min_samples_split=4250, min_samples_leaf=3700, max_leaf_nodes=21)
modified_model.fit(x_train_ss, y_train)
pr = modified_model.predict(x_test_ss)

In [120]:
print(modified_model.score(x_train_ss,y_train))
print(modified_model.score(x_test_ss, y_test))
print(accuracy_score(pr,y_test))

0.7843731819544101
0.7823207601497265
0.7823207601497265


In [121]:
## LOgistic Regression()

In [122]:
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
model1 = lg.fit(x_train_ss, y_train)
y_predict = model1.predict(x_test_ss)
predicted_values = model1.predict_proba(x_test_ss)

In [123]:
recall_score(y_test, y_predict)

0.8233909585260937

In [124]:
precision_score(y_test,y_predict)

0.8149903598971723

In [125]:
f1_score(y_test,y_predict)

0.8191691226936898

In [126]:
y_testt = y_test.squeeze()

In [127]:
precision_points, recall_points, threshold_points = precision_recall_curve(y_testt, predicted_values[:,1]) 

In [128]:
precision_points.shape, recall_points.shape, threshold_points.shape

((22630,), (22630,), (22629,))

In [129]:
precision_points

array([0.54447832, 0.54445819, 0.54448226, ..., 1.        , 1.        ,
       1.        ])

In [130]:
recall_points

array([1.00000000e+00, 9.99918838e-01, 9.99918838e-01, ...,
       1.62324487e-04, 8.11622433e-05, 0.00000000e+00])

In [131]:
threshold_points

array([0.00919393, 0.00922398, 0.00922728, ..., 0.99547383, 0.99551362,
       0.99560247])

In [132]:
plt.figure(figsize = (12,10), dpi = 100)
plt.plot(threshold_points, recall_points[:-1], color = 'red')
plt.plot(threshold_points, precision_points[:-1], color = 'blue')
plt.show()

In [133]:
tpr,fpr, threshold = roc_curve(y_testt, predicted_values[:,1])
tpr.shape, fpr.shape, threshold.shape

((5894,), (5894,), (5894,))

In [134]:
plt.figure(figsize = (12,10), dpi = 100)
plt.plot(tpr,fpr, color = 'red')
plt.plot([0,1],[0,1], color = 'blue')
plt.title("roc_curve")
plt.show()

print(roc_auc_score(y_test, predicted_values[:,1]))

0.9053927809196839


In [135]:
print("Training Accuracy ", model1.score(x_train_ss,y_train))
print("Testing Accuracy ", model1.score(x_test_ss,y_test))

Training Accuracy  0.8209368333832837
Testing Accuracy  0.8157624120768376


In [136]:
print(classification_report(y_predict, y_test))

              precision    recall  f1-score   support

           0       0.81      0.82      0.81     11863
           1       0.82      0.81      0.82     12448

    accuracy                           0.82     24311
   macro avg       0.82      0.82      0.82     24311
weighted avg       0.82      0.82      0.82     24311



In [137]:
print(confusion_matrix(y_predict, y_test))

[[ 9687  2176]
 [ 2303 10145]]


In [138]:
print(accuracy_score(y_predict,y_test))

0.8157624120768376


In [139]:
x = pd.DataFrame(data = x_train_ss, columns = inde_vars.columns)
y = y_train
from sklearn.model_selection import  StratifiedKFold
accuracy1 = []
skf = StratifiedKFold(n_splits = 10, random_state = None)
skf.get_n_splits(x,y)
for train_index, test_index in skf.split(x,y):
  print('Train:', train_index, 'Validation',test_index)
  x1_train,x1_test = x.iloc[train_index],x.iloc[test_index]
  y1_train,y1_test = y.iloc[train_index],y.iloc[test_index]
  model1.fit(x1_train,y1_train)
  pred = model1.predict(x1_test)
  score = accuracy_score(pred,y1_test)
  accuracy1.append(score)
print(accuracy1)

Train: [ 5646  5651  5653 ... 56720 56721 56722] Validation [   0    1    2 ... 5689 5691 5693]
Train: [    0     1     2 ... 56720 56721 56722] Validation [ 5646  5651  5653 ... 11348 11352 11353]
Train: [    0     1     2 ... 56720 56721 56722] Validation [11338 11343 11344 ... 17100 17101 17102]
Train: [    0     1     2 ... 56720 56721 56722] Validation [16944 16945 16948 ... 22755 22758 22759]
Train: [    0     1     2 ... 56720 56721 56722] Validation [22614 22616 22617 ... 28391 28392 28394]
Train: [    0     1     2 ... 56720 56721 56722] Validation [28336 28337 28341 ... 34094 34095 34097]
Train: [    0     1     2 ... 56720 56721 56722] Validation [33968 33969 33970 ... 39745 39746 39749]
Train: [    0     1     2 ... 56720 56721 56722] Validation [39661 39664 39668 ... 45394 45395 45397]
Train: [    0     1     2 ... 56720 56721 56722] Validation [45362 45364 45365 ... 51056 51058 51063]
Train: [    0     1     2 ... 51056 51058 51063] Validation [51043 51044 51045 ... 56720

#### Hyper Parameter Tuning Logistic Regression() using GridSearchCV() 

In [140]:
## Random Forest

In [141]:
from sklearn.ensemble import RandomForestClassifier
re = RandomForestClassifier()
model3 = re.fit(x_train_ss, y_train)
re_pred = model3.predict(x_test_ss)

In [142]:
print("trianing accuracy is", model3.score(x_train_ss, y_train))
print('Testing Accuracy is', model3.score(x_test_ss, y_test))
print(accuracy_score(y_test, re_pred))

trianing accuracy is 1.0
Testing Accuracy is 0.970054707745465
0.970054707745465


In [143]:
print(confusion_matrix(y_test, re_pred))

[[11526   464]
 [  264 12057]]


In [144]:
print(classification_report(y_test, re_pred))

              precision    recall  f1-score   support

           0       0.98      0.96      0.97     11990
           1       0.96      0.98      0.97     12321

    accuracy                           0.97     24311
   macro avg       0.97      0.97      0.97     24311
weighted avg       0.97      0.97      0.97     24311



In [145]:
recall_score(y_test, re_pred)

0.9785731677623569

In [146]:
precision_score(y_test, re_pred)

0.9629422570082262

In [147]:
f1_score(y_test, re_pred)

0.9706947910796233

In [148]:
x = pd.DataFrame(data = x_train_ss, columns = inde_vars.columns)
y = y_train
from sklearn.model_selection import  StratifiedKFold
accuracy2 = []
skf = StratifiedKFold(n_splits = 10, random_state = None)
skf.get_n_splits(x,y)
for train_index, test_index in skf.split(x,y):
  print('Train:', train_index, 'Validation',test_index)
  x1_train,x1_test = x.iloc[train_index],x.iloc[test_index]
  y1_train,y1_test = y.iloc[train_index],y.iloc[test_index]
  model3.fit(x1_train,y1_train)
  pred = model3.predict(x1_test)
  score = accuracy_score(pred,y1_test)
  accuracy2.append(score)
print(accuracy2)

Train: [ 5646  5651  5653 ... 56720 56721 56722] Validation [   0    1    2 ... 5689 5691 5693]
Train: [    0     1     2 ... 56720 56721 56722] Validation [ 5646  5651  5653 ... 11348 11352 11353]
Train: [    0     1     2 ... 56720 56721 56722] Validation [11338 11343 11344 ... 17100 17101 17102]
Train: [    0     1     2 ... 56720 56721 56722] Validation [16944 16945 16948 ... 22755 22758 22759]
Train: [    0     1     2 ... 56720 56721 56722] Validation [22614 22616 22617 ... 28391 28392 28394]
Train: [    0     1     2 ... 56720 56721 56722] Validation [28336 28337 28341 ... 34094 34095 34097]
Train: [    0     1     2 ... 56720 56721 56722] Validation [33968 33969 33970 ... 39745 39746 39749]
Train: [    0     1     2 ... 56720 56721 56722] Validation [39661 39664 39668 ... 45394 45395 45397]
Train: [    0     1     2 ... 56720 56721 56722] Validation [45362 45364 45365 ... 51056 51058 51063]
Train: [    0     1     2 ... 51056 51058 51063] Validation [51043 51044 51045 ... 56720

### **Testing the Model**

In [149]:
ford = pd.read_csv("healthcare-dataset-stroke-data.csv")

In [150]:
ford.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [151]:
ford.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [152]:
ford.drop(index = ford[(ford.age > 1.0) & (ford.age < 2.0)].index, axis = 0, inplace = True)

In [153]:
ford.shape

(5038, 12)

In [154]:
anamolies = []
def outliers(data):
  random_state_mean = np.mean(data)
  random_state_std = np.std(data)
  anamoly = random_state_std * 3

  upper_limit = random_state_mean + anamoly
  lower_limit = random_state_mean - anamoly
  uu =  max(ford['avg_glucose_level'])
  ll = min(ford['avg_glucose_level'])

  print(upper_limit)
  print(lower_limit)
  for i in data:
    if i < ll or i > uu:
      anamolies.append(i)

In [155]:
outliers(ford['avg_glucose_level'])
print(len(anamolies))

242.64307272917313
-30.033838906227473
0


In [156]:
dodge['avg_glucose_level'].describe()

count    41261.000000
mean       102.504529
std         39.968402
min         55.000000
25%         77.370000
50%         91.170000
75%        110.770000
max        234.380000
Name: avg_glucose_level, dtype: float64

In [157]:
anamolies = []
def outliers(data):
  random_state_mean = np.mean(data)
  random_state_std = np.std(data)
  anamoly = random_state_std * 3

  upper_limit = random_state_mean + anamoly
  lower_limit = random_state_mean - anamoly
  ll = min(ford['bmi'])

  print(upper_limit)
  print(lower_limit)
  for i in data:
    if i < ll or i > upper_limit:
      anamolies.append(i)

In [158]:
outliers(ford['bmi'])
print(len(anamolies))

52.45615973942819
5.616289661645759
58


In [159]:
ford[ford['bmi'] > 52.45615973942819]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
113,41069,Female,45.0,0,0,Yes,Private,Rural,224.1,56.6,never smoked,1
258,28674,Female,74.0,1,0,Yes,Self-employed,Urban,205.84,54.6,never smoked,0
270,72911,Female,57.0,1,0,Yes,Private,Rural,129.54,60.9,smokes,0
333,1703,Female,52.0,0,0,Yes,Private,Urban,82.24,54.7,formerly smoked,0
358,66333,Male,52.0,0,0,Yes,Self-employed,Urban,78.4,64.8,never smoked,0
430,53144,Female,52.0,0,1,Yes,Private,Urban,72.79,54.7,never smoked,0
466,1307,Female,61.0,1,0,Yes,Private,Rural,170.05,60.2,smokes,0
544,545,Male,42.0,0,0,Yes,Private,Rural,210.48,71.9,never smoked,0
637,3130,Female,56.0,0,0,Yes,Private,Rural,112.43,54.6,never smoked,0
662,23551,Male,28.0,0,0,Yes,Private,Urban,87.43,55.7,Unknown,0


In [160]:
ford.drop(index = ford[ford['bmi'] > 52.45615973942819].index, axis = 0, inplace = True)

In [161]:
ford.shape

(4980, 12)

In [162]:
ford.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  196
smoking_status         0
stroke                 0
dtype: int64

In [163]:
ford['bmi'].mean()

28.681291806020063

In [164]:
ford['bmi'].fillna(ford['bmi'].mean(), inplace = True)

In [165]:
ford['bmi'].isnull().sum()

0

In [166]:
ford['smoking_status'].replace('Unknown', 'never smoked')

0       formerly smoked
1          never smoked
2          never smoked
3                smokes
4          never smoked
             ...       
5105       never smoked
5106       never smoked
5107       never smoked
5108    formerly smoked
5109       never smoked
Name: smoking_status, Length: 4980, dtype: object

In [167]:
ford.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4980 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 4980 non-null   int64  
 1   gender             4980 non-null   object 
 2   age                4980 non-null   float64
 3   hypertension       4980 non-null   int64  
 4   heart_disease      4980 non-null   int64  
 5   ever_married       4980 non-null   object 
 6   work_type          4980 non-null   object 
 7   Residence_type     4980 non-null   object 
 8   avg_glucose_level  4980 non-null   float64
 9   bmi                4980 non-null   float64
 10  smoking_status     4980 non-null   object 
 11  stroke             4980 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 505.8+ KB


In [168]:
ford.drop(columns = ['id'], axis=1, inplace = True)

In [169]:
ford['smoking_status'].replace({'Unknown':'never smoked'}, inplace = True)

In [170]:
ford['gender'] = ford['gender'].map(mapping)

In [171]:
ford['ever_married'] = ford['ever_married'].map(mapping1)

In [172]:
ford['smoking_status'] = ford['smoking_status'].map(mapping2)

In [173]:
ford[['gender', 'smoking_status', 'ever_married']].head()

Unnamed: 0,gender,smoking_status,ever_married
0,2,1,1
1,1,0,1
2,2,0,1
3,1,2,1
4,1,0,1


In [174]:
ford['work_type'].unique()

array(['Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'],
      dtype=object)

In [175]:
ford['Residence_type'].unique()

array(['Urban', 'Rural'], dtype=object)

In [176]:
ford['home_town'] = pd.get_dummies(ford['Residence_type'], drop_first = True)

In [177]:
rap = pd.get_dummies(ford['work_type'], drop_first = True)

In [178]:
cam = pd.concat([ford,rap], axis = 1)

In [179]:
cam.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,home_town,Never_worked,Private,Self-employed,children
0,2,67.0,0,1,1,Private,Urban,228.69,36.6,1,1,1,0,1,0,0
1,1,61.0,0,0,1,Self-employed,Rural,202.21,28.681292,0,1,0,0,0,1,0
2,2,80.0,0,1,1,Private,Rural,105.92,32.5,0,1,0,0,1,0,0
3,1,49.0,0,0,1,Private,Urban,171.23,34.4,2,1,1,0,1,0,0
4,1,79.0,1,0,1,Self-employed,Rural,174.12,24.0,0,1,0,0,0,1,0


In [180]:
cam.rename(columns = {'Never_worked':'w_t_n_w', 'Private':'w_t_p', 'Self-employed':'w_t_s_e', 'children':'w_t_c'}, inplace = True)

In [181]:
cam.drop(columns = ['work_type','Residence_type'], inplace = True)

In [182]:
cam.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,smoking_status,stroke,home_town,w_t_n_w,w_t_p,w_t_s_e,w_t_c
0,2,67.0,0,1,1,228.69,36.6,1,1,1,0,1,0,0
1,1,61.0,0,0,1,202.21,28.681292,0,1,0,0,0,1,0
2,2,80.0,0,1,1,105.92,32.5,0,1,0,0,1,0,0
3,1,49.0,0,0,1,171.23,34.4,2,1,1,0,1,0,0
4,1,79.0,1,0,1,174.12,24.0,0,1,0,0,0,1,0


In [183]:
target = cam['stroke']
original = cam.drop(columns = ['stroke'])

In [184]:
resampled_x,resampled_y = so.fit_resample(original,target.values.ravel())
pitt = pd.DataFrame(data = resampled_x, columns=original.columns)

In [185]:
#Before resampling
print("Before Resampling Target_Variable: ")
print(target.value_counts())

# After resampling
resampled_y = pd.DataFrame(resampled_y)
print("After Resampling Target_Variable:")
print(resampled_y[0].value_counts())

Before Resampling Target_Variable: 
0    4733
1     247
Name: stroke, dtype: int64
After Resampling Target_Variable:
1    4733
0    4733
Name: 0, dtype: int64


In [186]:
fish = se.fit_transform(resampled_x)
lucas = pd.DataFrame(data = fish, columns = original.columns)

In [187]:
lucas.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,smoking_status,home_town,w_t_n_w,w_t_p,w_t_s_e,w_t_c
0,1.446183,0.524426,-0.30611,4.360354,0.609823,2.003666,1.225259,0.875899,1.200133,-0.048265,1.035929,-0.373158,-0.263598
1,-0.690808,0.248866,-0.30611,-0.229339,0.609823,1.52144,-0.08396,-0.625597,-0.833241,-0.048265,-0.965317,2.679833,-0.263598
2,1.446183,1.121474,-0.30611,4.360354,0.609823,-0.232094,0.547396,-0.625597,-0.833241,-0.048265,1.035929,-0.373158,-0.263598
3,-0.690808,-0.302255,-0.30611,-0.229339,0.609823,0.957264,0.861528,2.377394,1.200133,-0.048265,1.035929,-0.373158,-0.263598
4,-0.690808,1.075548,3.266804,-0.229339,0.609823,1.009894,-0.857928,-0.625597,-0.833241,-0.048265,-0.965317,2.679833,-0.263598


In [188]:
lucas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9466 entries, 0 to 9465
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             9466 non-null   float64
 1   age                9466 non-null   float64
 2   hypertension       9466 non-null   float64
 3   heart_disease      9466 non-null   float64
 4   ever_married       9466 non-null   float64
 5   avg_glucose_level  9466 non-null   float64
 6   bmi                9466 non-null   float64
 7   smoking_status     9466 non-null   float64
 8   home_town          9466 non-null   float64
 9   w_t_n_w            9466 non-null   float64
 10  w_t_p              9466 non-null   float64
 11  w_t_s_e            9466 non-null   float64
 12  w_t_c              9466 non-null   float64
dtypes: float64(13)
memory usage: 961.5 KB


#### Testing the  Decision Tree Tuned Model on Testing data. 

In [189]:
hash = modified_model.predict(lucas)

In [190]:
print(accuracy_score(hash,resampled_y))

0.7800549334460173


In [191]:
print(classification_report(hash,resampled_y))

              precision    recall  f1-score   support

           0       0.75      0.80      0.77      4431
           1       0.81      0.76      0.79      5035

    accuracy                           0.78      9466
   macro avg       0.78      0.78      0.78      9466
weighted avg       0.78      0.78      0.78      9466



In [192]:
print(confusion_matrix(hash,resampled_y))

[[3541  890]
 [1192 3843]]


#### Testing the  Logistic Regression  Tuned Model on Testing data.  

In [193]:
pip install jupyterlab_legos_ui

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\users\ab\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.
