# Brain Stroke Prediction - Logistic Regression

## Background

A stroke is a medical condition in which poor blood flow to the brain causes cell death. There are two main types of stroke: ischemic, due to lack of blood flow, and hemorrhagic, due to bleeding. Both cause parts of the brain to stop functioning properly. Signs and symptoms of a stroke may include an inability to move or feel on one side of the body, problems understanding or speaking, dizziness, or loss of vision to one side. Signs and symptoms often appear soon after the stroke has occurred. If symptoms last less than one or two hours, the stroke is a transient ischemic attack (TIA), also called a mini-stroke. A hemorrhagic stroke may also be associated with a severe headache. The symptoms of a stroke can be permanent. Long-term complications may include pneumonia and loss of bladder control.

The **main risk factor** for stroke is **high blood pressure**. ***Other risk factors*** include ***high blood cholesterol, tobacco smoking, obesity, diabetes mellitus, a previous TIA, end-stage kidney disease, and atrial fibrillation***. 

**Prevention** includes ***decreasing risk factors***, surgery to open up the arteries to the brain in those with problematic carotid narrowing, and warfarin in people with atrial fibrillation. Aspirin or statins may be recommended by physicians for prevention.

### We want to analyze if our patients' are at risk for stroke with identified factors.

### Attribute Information

    1) gender: "Male", "Female" or "Other"
    2) age: age of the patient
    3) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
    4) heartdisease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease 
    5) evermarried: "No" or "Yes"
    6) worktype: "children", "Govtjov", "Neverworked", "Private" or "Self-employed" 
    7) Residencetype: "Rural" or "Urban"
    8) avgglucoselevel: average glucose level in blood
    9) bmi: body mass index
    10) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
    11) stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

### Initial Sample Size
4,981

### Filtered Sample Size (removed children)
4,308 (10% decrease)

In [100]:
# Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [101]:
# Read data into dataframe

stroke_data = pd.read_csv('./Resources/full_data.csv')
stroke_data.tail(20)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
4961,Male,59.0,1,0,Yes,Govt_job,Rural,253.93,32.1,formerly smoked,0
4962,Male,3.0,0,0,No,children,Rural,194.75,20.1,Unknown,0
4963,Female,20.0,0,0,No,Govt_job,Rural,79.53,26.5,never smoked,0
4964,Female,78.0,0,0,Yes,Govt_job,Urban,101.76,27.3,smokes,0
4965,Male,52.0,1,0,Yes,Govt_job,Rural,116.62,31.7,smokes,0
...,...,...,...,...,...,...,...,...,...,...,...
4976,Male,41.0,0,0,No,Private,Rural,70.15,29.8,formerly smoked,0
4977,Male,40.0,0,0,Yes,Private,Urban,191.15,31.1,smokes,0
4978,Female,45.0,1,0,Yes,Govt_job,Rural,95.02,31.8,smokes,0
4979,Male,40.0,0,0,Yes,Private,Rural,83.94,30.0,smokes,0


In [102]:
# Review stroke data - due to large sample count - likely want to view bins for 'age', 'bmi', & glucose levels
stroke_data.describe()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,4981.0,4981.0,4981.0,4981.0,4981.0,4981.0
mean,43.419859,0.096165,0.05521,105.943562,28.498173,0.049789
std,22.662755,0.294848,0.228412,45.075373,6.790464,0.217531
min,0.08,0.0,0.0,55.12,14.0,0.0
25%,25.0,0.0,0.0,77.23,23.7,0.0
50%,45.0,0.0,0.0,91.85,28.1,0.0
75%,61.0,0.0,0.0,113.86,32.6,0.0
max,82.0,1.0,1.0,271.74,48.9,1.0


In [103]:
# Review df info to look at Dtypes - you will need to assist with LR
stroke_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4981 entries, 0 to 4980
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             4981 non-null   object 
 1   age                4981 non-null   float64
 2   hypertension       4981 non-null   int64  
 3   heart_disease      4981 non-null   int64  
 4   ever_married       4981 non-null   object 
 5   work_type          4981 non-null   object 
 6   Residence_type     4981 non-null   object 
 7   avg_glucose_level  4981 non-null   float64
 8   bmi                4981 non-null   float64
 9   smoking_status     4981 non-null   object 
 10  stroke             4981 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 428.2+ KB


In [104]:
# Check for NaN values
stroke_data.isnull().values.any()

False

In [105]:
# Sample (row count)
len(stroke_data)

4981

In [106]:
# Filter df to view children - due to not likely having strokes - 
# initially filtered by work_type but better filter is to use age to capture all kids (age range <18)

child_filter = stroke_data[stroke_data['age'] >= 18]
child_filter.head(20)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
2,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
3,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
4,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
...,...,...,...,...,...,...,...,...,...,...,...
15,Female,60.0,0,0,No,Private,Urban,89.22,37.8,never smoked,1
16,Female,71.0,0,0,Yes,Govt_job,Rural,193.94,22.4,smokes,1
17,Female,52.0,1,0,Yes,Self-employed,Urban,233.29,48.9,never smoked,1
18,Female,79.0,0,0,Yes,Self-employed,Urban,228.70,26.6,never smoked,1


In [107]:
child_filter.count()

gender               4158
age                  4158
hypertension         4158
heart_disease        4158
ever_married         4158
work_type            4158
Residence_type       4158
avg_glucose_level    4158
bmi                  4158
smoking_status       4158
stroke               4158
dtype: int64

In [108]:
child_filter.shape

(4158, 11)

In [109]:
stroke_data_filtered = stroke_data.drop(stroke_data[stroke_data['age']< 18].index)
stroke_data_filtered.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
2,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
3,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
4,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1


In [110]:
stroke_data_filtered.shape

(4158, 11)

In [111]:
# Validation check to align with stroke_data_filtered

age_filtered = stroke_data[stroke_data['age'] >= 18]
age_filtered.shape

(4158, 11)

In [112]:
# shape: 11 Columns and 4981 Rows 
shape = stroke_data_filtered.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])


DataFrame Shape : (4158, 11)

Number of rows : 4158

Number of columns : 11


In [113]:
# Determine the number of unique values in each column.
stroke_data_filtered.nunique()

gender                  2
age                    65
hypertension            2
heart_disease           2
ever_married            2
work_type               3
Residence_type          2
avg_glucose_level    3416
bmi                   317
smoking_status          4
stroke                  2
dtype: int64

In [114]:
pd.set_option('display.max_rows', None)

In [124]:
age_count = stroke_data_filtered.age.value_counts()
age_sorted = age_count.sort_index()
age_sorted

18.0     57
19.0     49
20.0     59
21.0     46
22.0     45
23.0     61
24.0     53
25.0     56
26.0     61
27.0     53
28.0     52
29.0     51
30.0     54
31.0     75
32.0     70
33.0     56
34.0     67
35.0     53
36.0     52
37.0     73
38.0     70
39.0     70
40.0     72
41.0     71
42.0     66
43.0     67
44.0     74
45.0     82
46.0     58
47.0     74
48.0     62
49.0     76
50.0     80
51.0     84
52.0     83
53.0     82
54.0     85
55.0     82
56.0     75
57.0     92
58.0     66
59.0     78
60.0     72
61.0     75
62.0     73
63.0     73
64.0     53
65.0     61
66.0     58
67.0     49
68.0     46
69.0     54
70.0     45
71.0     61
72.0     45
73.0     45
74.0     39
75.0     53
76.0     50
77.0     42
78.0    102
79.0     84
80.0     70
81.0     60
82.0     56
Name: age, dtype: int64

In [143]:
age_df = stroke_data_filtered["age"].where(stroke_data_filtered["age"] != '40+', 'Under 40', inplace = True)
age_df

In [134]:
ages_to_replace_2 = []

for ages in range(0,23, age_sorted.size):
  ages_to_replace_2.append(age_sorted.index[x])

print(ages_to_replace_2)

# Alternatively:
# for x in range(application_counts.size):
#   if application_counts[x] < application_counts[7]:
#     application_types_to_replace.append(application_counts.index[x])

# # Replace in DataFrame

# for age in ages_to_replace:
#     stroke_data_filtered['age'] = stroke_data_filtered['age'].replace(age,"40+")
    
# stroke_data_filtered.tail(25)

[82.0]


In [75]:
avg_glucose_level_bin = stroke_data_filtered['avg_glucose_level'].value_counts()
avg_glucose_level_bin

91.68     5
73.00     5
67.92     4
86.06     4
72.49     4
         ..
68.43     1
229.94    1
131.23    1
93.90     1
64.02     1
Name: avg_glucose_level, Length: 3416, dtype: int64

40+     4632
74.0      39
9.0       38
11.0      36
10.0      34
4.0       32
7.0       32
6.0       24
1.8        9
1.32       8
1.64       8
1.24       7
1.88       7
1.08       7
1.48       6
1.72       6
1.0        5
0.88       5
0.24       5
0.72       5
0.32       5
0.56       5
0.8        4
0.64       4
1.56       4
0.48       3
1.4        3
1.16       3
0.4        2
0.08       2
0.16       1
Name: age, dtype: int64