# Gradient Boosting Machine Learning Model

## Reason this model was chosen:
Scikit-learn's gradient boosting binary classification model was chosen. This is a supervised machine learning technique that is based on ensemble learning. This model was chosen since it is a powerful technique to build predictive models for classification. The "boosting" ensemble learning method has to do with how the decision trees (weak learners) are sequentially built, converting the weak learners into strong learners, while reducing variance in the process. This model is also useful for reducing bias.

## How model will be trained:

The following steps will be taken to train model:

1) The preprocessed data is split into a training and testing dataset using scikit-learn's functionality.

2) The data will be scaled to have a mean of 0 and standard deviation of 1 using sci-kit learn's functionality.

3) An instance of scikit-learn's gradient boosting model will be created and trained with the training features and target.

## To improve the model's accuracy, the following steps can/will be taken:

* Oversampling the target vector using imbalanced-learn's RandomOverSampler since there are much more non-stroke patients in dataset than stroke patients.

* Testing different learning rates for the gradient boosting model.

* Testing different numbers of boosting stages (n_estimators).

* Altering other parameters of the model.

* Altering the dataset by binning features or removing features.

## How the gradient boosting model works:

This is a supervised machine learning technique that is based on ensemble learning. The "boosting" ensemble learning method has to do with how the decision trees (weak learners) are sequentially built, converting the weak learners into stronger learners in each iteration, while reducing variance in the process.

## Explanation of the gradient boosting model's limitations and benefits:

This model was chosen since it is a powerful technique to build predictive models for binary classification.

The "boosting" ensemble learning method has to do with how the decision trees (weak learners) of features are sequentially built, converting the weak learners into strong learners, while reducing variance in the process. This model is also beneficial in that it reduces bias well.

One of the limitations of using the gradient boosting model is that it does not handle outliers as well as other models and it is prone to overfitting, so different parameters must be tested to create a reliable model.

In [1]:
# import necessary dependencies
from path import Path
import pandas as pd
import numpy as np
from collections import Counter

# import dependencies for training, testing, scaling, and model creation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier

# import dependencies for model evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [2]:
# import cleaned and preprocessed dataset
path = Path("../resources/healthcare-dataset-stroke-data.csv")
stroke_df = pd.read_csv(path)
stroke_df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [3]:
#  Get a list of all columns within the DataFrame.
stroke_df.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [4]:
# Check the datatypes of your columns and other dataframe features.
stroke_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [5]:
# remove unnecessary columns that would not contribute to machine learning model
stroke_df = stroke_df.drop(labels="id", axis=1)
stroke_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [6]:
# Drop all rows with missing information
stroke_df = stroke_df.dropna(how='any')

In [7]:
# Verify dropped rows
stroke_df.count()

gender               4909
age                  4909
hypertension         4909
heart_disease        4909
ever_married         4909
work_type            4909
Residence_type       4909
avg_glucose_level    4909
bmi                  4909
smoking_status       4909
stroke               4909
dtype: int64

In [8]:
# number of unique values in each column
print(stroke_df.nunique())
print('--------------------')
print(stroke_df.dtypes)

gender                  3
age                   104
hypertension            2
heart_disease           2
ever_married            2
work_type               5
Residence_type          2
avg_glucose_level    3852
bmi                   418
smoking_status          4
stroke                  2
dtype: int64
--------------------
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object


In [9]:
# Convert these values into numerical ones with the pd.get_dummies() method
stroke_encoded = pd.get_dummies(stroke_df, columns=["gender", "ever_married", "work_type", "Residence_type", "smoking_status"])
stroke_encoded.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,ever_married_No,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,67.0,0,1,228.69,36.6,1,0,1,0,0,...,0,1,0,0,0,1,0,1,0,0
2,80.0,0,1,105.92,32.5,1,0,1,0,0,...,0,1,0,0,1,0,0,0,1,0
3,49.0,0,0,171.23,34.4,1,1,0,0,0,...,0,1,0,0,0,1,0,0,0,1
4,79.0,1,0,174.12,24.0,1,1,0,0,0,...,0,0,1,0,1,0,0,0,1,0
5,81.0,0,0,186.21,29.0,1,0,1,0,0,...,0,1,0,0,0,1,0,1,0,0


In [10]:
# Check the datatypes of your columns.
stroke_encoded.dtypes

age                               float64
hypertension                        int64
heart_disease                       int64
avg_glucose_level                 float64
bmi                               float64
stroke                              int64
gender_Female                       uint8
gender_Male                         uint8
gender_Other                        uint8
ever_married_No                     uint8
ever_married_Yes                    uint8
work_type_Govt_job                  uint8
work_type_Never_worked              uint8
work_type_Private                   uint8
work_type_Self-employed             uint8
work_type_children                  uint8
Residence_type_Rural                uint8
Residence_type_Urban                uint8
smoking_status_Unknown              uint8
smoking_status_formerly smoked      uint8
smoking_status_never smoked         uint8
smoking_status_smokes               uint8
dtype: object

In [11]:
# create for loop to convert all dataypes to integer datatypes
for column_name, item in stroke_encoded.iteritems():
    if type(item) != int:
        stroke_encoded[column_name] = stroke_encoded[column_name].astype('int64')
    else:
        continue
print(stroke_encoded.dtypes)

age                               int64
hypertension                      int64
heart_disease                     int64
avg_glucose_level                 int64
bmi                               int64
stroke                            int64
gender_Female                     int64
gender_Male                       int64
gender_Other                      int64
ever_married_No                   int64
ever_married_Yes                  int64
work_type_Govt_job                int64
work_type_Never_worked            int64
work_type_Private                 int64
work_type_Self-employed           int64
work_type_children                int64
Residence_type_Rural              int64
Residence_type_Urban              int64
smoking_status_Unknown            int64
smoking_status_formerly smoked    int64
smoking_status_never smoked       int64
smoking_status_smokes             int64
dtype: object


In [12]:
# save stroke_df as csv file.
stroke_encoded.to_csv('../resources/clean_stroke_data.csv', index=False)

## Description of preliminary data preprocessing:

### Steps taken:

1) csv dataset was read in a Pandas dataframe.

2) "id" column was dropped since machine learning model cannot learn anything valuable from this.

3) Categorical columns were converted to binary numerical columns using the Pandas get_dummies() function.

4) Columns that were of the "float" data type were converted into integers.

5) Rows with null values were removed.

6) A csv containing data prepared for machine learning model was exported to resources file.

In [13]:
# Display a statistical overview of the DataFrame
stroke_encoded.describe()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,ever_married_No,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
count,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,...,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0
mean,42.854145,0.091872,0.049501,104.816052,28.448564,0.042575,0.590141,0.409656,0.000204,0.347321,...,0.004482,0.572622,0.157873,0.136688,0.492768,0.507232,0.302098,0.170503,0.377266,0.150132
std,22.575986,0.288875,0.216934,44.417384,7.843834,0.201917,0.491858,0.49182,0.014273,0.476167,...,0.066801,0.494748,0.364659,0.343552,0.499999,0.499999,0.459214,0.376113,0.484752,0.357238
min,0.0,0.0,0.0,55.0,10.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,25.0,0.0,0.0,77.0,23.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,44.0,0.0,0.0,91.0,28.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,60.0,0.0,0.0,113.0,33.0,0.0,1.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0
max,82.0,1.0,1.0,271.0,97.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [14]:
# check for discrepancy between target vector categories 
stroke_encoded["stroke"].value_counts()

0    4700
1     209
Name: stroke, dtype: int64

In [15]:
# create feature and target vector objects, check shapes of both
X = stroke_encoded.drop(columns=["stroke"]).values
y = stroke_encoded["stroke"].values
print(X.shape, y.shape)

(4909, 21) (4909,)


In [16]:
X[:5]

array([[ 67,   0,   1, 228,  36,   0,   1,   0,   0,   1,   0,   0,   1,
          0,   0,   0,   1,   0,   1,   0,   0],
       [ 80,   0,   1, 105,  32,   0,   1,   0,   0,   1,   0,   0,   1,
          0,   0,   1,   0,   0,   0,   1,   0],
       [ 49,   0,   0, 171,  34,   1,   0,   0,   0,   1,   0,   0,   1,
          0,   0,   0,   1,   0,   0,   0,   1],
       [ 79,   1,   0, 174,  24,   1,   0,   0,   0,   1,   0,   0,   0,
          1,   0,   1,   0,   0,   0,   1,   0],
       [ 81,   0,   0, 186,  29,   0,   1,   0,   0,   1,   0,   0,   1,
          0,   0,   0,   1,   0,   1,   0,   0]], dtype=int64)

In [17]:
y[:5]

array([1, 1, 1, 1, 1], dtype=int64)

## Description of preliminary feature engineering and preliminary feature selection, including the decision-making process:

* Any unnecessary features were removed so that the model was not inaccurately skewed. Categorical features were also converted to numerical data and spread out into binary columns using the Pandas get_dummies() function

### The decision-making process was conducted as follows:

* Selecting features that contribute to a more accurate model and contains valuable information.

* Checking if categorical features have many unique values that can be binned (this was not the case since all categorical features had less than 10 unique values).

* Checking the datatype of each feature to see if any are not numerical, since the machine learning model requires numerical features as inputs.

In [18]:
# split data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.25, random_state=42, stratify=y
)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(3681, 21) (1228, 21) (3681,) (1228,)


In [19]:
# over-sample data since the patients with no stroke cases far out-weighs the number of stroke cases
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
ros = RandomOverSampler(random_state=1)

# NOTE: only oversample training data
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)


print(Counter(y_train_resampled))

Counter({0: 3524, 1: 3524})


In [20]:
# create an instance of the StandardScaler
scaler = StandardScaler()

X_scaler = scaler.fit(X_train_resampled)

X_train_scaled = X_scaler.transform(X_train_resampled)
X_test_scaled = X_scaler.transform(X_test)

#print first two rows of scaled training and testing features
print(
f'{X_train_scaled[:2]}\n \
----------\n \
{X_test_scaled[:2]}'
)

[[ 0.11953641 -0.51996894  2.82865288  1.82993353 -0.45444952 -1.19597025
   1.19632068 -0.01191236 -0.56227496  0.56227496 -0.36660008 -0.04461314
   0.82225293 -0.51864759 -0.27633167  1.01199015 -1.01199015 -0.52436993
  -0.5316207  -0.80970737  2.22546685]
 [-1.23227286 -0.51996894 -0.35352517 -0.23775606  1.22814569  0.8361412
  -0.83589628 -0.01191236  1.77848931 -1.77848931 -0.36660008 -0.04461314
   0.82225293 -0.51864759 -0.27633167 -0.98815191  0.98815191 -0.52436993
  -0.5316207   1.23501407 -0.44934392]]
 ----------
 [[-1.14215225 -0.51996894 -0.35352517 -0.65830309 -0.87509832  0.8361412
  -0.83589628 -0.01191236 -0.56227496  0.56227496  2.72776806 -0.04461314
  -1.2161708  -0.51864759 -0.27633167 -0.98815191  0.98815191  1.90705063
  -0.5316207  -0.80970737 -0.44934392]
 [-0.28600637 -0.51996894 -0.35352517 -0.57068913  0.38684809 -1.19597025
   1.19632068 -0.01191236 -0.56227496  0.56227496 -0.36660008 -0.04461314
   0.82225293 -0.51864759 -0.27633167  1.01199015 -1.0119

## Description of how data was split into training and testing sets:

### Steps taken:

* An object for both features and target vector were created.

* sklearn's train_test_split function was used to separate the data into training and testing dataset with a split of 75% training data and 25% testing data. Stratification was applied so there was an equal ratio of the target categories in both training and testing data.

* Data was over-sampled using imbalanced-learn's RandomOverSampler since there was a discrepancy between the target categories i.e. patients that had a stroke made up only 4.3% of the dataset.

* Data was scaled using sklearn's StandardScaler to give features a normalized statistical distribution since the machine learning model would behave better when working with all normalized data.

In [21]:
# create a function that takes implements the X_train_scaled and y_train data, and tests for the best learning rate for the
# gradient boosting model.

# model parameter selection e.g. max_depth, max_features, random_state: 
# https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

def gb_learning_rates(X_train_scaled, y_train_resampled, X_test_scaled, y_test):
    
    learning_rates = [0.05, 0.1, 0.25, 0.5, 0.75, 1]
    
    for learning_rate in learning_rates:
        
        # create instance of model
        gb_model = GradientBoostingClassifier(n_estimators=21,
                                             learning_rate=learning_rate,
                                             max_depth=3,
                                             max_features=6,
                                             random_state=None)
                                              
        # train model with training data
        gb_model.fit(X_train_scaled, y_train_resampled)
        
        print(f'Learning Rate: {learning_rate}')
        
        print(f'Mean Accuracy for Training Data: {gb_model.score(X_train_scaled, y_train_resampled):.3f}')
        
        print(f'Mean Accuracy for Testing Data: {gb_model.score(X_test_scaled, y_test):.3f}')
        
        print('--------------------------------')
    
    return

In [22]:
# call function and input arguments which is the intended training and testing data

gb_learning_rates(X_train_scaled, y_train_resampled, X_test_scaled, y_test)

Learning Rate: 0.05
Mean Accuracy for Training Data: 0.801
Mean Accuracy for Testing Data: 0.700
--------------------------------
Learning Rate: 0.1
Mean Accuracy for Training Data: 0.818
Mean Accuracy for Testing Data: 0.781
--------------------------------
Learning Rate: 0.25
Mean Accuracy for Training Data: 0.848
Mean Accuracy for Testing Data: 0.777
--------------------------------
Learning Rate: 0.5
Mean Accuracy for Training Data: 0.876
Mean Accuracy for Testing Data: 0.792
--------------------------------
Learning Rate: 0.75
Mean Accuracy for Training Data: 0.888
Mean Accuracy for Testing Data: 0.795
--------------------------------
Learning Rate: 1
Mean Accuracy for Training Data: 0.909
Mean Accuracy for Testing Data: 0.812
--------------------------------


In [23]:
# create a new model with the highest learning accuracy

# create instance of model
classifier = GradientBoostingClassifier(n_estimators=21,
                                             learning_rate=1,
                                             max_depth=3,
                                             max_features=6,
                                             random_state=None)

#train model with training features and target
classifier.fit(X_train_scaled, y_train_resampled)

# predict target using testing features and make dataframe to compare predictions with actual results
predictions = classifier.predict(X_test_scaled)

predictions_df = pd.DataFrame({"Prediction": predictions, "Actual": y_test})
predictions_df.sample(50)

Unnamed: 0,Prediction,Actual
491,0,0
193,0,0
12,1,1
66,0,0
402,0,0
1042,1,0
767,0,0
495,1,0
381,0,0
709,0,0


In [24]:
# Calculating the accuracy score
acc_score = accuracy_score(y_test, predictions)
print(f"Accuracy Score : {acc_score}")

Accuracy Score : 0.8143322475570033


In [25]:
# Generate the confusion matrix
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(
    cm, index=["Actual No Stroke", "Actual Stroke"],
    columns=["Predicted No Stroke", "Predicted Stroke"]
)

# Displaying results
display(cm_df)

Unnamed: 0,Predicted No Stroke,Predicted Stroke
Actual No Stroke,979,197
Actual Stroke,31,21


In [26]:
# Generate classification report
target_names = ["No stroke (0)", "Stroke (1)"]
classification_report = classification_report(y_test, predictions, target_names=target_names)
print(classification_report) 

               precision    recall  f1-score   support

No stroke (0)       0.97      0.83      0.90      1176
   Stroke (1)       0.10      0.40      0.16        52

     accuracy                           0.81      1228
    macro avg       0.53      0.62      0.53      1228
 weighted avg       0.93      0.81      0.86      1228

