## Gradient Boosting Modeling of Blood Transfusion Data

Gradient boosting is a powerful ensemble ML algorithm popular for structured predictive modeling problems, such as classification and regression on tabular data. It is an ensemble algorithm that fits boosted decision trees by minimizing an error gradient. 

In this project, I will implement the most used version (in SKlearn) and computationally efficient alternatives implemented in libraries namely the XGBoost, the LightGBM and the CatBoost library. 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv("transfusion.data", sep = ",")
data.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [3]:
inputs = data.rename(columns={'Recency (months)': 'Recency', 'Frequency (times)': 'Frequency', 
                                'Monetary (c.c. blood)':'Monetary', 'Time (months)': 'Time', 
                               'whether he/she donated blood in March 2007': 'Target'})
print(inputs)

     Recency  Frequency  Monetary  Time  Target
0          2         50     12500    98       1
1          0         13      3250    28       1
2          1         16      4000    35       1
3          2         20      5000    45       1
4          1         24      6000    77       0
..       ...        ...       ...   ...     ...
743       23          2       500    38       0
744       21          2       500    52       0
745       23          3       750    62       0
746       39          1       250    39       0
747       72          1       250    72       0

[748 rows x 5 columns]


In [4]:
data_new = inputs.drop('Target', axis = 'columns')
Target = inputs['Target']

X_train, X_test, y_train, y_test = train_test_split(inputs.drop(['Target'], axis = 'columns'), inputs.Target, test_size=0.33)
print(len(X_test))
print(len(X_train))

247
501


#### Gradient Boosting Model Using Repeated K-Fold Cross-Validation from SKLearn

In [5]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot

# evaluating model
model = GradientBoostingClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scoring = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy = %.3f (%.3f)' % (np.mean(scoring), np.std(scoring)))

# model fitting
model = GradientBoostingClassifier()
model.fit(X_train, y_train)

# make a prediction
row743 = [[23, 2, 500, 38]]
y_cap = model.predict(row743)
print('Prediction = %d' % y_cap[0])

Accuracy = 0.759 (0.047)
Prediction = 0


I observe 78.6% accuracy from GBM implemented above. Also, I predict the whether the 743th person donated blood in March 2007 or not and the algorithm correctly predicts that the patient did not donate blood in March 2007. Already, I observe an improvement in performance in prediction by comparing the mean accuracy score to that of the random forest model.

#### Gradient Boosting Model Using Repeated K-Fold Cross-Validation and the XGBoost Classifier

In [6]:
import xgboost
print(xgboost.__version__)

1.5.2


In [1]:
from numpy import asarray
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
import numpy as np

model = XGBClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
score = cross_val_score(model, X_train, y_train, scoring='accuracy', n_jobs=-1, cv=cv, error_score='raise')
print('Accuracy = %.3f (%.3f)' % (np.mean(score), np.std(score)))

# fit model
# model = XGBClassifier()
# model.fit(X_train, y_train)

# make a prediction
# row = [2, 20, 5000, 45]
# row3 = asarray(row).reshape((1, len(row)))
# y_cap = model.predict(row)
# print('Prediction = %d' % y_cap[0])

NameError: name 'X_train' is not defined