# 1. Introduction

Blood transfusion saves lives - from replacing lost blood during major surgery or a serious injury to treating various illnesses and blood disorders. Ensuring that there's enough blood in supply whenever needed is a serious challenge for the health professionals. According to [WebMD](https://www.webmd.com/a-to-z-guides/blood-transfusion-what-to-know#1), "about 5 million Americans need a blood transfusion every year".

This dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. I want to predict whether or not a donor will give blood the next time the vehicle comes to campus.

The data is structured according to RFMTC marketing model (a variation of RFM). Let's get started exploring the data.

In [38]:
import pandas as pd
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

transfusion = pd.read_csv("../input/donations.csv")
transfusion.head()

Unnamed: 0,V1,V2,V3,V4,Class
0,2,50,12500,98,2
1,0,13,3250,28,2
2,1,16,4000,35,2
3,2,20,5000,45,2
4,1,24,6000,77,1


The RFM model stands for Recency, Frequency and Monetary Value and it is commonly used in marketing for identifying the best customers. In this case, the customers are blood donors.

RFMTC is a variation of the RFM model. Below is a description of what each column means in the dataset:
*     R (Recency - months since the last donation)
*     F (Frequency - total number of donation)
*     M (Monetary - total blood donated in c.c.)
*     T (Time - months since the first donation)
*     a binary variable representing whether he/she donated blood in March 2007 (2 stands for donating blood; 1 stands for not donating blood)

It will be helpful to rename these columns as such; except for the last column, which will be the <code>Target</code> column, as the aim is to predict whether someone donated blood in March 2007.


In [39]:
transfusion.rename(
    columns={'V1':'Recency (months)',
             'V2':'Frequency(times)',
             'V3':'Monetary (c.c. blood)',
             'V4':'Time (months)',
             'Class':'Target'},
    inplace=True
)

transfusion.head()

Unnamed: 0,Recency (months),Frequency(times),Monetary (c.c. blood),Time (months),Target
0,2,50,12500,98,2
1,0,13,3250,28,2
2,1,16,4000,35,2
3,2,20,5000,45,2
4,1,24,6000,77,1


It looks like every column in this DataFrame has the numeric type, which is exactly what is required when building a machine learning model. Let's verify the hypothesis.

In [40]:
transfusion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
Recency (months)         748 non-null int64
Frequency(times)         748 non-null int64
Monetary (c.c. blood)    748 non-null int64
Time (months)            748 non-null int64
Target                   748 non-null int64
dtypes: int64(5)
memory usage: 29.3 KB


I want to predict whether or not the same donor will give blood the next time the vehicle comes to campus. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:

*     <code>1</code> - the donor will not give blood
*     <code>2</code> - the donor will give blood

Target incidence is defined as the number of cases of each individual target value in a dataset. That is, how many 1s in the target column compared to how many 2s? Target incidence gives us an idea of how balanced (or imbalanced) is our dataset.

Further, it'll be later useful to convert the (1, 2) values of Target to (0, 1).

In [41]:
transfusion['Target'] = transfusion['Target'].replace([1, 2], [0, 1])

display(transfusion['Target'].value_counts(normalize = True))

0    0.762032
1    0.237968
Name: Target, dtype: float64

Target incidence indicates that about 76% of the time an individual does not give blood.

I will now split this dataframe into train and test datasets, with testing data 25% of the total data. In doing so, I will also take care to keep the target incidence the same in both these datasets, i.e. they should both have roughly 76% 1s in their <code>Target</code> columns.

In [44]:
X_train, X_test, y_train, y_test = train_test_split(
    transfusion.drop(columns = 'Target'),
    transfusion.Target,
    test_size = 0.25,
    random_state = 42,
    stratify = transfusion.Target
)

# 2. Using TPOT to select model

[TPOT](https://github.com/EpistasisLab/tpot) is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. It automates the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for given data.

![](https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-ml-pipeline.png)

TPOT is built on top of scikit-learn, so all of the code it generates will be a scikit-learn pipeline, meaning it will include any pre-processing steps as well as the model.

I am using TPOT to help zero in on one model that can then be explored and optimized further.

In [45]:
tpot = TPOTClassifier(
    generations = 5,
    population_size = 20,
    verbosity = 2,
    scoring = 'roc_auc',
    random_state = 42,
    disable_update_check = True,
    config_dict = 'TPOT light'
)
tpot.fit(X_train, y_train)

tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    print(f'{idx}. {transform}')

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=120, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.7433977184592779
Generation 2 - Current best internal CV score: 0.7433977184592779
Generation 3 - Current best internal CV score: 0.7433977184592779
Generation 4 - Current best internal CV score: 0.7433977184592779
Generation 5 - Current best internal CV score: 0.7433977184592779

Best pipeline: LogisticRegression(input_matrix, C=0.5, dual=False, penalty=l2)

AUC score: 0.7850

Best pipeline steps:
1. LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)


TPOT has picked <code>LogisticRegression</code> as the best pipeline step for this data with an AUC score of 0.7850. However, one of the assumptions for linear regression models is that the data and the features we are giving it are related in a linear fashion, or can be measured with a linear distance metric. If a feature in the dataset has a high variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset.

So before applying regression, we need to check our data for its variance, and normalize if needed.

In [46]:
display(X_train.var())

Recency (months)         6.692902e+01
Frequency(times)         3.382982e+01
Monetary (c.c. blood)    2.114364e+06
Time (months)            6.111466e+02
dtype: float64

<code>Monetary (c.c. blood)</code>'s variance is very high in comparison to any other column in the dataset. This means that, unless accounted for, this feature may get more weight by the model (i.e., be seen as more important) than any other feature.

One way to correct for high variance is to use log normalization.

In [47]:
import numpy as np

X_train_normed, X_test_normed = X_train.copy(), X_test.copy()
col_to_normalize = "Monetary (c.c. blood)"

# Log normalization
for df_ in [X_train_normed, X_test_normed]:
    df_['Monetary_log'] = np.log(df_[col_to_normalize])
    df_.drop(columns = col_to_normalize, inplace=True)

display(X_train_normed.var())

Recency (months)     66.929017
Frequency(times)     33.829819
Time (months)       611.146588
Monetary_log          0.837458
dtype: float64

The variance now looks much better. While the variance of <code>Time (months)</code> is still high, it is not different by several orders of magnitude, which means that the data is now ready for regression.

# 3. Linear Regression

In [48]:
from sklearn import linear_model

logreg = linear_model.LogisticRegression(
    solver='liblinear',
    random_state=42
)

logreg.fit(X_train_normed, y_train)

logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_normed)[:, 1])
print(f'\nAUC score: {logreg_auc_score:.4f}')


AUC score: 0.7891


In this notebook, I explored automatic model selection using TPOT and AUC score we got was 0.7850. This is better than simply choosing 0 all the time (the target incidence suggests that such a model would have 76% success rate). We then log normalized our training data and improved the AUC score by 0.5%. In the field of machine learning, even small improvements in accuracy can be important, depending on the purpose.