## 1. RFMTC Model on Blood Donation.
<p><img src="blood_donation.png" style="float: right;" alt="A pictogram of a blood bag with blood donation written in it" width="200"></p>
<p>The original problem is about a bus collecting blood donations in a single institution (a university). Working with 748 donor data points, the researchers were able to build a RFMTC model that predicted wether or not a individual would donate blood on the next visit to the university with 0.79% accuracy.</p>
<p>DataCamp used the case above to develop a project exercise to it's students. The project begins with some data exploratory analysis and ends with the model building.  
</p>
<p>
The repository at hand continues that project by providing an API to be consumed on other applications. This API is hosted on my local machine, but the next step is to host it on cloud or other Linux machines.  
</p>

In [1]:
# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt


In [2]:
# Read in dataset
transfusion = pd.read_csv('datasets/transfusion.data')

# Rename target column as 'target' for brevity 
transfusion.rename(
    columns={'whether he/she donated blood in March 2007': "target"},
    inplace=True
)

### The Data set is split into train set and test set. The first teaches the machine patterns it needs to learn, and the second evaluates the model's performance. The stratify parameter indicates that the test set has the same proportions as the train set. 

In [3]:
# Split transfusion DataFrame into
# X_train, X_test, y_train and y_test datasets,
# stratifying on the `target` column
X_train, X_test, y_train, y_test = train_test_split(
    transfusion.drop(columns='target'),
    transfusion.target,
    test_size=0.25,
    random_state=42,
    stratify=transfusion.target
)

### The Monetary Blood feature/column has a much higher variance than other features, so it needs to be normalized. The normalization technique chosen was log normalization.

In [4]:
# Copy X_train and X_test into X_train_normed and X_test_normed
X_train_normed, X_test_normed = X_train.copy(), X_test.copy()

# Specify which column to normalize
col_to_normalize = 'Monetary (c.c. blood)'

# Log normalization
for df_ in [X_train_normed, X_test_normed]:
    # Add log normalized column
    df_['monetary_log'] = np.log(df_[col_to_normalize])
    # Drop the original column
    df_.drop(columns=col_to_normalize, inplace=True)


### TPOT's analysis indicated that Logistic Regression is the best model to work with this dataset.

In [5]:
# Instantiate LogisticRegression
logreg = linear_model.LogisticRegression(
    solver='liblinear',
    random_state=42
)

# Train the model
logreg.fit(X_train_normed, y_train)

# AUC score for tpot model
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_normed)[:, 1])
print(f'\nAUC score: {logreg_auc_score:.4f}')


AUC score: 0.7891


In [6]:
# Verifying the predictions, it's probabilities and the actual data points.
y_pred_proba =logreg.predict_proba(X_test_normed)[:,1]

y_pred = logreg.predict(X_test_normed)

dic = {"index":y_test.index,"Valores Reais":y_test,"Probabilidade da Predição":y_pred_proba, "Predições":y_pred}
df_pred = pd.DataFrame(dic)
df_pred.set_index("index")
df_pred.drop("index", axis=1).head()

Unnamed: 0,Valores Reais,Probabilidade da Predição,Predições
41,0,0.452876,0
682,0,0.142197,0
532,0,0.434499,0
538,1,0.414502,0
153,1,0.339023,0


In [7]:
# Verifying data points trained by the model.
df_explorador = X_test_normed
df_explorador['target'] = y_test
df_explorador.head(5)

Unnamed: 0,Recency (months),Frequency (times),Time (months),monetary_log,target
41,2,5,16,7.130899,0
682,11,2,25,6.214608,0
532,4,8,28,7.600902,0
538,2,8,38,7.600902,1
153,2,1,2,5.521461,1


In [8]:
# Testing a single arbitrary individual.
registro_unico = np.array([2, 15, 30, 90.134544]).reshape(1,-1) # Entrada
previsao_registro_unico = logreg.predict(registro_unico) # Previsão
print(previsao_registro_unico) # Saída

[1]


In [9]:
# Exporting the model to be used in the API.
import pickle
pickle.dump(logreg, open('modelo_logreg.pickle', 'wb'))