# Data Science 3 (prediction)

## End to end example

**Teachers:**

* Supervised learning: Bart Barnard (BABA)

**Data files:**

* heart_failure_clinical_records_dataset.csv
* data_description.csv


This notebook demonstrates the steps to build a supervised machine learning model. The following steps are performed

1. Data is loaded. 
2. Data is inspected
3. Data is visualized to gain more understanding about the data
4. Data is prepared for the model to train
5. Model is trained on the prepared data
6. Model is used to test the working of the model
7. Model is evaluated
8. Model is improved




##  Supervised learning 

Cardiovascular diseases kill approximately 17 million people globally every year, and they mainly exhibit as myocardial infarctions and heart failures. Heart failure occurs when the heart cannot pump enough blood to meet the needs of the body. Available electronic medical records of patients quantify symptoms, body features, and clinical laboratory test values, which can be used to perform biostatistics analysis aimed at highlighting patterns and correlations otherwise undetectable by medical doctors. Machine learning, can predict patients’ survival from their data and can individuate the most important features among those included in their medical records[1]. In this we build a machine learning classifier to predict a patient's survival. The goal is to select the most important features for predicting the patient's survival. Data for the analysis is available in `heart_failure_clinical_records_dataset.csv`. The data description is to be found in the table `data_description.csv`


In [None]:
#given code for data description
import pandas as pd
import numpy as np

md = pd.read_csv('data/data_description.csv', sep=';')
md


### Inspect the data
 

Data is loaded. The `time` feature is dropped, since this is not a biometric.  After loading the data the following questions are answered by the aid of creating meaningfull visualizations and overviews. 

1. Is the dataset balanced? How might that affect the confusion matrix
2. Based on the datatype of each feature, are there any features that cannot be used in a logistic regression classifier? 
3. Is imputation needed?

In [None]:
#Load the data. Drop the time feature, since this is not a biometric
import pandas as pd
import numpy as np

df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
df = df.drop(['time'],axis = 1) 
df.head()


In [None]:
# are there any features that cannot be used in a logistic regression classifier? 
df.info() 
# all datatypes needs to be numeric to do the calculations. 
# All datatypes are float or integers and therefore can be used in a classifier. 

In [None]:
df['DEATH_EVENT'].value_counts() 
# Because of the imbalance of the dataset, 
# all the methods obtain a better prediction scores on the true negative rate (predicting death), 
# rather than on the true positive rate

survived, dead = df['DEATH_EVENT'].value_counts()
print('Number of survived: ', survived)  
print('Number of deaths: ', dead) 

print('\n')
print('% of survived', round(survived / len(df) * 100, 1), '%')
print('% of death', round(dead / len(df) * 100, 1), '%')


df.DEATH_EVENT.value_counts().plot(kind='barh')

In [None]:
# missing data
df.isnull().sum() 
# no missing data, no imputation needed

### Feature selection

We create a meaningful visualization to estimate the three most important features. 

In [None]:
import seaborn as sns
c = df.corr().abs()
sns.heatmap(c)
# age, ejection_fraction, serum_creatinine do have most light colors -> strongest contributers?
# a model with biometrics is for medical use. It is often expensive to acquire the data, 
# a simpel model which has an acceptable performance metrics is the most practical. Feature 
# selection might also improve overfitting

In [None]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
sns.displot(df, x="serum_creatinine", hue="DEATH_EVENT")
plt.show()

### Build a classifier


We will build a logistic regression classifier. Next we will evaluate the logistic regression classifier using the confusion matrix. Lastly we create a new patient and predict if the patient will survive. 


The df dataframe is the dataframe containing all the raw data (numbers). We will use this to fill the feature matrix $X$ and the vector $y$ with the labeled class. 


\begin{equation}
X = 
   \begin{bmatrix} \
    x_1^{(1)}  & x_2^{(1)} & x_3^{(1)} & .. & x_n^{(1)}\\
    x_1^{(2)}  & x_2^{(2)} & x_3^{(2)} & .. & x_n^{(2)}\\ 
    x_1^{(3)}  & x_2^{(3)} & x_3^{(3)} & .. & x_n^{(3)} \\ 
    .. & .. & .. & .. & ..\\ 
    x_1^{(m)}  & x_2^{(m)} & x_3^{(m)}  & .. & x_n^{(m)}\\ 
   \end{bmatrix} 
   \
   %
   y = 
   \begin{bmatrix} \
   y^{(1)} \\
   y^{(2)} \\ 
   y^{(3)} \\ 
   .. \\ 
   y^{(m)} \\ 
   \end{bmatrix} 
  %
\end{equation}



In [None]:
# Build model 
#preprosses data 
y = np.array(df['DEATH_EVENT'])
X = df.iloc[:,[0,1,2,3,4,5,6,7,8,9,10]] #without feature selection

#normalise 
from sklearn.preprocessing import StandardScaler

def normalize(X):
    scalar = StandardScaler()
    scalar = scalar.fit(X)
    X = scalar.transform(X)
    return X

X = normalize(X)
X.shape

#split 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#train  
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# optional verify feature weights
for i in model.coef_:
       for index, j in enumerate(i):        
            print(f"{j:.4}  {list(df.columns.values)[index]}")
# You will see that age, ejection_fraction and serum_creatinine are driving the outcome the most

In [None]:
# evaluate the model
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
print(classification_report(y_test, y_pred))
# Accuracy is not that high Indeed negatives are predicted better then positives. 



In [None]:
 # function to evaluate
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import learning_curve

def plot_learning_curves(model, X_train, y_train, X_val, y_val):
    """
    input:
        model:pipeline object
        X_train, y_train: training data
        X_val, y_val: test data
    """
    train_errors, val_errors = [], []
    for m in range(30, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train_predict, y_train[:m]))
        val_errors.append(mean_squared_error(y_val_predict, y_val))

    plt.plot(np.sqrt(train_errors), "r-+", linewidth=1, label="training data")
    plt.plot(np.sqrt(val_errors), "b-", linewidth=1, label="validation data")
    plt.legend(loc="upper right", fontsize=10)   
    plt.xlabel("Training set size", fontsize=10) 
    plt.ylabel("RMSE", fontsize=10)     
    # compare accuracy train versus test to access overfit 
    print(f'test  acc: {model.score(X_val, y_val)}')
    print(f'train acc: {model.score(X_train, y_train)}')
    
    
plot_learning_curves(model, X_train, y_train, X_test, y_test)

In [None]:
# and or new model with feature selection 
y = np.array(df['DEATH_EVENT'])
X = df.iloc[:,[0,4,7]]
#normalise
from sklearn.preprocessing import StandardScaler

def normalize(X):
    scalar = StandardScaler()
    scalar = scalar.fit(X)
    X = scalar.transform(X)
    return X

X = normalize(X)
#split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#train LogisticRegression
new_model = LogisticRegression()
new_model.fit(X_train, y_train)
# evaluate the model. Is your assumption of contribution features correct based on the outcome?
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

y_pred = new_model.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
print(classification_report(y_test, y_pred))
# optional

print(new_model.coef_)

In [None]:
print(f"survivalrate: {round(new_model.predict_proba([[25, 25, 2.0]])[0][0], 3)}") 

### improving the model

What might be another classifier suitable for the dataset and the problem? 

In [None]:
#1 Decision trees often perform well on imbalanced datasets because their hierarchical structure allows 
#  them to learn signals from both classes.
#2 cross validation is advisable because of the small sample. 

In [None]:
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, VotingClassifier

In [None]:
# improve model for instance with AdaBoost (applicabl for binary classes)
adb = AdaBoostClassifier(LogisticRegression(), n_estimators = 10, learning_rate = 1)
adb.fit(X_train, y_train)

y_pred = adb.predict(X_test)

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
print(classification_report(y_test, y_pred))
