# UCLAIS Tutorial Series Challenge 1

We are proud to present you with the first challenge of the 2022-23 UCLAIS tutorial series: brain stroke prediction. You will be introduced to a variety of core concepts in machine learning and their implementation using `scikit-learn`. 

This Jupyter notebook will guide you through the various general stages involved in end-to-end machine learning projects, including data visualisation, data preprocessing, model selection, model training and model evaluation. Finally, you will get the chance to submit your results to [DOXA](https://doxaai.com/).

If you do not already have a DOXA account, you will want to [sign up](https://doxaai.com/sign-up) first before proceeding.


## Background & Motivation

![title](https://www.cdc.gov/stroke/images/Stroke-Medical-Illustration.jpg?_=77303?noicon)

**Background**: A stroke occurs when the bloody supply to the brain is blocked or when a blood vessel within the brain bursts. Brain cells can begin to die within minutes, so a stroke is a medical emergency, and prompt treatment is crucial. A stroke can cause lasting brain damage, long-term disability and eventually death, so early action is crucial to minimise brain damage and other complications.

**Objective**: Our objective is to be able to predict whether a person has a stroke or not given some information about them.

**Dataset**: The dataset is based on the following [stroke prediction dataset](https://www.kaggle.com/datasets/zzettrkalpakbal/full-filled-brain-stroke-dataset).

## Machine Learning Workflow

![title](https://miro.medium.com/max/1400/0*V0GyOt3LoDVfY7y5.png)

As you already know, the machine learning process covers a wide set of steps. As you go through this notebook, try to keep in mind which stage are we dealing with at that moment and what we are trying to achieve. 

As you reach the end of the notebook, you will notice that the sixth step (parameter tuning) from the figure above is missing; this is a challenge for you! Be creative and try to learn something new as you implement your ideas. 

There are a lot of helpful resources online you can use, such as the excellent `scikit-learn` [documentation](https://scikit-learn.org/stable/getting_started.html). This will hopefully allow you to improve your score on the DOXA scoreboard!

## Installing and Importing Useful Packages

To get started, we will install a number of common machine learning packages.

In [None]:
%pip install -U numpy pandas matplotlib seaborn scikit-learn doxa-cli

In [None]:
# Import relevant libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Import relevant sklearn classes/functions related to data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OrdinalEncoder

# Import relevant sklearn classes related to machine learning models
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso
from sklearn.svm import SVC, SVR, NuSVC
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier

# Import relevant sklearn class/function related to evaluation
from sklearn.metrics import accuracy_score, f1_score, ConfusionMatrixDisplay         

%matplotlib inline

## Data Loading

In [None]:
# Import the training dataset
data_original = pd.read_csv("./data/train.csv")  # Change the path accordingly

# We then make a deep copy of the dataset that we can manipulate
# and process while leaving the original intact
data = data_original.copy()

## Data Understanding & Visualisation

Before we start to train our Machine Learning model, it is important to have a look and understand first the dataset that we will be using. This will provide some insights onto which model, model hyperparameter, and loss function are suitable for the problem we are dealing with. 

In [None]:
# Let's see the first 15 entries of our dataset
data.head(15)

In [None]:
# View the size and shape of our training data
print(f"Shape: {data.shape}\n")

# Display the list of features we have
print(f"List of features: {data.columns}\n")

# Check for any missing values
print("Missing values: ")
print(data.isna().sum())

From the dataframe and simple analysis above, there are several things we can observe:

- There are 10 features (excluding `stroke`, which we are trying to predict) and 4500 samples
- The features in our dataset involve both numerical and categorical value
- The range of the numerical features in our dataset varies significantly
- There are no missing data values in our dataset
- We are dealing with a binary classification problem, where the output is either 0 or 1

One of the most important findings from listing even just the first 15 values of our dataset is that we are dealing with an imbalanced classification problem, where the output (whether a person has had a stroke or not) is heavily skewed towards 0 (i.e. not having had a stroke).

In [None]:
num_stroke_false = len(data[data["stroke"] == 0])
num_stroke_true = len(data[data["stroke"] == 1])

print(f"Number of people that have stroke: {num_stroke_false}")
print(f"Number of people that have stroke: {num_stroke_true}")

In fact, almost 95% of our sample has a label of 0, which indicates that most people in our dataset have not had a stroke.

## Data Visualisation

In general, we know that as the age of a person increases, the chance of that person having a stroke also increases. 

But, is this true? And does it apply to this dataset? We can verify this correlation by producing a plot of the rate of having a stroke against age.

In [None]:
fig = plt.figure(figsize=(10, 5))
ax0 = fig.add_subplot()

data["age"] = data["age"].astype(int)

# Calculate the rate of a person getting stroke as a function of age
rate = []
for i in range(data["age"].min(), data["age"].max()):
    rate.append(
        data[data["age"] < i]["stroke"].sum() / len(data[data["age"] < i]["stroke"])
    )

# Draw a lineplote
sns.lineplot(data=rate, ax=ax0)

# Remove the top, right, and left surrounding line for aesthetic purposes
for s in ["top", "right", "left"]:
    ax0.spines[s].set_visible(False)

# Adjust the tick appearance for aesthetic purposes
ax0.tick_params(axis="both", which="major", labelsize=8)
ax0.tick_params(axis="both", which="both", length=0)

# Add some text on the figure
ax0.text(
    -3,
    0.055,
    "Stroke Risk by Age",
    fontsize=18,
    fontfamily="serif",
    fontweight="bold",
)
ax0.text(
    -3,
    0.047,
    "As people age, the risk of having a stroke increases",
    fontsize=14,
    fontfamily="serif",
)

plt.show()


Yep - as expected, the higher the age, the higher the chance of having a stroke. 

## Data Preprocessing

There are a few more things we need to do before we can start training a machine learning model. Among them are the following:
- Converting categorical data into numerical data
- Standardising the range of our datasets

In [None]:
# Find categorical features, along with their values
# We do this by exploiting the fact that categorical features have a data type of 'object'

for col in data.columns:
    if data[col].dtype == "object":
        print(col, data[col].unique())

In [None]:
# Encode categorical values
data["gender"] = data["gender"].replace({"Male": 0, "Female": 1}).astype(np.uint8)
data["ever_married"] = (
    data["ever_married"].replace({"No": 0, "Yes": 1}).astype(np.uint8)
)
data["work_type"] = (
    data["work_type"]
    .replace(
        {
            "Private": 0,
            "Self-employed": 1,
            "Govt_job": 2,
            "children": 3,
            "Never_worked": 4,
        }
    )
    .astype(np.uint8)
)
data["Residence_type"] = (
    data["Residence_type"].replace({"Rural": 0, "Urban": 1}).astype(np.uint8)
)
data["smoking_status"] = (
    data["smoking_status"]
    .replace({"formerly smoked": 0, "smokes": 1, "never smoked": 2, "Unknown": 3})
    .astype(np.uint8)
)


In [None]:
# Let's check our dataset to see whether our categorical data has correctly been changed into numerical data
data.head()

Cool! Now, we have converted the categorical features in our dataset into numerical features. A faster way of doing this is by using the [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder) class provided by scikit-learn.

**Challenge**: have a think about the way we have encoded our categorical data - what are the potential consequences of encoding categorical data in this way? What might be a better way of encoding this type of data? We'll discuss this further later on in the notebook.

Next, let's standardise the numerical features by taking off the mean and scaling the data to unit variance.

In [None]:
scaler = StandardScaler()
data[["age", "avg_glucose_level", "bmi"]] = scaler.fit_transform(
    data[["age", "avg_glucose_level", "bmi"]]
)

# Verify that our feature has been standardized
data.head()


Now that our data has been standardised and all our features are numerical, we are very close to training our first machine learning model.

All that is left to do is the following:

1. **Separate the input features and the output label**: this is an important requirement when training our dataset - we don't want to train our `scikit-learn` models on the data we are trying to predict!

2. **Split our data into training and test sets**: the training set is the dataset on which our models will be trained. After training our models, we then test them on our newly created test set.

We will use the **empirical error** from evaluating our models on the test set as a proxy for the **generalisation error**: a measure of how accurately an algorithm can predict outcomes for unseen data (which is what we are trying to do eventually!). It will also provide us with a useful tool for comparing the different models we have trained so that we can decide which model to use for our submission to DOXA. Bam!

In [None]:
# Separate our data into X, which contains all the features in our dataset and y, which contains only the output/label (stroke)
X = data.drop(columns=['stroke'])
y = data['stroke']

In [None]:
# Verify that we have correctly separated the features and the output by looking at the shape of X and y
print(X.shape)
print(y.shape)

In [None]:
# Split our features and output into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# In this case, the test_size parameter is equal to 0.2, so our test set will 
# have 20% the data, while the training set will have the other 80% of the data

# TODO: try changing the test_size parameter and see whether it impacts the performance of our model

In [None]:
# Verify that the operation ran as intended by checking the shape of the splitted dataset
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

## Model Training

This is where the magic begins. As an example, we will be training our dataset by using three different models and choosing the best model for submission later. The models we will be testing out are [logistic regression models](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [support vector machines](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) and [decision trees](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). 

Bear in mind that each of the different types of model has its own set of hyperparameters that you can tune to improve performance. Do check out the documentation for each type of model!

In [None]:
clf_lr = LogisticRegression()
clf_lr.fit(X=X_train, y=y_train)

clf_svm = SVC()
clf_svm.fit(X=X_train, y=y_train)

clf_tree = DecisionTreeClassifier()
clf_tree.fit(X=X_train, y=y_train)

## Model Testing
Now that we have trained our machine learning models, we can test them on our test set

In [None]:
# Use the .predict() method to predict output values for our test set
lr_predicted = clf_lr.predict(X_test)
svm_predicted = clf_svm.predict(X_test)
tree_predicted = clf_tree.predict(X_test)

In [None]:
# We will be using the accuracy_score() function as our evaluation metric, which simply calculates 
# the number of predictions that are correct and divides it by the total number of predictions.
lr_accuracy = accuracy_score(lr_predicted, y_test)
svm_accuracy = accuracy_score(svm_predicted, y_test)
tree_accuracy = accuracy_score(tree_predicted, y_test)

print("Accuracy (Logistic Regression): ", lr_accuracy)
print("Accuracy (SVM): ", svm_accuracy)
print("Accuracy (Decision Tree): ", tree_accuracy)

Neat! We can see that the logistic regression model and the SVM performed equally well (with about 95% accuracy), while our decision tree has slightly worst performance.

Let's put things back into perspective. Right now, we are doing an imbalanced classification problem for which only 5% of our outputs have a value of 1; thus, we could easily achieve 95% accuracy just by always outputting 0 even! There is definitely a long way to go with regards to producing a model that is able to predict accurately whether a person has a stroke or not. 

In [None]:
# We can do further analysis by confusion matrices
print("Confusion Matrix (Logistic Regression)")
print(ConfusionMatrixDisplay.from_predictions(y_true=y_test, y_pred=lr_predicted))

print("\nConfusion Matrix (SVM)")
print(ConfusionMatrixDisplay.from_predictions(y_true=y_test, y_pred=svm_predicted))

print("\nConfusion Matrix (Decision Tree)")
print(ConfusionMatrixDisplay.from_predictions(y_true=y_test, y_pred=tree_predicted))

## Preparing our DOXA Submission

Once we are confident with the performance of our model, we can start deploying it on the real test dataset for submission to DOXA! 

In [None]:
# First, let's import our test dataset and save it in a variable called data_test
data_test = pd.read_csv("./data/test.csv")          # Change the path accordingly 

Then, we must preprocess the dataset before feeding it into the trained model. The preprocessing steps include: 
1. Converting categorical data into numerical data
2. Standardising numerical data that has a large range

In [None]:
# Encoding categorical values as we did before
data_test["gender"] = (
    data_test["gender"].replace({"Male": 0, "Female": 1}).astype(np.uint8)
)
data_test["ever_married"] = (
    data_test["ever_married"].replace({"No": 0, "Yes": 1}).astype(np.uint8)
)
data_test["work_type"] = (
    data_test["work_type"]
    .replace(
        {
            "Private": 0,
            "Self-employed": 1,
            "Govt_job": 2,
            "children": 3,
            "Never_worked": 4,
        }
    )
    .astype(np.uint8)
)
data_test["Residence_type"] = (
    data_test["Residence_type"].replace({"Rural": 0, "Urban": 1}).astype(np.uint8)
)
data_test["smoking_status"] = (
    data_test["smoking_status"]
    .replace({"formerly smoked": 0, "smokes": 1, "never smoked": 2, "Unknown": 3})
    .astype(np.uint8)
)


In [None]:
# Standardise numerical features
scaler = StandardScaler()
data_test[["age", "avg_glucose_level", "bmi"]] = scaler.fit_transform(
    data_test[["age", "avg_glucose_level", "bmi"]]
)

# Output the shape of our submission dataset
print(f"Shape: {data_test.shape}")

# Verify that our features have been standardised
data_test.head()


Once we have redone all the preprocessing stages, we can proceed to do inference on the DOXA submission test dataset.

In [None]:
# We will choose the logistic regression model
predictions = clf_lr.predict(data_test)

predictions

In [None]:
predictions.shape

It seems that the output is the shape it should be, having 481 entries, so we are now ready to submit our predictions. 

In [None]:
os.makedirs("submission", exist_ok=True)

with open("submission/y.txt", "w") as f:
    f.writelines([f"{prediction}\n" for prediction in predictions])

with open("submission/doxa.yaml", "w") as f:
    f.write("competition: uclais-1\nenvironment: cpu\nlanguage: python\nentrypoint: run.py")

with open("submission/run.py", "w") as f:
    f.write("with open('y.txt', 'r') as f: print(f.read().strip())")


## Submitting to DOXA

Before you can submit to DOXA, you must first ensure that you are enrolled for the challenge on the DOXA website. Visit [the challenge page](https://doxaai.com/competition/uclais-1) and click "Enrol" in the top-right corner.

You can then log in using the DOXA CLI by running the following command:

In [None]:
!doxa login

You can then submit your results to DOXA by running the following command:

In [None]:
!doxa upload submission

Yay! You have (probably) just uploaded your first submission to DOXA! Take a moment to see where you are on the [scoreboard](https://doxaai.com/competition/uclais-1)!

## Possible Improvements

Our model is not that good at predicting stroke since it mainly just outputs 0, so there is definitely scope for improvement! Here are a few ways we could improve the process:

**1. Data Visualisation**
- Visualise other features as well (rather than just age) to see what other features correlate with a person having a stroke. We could potentially produce a correlation matrix.

**2. Data Preprocessing**
- Apply the [PCA algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), which will reduce the number of input features to a smaller subset that matters more to us. We choose `n` input features that have the highest orthogonality, where `n` is a hyperparameter, so tuning needs to be done to get the best performance.

- Perhaps an ordinal encoding is not the most appropriate for our categorical data. We could, for example, try using a [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) instead!

**3. Dealing with Imbalanced Dataset**
- The challenge of working with imbalanced datasets is that many ML models will just ignore the minority class (as you can see if you run through a decision matrix for the SVM and logistic regression models from earlier).
- One approach to address this is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these will not create any new meaningful information for the model. Instead, new examples can be synthesised from the existing examples. This type of data augmentation is referred to as the [Synthetic Minority Oversampling Technique](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/) (SMOTE).

**4. Model Selection**
- In our example, we have looked at implementing [logistic regression models](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression), [support vector machines](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) and [decision trees](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Each of these models has its own set of hyperparameters that you can tune to improve model performance. The link will bring you to the documentation page of `scikit-learn` where you can discover more about the hyperparameters of each type of model.
- On top of that, there are many more machine learning model types that you can try out and see whether accuracy improves or not. Indeed, there are even ensemble methods that use multiple machine learning models under the hood! 
- If you look at the different machine learning models being imported at the start of the notebook, you will notice that there are quite a few which have not been used. This might be a good starting point!

And perhaps, many more...