# **RAAN UK F2F Guided Data Challenge --- No Code Worksheet** 

Welcome to the Roche UK DSC guided data challenge! If you are completely new to the world of data science we hope this challenge will give you the guidance and support you need to kick-start your data science journey. If you consider yourself a guru we hope this challenge provides some prompts for you to push your knowledge to the next level and gets you thinking about ways to advance your skill sets. 

This challenge has been written with the expectation of having _some_ prior python experience, though by no means extensive. If you are completely new to python, the `completed_code.ipynb` notebook will walk you directly through all the code you will need, or else provide links to where you can find it. So if you get stuck with any programming please don't hesitate to ask your team members and supervisors for help. We're all learning together! 

For our challenge today imagine you have been handed a large dataset of digitized images of FNAs (fine needle aspirate) of a breast mass and tasked to predict whether or not an image contains benign or malignant cancer cells. Being the savvy data analysts you all are you decide to employ a machine learning solution.

By the time you finish this challenge, you will be able to :

* Load the data, understand what is in it and clean datasets. 
* Visualise data 
* Split data using test/train split
* Establish a baseline model using SVM, Decision Tree 
* How to improve a base machine learning model using Feature engineering Hyperparameter optimisation.
* Program in Python!
 





### Load Packages 

This Python 3 environment comes with many helpful analytics libraries installed. It is defined by the [kaggle/python docker image](https://github.com/kaggle/docker-python). For example, here are several helpful packages to load in which we will have to use later on. 
Run the code below (by clicking run or pressing Shift+Enter).
 

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # data visualization library  
import matplotlib.pyplot as plt
import time
from subprocess import check_output
import warnings # import warnings library
warnings.filterwarnings('ignore') # ignore all warnings
# Any results you write to the current directory are saved as output.

### Loading the Data
To get started we first need to access our dataset which you should have been sent in the form of a csv file. Upload this file to colab by using the File tab on the left, and press the "upload" button, and specify `data.csv`. Then load it into a dataframe in pandas below using `pd.read_csv()`.


Here are some websites that can help you import a csv file into python: 

[How to import a csv into python using pandas](https://datatofish.com/import-csv-file-python-using-pandas/)

[How to read and write csv files with python](https://stackoverflow.com/questions/41585078/how-do-i-read-and-write-csv-files-with-python)


In [2]:
# Load the data into a pandas dataframe called `data`

### Understanding and viewing the data
Now that we have access to the data we want to be able to see what is inside it, just to check everything is in order and ready to start working with. Since it's quite a large dataset we will save some time by only printing out the first and last 5 observations. Once you manage to do this look through the data and note down anything that looks suspicious. Note that in python that indexing will start from 0 (We start counting from 0 and not 1). We can also look at the statistics of each column in the data by calling the "describe" function.

Helpful links:

[How to view a portion of observations in a python dataset.](https://appdividend.com/2020/05/26/pandas-dataframe-head-method-in-python/) 


In [3]:
# Look at the top 5 or ten elements in your dataframe.
# This can be done by using the dataframe.head() function! How would you look at the last 5?

In [4]:
# Look at a statistical summary of your dataframe using dataframe.describe
# Use the argument include = all to return mixed types (e.g. not just numeric columns)

### A brief aside: working with pandas dataframes

It's worthwhile trying to become comfortable with pandas dataframes, both in terms of selecting data, setting data, and various operations that you can perform on the data. In this way you can, as we shall soon see, remove rows or entire columns with missing values, create new features, and many other things. A link to a quick overview can be found in the official pandas docs: [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html). Have a read, and use the below code box to experiment with selecting and setting data.

In [5]:
# You can make a copy of your dataframe using
# temp_data = data.copy()
# This way you can experiment without risking losing your data, though any future work done on `temp_data`
# will not effect `data` unless you do it to data too!

# There are two ways to select data, by label and by index. Try them out here

# How about selecting data by boolean values? Try selecting all rows with a radius_mean over 13.37

# Can you create a new column called area_estimated where you estimate the area based off of the radius_mean
# (assuming the area follows the equation for the area of a circle)?



### 4 things you hopefully should have noticed from the summaries:

1)  <b>id</b> cannot be used for classification so should be removed (Discuss why this is with your team)

2)  <b>diagnosis </b>  is the class label, and it is in a string format.

3)  <b> Unnamed: 32 </b> feature includes Nan values and is not needed 

4)  We currently have no other knowledge about which features are necessary to keep.

5) Data exists at different scales; some have a mean of less than 1, some are over 1000!
 

### What do we do with this knowledge? 

Hint 1: Lets make our life easier by making the diagnosis a binary variable (1 for Malignant, 0 for Benign)

[Hint 2: Some columns will need to be dropped](https://www.geeksforgeeks.org/how-to-drop-one-or-multiple-columns-in-pandas-dataframe/) 

Hint 3: One of these columns will need to be saved 
e.g. y = data.(column you want to keep)

Hint 4: Some models really don't like it when input features are different orders of magnitude (others work perfectly well!)



In [6]:
# Remap the diagnosis column.
# There are a few ways of doing this.
# One "pythonic" way of doing it using using a dictionary and the map function
# d = {"M": 1, "B": 0}
# data["diagnosis"].map(d)
# How would you ensure this change is saved?


# Make a new variable `y` containing just the diagnoses

# Remove the "id" and "Unnamed: 32" columns

# We will do the rescaling later on, once we've split the data!

### Compare the number of benign vs malignant cases 

Next we would like you to compare the number of benign vs malignant cancer cases using your saved variable and plot the result.

[Working out index value using pandas](https://www.geeksforgeeks.org/python-pandas-index-value_counts/?ref=lbp)

[How to create a seaborn.countplot](https://seaborn.pydata.org/generated/seaborn.countplot.html)



In [7]:
# Use seaborn countplot to plot a histogram of the different cases

# You can see how many of each value there are by using Series.value_counts()
# (where series simply means you are referencing a particular column instead of a whole dataframe)

## Visualising the data

The ground work is done for the data exploration!!

To understand the hidden patterns, its always good to see big picture first and then dive deep into data. So lets begin analysis or visualization of patterns with our data and then move to the feature level understanding.... This will allow us to gain a better idea of which features have importance to us. In these next few steps you will be making box plots, heatmaps, violin plots and swarm plots!

This data has 3 seperate feature sections: x_mean, x_se and x_worst

Try and seperate the data into 3 seperate sections below. [Hint here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) 


In [8]:
# splitting into X_mean, X_se and X_worst

### Correlation of features

Firstly we will be looking at the correlation( mutual relationship between two or more things) of features to one another.
For this we will be using box plots and heat maps. The importance of this is so that we don't include two features that are strongly correlated
to each other as this can off balance our machine learning algorithm that we will use later. To learn more about this click
 [here](https://towardsdatascience.com/why-exclude-highly-correlated-features-when-building-regression-model-34d77a90ea8e). Later we will just select 1 of the strongly correlated values. This next activity may be quite tricky so try and follow the given documentation and try and work out the step within your team and if all else fails your supervisor can help you!


### Making Boxplots 

Try to create a boxplot to compare two variables. For example radius_mean and texture_mean. [Here is the documentation for creating boxplots.](https://seaborn.pydata.org/generated/seaborn.boxplot.html)


### Making Heatmaps 

Try to create heatmaps of the 3 different feature sections you have seperated. [Here is the documentation for creating heatmaps.](https://seaborn.pydata.org/generated/seaborn.heatmap.html)


#### Heatmap for x_mean 

#### Heatmap for x_se 

#### Heatmap for x_worst 

Lets identify the 10 most correlated features with the diagnosis label from the whole dataset (not the x_mean, x_worst, etc). You can use the [DataFrame.corr](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) function for this. Remember that we want the correlations with the diagnosis, which is the first column! Also bear in mind that correlation can be positive or negative. If we want both, it might be useful to take the absolute value...

In [9]:
# Remember how to index variables, and also be aware that you can sort a dataframe or series using Series.sort_values

### Violin plots

A [violin plot](https://mode.com/blog/violin-plot-examples/) is a hybrid of a box plot and a kernel density plot, which shows peaks in the data. It's used to visualize the distribution of
numerical data. Unlike a box plot that can only show summary statistics, violin plots depict summary statistics and the density of each variable.
Using these plots we can determine which features may be most important to determining benign and malignant cancer cells. Our data will need
 need to be standarized (converted to a common format) for this next step. 

[Hint here for standardization](https://www.askpython.com/python/examples/standardize-data-in-python)

[Hint here for  violin plots](https://seaborn.pydata.org/generated/seaborn.violinplot.html)


#### Violin plot for mean variables 

In [None]:
# Might be easier to display results on a standardised dataset
# data_dia = y
# data = x_mean
# data_standarized = (data - data.mean()) / (data.std()) 

#### Violin plot for se variables 

#### Violin plot for worst variables 

### Swarm plots 

The last visualization we will look at is the [swarm plot](https://prvnk10.medium.com/swarm-plot-4728f52b688e). A swarmplot shows all the data points which helps to understand the distribution in a better manner. It also helps to understand how the data is distributed across a categorical attribute and how the continuous variable is varying within a category. This can be used to clearly define any differences between features and hopefully by the end of this step you can start to pick out what we can see to be our most important features. Again you will need to standardize the data for the best results.

[Click here for seaborn documentation on swarm plots](https://seaborn.pydata.org/generated/seaborn.swarmplot.html) 



#### Swarm plot for mean variables 

#### Swarm plot for se variables 

#### Swarm plot for worst variables 

## Feature selection

Feature selection is the process of reducing the number of input variables when developing a predictive model.
It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases,
 to improve the performance of the model. From our visulations we now need to pick some features run our models on.
 For more information on feature selection click [here](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/).
 We will first run feature selection for our mean cases 

 We need to look at features that have strong correlations and group them together. 

For example radius_mean, perimeter_mean, area_mean are all highly correlated. Having grouped these variables together we only need to use perimeter_mean instead of all three variables.

Try and group the other remaining variables and pick 4 other variables that you can use as prediction variables. Set up an array of the variables you have chosen as your prediction variables.



In [10]:
# prediction_var = ['texture_mean','perimeter_mean','smoothness_mean','compactness_mean','symmetry_mean','concavity_mean',
# 'texture_se','perimeter_se','smoothness_se','compactness_se','symmetry_se','concavity_se',
# 'texture_worst','perimeter_worst','smoothness_worst','compactness_worst','symmetry_worst','concavity_worst']

## Making predictions

“Prediction” refers to the output of an algorithm after it has been trained on a historical dataset and applied to new data when forecasting
the likelihood of a particular outcome. In our case we will be splitting our data into training and testing data from
 
 which we will use our prediction variables to make judgements on whether or not an image shows benign or malignant cancer cells.
 We split the data into train and test to avoid overfitting where our machine learning model performs really well on data it has seen 
 but fails when shown any new data. Underfitting can also occur where the train data is of poor quality.
To read more about machine learning and predictions click [here](https://www.datarobot.com/wiki/prediction/).
To read more on train/test click 
[here](https://towardsdatascience.com/how-to-split-a-dataset-into-training-and-testing-sets-b146b1649830#:~:text=The%20simplest%20way%20to%20split,the%20performance%20of%20our%20model.).

Before we get started with splitting the data we need to install a few packages.

In [11]:
from sklearn.linear_model import LogisticRegression # to apply the Logistic regression
from sklearn.model_selection import train_test_split # to split the data into two parts
from sklearn.model_selection import GridSearchCV# for tuning parameter
from sklearn.ensemble import RandomForestClassifier # for random forest classifier
from sklearn import svm # for Support Vector Machine
from sklearn import metrics # for the check the error and accuracy of the model

### Splitting into train/test 

Let's now split our dataset up into train and test datasets. A good ratio to split into is 80:20 train to test. The training dataset will 
be used to train the model to predict expected diagnosis from our selected features.
The test dataset will be used to test how well the trained model can predict the correct diagnosis.

[Follow this guide on how to split the data](https://towardsdatascience.com/splitting-a-dataset-e328dab2760a)

In [22]:
# now split our data into train and test

# we can check their dimension
# print(train.shape)
# print(test.shape)

(455, 31)
(114, 31)


We need to split again into train_X and train_Y as well as test_X and test_Y. The X variable being the prediction variables and 
the Y variable being the variable we want to predict.

In [12]:
# Split the training data

# Split the test data

# n_features = train_X.shape[1]

Now we can do the rescaling. We will use Scikitlearn's [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for this purposes.

The training data will be rescaled using `fit_transform`, while we will use `transform` for the test data to prevent data "leakage". `fit_transform` computes the mean and variance for the training dataset, then rescales the data according to these parameters. `transform` meanwhile, _reuses those parameters, instead of refitting them on the test dataset_.

Why is this important? If we use `fit_transform` again, we will calculate a new mean and variance, and let our model learn about the new test dataset. We will therefore not get a good understanding of how the model is behaving on completely unseen data!

In [14]:
from sklearn.preprocessing import StandardScaler
# Create a standardscaler object
# Call fit_transform on the training data to create a standardised training dataset

# Call transform on the test dataset

## Machine learning models 

### Random Forest Classifiers 

Now our data has been split we need to select some models we can train and use to predict. We will start off by looking at a random 
forest classifier. 

If the words machine learning scares you don't worry as it might not be as difficult as you first think. We will use the 
 [sklearn libary](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to 
access these models so very little programming will be needed!

A random forest classifier like its name implies, consists of a large number of individual decision trees that operate as an ensemble 
An error from an individual decision tree is negated as the results of the whole are averaged out to come to a correct decision.
Decisions trees are very sensitive to the data they are trained on — small changes to the training set can result in significantly 
different tree structures. Random forest takes advantage of this by allowing each individual tree to randomly sample from the dataset
 with replacement, resulting in different trees. This process is known as bagging. Please take some time to read 
[this post](https://towardsdatascience.com/understanding-random-forest-58381e0602d2) for a more in depth look at random forest classifiers.


In [15]:
# a simple random forest model
# now fit our model for traiing data
# predict for the test data

# accuracy = metrics.accuracy_score(prediction,test_y)
# err = 1 - accuracy
# print("Prediction Accuracy = {0:4.2f}%".format(accuracy*100))
# print("Prediction Error = {0:4.2f}%".format(err*100))

### Feature importance
Random forests also provide feature importance in the `RandomForest.feature_importances_` array. Print the features with importance over some threshold

In [16]:
#importance_threshold = 0.02

We can then plot these feature importances

In [17]:
# def get_feature_importance(n_features, importances, vars):
#     fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 4))

#     # Identify the important features
#     importance_threshold = 0.02
#     idx = np.array(range(n_features))
#     imp = np.where(importances >= importance_threshold)  # important features
#     rest = np.setdiff1d(idx, imp)  # remaining features

#     # Plot the important features and the rest on a bar chart
#     plt.bar(idx[imp], importances[imp], alpha=0.65)
#     plt.bar(idx[rest], importances[rest], alpha=0.45)

#     # Print feature names on the bar chart
#     for i, (feature, importance) in enumerate(zip(vars, importances)):
#         if importance > importance_threshold:
#             plt.text(i, 0.015, feature, ha='center', va='bottom', rotation='vertical', fontsize=16, fontweight='bold')
#             print('[{0}] {1} (score={2:4.3f})'.format(i, feature, importance))
#         else:
#             plt.text(i, 0.01, feature, ha='center', va='bottom', rotation='vertical', fontsize=16, color='gray')
        
#     # Finish the plot    
#     fig.axes[0].get_xaxis().set_visible(False)
#     plt.ylabel('Feature Importance Score', fontsize=16)
#     plt.xlabel('Features for Breast Cancer Diagnosis', fontsize=16)
#     plt.show()

# Use the above function to plot feature imporatances for our random forest model

### Individual Prediction 

Now lets try making an individual prediction based off the model we have just set up. In the space below create an individual example 
 from your test data for both X and Y variables. They should be from the same index in your data.
Then use your model to make a prediction. And see if it is correct! Try lots of different individual records and see how many predictions 
 are correct.

### SVM (Support Vector Model)

An SVM or Support Vector Model finds a hyperplane (a line) in an N-dimensional space (N — the number of features) that 
distinctly classifies the data points. For a high level overview of SVM please read this 
 [article](https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989).  

 [For the relevant code for an SVM click here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
 The same as before try setting up the model and running the model on individual observations in your test data. 
 Are the number of correct observations higher for this model or lower?


In [18]:
# Set up a SVM with a *linear kernel*. This will allow us to look at feature importances.


# print("Prediction Accuracy = {0:4.2f}%".format(accuracy*100))
# print("Prediction Error = {0:4.2f}%".format(err*100))

In [19]:
# Our prediction on a particular individual

In [20]:
# Feature importances in SVMs using linear kernels can bo obtained from svm.coef_
# Note that you might have to change the dimensions of the coefficients using array.flatten()

## Improving Prediction Accuracy 
 

Now you have hopefully been able to get a working machine learning model up and running you may have noticed that some predictions have
 been inaccurate. If we had to use this in a real medical scenario this would be a disaster! Thankfully there are many ways we can 
 improve our machine learning models to make sure they provide accurate diagnoses. Here are a couple:

* Using larger and more complex data in your training set. 
* Testing multiple algorithms. (You have already done this!) 

### [Hyperparameterisation](https://www.jeremyjordan.me/hyperparameter-tuning/)
 

Hyperparameters are the parameters in machine learning models that determine how they work. These parameters can include things like 
the number of layers in a deep neural network, or how many trees there should be in an ensemble model. You usually need to adjust
 these hyperparameters yourself because they aren’t automatically set when you train your model. In true machine learning fashion, 
 we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically and thus this 
 process of searching for the ideal model architecture is referred to as hyperparameter tuning.

### [Oversampling](https://towardsdatascience.com/oversampling-and-undersampling-5e2bbaf56dcf)

Random Oversampling includes selecting random examples from the minority class with replacement and supplementing the training data 
with multiple copies of this instance, hence it is possible that a single instance may be selected multiple times. For Machine Learning 
algorithms affected by skewed distribution, such as artificial neural networks and SVMs, this is a highly effective technique.

###  [Ensemble methods.](https://towardsdatascience.com/ensemble-models-5a62d4f4cb0c)
 

Another approach is to use an ensemble method, which combines two or more algorithms together into one model. Ensembles are often 
 more accurate than any individual algorithm because they leverage the strengths of each and compensate for their weaknesses.

In other words, if you combine multiple weak learners (i.e., models that perform poorly on their own) into one ensemble, you can get 
 a stronger learner (i.e., a model that performs well as an individual).

### [Cross validation.](https://machinelearningmastery.com/k-fold-cross-validation/) 
 

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. 
 As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k 
 in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. 
That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions 
 on data not used during the training of the model.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic 
estimate of the model skill than other methods, such as a simple train/test split.


## Congratulations! You have completed the DSC data challenge 2022!

We hope you have been able to learn something new today that can potentially be used in hackathons that you may join in future and 
has opened up the world of data science to you! We have outlined a basic pipeline for building and applying machine learning models, but this is just the basics --- much more can be done. We've also included some extra things for you to try!


## Extra: Model selection

We've tried out a random forest classifier, as well as an SVM... what about another kind of model? What is the best way of figuring out what model to use? Well, one way is just to try lots out!


In [21]:
!pip install xgboost
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier



In [22]:
# Most of these are from SKLearn, however xgboost is also really good!
# classifiers = {
#     "XGBClassifier": XGBClassifier(),
#     "RandomForestClassifier": RandomForestClassifier(),
#     "DecisionTreeClassifier": DecisionTreeClassifier(),
#     "GaussianProcessClassifier": GaussianProcessClassifier(),
#     "LogisticRegression": LogisticRegression(),
#     "PassiveAggressiveClassifier": PassiveAggressiveClassifier(),
#     "GaussianNB": GaussianNB(),
#     "KNeighborsClassifier": KNeighborsClassifier(),
# }

Next, we will loop over each model, fit the data using `train_X` and `train_y`, generate predictions using `test_X` and `test_y` and calculate the accuracy across multiple rounds of cross-validation to get an unbiased estimate of each models performance.

In [23]:
from sklearn.model_selection import cross_val_score

# df_models = pd.DataFrame(columns=['model', 'run_time', 'accuracy', 'accuracy_cv'])

# Loop over all models in `classifiers`, train your model, then predict diagnoses on the test set.
# Then run 10-fold cross validation using cross_val_score with scoring = "accuracy"
# Save all the results to df_models so we can see them afterwards!

Lets have a look at the performance!

## Extra: Hyperparameter tuning

To finish our challenge why not try some hyperparameter tuning yourself. 
 [Check out this article for a guide on this.](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)

Our best performing models were the logistic regression and the XGBoost classifier. These are just the models with default parameters though. Do we have any more room to improve if we optimise them?

The first step is to understand what parameters are available, and which can be optimised. You can do this by printing `model.get_params()`, however you may need to look at the documentation for your particular model to understand how it can be tuned.


In [24]:
# classifier = XGBClassifier()
# Fit the classifier here


# Look at the parameters


Next, take a subset of the hyperparameters and add them to a dictionary and assign this to a param_grid.

We can then define the model and configure the [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) function to test each unique combination of hyperparameters and record the accuracy on each iteration. After going through the whole batch, the optimum model parameters can be printed.

In [25]:
# param_grid = dict(
#     n_jobs = [16],
#     learning_rate = [0.1, 0.5],
#     objective = ['binary:logistic'],
#     max_depth = [int(x) for x in np.linspace(1, 21, num = 11)], 
#     n_estimators = [int(x) for x in np.linspace(start = 100, stop = 2000, num = 10)],
#     subsample = [0.2, 0.8, 1.0],
#     gamma = [0, 0.05, 0.5],
#     scale_pos_weight = [0, 1],
#     reg_alpha = [0, 0.5],
#     reg_lambda = [1, 0],
# )

# model = XGBClassifier(random_state=1, verbosity=1)

from sklearn.model_selection import RandomizedSearchCV
# Call RandomizedSearchCV on model using param_grid

# best_model = random_search.fit(train_X, train_y)
# print('Optimum parameters', best_model.best_params_)

In [26]:
# classifier = XGBClassifier(
#     # Set all the parameters we found from above here
# )

# Now fit the model and compare the performance with our earlier model
# accuracy = metrics.accuracy_score(y_pred, test_y)
# err = 1 - accuracy

# print("Prediction Accuracy = {0:4.2f}%".format(accuracy*100))
# print("Prediction Error = {0:4.2f}%".format(err*100))

We've gotten our prediction accuracy up to 98%! That's definitely an improvement on the default parameters!

## Extra: Ensemble methods

We've tried out using a _single_ model, but what happens when multiple models join forces?

In [40]:
# classifiers = {
#     "XGBClassifier": XGBClassifier(),
#     "RandomForestClassifier": RandomForestClassifier(),
#     "DecisionTreeClassifier": DecisionTreeClassifier(),
#     "GaussianProcessClassifier": GaussianProcessClassifier(),
#     "LogisticRegression": LogisticRegression(),
#     "PassiveAggressiveClassifier": PassiveAggressiveClassifier(),
#     "GaussianNB": GaussianNB(),
#     "KNeighborsClassifier": KNeighborsClassifier(),
# }

# def fit(classifiers, X, y):
#     for model, classifier in classifiers.items():
#         classifier.fit(X, y)
#     return classifiers

Let's also set aside some data for validation

In [27]:
# new_train_X, val_X, new_train_y, val_y = train_test_split(train_X, train_y, test_size=0.25)        # Set aside 25% of data for validation

In [28]:
# Call the function we defined above to fit all of our classifiers

Given we have 8 individual "base" classifiers, each test example will end up with 8 different predictions, each one corresponding to an associated base classifier.

We are not just interested in the diagnosis itself, but also the probabilities for the diagnosis. Lets make a function that returns the predictions for each classifier, and allows us to specify whether we want the class label or the probability as an output.

In [29]:
def predict_individual(X, classifiers, prob=False):
    n_classifiers = len(classifiers.keys())
    n_samples = X.shape[0] 

    y = np.zeros((n_samples, n_classifiers))
    for i, (model, classifier) in enumerate(classifiers.items()):
        if prob:
            y[:, i] = classifier.predict_proba(X)[:, 1]  
        else:
            y[:, i] = classifier.predict(X)              
    return y

# Test with prob=False

# Sanity check that the output has the same number of examples as the test dataset, and 8 estimates, and is 1 or 0

Now there are a few different ways of combining these. We will try a few out here:

1) Majority vote: simply pick the most common label from each set of the predictions

2) Accuracy weighting: higher performing classifiers are given greater weight in the ensemble

3) Entropy weighting: similar to above, except entropy is used in deciding the weighting (where entropy is a form of uncertainty)

We can write some helper functions to apply these methods. Have a read of the code below and make sure you understand what they are doing!

In [30]:
# from scipy.stats import mode

# def combine_using_majority_vote(X, classifiers):
#     y_individual = predict_individual(X, classifiers, prob=False)
#     y_final = mode(y_individual, axis=1)
#     return y_final[0].reshape(-1, )

# def combine_using_accuracy_weighting(X, classifiers, Xval, yval):
#     n_classifiers = len(classifiers)
#     yval_individual = predict_individual(Xval, classifiers, prob=False)
    
#     wts = [metrics.accuracy_score(yval, yval_individual[:, i]) 
#        for i in range(n_classifiers)] 
#     wts /= np.sum(wts)

#     ypred_individual = predict_individual(X, classifiers, prob=False)
#     y_final = np.dot(ypred_individual, wts) 

#     return np.round(y_final)

# # For entropy weighting first define how to calculate entropy
# def entropy(y):
#     _, counts = np.unique(y, return_counts=True) 
#     p = np.array(counts.astype('float') / len(y))
#     ent = -p.T @ np.log2(p) # @ is the matrix multiplication operator

#     return ent

# def combine_using_entropy_weighting(X, classifiers, Xval, yval):
#     n_classifiers = len(classifiers)
#     yval_individual = predict_individual(Xval, classifiers, prob=False)
    
#     wts = [1/entropy(yval_individual[:, i]) 
#            for i in range(n_classifiers)]
#     wts /= np.sum(wts)

#     ypred_individual = predict_individual(X, classifiers, prob=False)
#     y_final = np.dot(ypred_individual, wts)
    
#     return np.round(y_final)

In [31]:
# Call the three different functions to compare their performance

We can also do something called meta-learning, where instead of carefully designing a combination function to combine predictions, we will learn a combination function over the individual predictions. That is, the predictions of the base estimators are given as inputs to a second-level learning algorithm. Thus, rather than designing one ourselves, we will learn a second-level meta-classification function!

Stacking is the most common meta-learning method and gets its name because it stacks a second classifier on top of its base estimators. The general stacking procedure has two steps:

1) level 1: fit base estimators on the training data; this step is the same as before and aims to create a diverse, heterogeneous set of base classifiers.
2) level 2: construct a new data set from the output of the base classifiers, which become meta-features; meta-features can either be the predictions or the probability of predictions.

Lets again define our models first:

In [32]:
# classifiers = {
#     "XGBClassifier": XGBClassifier(),
#     "RandomForestClassifier": RandomForestClassifier(),
#     "DecisionTreeClassifier": DecisionTreeClassifier(),
#     "GaussianProcessClassifier": GaussianProcessClassifier(),
#     "LogisticRegression": LogisticRegression(),
#     "GaussianNB": GaussianNB(),
#     "KNeighborsClassifier": KNeighborsClassifier(),
# }

And then lets define some helper functions to allow us to use a second level.

In [34]:
# Fit our classifiers
# def fit_stacking(level1_classifiers, level2_classifier, X, y, use_probabilities=False):

#     fit(level1_classifiers, X, y)
    
#     X_meta = predict_individual(X, classifiers=level1_classifiers, prob=use_probabilities)
    
#     level2_classifier.fit(X_meta, y)

#     final_model = {'level-1': level1_classifiers, 
#                    'level-2': level2_classifier, 
#                    'use-prob': use_probabilities}
    
#     return final_model

# # Predict using the classifiers
# def predict_stacking(X, stacked_model):
#     level1_classifiers = stacked_model['level-1']
#     use_probabilities = stacked_model['use-prob']

#     X_meta = predict_individual(X, classifiers=level1_classifiers, prob=use_probabilities)

#     level2_classifier = stacked_model['level-2']
#     y = level2_classifier.predict(X_meta)
    
#     return y

Now lets try it out! Logistic Regression is a common second level estimator, but feel free to experiment with different ones yourself.

In [35]:
# Define your meta classifier here

# The call the fit_stacking function and predict stacking function to check their performance

As we could see, the accuracy actually is worse than our previous methods, however this is just a basic attempt. We can combine parameter optimisation, incorporate cross validation into the stacking procedure, and many more things to try and increase the accuracy!

All in all, ensemble learning can be a powerful tool to augment your machine learning toolbox!