# Comparing and Contrasting the Basic Principles and Characteristics of a range of Contemporary Machine Learning Algorithms
### *Arthur Milner - 21035478*


# Diabetes Dataset

To prepare the dataset for training on a model I first imported the diabetes.csv file into a DataFrame using pandas, the DataFrame pandas provides also greatly improves the readability of the data in the notebook. Consequently making it easier to get a better grasp on the structure of the dataset and what details are stored where, DataFrames also include built in features such as .head(x)/.tail(x) and .index to aid in analysing the structure of the dataset.

In [None]:
#importing libraries needed

import pandas as pd
import numpy as np
import seaborn as sns #for statistics ploting
import matplotlib.pyplot as plt
import math
import sklearn
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from scipy.stats import pearsonr
from sklearn import svm
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
%matplotlib inline

In [None]:
diabetesData = pd.read_csv('./datasets/diabetes.csv') # Reads dataset into a DataFrame from the diabetes.csv file
diabetesData.head(5) #First 5 rows of the DataFrame, checking it has imported correctly

In [None]:
diabetesData.tail(5) #Last 5 rows of the DataFrame, checking it has imported correctly

In [None]:
diabetesData.index #Shows number of rows in the dataset

Upon viewing the columns and values within the DataFrame, we can see the problem involves two classes, these classes being 0 and 1 to represent either a positive (1) or a negative (0) diagnosis of diabetes. Also noticeable is there are many missing values within some of the columns, such as SkinThickness, in order to get a better picture of the severity of this I will replace the necessary 0 values with null and use pandas built in functions to detect null values. Also notable about the dataset is that each column appears valuable in predicting diabetes as they can all have an effect on the development of the condition.

## Analysing the Dataset Before Processing:
Firstly I will replace the 0 values in the DataFrame with null, from there I will decide how to manage these null values once I gather the frequency of them in each respective column using diabetesData.isNull().sum().

In [None]:
#Code to replace the 0 values with N/A
columnsToChange = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI','Age']
for i in range(len(columnsToChange)-1):
    diabetesData[columnsToChange[i]].replace(to_replace = 0, value = pd.NA, inplace=True)
    

In [None]:
diabetesData.isnull().sum()

In [None]:
diabetesData.head(20) #Verifying replacement

Following the result of diabetesData.isnull().sum() we can see there are many missing values, .head(20) shows even the first lot of values contain multiple null results.

## Pre-Processing the Data:
Now that I have identified the fact all columns are necessary for the dataset and the rows with missing values I have some pre-processing to do to ensure those missing values do not harm the effeciveness of the model. I have decided for the rows containing null values, if the frequency of these null values is above one I will delete the row, if the row contains just one null value I will replace it with the appropriate for that row.

In [None]:
for index, row in diabetesData.iterrows():
    countOfNull = diabetesData.loc[[index]].isna().sum().sum() #Gets number of null values for current row
    if countOfNull > 1:
        diabetesData = diabetesData.drop([index])

In [None]:
diabetesData.head(20) #Verify rows with multiple null values have been removed

In [None]:
diabetesData.isnull().sum() #Checking what values are still null

Upon reviewing which rows still contain null values after removing those with multiple nulls it appears only insulin remains, with two outliers being BMI and Glucose. Considering this it might make sense to drop the two rows with BMI and Glucose missing and then predict the insulin for the remaining 140 rows using a linear regression line.

In [None]:
diabetesData = diabetesData.dropna(subset=['Glucose'])
diabetesData = diabetesData.dropna(subset=['BMI'])
diabetesData.isnull().sum() #Checking what values are still null

#### Using Graphs to Identify Relationships Between Variables
Where I have identified insulin to be the column missing the most values after removing rows with mulitple null values I will use matplotlib to visualise how the other variables might effect the insulin value, drawing upon real world knowledge it is expected columns such as BMI and Glucose will be the greatest indicators. In identifying the best variable to use I can draw a linear regression line which returns a prediction of these null values for insertion into the dataset. The goal of this is to give more accurate values compared to the approach of simply inserting the average insulin value into all null values.

In [None]:
#Creating a copy of the current DataFrame without any null insulin values
diabetesData1 = diabetesData.dropna()

In [None]:
diabetesData1.isnull().sum() #Checking what values are still null

Using seaborn pairplot it allows us to see the relationship between all variables in the dataset, including insulin.

In [None]:
sns.pairplot(diabetesData1, hue="Outcome")

Upon viewing the graphs above, it appears glucose is the best variable for predicting insulin levels, with this knowledge I will now plot a linear regression line in order to predict the values which will fill in the null values present in the dataset. This also makes sense as in real life those with high glucose may get insulin injections. Also notable about the graphs is that some measures appear to be much better indicators of diabetes than others, for instance glucose has a clear link between the outcome.

In [None]:
glucoseX = diabetesData1['Glucose'].to_numpy()
insulinY = diabetesData1['Insulin'].to_numpy()

regr = LinearRegression().fit(glucoseX.reshape(-1, 1), insulinY)

insulinYHat = regr.predict(glucoseX.reshape(-1, 1))

plt.scatter(glucoseX, insulinY, c='b', label='Data')
plt.plot(glucoseX, insulinYHat, c='g', label='New Model') 

plt.legend(loc='best')

plt.xlabel('BMI')
plt.ylabel('Disease Progression')
plt.show()

With the linear regression line plotted I must now use it to insert the null values.

In [None]:
for index, row in diabetesData.iterrows():
    countOfNull = diabetesData.loc[[index]].isna().sum().sum() #Gets number of null values for current row
    if countOfNull == 1:
        glucoseNewX = np.array([[diabetesData.loc[index, 'Glucose']]])
        insulinNewYHat = int(regr.predict(glucoseNewX.reshape(-1, 1)))
        diabetesData.loc[index, 'Insulin'] = insulinNewYHat

In [None]:
diabetesData.head(5) #Verifying values have changed

In [None]:
diabetesData.isnull().sum() #Checking what values are still null

## Feature Selection

Using the pearson correlation below I am checking to see how relevant each feature is in detecting diabetes.

In [None]:
columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
for i in columns:
    correlation, na = pearsonr(diabetesData[i], diabetesData["Outcome"])
    print(i, correlation)

The above results denotes Glucose, BMI, Age and Insulin to be the biggest detectors, as blood pressure has a low correlation (under 0.2 is considered weak) it may be worth removing it from the dataset entirely.

In [None]:
diabetesData = diabetesData.drop(columns=['BloodPressure'])
diabetesData.head(5)

## Scaling the Dataset

Because some algorithms I will be using are effected by whether or not data is scaled it is important I scale the dataset. Scaling a dataset is done because data is often stored in different measures, which when using something such as euclidian distance can severely alter results.

In [None]:
#Checking the outliers within the dataset, if there are a large amount min max scaler
# might not be viable as it is sensitive to outliers
for column in diabetesData.columns:
    plt.figure()
    sns.boxplot(y = column, data = diabetesData, orient = "v")



Looking at the box plots it is clear the data includes quite a few outliers, consequently a robust scaler might be the best option as it is much less sensitive to outliers due to it making a much wider range from the values, thus reducing the relative distance.

In [None]:
#Get dataset into x and y variables, y for the outcome
X = diabetesData.iloc[:,:-1].values
y = diabetesData.iloc[:,-1].values

In [None]:
#Splitting the dataset into train and test, 80% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
#Showing the amount in each array
print('X_train: ', X_train.shape)
print('X_test: ', X_test.shape)
print('y_train: ', y_train.shape)
print('y_test: ', y_test.shape)

<strong>Note: For the grid searches below I often did experimentation in order to decide on a nice range of values.</strong>

# SVM Classifier

The first model I will train will be an SVM classifier, the hyper-parameters I will need to tune will be:
- Kernel
- C 
- Degree (Polynomial)
- Gamma (RBF)

I have chosen SVM because it is widely used for classification problems and generally performs well on a wide variety of datasets. The size of the dataset is also further reasoning for choosing SVM classifier as with large datasets SVM can be very time consuming, which should not be a problem here. The many kernels you can use with SVM also appeal to me as it makes it quite versatile for many dimensions and various datasets.

To do this I will use grid search CV and justify why certain parameters perform better than others as I go along. I decided on grid search as it is very efficient in testing multiple parameters/values with minimal code, meaning I can try low, medium and high values all at once, I am also using cross validation with 5 folds to further the accuracy of my results through averaging performance.

- Perhaps the most important hyper-parameter for SVM is the kernel, consequently it is very important that I spend a majority of time deciding on the kernel most appopriate for the dataset. I predict the polynomial/RBF kernel will outperform the linear kernel as this data is not linearly seperable.
- For values of C I will base my range around the recommendation of "trying exponentially growing sequences of C" to identify good parameters. <strong>(Hsu, C., Chang, C. and Lin, C., 2016)</strong>

### Linear Kernel
Below are the scores returned by the linear kernel using a wide range of values for C.

In [None]:
linearKernel = Pipeline([("scaler", RobustScaler()), ("svc", svm.SVC(kernel="linear"))])

params = {
    'svc__C': [0.1, 1, 10, 100, 1000]
}

linearGrid = GridSearchCV(linearKernel, params, cv=5,verbose=True,n_jobs=-1,scoring='f1')
linearGrid.fit(X_train,y_train)

In [None]:
print("The parameters for the best score the grid search returned were: " + str(linearGrid.best_params_))
print("Best score the grid search returned was: "+ str(linearGrid.best_score_))
print("\n\n")
print("Scores on training data:")
y_pred = linearGrid.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
print("\n\n")
print("Scores on testing data:")
y_pred = linearGrid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test,y_pred))

### Polynomial Kernel
Below are the scores returned by the polynomial kernel using a wide range of values for C and degree. The degree will be searched between 1-10, this gives a nice range of values allowing for both straighter and more flexible decision boundaries.

In [None]:
polynomialKernel = Pipeline([("scaler", RobustScaler()), ("svc", svm.SVC(kernel="poly"))])

params = {
    'svc__C': [0.1, 1, 10, 100, 1000],
    'svc__degree' : [2,3,4,5,6,7,8,9,10]
}

polynomialGrid = GridSearchCV(polynomialKernel, params, cv=5,verbose=True,n_jobs=-1,scoring='f1')

In [None]:
polynomialGrid.fit(X_train,y_train)

In [None]:
print("Best parameters the grid search returned were: " + str(polynomialGrid.best_params_))
print("Best score the grid search returned was: "+ str(polynomialGrid.best_score_))
print("\n\n")
print("Scores on training data:")
y_pred = polynomialGrid.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
print("\n\n")
print("Scores on testing data:")
y_pred = polynomialGrid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test,y_pred))

### RBF Kernel
Below are the scores returned by the RBF kernel using a wide range of values for C and gamma.

In [None]:
RBFKernel = Pipeline([("scaler", RobustScaler()), ("svc", svm.SVC(kernel="rbf"))])

params = {
    'svc__C': [0.1, 1, 10, 100, 1000],
    'svc__gamma' : [0.001, 0.01, 0.1, 0.5, 1.0, 10.0, 50, 100, 'scale', 'auto']
}

RBFGrid = GridSearchCV(RBFKernel, params, cv=5,verbose=True,n_jobs=-1,scoring='f1')

In [None]:
RBFGrid.fit(X_train,y_train)

In [None]:
print("Best parameters the grid search returned were: " + str(RBFGrid.best_params_))
print("Best score the grid search returned was: "+ str(RBFGrid.best_score_))
print("\n\n")
print("Scores on training data:")
y_pred = RBFGrid.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
print("\n\n")
print("Scores on testing data:")
y_pred = RBFGrid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test,y_pred))

## Kernel Comparison and Selection:

### Comparing the C Values and Considering the Degree of the Polynomial kernel:
 - Linear and RBF:
    - The grid search returned a C of 10 to be the best performing for both the linear and RBF kernel, meaning with a C of ten the trade-off between margin size and misclassification is the most even compared to the other C values I fed into the grid search. The C value controls the penalty for misclassification. This suggests the higher C values suffered from potentially overfitting on the training set due to having too small a margin, leading to poor performance on the test set. It also suggests the lower values of C lead to too many misclassifications, meaning the margin was too large to avoid a higher volume of misclassification.

 - Polynomial:
    - Interestingly, the polynomial kernel found a C of 100 to be most appropriate. This is perhaps due to the fact the degree the grid search had returned was 3, meaning the decision boundary can be a little more flexible and potentially requires a bit of a bigger margin to maintain the margin/misclassification trade-off. It is also worth noting when I allowed a degree of 1 in the grid search it would choose this degree, meaning it would essentially be acting as a linear kernel and suggests the polynomial kernel is not a good fit for the problem because the data is somewhat linearly separable before applying the kernel trick of translating to a higher degree.
    
- Considering the gamma for RBF:
    - The ideal gamma value chosen by the grid search for RBF was 0.001, which can be considered a very low value for gamma, meaning the kernel is considering points both close and very far away in the creation of its decision boundary. This suggests that the ideal decision boundary will be somewhat straight, which also further explains why the linear kernel performs so well on the dataset, contrary to my predictions.

### Comparison and Analysis of the Kernels Performance:
After utilising grid search with a wide range of values and comparing their results I think the optimal kernel for this dataset would be, to my surprise, the linear kernel. The precision, recall and f1-score of the linear kernel outperform both the RBF and polynomial kernels in every category. It is because of this I believe the linear kernel to perhaps be the optimal kernel for the problem as it has the lowest computational cost and training speed and delivers consistently better results than the polynomial and RBF kernel with the tested parameters.

To compare the scores of the kernels against the test and train data, it is clear the polynomial kernel suffers from overfitting to a much higher degree than the other two, which appear to have a somewhat even performance between train and test scores. The polynomial generally has an f1-score over 20 points higher on the train than the test, a clear indicator of overfitting a model onto the training data.

To take a more in-depth look at the general scores of the three kernels you can see predicting a positive outcome is considerably less accurate than predicting a negative one. Generally all three kernels manage an f1-score of at least 79 for predicting a negative outcome, where as predicting a positive outcome the f1-score can go into the low 50s for the polynomial kernel and mid 60s for RBF/linear kernel. To consider the real world this is perhaps because diabetes can be heavily influenced by genetics, such as type 2 diabetes being linked to family history which is not measured within the dataset.

To conclude I will be choosing linear kernel for the SVC on this dataset due to its low performance cost and relatively high performance metrics.

## Further Optimisation with Linear Kernel:

I will now quickly run the linear kernel with a closer ranges of values in hopes to narrow down the most ideal C value, I am basing this off the current ideal C of 10.

In [None]:
linearKernel = Pipeline([("scaler", RobustScaler()), ("svc", svm.SVC(kernel="linear"))])

params = {
    'svc__C': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
}

linearGrid = GridSearchCV(linearKernel, params, cv=5,verbose=True,n_jobs=-1,scoring='f1')
linearGrid.fit(X_train,y_train)

In [None]:
print("The parameters for the best score the grid search returned were: " + str(linearGrid.best_params_))
print("Best score the grid search returned was: "+ str(linearGrid.best_score_))
print("\n\n")
print("Scores on training data:")
y_pred = linearGrid.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
print("\n\n")
print("Scores on testing data:")
y_pred = linearGrid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test,y_pred))

In [None]:
linearKernel = Pipeline([("scaler", RobustScaler()), ("svc", svm.SVC(kernel="linear"))])

params = {
    'svc__C': [4, 4.25, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6]
}

linearGrid = GridSearchCV(linearKernel, params, cv=5,verbose=True,n_jobs=-1,scoring='f1')
linearGrid.fit(X_train,y_train)

In [None]:
print("The parameters for the best score the grid search returned were: " + str(linearGrid.best_params_))
print("Best score the grid search returned was: "+ str(linearGrid.best_score_))
print("\n\n")
print("Scores on training data:")
y_pred = linearGrid.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
print("\n\n")
print("Scores on testing data:")
y_pred = linearGrid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test,y_pred))

In [None]:
linearKernel = Pipeline([("scaler", RobustScaler()), ("svc", svm.SVC(kernel="linear"))])

params = {
    'svc__C': [4.10, 4.15, 4.2, 4.25, 4.3]
}

linearGrid = GridSearchCV(linearKernel, params, cv=5,verbose=True,n_jobs=-1,scoring='f1')
linearGrid.fit(X_train,y_train)

In [None]:
print("The parameters for the best score the grid search returned were: " + str(linearGrid.best_params_))
print("Best score the grid search returned was: "+ str(linearGrid.best_score_))
print("\n\n")
print("Scores on training data:")
y_pred = linearGrid.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
print("\n\n")
print("Scores on testing data:")
y_pred = linearGrid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test,y_pred))

After a quick operation closing in on a more specific C value with grid search I arrived at the C value of 4.1.

## Final SVM Model


Below is the final model I arrived at through my experimentation with SVM on the diabetes dataset:

In [None]:
optimalSVMModel = Pipeline([("scaler", RobustScaler()), ("svc", svm.SVC(kernel="linear", C=4.1))])
optimalSVMModel.fit(X_train,y_train)
print("Scores on training data:")
y_pred = optimalSVMModel.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
y_pred = optimalSVMModel.predict(X_test)
print("Scores on testing data:")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Random Forest (Ensemble)

The second model I will train will be random forest, the hyper-parameters I will need to tune will be:
- Max Features (The number of features to consider)
- Max Depth (The maximum height the trees within the random forest can grow)
- Number of Estimators (The number of trees inside the random forest)
- Min Samples Split (The number of samples in a node that allow it to split into other nodes)
- Max Samples (The maximum ratio of samples each tree can use)

I have decided upon Random Forest for a similar reason to SVM, it is a widely used model in classification problems and again generally produces good results due to its versatility. For the criterion I am using gini as I am familiar with how it works and it is also commonly used. I will again use cross validation with 5 folds.

Also notable is the fact Random Forest does not require feature scaling as it is a tree based model.

In [None]:
randomForestModel = Pipeline([("randomForest", RandomForestClassifier(criterion="gini", random_state=1, max_samples=0.6))])

params = {
    'randomForest__max_features': ['sqrt', 'log2', 2],
    'randomForest__max_depth' : [1, 2, 3, 4, 5],
    'randomForest__n_estimators': [25, 50, 75, 100, 150, 200]
}

randomForestGrid = GridSearchCV(randomForestModel, params, cv=5,verbose=True,n_jobs=-1,scoring='f1')
randomForestGrid.fit(X_train,y_train)

In [None]:
print("The parameters for the best score the grid search returned were: " + str(randomForestGrid.best_params_))
print("Best score the grid search returned was: "+ str(randomForestGrid.best_score_))
print("\n\n")
print("Scores on training data:")
y_pred = randomForestGrid.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
print("\n\n")
print("Scores on testing data:")
y_pred = randomForestGrid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test,y_pred))

Straight away I can see the Random Forest is delivering a decent performance with the current values. I played around with max_samples and decided on 0.6 as it is a good amount for training on a single tree to get results reflective of the dataset.

The grid search returning the highest available max-depth is perhaps the reason for the model overfitting, the bigger the depth the more chance you have of overfitting since the model will be highly tuned towards the boundaries of the training data. It is also notable that max_depth might be considered one of the most important parameters in a Random Forest as it has the potential to vastly improve performance, but also to overfit the model, so a good balance is important. The model does perform worse on the testing data compared to the training data, and is consequently  overfitting.

The sqrt (√number of features) for max features is common in Random Forest classifiers, so it is not surprising it has been selected here, having a higher number of features for each tree could require them to need a lot of depth to ensure leaves maintain a low gini index. This high depth can in turn lead to overfitting.

For number of estimators I chose higher numbers as the trade-off of an increase in the computation time of the model for real-time classifications is not important for what I am doing. It does appear, however, the model requires a small amount of estimators regardless and any more will offer little improvement.

<strong>I will now further tune max depth and number of estimators, along with the min samples split parameter.</strong> 

Tuning the min number of samples parameter can aid the performance of the model as it manages overfitting much like max depth as it prevents too many splits occuring. I will search between 4 and 8 as too high a value can actually cause underfitting as the final node can still contain a considerable amount of samples of multiple classes, a low value of course overfits as it will create too many splits, in a similar fashion to a large max depth. I am searching higher values than the default 2 as the model is currently overfitting.

In [None]:
randomForestModel = Pipeline([("randomForest", RandomForestClassifier(criterion="gini", random_state=1, max_samples=0.6, max_features="sqrt"))])

params = {
    'randomForest__max_depth' : [2,3,4,5],
    'randomForest__n_estimators': [25, 50, 75, 100],
    'randomForest__min_samples_split' : [2, 5, 8, 10, 12]
}

randomForestGrid = GridSearchCV(randomForestModel, params, cv=5,verbose=True,n_jobs=-1,scoring='f1')
randomForestGrid.fit(X_train,y_train)

In [None]:
print("The parameters for the best score the grid search returned were: " + str(randomForestGrid.best_params_))
print("Best score the grid search returned was: "+ str(randomForestGrid.best_score_))
print("\n\n")
print("Scores on training data:")
y_pred = randomForestGrid.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
print("\n\n")
print("Scores on testing data:")
y_pred = randomForestGrid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test,y_pred))

Following these results, the performance appears to have not improved, it still favours the default min sample split. This is unexpected as it could help combat the overfitting of the model by reducing the number of splits in some trees and consequently their depth. It is possible the Random Forest's weak classifiers are causing this overfitting, and since all the trees in a Random Forest have the same say towards the final prediction they hold just as much value as the strong classifiers.

I will consider a quick look at AdaBoost in hopes it can deliver improved performance as I have considered many parameters of the Random Forest classifier with little luck in improving accuracy.

## Final Random Forest Model
Below is the final model I arrived at through my experimentation with Random Forest on the diabetes dataset:

In [None]:
finalRandomForestModel = Pipeline([("randomForest", RandomForestClassifier(criterion="gini", max_depth=5, min_samples_split=2,n_estimators=25,random_state=1, max_samples=0.6, max_features="sqrt"))])
finalRandomForestModel.fit(X_train,y_train)
y_pred = finalRandomForestModel.predict(X_train)
print("Scores on testing data:")
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
y_pred = finalRandomForestModel.predict(X_test)
print("Scores on testing data:")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# AdaBoost

I hope AdaBoost can find improved scores from my Random Forest model, the two models are similar however AdaBoost allows for different decision trees to have a different amount of influence on the training of the model based on their effectiveness. The parameters I will be tuning here will be the learning rate and number of estimators, the learning rate decides how much the next version of the model will change from the previous version. Number of estimators is the same as seen in Random Forest. I will simply use the recommended base estimator of the 1 depth decision tree/stump, this is because AdaBoost works best with many weak learners, which a depth of 1 would theoretically achieve.

In [None]:
adaBoostModel = Pipeline([("adaBoost", AdaBoostClassifier(random_state=1))])

params = {
    'adaBoost__learning_rate' : [0.0001, 0.001, 0.01, 0.1, 1, 2],
    'adaBoost__n_estimators' : [10, 50, 100, 200, 500, 1000]
}

adaBoostGrid = GridSearchCV(adaBoostModel, params, cv=5,verbose=True,n_jobs=-1,scoring='f1')
adaBoostGrid.fit(X_train,y_train)

In [None]:
print("The parameters for the best score the grid search returned were: " + str(adaBoostGrid.best_params_))
print("Best score the grid search returned was: "+ str(adaBoostGrid.best_score_))
print("\n\n")
print("Scores on training data:")
y_pred = adaBoostGrid.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
print("\n\n")
print("Scores on testing data:")
y_pred = adaBoostGrid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test,y_pred))

The model is overfitting horribly, this perhaps suggests the dataset has a lot of noise and the model is focusing on those noisy features/values over the important sections of the dataset. The model may not be appropriate for the problem without applying further performance improving techniques/alogrithms or modifications to the dataset. A potential cause of this severe overfitting could be the fact that predicting a negative is generally a lot easier to predict than a positive using the dataset, this has also been the case for all 3 models I have trained thus far. Consequently, some weak learners may actually perform well on predicting the negative cases which could harm AdaBoost's effectiveness. This could be further evidenced in the almost equal number of false and positive negatives within the confusion matrix for the test data, the model is essentially as accurate as a 50/50 guess on predicting negative outcomes.

I chose my selection of potential learning rates with the risk of a model which changes too greatly at each interval in mind, resulting in inaccuracy, hence why they are mostly somewhat low values. This of course has not worked when looking at the results, and the best learning rate returned by the grid search is the value of 1, which is a somewhat standard learning rate. This is good as it is a nice middle ground between too small and too large which means the model can tune itself at a comfortable rate, too low a learning rate requires more computing power for a similar outcome as the optimal learning rate.

The 500 estimators being selected suggests this is around the cut off point where any added estimators add very little to the model, in a similar fashion to Random Forest.


The final scores on the test data from the model are the lowest seen yet, whilst the scores on the training data are the highest. Again, this shows an extreme case of overfitting and suggests AdaBoost might not be optimal for the dataset.

# Comparison of the SVM and Random Forest Models

## SVM Final Model's Performance on Train and Test Data:

In [None]:
print("Scores on training data:")
y_pred = optimalSVMModel.predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
y_pred = optimalSVMModel.predict(X_test)
print("\n\n")
print("Scores on testing data:")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

## Random Forest Final Model's Performance on Train and Test Data:

In [None]:
y_pred = finalRandomForestModel.predict(X_train)
print("Scores on testing data:")
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
y_pred = finalRandomForestModel.predict(X_test)
print("\n\n")
print("Scores on testing data:")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

## Comparison:

<strong>Comparing the Scores of the Two Models:</strong>

I could perhaps begin my comparison by looking at the performance of the two tuned models. The linear SVM has a better average performance on the test data in precision, recall and f1-score. Consequently, it would not be unfair to state it performs better on the given dataset in classifying unseen data accurately. Both the models performance, however, is somewhat unimpressive as they both get around 80 for all 3 performance metrics of precision, recall and f1-score. Interesting to note is the Random Forest suffers severely from overfitting, its scores on the training data come in at around a very acceptable 90 but drop to the 70s on the test data. If I managed to reduce this overfitting its very possible the Random Forest would outperform the SVM so it is a shame I was unable to do so.

It could be valid to say that the SVM model is capturing more of a macro perspective on the problem whilst the Random Forest is capturing more of a micro perspective by overfitting on the training data.

<strong>Tuning Process of Both Models:</strong>

One benefit you might consider of the SVM model was that the linear kernel only requires one parameter to be tuned, consequently its much easier to realise the optimal performance of that model, on the other hand, Random Forest has many parameters which have an effect on model performance meaning it can be a lot more of an exhaustive search for the optimal parameters. Another perspective here however could be that the many parameters of Random Forest makes it a much more adaptive algorithm for different datasets, of course, the many kernels for SVM means it also has this adaptivity but for linear kernel it can be considered somewhat constricted. It is worth noting even when using polynomial or RBF kernel for SVM it has much less important parameters you must tune for performance improvement compared to Random Forest.

<strong>Advantages of SVM:</strong>
- The model is very versatile and can work well with a lot of datasets.
- Relatively simple model to understand.
- Easy to tune due to the small number of important parameters.
- Linear SVM is fast to train, especially compared to other kernel options.
- SVM can handle many dimensions of data relatively well depending on the kernel. (For high dimensions RBF generally considered)

<strong>Disadvantages of SVM:</strong>
- Struggles with large datasets for reasons such as the time complexity.
- Sensitive to noise within the dataset. (Such as in the diabetes dataset where positive and negative outcomes often overlap.)

<strong>Advantages of Random Forest:</strong>
- There are ways to measure how effective each feature is in improving the overall impurity in the forest, meaning you can effectively dispose of irrelevant features and helping in feature selection.
- The model is very versatile and can work well with a lot of datasets.
- Relatively simple model to understand.
- Can work for both classification and regression.

<strong>Disadvantages of Random Forest:</strong>
- If the model requires a lot of estimators to be effective, it can become ineffective for real time classification due to the time complexity.
- With the diabetes dataset it appears to suffer from overfitting, perhaps this can be overcome with better feature selection.
- Sensitive to noise within the dataset.


<strong>Conclusion:</strong>

To conclude both models have their uses, many of these uses overlap, such as they are both often used to gain a better insight into the data rather than as a final model. In this case the SVM outperformed the Random Forest and should theroretically perform faster due to it using the linear kernel. That being said, Random Forest was perhaps focusing on the noise in the dataset, so with some further pre-processing it could very well outperform the SVM. Something such as making sure there were even rows of negative and positive outcomes could be an example of further pre-processing.

# Bibliography:

- Hsu, B., Chang, C. and Lin, C. (2016) A Practical Guide to Support Vector Classification. *Department of Computer Science, National Taiwan University* <strong>[online]</strong>. [Accessed 15/03/2023].