# Discussion of Results

## Performance metric choice

Given the highly imbalanced nature of our dataset, we decided against using accuracy as a performance metric. This is because we could achieve a really high accuracy score by simply predicting that all observations belong to the majority class (no heart disease). However, seeing as we care more about predicting the heart disease class (1) correctly, this would not be a successful model.

Furthermore, for the context of our problem, we have been provided with a budget of £X million for heart disease screening and we must decide what threshold (what probability our model predicts a patient is positive for heart disease) is appropriate given the estimated cost of heart disease screening (cholestreol test) is £Y per patient. We decided that we cared more about minimising false negatives and focus on the positive class as we did not want to misidentify someone who has heart disease.

We initially investigate the ROC curve which shows you sensitivity and specificity at all possible thresholds. So if you find a point that represents the right tradeoff, you can choose the threshold that goes with that point on the curve. However, some literature argues that models trained on imbalanced datasets may seem to perform well when you look at an ROC curve, but when looking at the precision recall curve they do not perform well at all [The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets, 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/). 

That being said, when evaluating just by the ROC curve, the linear regression model (baseline) was a clear winner at all thresholds as every part of the curve lies above all the other model curves. It therefore also had the highest AUC score. [The Relationship Between Precision-Recall and ROC Curves, 2006](https://www.biostat.wisc.edu/~page/rocpr.pdf) states that if a curve dominates in ROC, it also dominates in PR. Therefore, although there is a class imbalance, there was no need to investigate the precision recall curve as the linear model dominates at every threshold.

This surprised us as the literature we read indicated that for binary classification, linear regression models generally perform badly when the target variable is binary [Why Linear Regression is not suitable for classification?, 2021](https://medium.com/analytics-vidhya/why-linear-regression-is-not-suitable-for-classification-cd724dd61cb8#:~:text=There%20are%20two%20things%20that,new%20data%20points%20are%20added.). Due to the high dimensionality and imbalance of the data, we expected the boosted decision tree to outperform.

It is possible that the dataset has an underlying linear relationship, however we would expect logistic regression to outperform the multi linear regression model.
We hypothesise that the reason logistic regression performed worse is because it was trained on a smaller training set (due to memory restrictions), whereas the linear regression was trained on the full dataset.

We were planning to investigate a stacking model, where we feed the probabilities outputted by each of the models to another stacked model as features and see if this model performed better. However, as the linear regression model dominated every other model, this would not provide increased performance. We were also unsure whether the predictions made by the models or the errors in predictions made by the models were uncorrelated, hence stacking might not have been appropriate.


### Boosted decision tree (imputed vs missing data training sets)

..

### KNN (equal ratio vs preserved ratio downsampling)

The model which was trained on the dataset with equal class ratios outperformed the model trained on the dataset with the preserved ratios. According to the ROC curve, it dominated at thresholds above 0.1.

We also looked at the recall plotted against the thresholds. This curve highlighted that the equal ratio model outperformed when looking strictly at recall also, confirming our conclusion that the equal ratio model is better suited to our data.

The training data with 50/50 class split (KNN Equal) involved downsampling by reducing the count of training samples falling under the majority class. The risk associated with doing this is that by removing the collected data, we tend to lose a lot of valuable information. 
As the train test split was done using stratified sampling, the class distributions in the training set and test set are equal. We would therefore expect the KNN model with the preserved class distribution to outperform. Our results are a little surprising, however, the model with equal class ratio was exposed to more positive cases so it could predict positive cases in the test set better.

## Limitations of our project

We recognise that our project is not fully polished and there are limitations to the conclusions we have drawn.
* Some of our models could not train on the full dataset (KNN & LogReg) hence we cannot predict how well they will generalise to the full population.
* For the logistic regression model, there was a trade off between information loss and dimensionality reduction.
* The selected KNN model used an approximate search algorithm (KD Tree) for time saving purposes, however as the Mahalanobis distance metric is incompatible with KD Trees, the Euclidean distance was used. This is a limitation as the Euclidean distance cannot detect high correlation between variables.
* Within the logisitic regression model, it is not easy to interpret and is sensitive to towards the scale of
* We allowed models to train on different datasets (dataset with missing values and the imputed dataset). This did not allow us to make fair comparisons.
* Our imputation method only achieved a root mean squared error of 0.33, which considering the standardised data took values between 0 and 1, this is not very good. However, some columns had less than 5% missingness so the effect of the imputation may be negligble.

## Future works

Given the time constraints and difficulties we encountered, there are numerous avenues for further exploration with our data. Some of them are listed below:
* Using Bluecrystal HPC would have massively reduced computing time, in particular for the boosted decision tree. Furthermore, for the KNN and Logistic Regression models, using Bluecrystal HPC could allow us to train on the whole dataset, rather than down sampling.
* Implementing cross validation and hyperparameter tuning on all methods to try and improve performance.
* Investigate different hyperparameter tuning methods such as Bayesian optimisation.
* Investigate other imputation methods and compare their performance (RMSE).
* For future renditions of the logisitic regression model we would like to perform PCA on the dataset before performing dimensionality reduction. Dimensionality reduction would be done by selecting the number of components using cross validation.