# Discussion of Results

## 1. Performance Metrics for Classifier

Given the highly imbalanced nature of our dataset, we decided against using accuracy as a performance metric. This is because we could achieve a really high accuracy score by simply predicting that all observations belong to the majority class (positive reviews). However, seeing as we cared equally about our topic models' ability to detect positive and negative sentiment, this seemed like a bad idea.

Initially, we investigate the ROC curve which shows you sensitivity and specificity at all possible thresholds. So if you find a point that represents the right tradeoff, you can choose the threshold that goes with that point on the curve. However, some literature argues that models trained on imbalanced datasets may seem to perform well when you look at an ROC curve, but when looking at the precision recall curve they do not perform well at all [The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets, 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/). 

When evaluating just by the ROC curve, ................... was almost a winner at all thresholds as most of the curve lies above all the other model curves. It therefore also had the highest AUC score. [The Relationship Between Precision-Recall and ROC Curves, 2006](https://www.biostat.wisc.edu/~page/rocpr.pdf) states that if a curve dominates in ROC, it also dominates in PR, which was also the case when we plotted it. Due to the .............., we expected the Word2Vec model to outperform LDA.

This prompted us to investigate the results further, looking more closely at the recall and precision individually, in which we found some more expected results. 

In conclusion, we decided to evaluate our models on their ......... curves. The aim of this was to ................... 

The F-measure is defined to be the harmonic mean of precision and recall. This means it is a measure of models, giving equal weight to the precision and recall. By our results from the Precision-Recall curve, plotting the F-measure against threshold would give the BDT as the winner over KNN. [A Gentle Introduction to the Fbeta-Measure for Machine Learning](https://machinelearningmastery.com/fbeta-measure-for-machine-learning/) introdcues us to a variation on the F-Measure, the Fbeta-Measure. The Fbeta-Measure is the weighted harmonic mean of precision and recall, i.e. a beta of 2 gives double the weight to precision. Therefore, by increasing beta, and plotting the Fbeta-Measure against threshold, we will eventually approach the result of the Recall-Threshold curves. Thus, beta can give us the key to deciding our winner out of KNN and BDT.

We hypothesise that the reason logistic regression performed worse is because it was trained on a downsmapled training set (due to memory restrictions), where it did not get to see all of the positive cases.



## 2. LDA Model (N-grams vs no n-grams)

In the LDA model development, two models and corresponding sets of predictions were produced. The aim of this was to see if the adapting the text to concatenate common bigrams would improve the topic model's performance. In the performance evaluation, it was decided that model which did not include bigrams performed better. However, due to time constraints of the project, hyperparameter tuning was only able to take place for the no bigram model, and therefore it is not possible to put the results down to one or the other of hyperparameter tuning or retention of bigrams.

We can conclude that at least one of these approaches improved the model and a better result would be achieved out of it if the predictions were to be used in the real world, but more investigation would be required to know the true causation.

## 3. Review Summaries vs Review Summaries & Text

The model which was trained on the dataset with equal class ratios outperformed the model trained on the dataset with the preserved ratios. According to the ROC curve, it dominated at thresholds above 0.1.

We also looked at the recall plotted against the thresholds. This curve highlighted that the equal ratio model outperformed when looking strictly at recall also, confirming our conclusion that the equal ratio model is better suited to our data.

The training data with 50/50 class split (KNN Equal) involved downsampling by reducing the count of training samples falling under the majority class. The risk associated with doing this is that by removing the collected data, we tend to lose a lot of valuable information. 
As the train test split was done using stratified sampling, the class distributions in the training set and test set are equal. We would therefore expect the KNN model with the preserved class distribution to outperform. Our results are a little surprising, however, the model with equal class ratio was exposed to more positive cases so it could predict positive cases in the test set better.

## 4. Limitations of Our Project

We recognise that our project is not fully polished and there are limitations to the conclusions we have drawn.
* Our models could not train on the full dataset hence we cannot predict how well they will generalise to the full population.
* For the logistic regression model, there was a trade off between information loss and dimensionality reduction when implementing PCA.
* The main limitation of logistic regression is the assumption of linearity between the dependent and independent variables.
* The selected KNN model used an approximate search algorithm (KD Tree) for time saving purposes, however as the Mahalanobis distance metric is incompatible with KD Trees, the Euclidean distance was used. This is a limitation as the Euclidean distance cannot detect high correlation between variables.
* In the BDT models, the class imbalance weighting correction used was incorrect and due to the time restraints was not able to be rectified. This means that the results are probably worse than they could have been.
* We allowed models to train on different datasets (dataset with missing values and the imputed dataset). This did not allow us to make fair comparisons.
* Our imputation method only achieved a root mean squared error of 0.33, which considering the standardised data took values between 0 and 1, this is not very good. However, some columns had less than 5% missingness so the effect of the imputation may be negligble.

## 5. Future works

Given the time constraints and difficulties we encountered, there are numerous avenues for further exploration with our data. Some of them are listed below:
* Using Bluecrystal HPC would have massively reduced computing time and would have allowed us to train our models on a larger dataset. We hypothesise that our models would perform better on larger datasets due to the nature of how topic modelling works.
* Implementing cross validation when using grid search for the LDA model. This may have allowed us to choose a more suitable number of topics.
* Investigate different hyperparameter tuning methods such as Bayesian optimisation.
* Investigate other negation detection methods to improve our sentiment analysis for LDA.
* To further improve our classification model, we could incorportate other features of the data, such as the review helpfulness score, which could act as a weighting for each review.