# Discussion of Results

## 1. Performance Metrics for Classifier

Given the highly imbalanced nature of our dataset, we decided against using accuracy as a performance metric. This is because we could achieve a really high accuracy score by simply predicting that all observations belong to the majority class (positive reviews). However, seeing as we cared equally about our topic models' ability to detect positive and negative sentiment, this seemed like a bad idea.

Initially, we investigate the ROC curve which shows you sensitivity and specificity at all possible thresholds. So if you find a point that represents the right tradeoff, you can choose the threshold that goes with that point on the curve. However, some literature argues that models trained on imbalanced datasets may seem to perform well when you look at an ROC curve, but when looking at the precision recall curve they do not perform well at all [The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets, 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/). 

When evaluating just by the ROC curve, Text and Summaries Word2Vec model was a winner at all thresholds as all of the curve lies above all the other model curves. It therefore also had the highest AUC score. [The Relationship Between Precision-Recall and ROC Curves, 2006](https://www.biostat.wisc.edu/~page/rocpr.pdf) states that if a curve dominates in ROC, it also dominates in PR, therefore we did not plot the precision-recall curve. Due to the dimensionality of the features (Word2Vec had 100 whereas LDA had 58), we expected the Word2Vec model to outperform LDA.


## 2. LDA Model (N-grams vs no n-grams)

In the LDA model development, two models and corresponding sets of predictions were produced. The aim of this was to see if the adapting the text to concatenate common bigrams would improve the topic model's performance. In the performance evaluation, it was decided that model which did not include bigrams performed better. However, due to time constraints of the project, hyperparameter tuning was only able to take place for the no bigram model, and therefore it is not possible to put the results down to one or the other of hyperparameter tuning or retention of bigrams.

We can conclude that at least one of these approaches (hyperparameter tuning / not including bigrams) improved the model and a better result would be achieved out of it if the predictions were to be used in the real world, but more investigation would be required to know the true causation.

## 3. LDA Model (200K vs 3million on text and summaries)

Interestingly, although the ROC curve for the LDA model ran on the full dataset (~3million) lay above the curve for LDA on 200K documents at all thresholds, it was not that dissimilar in performance. We expected a greater difference in performance, given that the model was trained on over ten times the amount of documents. This surprising result may indicate that the tradeoff between a larger training set and model performance levels off between 200K-3mil for LDA trained on the texts and summaries. 

## 4. Review Summaries vs Review Summaries & Text

For the word2vec models, the review summaries and text model ROC curve lay above the review summaries curve at all thresholds, hence we can conclude it dominates and therefore performs better. Logically, this makes sense as the model is provided with more words to help analyse the sentiment.

On the other hand, for the LDA models, the review summaries and text model's ROC curve lay above the review summaries curve for most thresholds, however when zooming in on the corners of the ROC graph, we see there is some overlap. At very high and very low thresholds, the model trained just on the summaries outperforms the text and summaries model. We therefore decided to investigate the F1 score. The F1 score is the harmonic mean of precision and recall, and therefore weights them equally. When inspecting the F1 graph, the LDA model's curve trained on just the summaries lay above the curve for the LDA trained on both summaries and review text. As we wanted to weight precision and recall equally, this implies that for the LDA model (200k samples), sentiment analysis on just the review summaries performs better.

## 5. Limitations of Our Project

We recognise that our project is not fully polished and there are limitations to the conclusions we have drawn.
* The word2vec model could not train on the full dataset hence we cannot conclude how well it would generalise to the full population. That being said, as the LDA model trained on the HPC on the full dataset performed slightly better than on the sample dataset, we would expect a similar trend for word2vec.
* As a consequence of setting the min_word_count to three for the word2vec model, words that appear less than three times in the training set are deleted. Hence if all the words in a review/summary are very uncommon, they are all deleted and we are unable to perform word2vec on them. 
* We did not successfully address the issue caused by negation (words such as 'good' are negated if they are preceded by 'not') and this can drastically alter the sentiment of a piece of text.
* We noticed a few spelling errors in the reviews, as well as unusual spellings (e.g., 'bo-o-o-o-ring') which will have affected our models' performances. We did not manage to find a way to address this issue during the data cleaning step. This is a common issue found in many datasets (such as Tweets, customer support chats) hence there is definitely a solution to be found.

## 6. Future works

Given the time constraints and difficulties we encountered, there are numerous avenues for further exploration with our data. Some of them are listed below:
* Using Bluecrystal HPC would have massively reduced computing time and would have allowed us to train our models on a larger dataset. We hypothesise that our models would perform better on larger datasets due to the nature of how topic modelling works.
*  we could try to combine the word ‘not’ with the adjective that follows it, and check if it improve sentiment analysis and lead to better predictions.
* Implementing cross validation when using grid search for the LDA model. This may have allowed us to choose a more suitable number of topics.
* Investigate different hyperparameter tuning methods such as Bayesian optimisation.
* Investigate other negation detection methods to improve our sentiment analysis for LDA.
* To further improve our classification model, we could incorportate other features of the data, such as the review helpfulness score, which could act as a weighting for each review.
* We can tune the hyper-parameters for the classification model using random search to search over a large range of values first, and then use the grid search on the reduced parameter grid to search for an accurate optimum combination.
* Innvestigating the word_min_count and see how this affects the performance and the number of reviews that the word2vec model can't predict. We expect a trade-off here.
* Another avenue for exploration would be investigating the trade-off between runtime to performance, as both of our models were time intensive to run. 
* Finally, it could have been interesting to explore different genres of books and see whether certain genres elicit more polarised reviews than others.