<img src="GA_logo.svg" style="float: left; margin: 20px">

#  Book Reviews Capstone Project: Overview


*Delphine Defforey*

___


## Executive Summary 
_______
### Project Goals

<font color=navy>
    Book ratings give an incomplete picture of what consumers think of books, which can make it difficult to improve marketing strategies for publishers or retailers. The main goal of this project is to identify common topics in text reviews of books using topic modelling, and to determine whether developing distinct marketing strategies to promote books based on their genre would be an effective strategy to pursue. A secondary goal of the project is to train and test a topic model capable of predicting the main topics of text reviews.<br><br>
    The findings of this project may have interesting business outcomes for publishers and retailers as they could enable more effective marketing strategies for different books should distinct topics be identified across book genres. For example, fantasy readers may care more about character personalities than sci-fi readers and thus marketing materials for fantasy books could be adjusted accordingly. Conversely, if book genres are not identified in review topics, this could indicate that investing resources in creating promotion materials for distinct book genres is an unnecessary expense.
    </font>

### Datasets

<font color=navy>
     This project combines data from three sources:<br>(1) a dataset of scraped information from the librarything.com website that was collected by Prof. Julian McAuley (University of California, San Diego) and colleagues<sup>1, 2</sup>. This dataset can be downloaded on Prof. McAuley's <a href="https://cseweb.ucsd.edu/~jmcauley/datasets.html#social_data" target="_blank">website</a>.<br>(2) additional information (book titles, authors, ISBNs) I scraped from the <a href="https://www.librarything.com/" target="_blank">LibraryThing website</a>.<br>(3) book genre information I collected using the <a href="https://www.goodreads.com/api/index" target="_blank">Goodreads API</a>.<br><br> My notebooks have been organised accordingly, with the first being dedicated to importing the librarything dataset, the second to scraping the librarything.com website, the third to collecting book genre information using the Goodreads API, the fourth to combining the datasets and exploratory data analysis, and the last to modelling.

### EDA, Modelling and Analysis Overview

<font color=navy>
     I use three main approaches to visualise my data. First, I use a bar chart to illustrate the number of reviews for each book genre. Then, I use individual wordclouds for each book genre to get an overview of words commonly found in book reviews. Then, I generate bar plots to visualise counts for individual words following cleaning, tokenization and lemmatization.<br>
    
   The topic model used for my analysis is Latent Dirichlet Allocation<sup>3</sup>, and I visualised model outputs using the pyLDAvis library. My analysis also included a polarity analysis using VADER (Valence Aware Dictionary and sEntiment Reasoner) to determine the positivity/negativity of book reviews.<br>
   
   More detailed information on the EDA can be found in notebook #4 (Data cleaning and EDA), and additional information on the analysis can be found in notebook #5 (Modelling and analysis).
</font>

### Metrics for Model Selection 

<font color=navy>
    Given that this project uses unsupervised learning, success metrics are difficult to define. The main approach used to determine whether or not the LDA models used performed adequately was by inspection of the topics it produced (whether they were sensible), and the degree of overlap across topics as determined visually using LDAvis<sup>4</sup>. To compare the performance of different LDA models, I also used perplexity and coherence scores as metrics to determine which LDA model performed best. Additional information on these metrics can be found in notebook #5.
    </font>

### Key Findings and their Limitations

#### [Link to interactive LDA topic visualisation](./Resources/lda_model5.html)

<font color=navy>
    Using bigrams, 20 topics and 10 corpus passes yielded the best model out of the 10 models tested. This model produced topics that were interpretable, with some overlap but overall low perplexity and acceptable topic coherence. It can now be used to predict topics in book reviews.<br><br>
     The key findings of this modelling exercise are that some themes common to specific book genres do come across as topics in the topic model, for example magic, love and courtship, story plots or adventure. This indicates that it may be worth developing distinct marketing strategies for fantasy, romance, thriller books in particular. Another finding is that target audiences also come up as topics, in particular children and young adult, indicating that books for this demographic range should be promoted in a way that targets this specific audience. An important potential limitation to bear in mind when interpreting these findings is that topics were manually labelled, and that there may be some bias imparted by the person labelling them.
</font>

### Future Work and Considerations for Real World Production Environment

<font color=navy>
    1. Manually label books with incorrect or missing ISBNs and add them to the dataset to increase the size of the corpus and improve model performance.<br>
    2. Create a web interface (e.g. with Flask) where users can input reviews, then retrieve dominant topics.
    </font>

### References

<font color=navy>
1. <b>SPMC: Socially-aware personalized Markov chains for sparse sequential recommendation</b><br>
Chenwei Cai, Ruining He, Julian McAuley<br>
<i>IJCAI</i>, 2017<br>

2.<b> Improving latent factor models via personalized feature projection for one-class recommendation</b><br>
Tong Zhao, Julian McAuley, Irwin King<br>
<i>Conference on Information and Knowledge Management (CIKM)</i>, 2015

3.<b> Latent Dirichlet Allocation</b><br>
David M. Blei, Andrew Y. Ng, Michael I. Jordan<br>
<i>Journal of Machine Learning Research</i> 3, 993-1022, 2003

4.<b> LDAvis: A method for visualizing and interpreting topics</b><br>
Carson Sievert, Kenneth E. Shirley<br>
<i>Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces</i>, 63-70, 2014 
</font>
____