For introduction and problem statement, please refer to notebook 1

## Content 

Notebook 1: 1_cellphones_reviews_data_cleaning_and_eda
- Data Import and Cleaning
- Exploratory Data Analysis
- Text Data Pre-processing

**Notebook 2: 2_cellphones_reviews_topic modelling**
- Data Import
- Topic Modelling with Gensim

**Notebook 3: 3_cellphones_reviews_topic_analysis_and_visualizations**
- Findings and Analysis of Topic Modelling

**Notebook 4: 4_features_extractions_and_sentiment_analysis**
- Data Import
- Sentiment Analysis with VADER
- Sentiment Analysis with Logistic Regression(Multi-Class Classification)
- Evaluation of Sentiment Analysis with BERT(Multi-Class Classification)
Please refer to notebook 5 for the fine-tuning process of pre-trained BERT model


**Notebook 5: 5_fine_tuning_of_BERT_model**   
The reason why this notebook is separated from notebook 4 which contains the evaluation of BERT model is because the fine-tuning of BERT model requires GPU. Hence, the model was fine-tuned on Google Colaboratory and loaded back into notebook 4 for evaluation


**Notebook 6: 6_analysis_and_findings**
- [Data Import](#Data-Import)
- [Comparison of the 3 Methods](#Evaluating-and-Comparing-the-3-Models)
- [Deployment](#Deployment)
- [Conclusion and Future Steps](#Conclusion-and-Future-Steps)

## Data Import

In [1]:
import re
import spacy
import pickle
import math
import torch
import pandas as pd 
import numpy as np
from nltk import tokenize
import matplotlib.pyplot as plt
from nltk.corpus import stopwords 
from bs4 import BeautifulSoup
from transformers import BertTokenizer
from torch.utils.data import TensorDataset
from nltk.stem import WordNetLemmatizer
from transformers import BertForSequenceClassification
from torch.utils.data import DataLoader, SequentialSampler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

In [4]:
new_reviews  = pickle.load(open('../data/reviews_with_feature_sentiments.pkl', 'rb'))

## Defining functions for Predictions

This section, we are just copying the functions that were defined in the earlier notebook so that we can make predictions on toy examples to compare the 3 models.

In [5]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
#instantiate vader sentiment analyzer
analyser = SentimentIntensityAnalyzer()

#adding a new word to the lexicon
new_words = {
    'new': 3.0
}

analyser.lexicon.update(new_words)

#defining the stop words
stop_words = stopwords.words('english')


#remove negation words from stop words as they are useful context for sentiment predictions
negation_words = ['ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', 
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 
'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 
'wouldn', "wouldn't","not","no",'don',"don't"]

for word in negation_words:
    stop_words.remove(word)

#confirm that the negation words have been removed
len(stop_words)

def summarise_reviews (reviews):
    """
    Find sentences with keywords, followed by cleaning by removing html, non letter words, stopwords and lemmatizing

    Parameter
    ----------
    reviews:a string of words 

    Returns
    -------
    summary: processed sentences that contained keywords defined below
    """
    list_of_keywords = ['camera','screen','battery','simcard','touchscreen',
                        'fingerprint','fingerprints','ringtones','charger']
    summary = set()
    texts = tokenize.sent_tokenize(reviews)
    for sentence in texts:
        sentence = sentence.lower()
        for word in list_of_keywords:
            if word in sentence:
                # Remove HTML.
                post_text = BeautifulSoup(sentence).get_text()

                # Remove non-letters.
                letters_only = ' '.join(re.findall(r"[A-z’]+",post_text))

                # Convert to lower case, split into individual words.
                words = letters_only.lower().split()

                #convert the stopwords to a set.
                stops = set(stop_words)

                # Remove stopwords.
                meaningful_words = [w for w in words if w not in stops]

                # Stemming 
                #p_stemmer = PorterStemmer()
                #meaningful_words = [p_stemmer.stem(w) for w in meaningful_words]

                #Lemmatize
                lemmatizer = WordNetLemmatizer()
                meaningful_words = [lemmatizer.lemmatize(word) for word in meaningful_words]

                cleaned_sentence = (" ".join(meaningful_words))
                
                summary.add(cleaned_sentence)
                
    return list(summary)

In [6]:
def vader_sentiments (reviews):
    """
    extract features by searching for sentences with keywords defined 
    and predicting sentiments of each sentence using vader 

    Parameter
    ----------
    summarised_reviews: a string of words (reviews that have been cleaned)

    Returns
    -------
    (sentiment score, keyword)
    """
    #list down the keywords
    list_of_keywords = ['camera','screen','battery','simcard','touchscreen','fingerprint',
                        'fingerprints','ringtones','charger']
    summary = set()
    
    summarised_reviews = summarise_reviews (reviews)
    #loop through each sentence to make predictions
    for cleaned_sentence in summarised_reviews:
        
        #only predict and keep sentences with keywords
        for word in list_of_keywords:
            if word in cleaned_sentence:
                #predict sentiment with vader
                score = analyser.polarity_scores(cleaned_sentence)
                compound = score['compound']
                #assign negative sentiment to 1, 
                #neutral to 3, positive to 5
                if compound >= 0.05:
                    sentiment_score = 5
                elif compound >= -0.05:
                    sentiment_score = 3
                else:
                    sentiment_score = 1

                summary.add((sentiment_score,word))
    return list(summary)

In [7]:
logreg_model = pickle.load(open('../data/logreg_3classes.pkl', 'rb'))

def logreg_sentiments(reviews):
    """
    extract features by searching for sentences with keywords defined 
    and predicting sentiments of each sentence using logistic regression

    Parameter
    ----------
    summarised_reviews: a string of words (reviews that have been cleaned)

    Returns
    -------
    (sentiment score, keyword)
    
    """
    reviews = summarise_reviews (reviews)
    
    list_of_keywords = ['camera','screen','battery','simcard',
                        'touchscreen','fingerprint','fingerprints','ringtones','charger']
    summary = set()
    #predict with logistic regression model
    pred = logreg_model.predict(reviews)
    
    predicted_ratings= []
    #convert class 0,1,2 to 1,3,5 
    for score in pred:
        if float(score) == 2.0:
            rating = 5
            predicted_ratings.append(rating)
        elif float(score) == 1.0:
            rating = 3
            predicted_ratings.append(rating)
        else:
            rating = 1
            predicted_ratings.append(rating)
    
    #loop through each clean sentence
    for i,cleaned_sentence in enumerate(reviews):        
        for word in list_of_keywords:
            if word in cleaned_sentence:
                summary.add((predicted_ratings[i],word))
                
    
    return list(summary)

In [8]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=3,
                                                      output_attentions=False,
                                                      output_hidden_states=False)


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

model.load_state_dict(torch.load('../data/finetuned_BERT_epoch_2_3classes.model', map_location=torch.device('cpu')))


### Loading Tokenizer and Encoding Data by Sentences

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

batch_size=32

def sentences_with_keywords (reviews):
    list_of_keywords = ['camera','screen','battery','simcard','touchscreen','fingerprint','fingerprints',
                        'ringtones','charger']
    summarised_reviews = set()
    texts = tokenize.sent_tokenize(reviews)
    for sentence in texts:
        sentence = sentence.lower()
        for word in list_of_keywords:
            if word in sentence:
                summarised_reviews.add(sentence)
    
    summarised_reviews = list(summarised_reviews)
    
    return summarised_reviews

def bert_sentiments (reviews):
    list_of_keywords = ['camera','screen','battery','simcard','touchscreen','fingerprint','fingerprints',
                        'ringtones','charger']
    
    summary = set()
    
    summarised_reviews = sentences_with_keywords (reviews)
    
    encoded_data_features = tokenizer.batch_encode_plus(
    summarised_reviews, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

    input_ids_features = encoded_data_features['input_ids']
    attention_masks_features = encoded_data_features['attention_mask']
    #labels_features = torch.tensor(df[df.data_type=='val'].label.values)

    dataset_features = TensorDataset(input_ids_features, attention_masks_features)

    dataloader_features = DataLoader(dataset_features , 
                                       sampler=SequentialSampler(dataset_features ), 
                                       batch_size=batch_size)

    #activate evaluation mode
    model.eval()

    #loop through the data that is fed into the function
    for batch in dataloader_features:

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                     }
    #deactivate gradiant calculation
    with torch.no_grad():        
        outputs = model(**inputs)
        
        
    #calculate the predicted class by finding the index of the highest logit
    rating_score = torch.argmax(outputs[0],dim=1)

    #convert class 0,1,2 to 1,3,5 
    try:
        
        predicted_ratings = []

        for score in rating_score:
            if float(score) == 2.0:
                rating = 5
                predicted_ratings.append(rating)
            elif float(score) == 1.0:
                rating = 3
                predicted_ratings.append(rating)
            else:
                rating = 1
                predicted_ratings.append(rating)
        
        #loop through each sentence with keyword
        for i,cleaned_sentence in enumerate(summarised_reviews):        
            for word in list_of_keywords:
                if word in cleaned_sentence:
                    summary.add((float(predicted_ratings[i]),word))
    except:
        summary.add(np.nan)
        
    return list(summary)

## Evaluating and Comparing the 3 Models

In [9]:
def mean_ratings (feature_ratings):
    """
    calculate mean ratings for each feature if there are two features with different sentiments in a review

    Parameter
    ----------
    [(sentiment score 1,feature 1),(sentiment score 2,feature 2),...] (multiples tuples grouped in a list format)

    Returns
    -------
    dictionary with feature as key and mean sentiment as value
    
    """
    #define the dictionary
    all_features = {'camera':[],'battery':[],'fingerprint':[],'screen':[],'charger':[],'simcard':[],'ringtones':[]}
    
    #iterate through the list of feature,sentiments tuple within a review
    try:
        for i,feature in enumerate(feature_ratings):

                if feature[1]  == 'camera':
                    all_features ['camera'].append(feature[0])

                elif feature[1] =='battery':
                    all_features ['battery'].append(feature[0])

                elif feature[1] == 'fingerprint':
                    all_features ['fingerprint'].append(feature[0])

                elif feature[1] == 'fingerprints':
                    all_features ['fingerprint'].append(feature[0])

                elif feature[1] == 'screen':
                    all_features ['screen'].append(feature[0])

                elif feature[1]  == 'charger':
                    all_features ['charger'].append(feature[0])

                elif feature[1] == 'touchscreen':
                    all_features ['screen'].append(feature[0])

                elif feature[1] == 'simcard':
                    all_features ['simcard'].append(feature[0])

                elif feature[1] == 'ringtones':
                    all_features ['ringtones'].append(feature[0])
    
    
        #calculate the mean value 
        try: 
            all_features_mean = {key:np.mean(value) for key,value in all_features.items()}

        except:
            
            all_features_mean = {key:np.nan for key,value in all_features.items()}
        
        #for keys with nan value, remove the key and value pair
        new_dict = {key:val for key, val in all_features_mean.items() if math.isnan(val)==False}
    
    except:
        new_dict = np.nan
    
    return new_dict 

In [10]:
#calculate mean ratings - if there are features that are mentioned twice with different sentiments, the mean 
#rating will be displayed

import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)
    new_reviews['vader_analysis'] = new_reviews['vader_analysis'].apply(mean_ratings)
    new_reviews['logreg_pred'] = new_reviews['logreg_pred'].apply(mean_ratings)
    new_reviews['bert_analysis'] = new_reviews['bert_analysis'].map(mean_ratings)

In [11]:
#sanity check to confirm that the mean rating was implemented correctly 
pd.set_option('display.max_colwidth',None)
new_reviews[['vader_analysis','logreg_pred','bert_analysis']].sample(10)

Unnamed: 0,vader_analysis,logreg_pred,bert_analysis
1617,"{'battery': 5.0, 'screen': 5.0}","{'battery': 5.0, 'screen': 5.0}","{'battery': 5.0, 'screen': 5.0}"
10551,"{'camera': 5.0, 'screen': 3.0}","{'camera': 5.0, 'screen': 3.0}","{'camera': 5.0, 'screen': 3.0}"
12233,{'camera': 3.0},{'camera': 1.0},{'camera': 1.0}
477,{'screen': 5.0},{'screen': 5.0},{'screen': 5.0}
20801,{'fingerprint': 3.0},{'fingerprint': 3.0},{'fingerprint': 2.0}
18885,{'battery': 5.0},{'battery': 5.0},{'battery': 5.0}
20107,"{'camera': 3.0, 'battery': 5.0, 'fingerprint': 4.0, 'screen': 4.0}","{'camera': 5.0, 'battery': 5.0, 'fingerprint': 5.0, 'screen': 5.0}","{'camera': 3.0, 'battery': 5.0, 'fingerprint': 5.0, 'screen': 5.0}"
3576,{'camera': 5.0},{'camera': 5.0},{'camera': 5.0}
5196,{'battery': 3.0},{'battery': 5.0},{'battery': 5.0}
14257,{'battery': 3.0},{'battery': 5.0},{'battery': 5.0}


As mentioned in the comments above, we are implementing this mean_rating function to all the output of the model predictions **because there are some reviews that mention a feature more than once with different sentiments.** Hence, in that case, the mean rating of that feature will be displayed instead.

An example of how mean rating works is shown below. The predictions from VADER shows that **camera has a rating of 4. Our initial model output only had 3 classes: Class 0,1,2 which was subsequently converted to 1,3,5 (corresponds to negative, neutral and positive). However, the reason why the predicted rating is shown as 4 here is because VADER picked up two different sentiments (3 and 5) in the two sentences that talk about camera.**

In [12]:
example = new_reviews.loc[19551,"sentences_with_keywords"]
pred = new_reviews.loc[19551,"vader_analysis"]
print(f"Example: {example}")
print(f"VADER predictions: {pred}")

Example: ['the only con was the camera is not a $800.', 'i think the camera does a good job.']
VADER predictions: {'camera': 4.0}


As we do not have labels for sentiments at feature level, I have decided to do a manual accuracy test. I have extracted 40 reviews, sorted by helpfulness to evaluate and compare the accuracy of these 3 models at feature level.  

In [13]:
#get top 40 most helpful reviews 
helpful_reviews_indexes = new_reviews['helpfulVotes'].sort_values(ascending=False).head(40).index

In [14]:
helpful_reviews = new_reviews.loc[helpful_reviews_indexes ]

In [15]:
helpful_reviews.to_csv("../data/helpful_reviews.csv",index=False)

The way it was evaluated was, if a review has 4 features. If the model predicts the sentiment of one feature correctly, the score give would be 0.25. If the sentiments of all features are predicted correctly within one review, the score for that particular review would be 1. The scores for 40 reviews were then added up and the sum was divided by 40 to get the % accuracy. The evaluation of each row was done on google spreadsheet which has also been uploaded on the data folder. It is **named "helpful_reviews_evaluated.csv"**

Here is the result:

VADER's score was 28.27/40 = **70.7%**  
Logistic Regression's score was 28.57/40 = **71.4%**  
BERT's score was 34.52/40 = **86.3%**  


From the result above, we can clearly see BERT did a lot better than the other two on feature level. On the previous notebook, BERT has also shown to be superior to Logistic Regression in predicting overall rating. 

Also, as I mentioned in the introduction of this project, we want to try to create a model that is able to detect negation effect on the sentence. BERT seems to be able to do it pretty well, as seen on the examples below:

In [16]:
example= "I don't like the screen. It is different from the description. Please don't buy it."
print(f"Example: \n{example}")
print("\n")
print(f"VADER rating: {vader_sentiments(example)}")
print(f"Logistic Regression rating: {logreg_sentiments(example)}")
print(f"BERT rating: {bert_sentiments(example)}")

Example: 
I don't like the screen. It is different from the description. Please don't buy it.


VADER rating: [(5, 'screen')]
Logistic Regression rating: [(5, 'screen')]
BERT rating: [(1.0, 'screen')]


You can see here that **BERT is able to detect the negation factor here correctly** while the other two models are unable to. Another observation is that BERT is able to read the context of the sentence better than the other two. Let's look at another example from Amazon review.

In [17]:
example= new_reviews.loc[20317,"reviews"]


print(f"Amazon example: \n{example}")
print("\n")
print(f"VADER rating: {vader_sentiments(example)}")
print(f"Logistic Regression rating: {logreg_sentiments(example)}")
print(f"BERT rating: {bert_sentiments(example)}")

Amazon example: 
Eh wouldn’t buy again One - It comes in a weird box Two it had more scuffs and scratches than I’d like for the price 3 - I had to return it because of how over all lame it came and how it showed up/ battery was pretty warn/scratches/ and I didn’t get the awesome feeling of unboxing it..


VADER rating: [(5, 'battery')]
Logistic Regression rating: [(5, 'battery')]
BERT rating: [(1.0, 'battery')]


From the review above, we can see that battery was rated negatively. It was described as "pretty warn/scratches". However, VADER and Logistic Regression predicted this as a positive statement. It could be due to the word "pretty". It may be reading pretty in the wrong context. 

## Deployment

I have deployed the feature extraction and sentiment analysis portion of this project here: http://ec2-3-22-98-206.us-east-2.compute.amazonaws.com:5000/predict-review

The current purpose of deploying the model is just to showcase how the model works, especially to non-technical people who will not be equipped to download this notebook to try the model. However, we can definitely optimise the model and deploy it in any companies that receive a large volume of user-generated reviews so that it can **automatically extract the features along with the ratings of the features.** However, the current model has been fine-tuned to cell phone reviews, future adoptions to other type of reviews has to be fine-tuned again before deployment. 

As BERT model outperformed VADER and Logistic Regression, I have deployed BERT model to AWS. I wanted to deploy it to Heroku as it was free, however, the fine-tuned BERT model file size is too big to be deployed to Heroku. 

## Conclusion and Future Steps

To conclude, the BERT model performed relatively well in terms of accuracy score. However, I have found some limitations as I was testing the model with many variations of sentences with different sentiments. The limitations are elaborated below. 

**Limitations #1 (sentiment predictions):**

It is predicting words like "okay" to be 5 star (To me, I would rate it as 3 or 4 star). This could be because many reviewers who think that certain features are "okay" have given a rating of 4 (which is a slightly above average type of rating). As we only fine-tuned the model with 3 classes (negative, neutral, positive), we have categoried 4 star rating to be positive. This explains why the model is giving a 5 star for feature that is described as "okay". 

**Solution #1:** 

I will consider fine-tuning the model with 5 classes instead as a way to improve the model. Also, I would like to consolidate all the frequently used words/sentences that are not accurately being predicted, label them with the right ratings and add them to the train dataset to fine-tune the model. 


**Limitations #2 (feature extractions):**

Currently, it is only to search for features with the keywords that I have defined which is pretty limited. It is unable to search for synonym of the keywords. Example, "this phone takes nice picture" will not be picked up by the model currently. 

**Solution #2:** 

A simple way to improve this is to generate all the synonyms of the features and fine-tune the model further. As fine-tuning of BERT model is quite time-consuming, I will work on this improvement in the future.


Recapping on the business problem and problem statement that we were trying to address: 

Business agenda: Improve user interface and platform experience by seggregating reviews into topics or summarising long reviews into just the main points/features and their corresponding sentiments.

Problem Statement:
1. Automatically segregates reviews by topics
2. Summarise each review by features and sentiments


So far, we have addressed both the problem statements. **The topic modelling that was done in notebook 2 and 3 have identified clear and logical topic clusters** which is definitely useful in segregating reviews on e-commerce platform (eg. Amazon,Lazada,Expedia,Airbnb), especially on popular listing with hundreds/thousands of reviews. This would greatly **enhance user experience.** The **extractions of features and sentiments would also help users extract important information from extremely long reviews, as we have seen on the EDA that there are reviews with about 1000 wordcounts.** 


With the output of features and ratings of each review that we have currently, we can also use it to produce an aggregated rating of each feature within a listing. Example, on iphone XR, the aggregated rating on camera is x, the aggregated rating on screen is y, the aggregated rating on battery is z (calculated from thousands of reviews that it has).