For introduction and problem statement, please refer to notebook 1

## Content 

Notebook 1: 1_cellphones_reviews_data_cleaning_and_eda
- Data Import and Cleaning
- Exploratory Data Analysis
- Text Data Pre-processing

**Notebook 2: 2_cellphones_reviews_topic modelling**
- Data Import
- Topic Modelling with Gensim

**Notebook 3: 3_cellphones_reviews_topic_analysis_and_visualizations**
- Findings and Analysis of Topic Modelling

**Notebook 4: 4_features_extractions_and_sentiment_analysis**
- Data Import
- Sentiment Analysis with VADER
- Sentiment Analysis with Logistic Regression(Multi-Class Classification)
- Evaluation of Sentiment Analysis with BERT(Multi-Class Classification)
Please refer to notebook 5 for the fine-tuning process of pre-trained BERT model


**Notebook 5: 5_fine_tuning_of_BERT_model**   
The reason why this notebook is separated from notebook 4 which contains the evaluation of BERT model is because the fine-tuning of BERT model requires GPU. Hence, the model was fine-tuned on Google Colaboratory and loaded back into notebook 4 for evaluation


**Notebook 6: 6_analysis_and_findings**
- [Data Import](#Data-Import)
- [Comparison of the 3 Methods](#Evaluating-and-Comparing-the-3-Models)
- Recommendation and Conclusion 
- Future Steps

## Data Import

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import spacy
from nltk import tokenize
from nltk.corpus import stopwords 
import re
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer
import pickle
import math

In [2]:
new_reviews  = pickle.load(open('../data/reviews_with_feature_sentiments.pkl', 'rb'))

## Evaluating and Comparing the 3 Models

In [45]:
samples[['vader_analysis','logreg_pred','bert_analysis']][:10]

##showcase bert with no 9 

Unnamed: 0,vader_analysis,logreg_pred,bert_analysis
0,"[(5, battery), (5, camera)]","[(5, battery), (5, camera)]","[(5.0, battery), (5.0, camera)]"
1,"[(5, battery)]","[(5, battery)]","[(5.0, battery)]"
2,"[(5, screen)]","[(5, screen)]","[(5.0, screen)]"
3,"[(5, screen)]","[(5, screen)]","[(5.0, screen)]"
4,"[(5, screen)]","[(5, screen)]","[(1.0, screen)]"
5,"[(5, camera)]","[(1, camera), (5, camera)]","[(5.0, camera)]"
6,"[(1, battery)]","[(1, battery)]","[(1.0, battery)]"
7,"[(5, screen)]","[(1, screen)]","[(1.0, screen)]"
8,"[(5, screen), (3, battery), (1, screen), (5, camera), (3, screen)]","[(5, screen), (1, screen), (5, battery), (5, camera)]","[(5.0, screen), (1.0, screen), (1.0, battery), (5.0, camera)]"
9,"[(5, battery)]","[(5, battery)]","[(1.0, battery)]"


In [78]:
#index 20317, asin B07RT1X4FJ
iphone_xs = new_reviews[new_reviews['asin'] == 'B07RT1X4FJ']

In [81]:
pd.set_option('display.max_colwidth',None)
iphone_xs['reviews'][:5]

20316    Honestly, it was worth it I was very hesitant about buying an iPhone off of Amazon, but I did it and was not disappointed. It came with 100% battery life, so it basically was brand new. It came with a charger and cube (plug), but that's it. Sorry to disappoint those who into that airpod life. No scratches and shipped fairly quickly. The only complaint I have for the iPhone XS, in general, is that I really liked the fingerprint scan, sometimes the face recognition doesn't work when you're looking down. But what it lacks in that feature, the iPhone delivers in the portrait camera mode, I seriously am in love with the blur feature. Sincerely, A very satisfied Amazon customer
20317                                                                                                                                                                                                                                                                                                                 

In [6]:
new_reviews['helpfulVotes'] = new_reviews['helpfulVotes'].apply(pd.to_numeric)

In [18]:
helpful_reviews_indexes = new_reviews['helpfulVotes'].sort_values(ascending=False).head(50).index

In [19]:
helpful_reviews = new_reviews.loc[useful_reviews_indexes ]


In [20]:
helpful_reviews.to_csv("../data/helpful_reviews.csv",index=False)

## Comparing VADER vs Logistics Regression vs BERT sentiment analysis

In [4]:
new_reviews.columns

Index(['asin', 'name', 'rating', 'date', 'verified', 'review_title', 'body',
       'helpfulVotes', 'brand', 'item_title', 'url', 'image', 'reviewUrl',
       'totalReviews', 'price', 'originalPrice', 'reviews', 'word_count',
       'cleaned_reviews', 'multi_class_sentiment', 'tokens', 'summary',
       'sentences_with_keywords', 'features_and_sentiments', 'filter summary',
       'logreg_pred', 'data_type', 'bert_analysis', 'vader_analysis'],
      dtype='object')

In [5]:
type(new_reviews['logreg_pred'][0])

list

In [4]:
all_features=set()

for idx in new_reviews.index:
    for feature in new_reviews.loc[idx,'features_and_sentiments']:
        all_features.add(feature[1])

In [5]:
all_features #unique features

{'battery',
 'camera',
 'charger',
 'fingerprint',
 'ringtones',
 'screen',
 'simcard',
 'touchscreen'}

## Accuracy of analysis 

In [11]:
evaluated_sample = pd.read_csv("../data/samples_for_evaluation_updated.csv")

In [41]:
samples = new_reviews.sample(50,random_state=42)
samples.reset_index(inplace=True,drop=True)

In [46]:
samples.to_csv("../data/samples_to_be_evaluated.csv")

pd.set_option('display.max_colwidth',None)
samples['sentences_with_keywords'][:10]

In [None]:
math.isnan()

In [67]:
dummy = {'battery':[]}
for feature in new_reviews['logreg_pred'][0]:
    if feature[1] == 'battery':
        dummy['battery'].append(feature[0])

dummy

{'battery': [5]}

In [6]:
def mean_ratings (feature_ratings):
    all_features = {'camera':[],'battery':[],'fingerprint':[],'screen':[],'charger':[],'simcard':[],'ringtones':[]}
    for feature in feature_ratings:
        
        if feature[1]  == 'camera':
            all_features ['camera'].append(feature[0])
            
        elif feature[1] =='battery':
            all_features ['battery'].append(feature[0])
            
        elif feature[1] == 'fingerprint':
            all_features ['fingerprint'].append(feature[0])
        
        elif feature[1] == 'fingerprints':
            all_features ['fingerprint'].append(feature[0])
        
        elif feature[1] == 'screen':
            all_features ['screen'].append(feature[0])
            
        elif feature[1]  == 'charger':
            all_features ['charger'].append(feature[0])
        
        elif feature[1] == 'touchscreen':
            all_features ['screen'].append(feature[0])
            
        elif feature[1] == 'simcard':
            all_features ['simcard'].append(feature[0])
        
        elif feature[1] == 'ringtones':
            all_features ['ringtones'].append(feature[0])
    try: 
        all_features_mean = {key:np.mean(value) for key,value in all_features.items()}
    
    except:
        all_features_mean = {key:np.nan for key,value in all_features.items()}
        
    new_dict = {key:val for key, val in all_features_mean.items() if math.isnan(val)==False}
    
    return new_dict 
    

In [7]:
new_reviews['logreg_pred'][22036]

[(5, 'screen'),
 (1, 'screen'),
 (5, 'camera'),
 (5, 'fingerprint'),
 (5, 'battery')]

In [9]:
dummy = mean_ratings (new_reviews['logreg_pred'][22036])
dummy

{'camera': 5.0, 'battery': 5.0, 'fingerprint': 5.0, 'screen': 3.0}

In [71]:
new_dict = {key:val for key, val in dummy.items() if math.isnan(val)==False}
new_dict

{'camera': 5.0, 'battery': 5.0, 'fingerprint': 5.0, 'screen': 3.0}

In [47]:
dummy.pop('charger')

nan

In [48]:
dummy

{'camera': 5.0, 'battery': 5.0, 'fingerprint': 5.0, 'screen': 3.0}

In [28]:
for key,value in dummy.items():
    if math.isnan(value):
        del dummy[key]
    
    

In [None]:
all_products = {}
unique_asins = new_reviews['asin'].unique()
for product in unique_asins:
    all_products[product] = {'camera':[],'battery':[],'fingerprint':[],'screen':[],'charger':[],
                                 'touchscreen':[],'simcard':[],'ringtones':[]}
    for features_dict in new_reviews['logreg_pred']:

        for key,value in features_dict.items():
            if key == 'camera':
                all_products[product]['camera'].append(value)
            elif key =='battery':
                all_products[product]['battery'].append(value)
            elif key == 'fingerprint':
                all_products[product]['fingerprint'].append(value)  
            elif key == 'screen':
                all_products[product]['screen'].append(value)   
            elif key  == 'charger':
                all_products[product]['charger'].append(value)
            elif key == 'touchscreen':
                all_products[product]['touchscreen'].append(value)
            elif key == 'simcard':
                all_products[product]['simcard'].append(value)
            elif key == 'ringtones':
                all_products[product]['ringtones'].append(value)      

## Mean ratings by features of each unique product

In [None]:
new_reviews.reset_index(inplace=True,drop=True)

In [None]:
unique_asins = new_reviews['asin'].unique()

In [None]:
new_reviews.loc[1,'features_and_sentiments']

In [None]:
all_products = {}

#for cell in new_review['features_and_sentiments']: 
all_features=set()

for product in unique_asins:
    all_products[product] = {'camera':[],'battery':[],'fingerprint':[],'screen':[],'charger':[]}
    for idx in new_reviews.index:
        if new_reviews.loc[idx,'asin'] == product:
            for feature in new_reviews.loc[idx,'features_and_sentiments']:
                all_features.add(feature[1])
                if feature[1] =='battery':
                    all_products[product]['battery'].append(feature[0])
                elif feature[1]  == 'camera':
                    all_products[product]['camera'].append(feature[0])
                elif feature[1]  == 'charger':
                    all_products[product]['charger'].append(feature[0])
                elif feature[1] == 'screen':
                    all_products[product]['screen'].append(feature[0])
                elif feature[1] == 'fingerprint':
                    all_products[product]['fingerprint'].append(feature[0])
        

In [None]:
for key_1,value_1 in all_products.items():
    for key_2,value_2 in all_products[key_1].items():
        try:
            all_products[key_1][key_2] = round(np.mean(all_products[key_1][key_2]),1)
        except:
            all_products[key_1][key_2] = np.nan

In [None]:
mean_ratings = pd.DataFrame(all_products).T

In [None]:
mean_ratings.reset_index(inplace=True)
mean_ratings

In [None]:
mean_ratings.rename(columns={'index':'asin'},inplace=True)

In [None]:
updated_mean_ratings = pd.merge(mean_ratings,new_reviews[['asin','item_title']],on='asin',how='inner')
updated_mean_ratings.drop_duplicates(subset=['asin'],keep='first',inplace=True)

In [None]:
updated_mean_ratings.reset_index(inplace=True,drop=True)

In [None]:
updated_mean_ratings.tail(20)

In [None]:
## Mean ratings by features of each unique product

new_reviews.reset_index(inplace=True,drop=True)

unique_asins = new_reviews['asin'].unique()

new_reviews.loc[1,'features_and_sentiments']


all_products = {}

#for cell in new_review['features_and_sentiments']: 
all_features=set()

for product in unique_asins:
    all_products[product] = {'camera':[],'battery':[],'fingerprint':[],'screen':[],'charger':[]}
    for idx in new_reviews.index:
        if new_reviews.loc[idx,'asin'] == product:
            for feature in new_reviews.loc[idx,'features_and_sentiments']:
                all_features.add(feature[1])
                if feature[1] =='battery':
                    all_products[product]['battery'].append(feature[0])
                elif feature[1]  == 'camera':
                    all_products[product]['camera'].append(feature[0])
                elif feature[1]  == 'charger':
                    all_products[product]['charger'].append(feature[0])
                elif feature[1] == 'screen':
                    all_products[product]['screen'].append(feature[0])
                elif feature[1] == 'fingerprint':
                    all_products[product]['fingerprint'].append(feature[0])
        

for key_1,value_1 in all_products.items():
    for key_2,value_2 in all_products[key_1].items():
        try:
            all_products[key_1][key_2] = round(np.mean(all_products[key_1][key_2]),1)
        except:
            all_products[key_1][key_2] = np.nan

mean_ratings = pd.DataFrame(all_products).T

mean_ratings.reset_index(inplace=True)
mean_ratings

mean_ratings.rename(columns={'index':'asin'},inplace=True)

updated_mean_ratings = pd.merge(mean_ratings,new_reviews[['asin','item_title']],on='asin',how='inner')
updated_mean_ratings.drop_duplicates(subset=['asin'],keep='first',inplace=True)

updated_mean_ratings.reset_index(inplace=True,drop=True)

updated_mean_ratings.tail(20)