# Review Rating Predictor for Amazon
by Allen Wang

*Amazon is one of the largest e-commerce companies and are one of the big five of the US information technology industry. Amazon offers a huge variety of products in a multitude of categories. Their products are defined by 5 star ratings and reviews and a large objective of this project was to identify and label customer satisfaction among purchasers of amazon products. I have created an algorithm that sorts text data into three categories of ratings: (High: 4-5 stars, Medium: 3 stars, Low: 1-2 stars). The reasoning for sorting into three categories is that a consumer's rating of a 4 star review is hardly distinguishable from a 5 star review for a computer let alone a human and the same goes with a 1 or a 2 star review. The model takes in text data and outputs an overall satisfaction grade from the customer: High, Medium, or Low.*

![Amazon Products](amazon.jpg)


### 1. Data

The Data comes from a dataset on Kaggle.com that was published by Datafini. Datafini has a product database of a number of Amazon products that range from the Kindle Fire to Amazon Essentials. The dataset comes with 24 columns of information and 28,332 reviews and ratings and other information such as product ID, product category. To view the dataset and the Datafini homepage click on the links below:

* [Datafini](https://datafiniti.co/)


* [Kaggle](https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products?select=Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv)

### 2. Method

For this type of project, it is imperative to properly sample the data, vectorize the text data, and apply the TFIDF transformer which is term frequency-inverse document frequency. This is a statistical method for retrieving keywords from documents and we're going to use this method for keyword frequency. The formula for TFIDF is shown below. 

![tfidf](tfidf1.png)



What's important is that the TFIDF transformer turns raw text data into word frequency numerical data that can be fitted to different models.

After the TFIDF transformer is applied, the data is run through different algorithms of choice: Logistic Regression,  Mulitnomial Naive-Bayes, Random Forest, Linear Support Vector Classifier, and XGBoost. 
Logistic Regression is a mostly used as a binary classifier and it is modeled by a sigmoid function. Multinomial Naive-Bayes is a classifier used for classification often used for discrete features and in our case, word count. It might seem like the most ideal algorithm to be used for this project. Random forest is a conglomeration of decision trees and is a relatively good algorithm for most scenarios. It also automatically balances datasets with imbalanced categories. Linear Support Vector Classifier is an algorithm that draws a hyperplane separating datapoints into different classes. XGBoost is gradient boosting classifier and that is a that new models are added in order to account for the errors learned by previous models. It is still not clear what algorithm to use but I'm leaning towards random forests, Multinomial NB, and XGBoost since all are reliable algorithms that can handle a wide variety of data. XGBoost proves to be the most reliable and accurate model in Kaggle competitions.


### 3. Data Cleaning

Just from looking at the dataset, there seems to be a couple problems to begin with. There are two columns with nearly all missing values: 'reviews.id' and 'reviews.didPurchase'. Those columns are not going to be useful at all. I decided to drop those columns in order to clean the dataset. Another problem was that all the date time objects need to be converted to the datetime format if we wanted to conduct some data analysis or some feature engineering on those columns. I converted three columns to datetime objects 'reviews.date','dateAdded', and 'dateUpdated'. I called a value_counts() on the 'ratings.review' column and discovered that the dataset was heavily imbalanced so eventually later on in the machine learning process, I downsampled the number of 5 rating reviews and 4 rating reviews. 

![ratinghistogram](download.png)

### 4. Exploratory Data Analysis

I decided to examine the most popular products and the products with the most reviews. The most reviewed products were electronics and coming in at a close second is Health and Beauty. The most popular products were the Amazon Triple A and double A batteries and the second most popular Amazon products were the Fire HD 8 Tablet and the Kids Tablet.

Most Popular Categories    |  Most Popular Products
:-------------------------:|:-------------------------:
![](eda1.png)  |  ![](eda4.png)

I then created a scatterplot showing the review lengths in respect to the review ratings. From the plot it shows a positive correlation between the review rating and the review length but mostly long review in the 4 star category. 

![ratinghistogram](eda3.png)

Also I realized that it is extremely difficult to distinguish between 4 or 5 stars or 1 or 2 stars so decided to create three classes of ratings: 'High', 'Medium', and 'Low'. High consists of 4 or 5 stars, Medium consists of 3 stars, and Low consists of 1 or 2 stars. Now the data is ready to be utilized for machine learning.

### 5. Algorithms and Machine Learning

In order to create and fit a machine learning model out of text data we need to apply the count vectorizer to the review text column of the dataset. After applying the count vectorizer, I looked at the most popular words and bigrams that are common amongst the reviews. 

![ratinghistogram](eda5.png)

Here is a wordcloud for the most common words in the reviews.text column of the dataset.

High Rating   |  Medium Rating   | Low Rating
:-------------------------:|:-------------------------:|:-------------------------:
![](wordcloud3.png)  |  ![](wordcloud2.png) | ![](wordcloud1.png)

From the earlier bar graph of the number of 5 star reviews, I realized the dataset was heavily imbalanced. I needed to downsample the nubmer of 5 star ratings and 4 star ratings as well. I realize this is a problem because Amazon products are usually reliable considering that the products are made from Amazon themselves. I downsampled the number of 5 star reviews to a 2:1 ratio in comparison with the other ratings. Even though it is still imbalanced, it is a fair number to work with.

I then called a train_test_split on the dataset, splitting the data 80/20 training and testing. I then called a function that ran the data through every single algorithm listed above. For the scoring metric, I decided accuracy was not the move. Accuracy only works on balanced datasets. For the Amazon dataset, it is still somewhat imbalanced after the downsampling as there are twice as many 5 star reviews as any other reviews. Therefore, I decided to use f1 score, specifically f1 micro as it accounts for class imbalances. F1 score is a formula that is a combination of precision and recall and precision is a metric that 
prioritizes minimizing false positives and recall prioritizes minimizing false negatives. Since neither are that drastic to the integrity to the model results, I decided to go with a f1 score metric. 

![ratinghistogram](eda6.png)

The best model results were Linear Support Vector Classifiers and the XGB Classifier

XGBClassifier is a highly reliable and efficient algorithm since it is a form of gradient boosting. I decided to tune the hyperparameters for the XGBoost Classifier and to see if I could improve that model's results. I decided to conduct a Random Search. I decided on random search because a grid search will take too much CPU capacity and will inefficient in terms of computation time and CPU usage. Random search also provides near similar results as grid search with significantly less computational power. The random search somehow gave us a worse score than the default parameters.

![ratinghistogram](classreport1.png)

Classification report of random searched XGBoost model:

![ratinghistogram](xgboostparams.png)

![ratinghistogram](classreport2.png)

Classification report of Linear SVC model:

![ratinghistogram](classreport3.png)

After consideration, I realized it was better to use ROC-AUC score as the scoring metric because it shows individual probabilities of data being designated to a specific class rather than using the 50% threshold of the f1-score. It is a better scoring metric for our type of data.

![ratinghistogram](rocaucscore.jpg)

Despite having run extra computations and random searching parameters for XGBoost, Linear SVC still had the highest ROC-AUC score which is the metric I'm going to use to evaluate my model. Therefore, Linear SVC is the final model.

![ratinghistogram](finalmodel.png)