# Predicting Sales Rank for a Product on Amazon using Product Description

#### DATA 200, Spring 2021
#### Final Project
#### Project members:
- Alexander Wu, 3032676584, alexwu68@berkeley.edu
- Jonathan Kupfer, 25718319, jkupfer@berkeley.edu
- Utkarsh Yadav, 3035277597, utkarsh_yadav@berkeley.edu



## Introduction
This project aims to use Amazon metadata (e.g., product description, price, brand) to predict the Amazon sales rank within a specific category of data. The categories this project focuses on are `software` and `all beauty` products. 

## Motivation
Our target for this project is new venders who are interested in adding a product to Amazon. This model is helpful for them to understand what information (e.g., keywords in the description, specific brands/brand recognition, price, etc) can be used to improve a product's sales rank within its category.

## Research Question
How can the metadata of Amazon products for a given category be used to predict a new product's sales rank in that category?

## Literature Review
While we have not found this specific research question answered in existing literature, other projects use similar datasets to make sales-related predictions. A blog AIReview `https://www.aihello.com/resources/blog/2020/04/13/predicting-amazon-sales-using-deep-learning/` used deep learning to predict a product's sales, given the product's sales rank. Because they focused on top sellers and top seller categories, they were not able to make predictions for less high-selling categories. Numerous academic papers have researched sales predictions at Amazon. One paper by Singh et al. titled 'Sales Forecast for Amazon Sales with Time Series Modeling' uses neural networks to forecast future sales at Amazon using historical sales data. These predictions were not category-specific. This project adds to the field by showing how you can predict a product's sales rank within its own category using the metadata for a product, specifically using many features that the vendor has control over [citation: https://ieeexplore.ieee.org/document/9071463].  



## Methodolgy

### Description of the data:
This report studies two datasets. The reviews datasets and the metadata dataset. This study is on three categories of products on Amazon: Musical Instruments, Softwares and Beauty. The attributes of the metadata dataset along with a short description for each attribute is provided below:
- `asin` - ID of the product, e.g. 0000031852
- `title` - name of the product
- `feature` - bullet-point format features of the product
- `description` - description of the product
- `price` - price in US dollars (at time of crawl)
- `image` - url of the product image
- `related` - related products (also bought, also viewed, bought together, buy after viewing)
- `salesRank` - sales rank information
- `brand` - brand name
- `categories` - list of categories the product belongs to
- `tech1` - the first technical detail table of the product
- `tech2` - the second technical detail table of the product
- `similar` - similar product table

Similarly the attributes of the reviews dataset:
- `reviewerID` - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- `asin` - ID of the product, e.g. 0000013714
- `reviewerName` - name of the reviewer
- `vote` - helpful votes of the review
- `style` - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- `reviewText` - text of the review
- `overall` - rating of the product
- `summary` - summary of the review
- `unixReviewTime` - time of the review (unix time)
- `reviewTime` - time of the review (raw)
- `image` - images that users post after they have received the product

For the reviews dataset, we are specifically looking at the 5-core reviews. A k-core review dataset is a suset of the entire reviews dataset in which all users and items have at least 5 reviews. This dataset is smaller and less expensive on the computer memory.

### Description of methods:
#### Feature Engineering
- One Hot Encoding: We chose to use a one hot encoding of the descriptions because... 
- Price: Price is used as a feature in our model. The vendor can set the price of a product, and the price will affect a buyer's willingness to pay for the product.
- Price$^2$: Price squared is used as a feature in our model. Having the price squared will weight higher-priced items more heavily, which could negatively affect sales-rank. 
- Brand Counts: This feature represents the number of total products of the same brand in the dataset. This could be a proxy for brand recognition, but also for products that are more ubiquitous. 
- Brand$^2$: Brand squared is used as a feature in our model to highlight more ubiquitous brands. 

By squaring the price and brand count, we are allowing the model to train on a non-linear dataset. 



The square root of the sales rank was used as the label, because the distribution of the sales rank is more symmetric after this transformation than originally or when transformed with a logarithmic function, so we believed it would be easier for a linear model to predict.

#### Modeling

The above features

- Ridge Regression: the `sklearn` class `RidgeCV` was used to perform $l_2$ regularized linear regression.
    - 

## Summary of results:

### Missing prices:
The metadata dataset for the Software category has several products with missing prices. This initially came as a surprise and is also concerning as price is an important feature in our model. About 75% of the products were missing prices. 
It was found that most of these were old products that were once sold on Amazon but were now replaced by their better/improved versions. The image below lists one such product. Notice, how the product landing page says **Currently Unavailable**.

**** Insert figure from Amazon ****

Certain visualizations can help us explore this further. Fig. aaa shows a scatter plot between the proportions of products missing prices for a brand vs the total products that brand ever sold. It is important to note that the X axis is not the total sales but the total products by the brand that are listed on Amazon.


**** Insert scatter ****


But this probably includes a lot of brands which only ever sold a few products and shut down(notice so many scatter dots at y =1 close to x=0). If we only include brands which have atleast a few products with active prices and have sold a minimum of 100 products on Amazon we see that most of the scatter is above y = 0.5.


**** Insert filtered scatter ****

This shows how big brands especially in the tech industry keep improving and releasing new/revised products to survive the Amazon marketplace. It is astonishing how quickly the inventory has to evolve. The right most point is Microsoft with a total of roughly 1200 products in the dataset it is only actively selling about 175.

### Sentiment of reviews:
We again look at the Software category, however, this analysis can be performed for all the categories in this study. Sentiment itself is not a feature of this model but this analysis was done to see how the sentiment varies with the prices of products.
To get the polarity of the review text, VADER was used. 
As a part of the data cleaning process, all the punctuation was removed and the text was converted to lower case. Once extracted, we first observe polarity vs overall rating in each review. 

**** Insert box plot ****

Overall, it seems there is a positive correlation between polarity and the rating as one would expect, however as the rating goes from 4.0  to 5.0 we observe a dip in the box plot statistics. We further explore this by differentiating the reviews over verified and unverified users.

**** Insert box plot with hue****

It seems the dip is largely observed due to the reviews by unverified users. This sheds some light on the seriousness of reviews by unverfied users. However, one can also argue that a reviewr reaches a certain level of saturation in praising a product at 4.0 rating. One should also note that a lot of reviews are bound to have spelling mistakes which would not be recognized by VADER.




## Discussion:
Analysis of your findings to answer your research question(s). Include visualizations and specific results. If your research questions contain a modeling component, you must compare the results using different inference or prediction methods (e.g., linear regression, logistic regression, or classification and regression trees). Can you explain why some methods performed better than others?


## Limitations:
An evaluation of your approach and discuss any limitations of the methods you used.

## Surprises and future work:
Describe any surprising discoveries that you made and future work.