# HW5 - Rating prediction using Amazon's Reviews
    
In this exercise, you'll train a text classification on a **subset** of the the Amazon's Reviews dataset. 

The Amazon's Reviews dataset  contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.


We will focus on the Home and Kitchen segment which contains ~550k reviews and can be downloaded here: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz

You will predict the rating that was given to a product from the review.

The dataset contains the following fields for each review, in JSON format:
1. "reviewerID": "A11N155CW1UV02",
1. "asin": "B000H00VBQ",
1. "reviewerName": "AdrianaM"
1. "helpful": [0, 0]
1. "reviewText": "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all."
1. "overall": 2.0
1. "summary": "A little bit boring for me"
1. "unixReviewTime": 1399075200
1. "reviewTime": "05 3, 2014"




Please note that the **only** two fields that you are allowed to use in this exercise are "reviewText" which contains the review and "overall" which contains the rating. Other than that you have the **option** to use the "asin" field which is a unique product identifier. You may (or may not :) ) find this field useful. 



In [1]:
!wget -nc http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz

--2020-06-23 15:16:57--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)...171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80...connected.
HTTP request sent, awaiting response...200 OK
Length: 138126598 (132M) [application/x-gzip]
Saving to: ‘reviews_Home_and_Kitchen_5.json.gz’


2020-06-23 15:20:16 (679 KB/s) - ‘reviews_Home_and_Kitchen_5.json.gz’ saved [138126598/138126598]



## General guidelines

1. You are required to implement at least two models.
1. The first should be a CNN or an RNN (or a combination) and should include the use of Glove embeddings.
1. The second model should be implemented using the transformers package and include Transfer learning concepts that were mentioned in the Lecture.
1. Pay attention to any preprocessing steps that are needed.
1. Feel free to be creative and use any method which was mentioned in the lectures (e.g., tf-idf, pos,...) extra points will be given to creativity.
1. The main criteria for evaluation is not the over-all score but rather the entire process (preprocessing, efficient training ...)





In [2]:
import pandas as pd
import zipfile
import json
import gzip
from tqdm import tqdm

In [3]:
z = zipfile.ZipFile("/Users/Adam/workspace/yandex/Y-Data/2nd Semester/NLP/Assignment 4/glove.840B.300d.zip")
glove_pd = pd.read_csv(z.open('glove.840B.300d.txt'), sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in glove_pd.T.items()}
del glove_pd

First we open the reviews file and turn it into a mappable object. The formatting of the text file has newline characters so we need to split the textfile on '\n' and ignore the last entry as the file ends with '\n' and therefore the last element in the list is empty.

In [4]:
with gzip.open('reviews_Home_and_Kitchen_5.json.gz', 'rt') as f:
    content = f.read()
content = content.split('\n')[:-1]
print(content[0])
print(content[-1])

FileNotFoundError: [Errno 2] No such file or directory: 'reviews_Home_and_Kitchen_5.json.gz'

Now that we have a list of JSONifiable strings we can turn our reviews into a mapping where each review has a unique ID

In [None]:
reviews = pd.DataFrame.from_records([{i: json.loads(r) for i, r in enumerate(content)}])

Taking a look at the fields, the "asin" field could be used to create an average rating per product feature, which could help further down the line to predict a specific rating. Let's do it

In [69]:
from collections import Counter

In [77]:
product_ids = [{r['asin']: r['overall']} for r in reviews.values()]

In [1]:
pd.DataFrame.from_records(product_ids)

<IPython.core.display.Javascript object>

NameError: name 'product_ids' is not defined

Now we need to decide how to preprocess the review texts 

In [63]:
review_text = reviews[0]['reviewText']
review_text

'My daughter wanted this book and the price on Amazon was the best.  She has already tried one recipe a day after receiving the book.  She seems happy with it.'