This is a Bag-of-Words model that will classify Amazon customer's review rating and turn into sentiment analysis.

Dataset: http://jmcauley.ucsd.edu/data/amazon/ - reviews_Video_Games_5.json
Reference: https://medium.com/@qempsil0914/machine-learning-nlp-text-classification-with-amazon-review-data-using-python3-step-by-step-3fb0cc0cecc1

This notebook will perfore followings:

1. Data conversion from json to csv
2. Data preprocessing and labeling
3. Spliting datasets for training and testing purpose
4. Contructing a Bag-of-Words model

In [1]:
# 1. Data conversion
# Since the dataset is json format, it needs to be converted into csv to be used in a more easier way.
# Reference: https://github.com/Amber0914/NLP-Text_Classification/blob/master/convert_json_to_csv.py

import json
import csv
import os

fromJson = open(os.path.expanduser("~/Desktop/Bag-of-Words_model/reviews_Video_Games_5.json"), 'r', encoding="utf-8")
toCsv = open("reviews_Video_Games_5.csv", "w", encoding="utf-8")

with toCsv as output:
    csvFile = csv.writer(output)
    lines = fromJson.readlines()
    isHeadline = True
    
    for line in lines:
        tempCsv = json.loads(line)
        if isHeadline:
            isHeadline = False
            csvFile.writerow(tempCsv)
        csvFile.writerow(tempCsv.values())

In [16]:
# 2. Preprocessing and Labeling
# Since the main object of this model is not only text classification, it needs to label each data with 2 labels; 1 for positive and 2 for negative.
# By using text classification, reviews with rating under 3 will be labelled 2 and over 3 with label 1.
import pandas as pd

convertedCsv = pd.read_csv("reviews_Video_Games_5.csv")
convertedCsv['overall'] = convertedCsv['overall'].astype(object)
convertedCsv['reviewText'] = convertedCsv['reviewText'].astype(object)

In [17]:
# The model will only use the review and rating, therefore extract 'reviewText' and 'overall' section only
amazonDS = {'rating': convertedCsv['overall'], 'review': convertedCsv['reviewText']}
amazonDS = pd.DataFrame(data = amazonDS)
amazonDS = amazonDS.dropna()

# Label data
amazonDS['label'] = amazonDS['rating'].apply(lambda x: 1 if str(x) > '3' else 2)
amazonDS = amazonDS.dropna()
print(amazonDS['label'])

0         2
1         1
2         2
3         1
4         1
         ..
231775    1
231776    2
231777    2
231778    2
231779    2
Name: label, Length: 231736, dtype: int64


In [18]:
# 3. Test and Train dataset
from sklearn.model_selection import train_test_split

amazonReview = pd.DataFrame(amazonDS, columns=['review'])
amazonLabel = pd.DataFrame(amazonDS, columns=['label'])

trainX, trainY, testX, testY = train_test_split(amazonReview, amazonLabel, random_state=42)

In [21]:
# 4.Model Construction
# BOG model with CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

# Token must be words
BOG_Model = CountVectorizer(token_pattern=r'\b\w+\b')
train_vec = BOG_Model.fit_transform(trainX['review'])


In [22]:
tempTest = ['I love this game. This is GOAT!']
test_vec = BOG_Model.transform(tempTest)
print(test_vec)

  (0, 67890)	1
  (0, 71494)	1
  (0, 81037)	1
  (0, 86304)	1
  (0, 96011)	1
  (0, 158440)	2
