# Video Games NLP Exploratory Analysis

Data from
> Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019
[https://nijianmo.github.io/amazon/index.html#files](https://nijianmo.github.io/amazon/index.html#files)

## Purpose

The purpose of this notebook is to read in and explore text reviews for video games from Amazon.com. These reviews were collected from 1996-2014. The dataset is labeled with star ratings and contains text reviews for purchased video games.

This particular analysis will explore the data, look for interesting features about the text, and describe basic properties of it.

In [1]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# read in data
vg = pd.read_json('../Amazon_Data/Video_Games_5.json.gz', lines=True, compression='gzip')
vg.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,5,True,"10 17, 2015",A1HP7NVNPFMA4N,700026657,Ambrosia075,"This game is a bit hard to get the hang of, bu...",but when you do it's great.,1445040000,,,
1,4,False,"07 27, 2015",A1JGAP0185YJI6,700026657,travis,I played it a while but it was alright. The st...,"But in spite of that it was fun, I liked it",1437955200,,,
2,3,True,"02 23, 2015",A1YJWEXHQBWK2B,700026657,Vincent G. Mezera,ok game.,Three Stars,1424649600,,,
3,2,True,"02 20, 2015",A2204E1TH211HT,700026657,Grandma KR,"found the game a bit too complicated, not what...",Two Stars,1424390400,,,
4,5,True,"12 25, 2014",A2RF5B5H74JLPE,700026657,jon,"great game, I love it and have played it since...",love this game,1419465600,,,


In [3]:
# choose only select columns and clean up the datatypes and missing values
vg = vg.loc[:,['overall', 'reviewText']]
vg = vg.dropna(how='any')
vg.loc[:,'overall'] = vg.overall.astype('int16')

In [4]:
# inspect the df info
vg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 497419 entries, 0 to 497576
Data columns (total 2 columns):
overall       497419 non-null int16
reviewText    497419 non-null object
dtypes: int16(1), object(1)
memory usage: 8.5+ MB


In [5]:
# check out the distribution of ratings
vg.overall.value_counts()

5    299623
4     93644
3     49140
1     30879
2     24133
Name: overall, dtype: int64

The classes are heavily imbalanced, with far more positive ratings than negative. This analysis is not seeking to predict sentiment, therefore, the data will not be upsampled/downsampled right now.

The entire text data will be analyzed for interesting properties.

### Text Exploration

In [7]:
# import stop words from spacy
from  spacy.lang.en.stop_words import STOP_WORDS

In [8]:
# generate and edit stop words
stops = list(STOP_WORDS)
print(stops)

['amongst', 'take', 'then', 'again', 'anyhow', 'much', 'out', 'first', 'make', 'after', 'move', 'front', 'over', 'thereby', 'throughout', "'s", 'being', 'through', 'three', 'below', 'in', '’ll', 'made', 'part', 'quite', 'really', 'becomes', 'this', "n't", '‘ve', 'for', 'regarding', 'the', 'themselves', 'my', 'while', 'within', 'hereafter', 'somewhere', 'seemed', 'too', 'they', 'would', 'serious', 'further', 'became', 'though', 'above', 'amount', 'eight', 'yet', 'hundred', 'get', 'done', 'almost', 'must', 'of', 'nothing', 'once', 'doing', '’s', 'do', 'us', 'could', 'does', 'because', 'such', 'twelve', "'m", 'two', 'n’t', 'afterwards', 'had', 'fifty', 'twenty', 'yours', 'her', 'using', 'seems', 'few', 'ours', 'those', 'me', 'put', 'enough', 'them', 'eleven', 'meanwhile', 'thus', 'else', 'via', 'when', 'whereafter', 'why', 'whenever', 'besides', 'among', 'five', 'that', 'by', 'whoever', 'someone', 'hereupon', 'whom', 'ten', 'never', 'who', 'off', 'several', 'very', '’m', 'and', 'between',