<a href="https://colab.research.google.com/github/chuktuk/Amazon_Customer_Data/blob/master/Colab_TF_Keras_VG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Video Games Analysis with Tensorflow and Keras

## Notes
- My tf_env contains the necessary packages and dependencies for this notebook

Data from
> Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019

In [0]:
# import packages
import numpy as np
import pandas as pd
from keras.preprocessing.text import Tokenizer, one_hot, text_to_word_sequence
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM

Using TensorFlow backend.


In [0]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8938033469856109088
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 14305127488740761068
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 13787518889324494464
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 15956161332
locality {
  bus_id: 1
  links {
  }
}
incarnation: 10860756586248535120
physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0"
]


In [0]:
from keras import backend as K
K.tensorflow_backend._get_available_gpus()







['/job:localhost/replica:0/task:0/device:GPU:0']

In [0]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  1


In [0]:
# max_features = 20000
# # cut texts after this number of words (among top max_features most common words)
# maxlen = 80
# batch_size = 32

In [0]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz

--2020-03-02 16:46:51--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 154050105 (147M) [application/octet-stream]
Saving to: ‘Video_Games_5.json.gz’


2020-03-02 16:47:06 (10.6 MB/s) - ‘Video_Games_5.json.gz’ saved [154050105/154050105]



In [0]:
vg = pd.read_json('Video_Games_5.json.gz', lines=True, compression='gzip')

In [0]:
vg.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,5,True,"10 17, 2015",A1HP7NVNPFMA4N,700026657,Ambrosia075,"This game is a bit hard to get the hang of, bu...",but when you do it's great.,1445040000,,,
1,4,False,"07 27, 2015",A1JGAP0185YJI6,700026657,travis,I played it a while but it was alright. The st...,"But in spite of that it was fun, I liked it",1437955200,,,
2,3,True,"02 23, 2015",A1YJWEXHQBWK2B,700026657,Vincent G. Mezera,ok game.,Three Stars,1424649600,,,
3,2,True,"02 20, 2015",A2204E1TH211HT,700026657,Grandma KR,"found the game a bit too complicated, not what...",Two Stars,1424390400,,,
4,5,True,"12 25, 2014",A2RF5B5H74JLPE,700026657,jon,"great game, I love it and have played it since...",love this game,1419465600,,,


In [0]:
vg = vg.loc[:,['overall', 'reviewText']]

In [0]:
vg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497577 entries, 0 to 497576
Data columns (total 2 columns):
overall       497577 non-null int64
reviewText    497419 non-null object
dtypes: int64(1), object(1)
memory usage: 7.6+ MB


In [0]:
# clean up nan values and change datatype
vg = vg.dropna(how='any')
vg.loc[:,'overall'] = vg.overall.astype('int16')

In [0]:
vg.overall.value_counts()

5    299623
4     93644
3     49140
1     30879
2     24133
Name: overall, dtype: int64

In [0]:
# map sentiment for two-class pretrained model
vg.loc[:,'pt_sentiment'] = vg.overall.map({1: 0, 2: 0, 3: 1, 4: 1, 5: 1}).astype('int16')

In [0]:
vg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 497419 entries, 0 to 497576
Data columns (total 3 columns):
overall         497419 non-null int16
reviewText      497419 non-null object
pt_sentiment    497419 non-null int16
dtypes: int16(2), object(1)
memory usage: 9.5+ MB


In [0]:
vg.pt_sentiment.value_counts()

1    442407
0     55012
Name: pt_sentiment, dtype: int64

In [0]:
# import resample
from sklearn.utils import resample

# down-sample to balance classes
vg_class1 = vg[vg.pt_sentiment == 1]
vg_class0 = vg[vg.pt_sentiment == 0]

# downsample majority class
vg_class1_down = resample(vg_class1, replace=False, n_samples=vg_class0.shape[0], random_state=42)

In [0]:
# concat the dfs back together
vg_down = pd.concat([vg_class1_down, vg_class0])
vg_down.pt_sentiment.value_counts()

1    55012
0    55012
Name: pt_sentiment, dtype: int64

In [0]:
X = vg_down.reviewText.values
y = vg_down.pt_sentiment.values

In [0]:
X[0]

'Excellent protection for your New Nintendo 3DS. You can buy a skin for your 3ds and put this baby on, to protect it.'

In [0]:
t = Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, 
          split=' ', char_level=False, oov_token=None, document_count=2)

In [0]:
type(t)

keras_preprocessing.text.Tokenizer

In [0]:
help(t)

Help on Tokenizer in module keras_preprocessing.text object:

class Tokenizer(builtins.object)
 |  Text tokenization utility class.
 |  
 |  This class allows to vectorize a text corpus, by turning each
 |  text into either a sequence of integers (each integer being the index
 |  of a token in a dictionary) or into a vector where the coefficient
 |  for each token could be binary, based on word count, based on tf-idf...
 |  
 |  # Arguments
 |      num_words: the maximum number of words to keep, based
 |          on word frequency. Only the most common `num_words-1` words will
 |          be kept.
 |      filters: a string where each element is a character that will be
 |          filtered from the texts. The default is all punctuation, plus
 |          tabs and line breaks, minus the `'` character.
 |      lower: boolean. Whether to convert the texts to lowercase.
 |      split: str. Separator for word splitting.
 |      char_level: if True, every character will be treated as a token.

In [0]:
# split the data into train/test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

In [0]:
len(X_train)

88019

In [0]:
t.fit_on_texts(X_train)

In [0]:
# summarize what was learned
print('Number of training documents {}'.format(t.document_count))

Number of training documents 88021


In [0]:
encoded_docs = t.texts_to_matrix(X_train, mode='count')