## Siraj's 03 Challenge
#### This is a response to the Coding Challenge in: https://youtu.be/si8zZHkufRY
The challenge for this video is to train a model on this dataset of video game reviews from IGN.com. Then, given some new video game title it should be able to classify it. You can use pandas to parse this dataset. Right now each review has a label that's either Amazing, Great, Good, Mediocre, Painful, or Awful. These are the emotions. Using the existing labels is extra credit. The baseline is that you can just convert the labels so that there are only 2 emotions (positive or negative). Ideally you can use an RNN via TFLearn like the one in this example, but I'll accept other types of ML models as well. You'll learn how to parse data, select appropriate features, and use a neural net on an IRL problem.
### Sentiment Labels to be Predicted
- Great
- Good
- Okay
- Mediocre
- Amazing
- Bad
- Awful
- Painful
- Unbearable
- Masterpiece
### Accuracy Results
Dummy Classifier (i.e. select most frequent class): 0.25631 (25.6%)
Multinomial Naive Bayes:                            0.32355 (32.4%)
RNN (using tflearn):     

#### Import all the libraries needed and checking version of Python and Tensorflow 

In [16]:
import pandas as pd
import tensorflow as tf
import sys
import sklearn

print ('Sklearn:', sklearn.__version__)
print('Python Version:', sys.version.split()[0])
print('TensorFlow Ver:',tf.__version__)

Sklearn: 0.18.1
Python Version: 3.5.2
TensorFlow Ver: 1.0.0


Taking input of number of epochs from the user

In [2]:
n_epoch = int(input('Enter no. of epochs for RNN training: '))

Enter no. of epochs for RNN training: 100


In [3]:
pd.set_option('display.max_colwidth', 1000)

In [4]:
original_ign = pd.read_csv('input/ign.csv')
original_ign.head(10)

Unnamed: 0.1,Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero Edition,/games/littlebigplanet-ps-vita-marvel-super-hero-edition/vita-20027059,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11
5,5,Good,Total War Battles: Shogun,/games/total-war-battles-shogun/mac-142565,Macintosh,7.0,Strategy,N,2012,9,11
6,6,Awful,Double Dragon: Neon,/games/double-dragon-neon/xbox-360-131320,Xbox 360,3.0,Fighting,N,2012,9,11
7,7,Amazing,Guild Wars 2,/games/guild-wars-2/pc-896298,PC,9.0,RPG,Y,2012,9,11
8,8,Awful,Double Dragon: Neon,/games/double-dragon-neon/ps3-131321,PlayStation 3,3.0,Fighting,N,2012,9,11
9,9,Good,Total War Battles: Shogun,/games/total-war-battles-shogun/pc-142564,PC,7.0,Strategy,N,2012,9,11


#### Checking out the shape of the IGN Dataset

In [5]:
print('Shape of the Dataset:',original_ign.shape)

Shape of the Dataset: (18625, 11)


#### Print all the unique score_phrase as well as their counts

In [6]:
original_ign.score_phrase.value_counts()

Great          4773
Good           4741
Okay           2945
Mediocre       1959
Amazing        1804
Bad            1269
Awful           664
Painful         340
Unbearable       72
Masterpiece      55
Disaster          3
Name: score_phrase, dtype: int64

### Data Preprocessing
We want to filter out some data from the dataset before training our model(s).
#### Convert score_phrase to 2 emotions (positive or negative)

In [7]:
bad_phrases = ['Bad', 'Awful', 'Painful', 'Unbearable', 'Disaster']
original_ign['sentiment'] = original_ign.score_phrase.isin(bad_phrases).map({True: 'Negative', False: 'Positive'})
original_ign.head()

Unnamed: 0.1,Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day,sentiment
0,0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12,Positive
1,1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero Edition,/games/littlebigplanet-ps-vita-marvel-super-hero-edition/vita-20027059,PlayStation Vita,9.0,Platformer,Y,2012,9,12,Positive
2,2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12,Positive
3,3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11,Positive
4,4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11,Positive


### Number of Positive and Negative Sentiments

In [8]:
original_ign.sentiment.value_counts(normalize=True)

Positive    0.873933
Negative    0.126067
Name: sentiment, dtype: float64

#### Check for null elements

In [9]:
original_ign.isnull().sum()

Unnamed: 0         0
score_phrase       0
title              0
url                0
platform           0
score              0
genre             36
editors_choice     0
release_year       0
release_month      0
release_day        0
sentiment          0
dtype: int64

#### Replace the null element with an empty string

In [10]:
original_ign.fillna(value='', inplace=True)
original_ign.head()

Unnamed: 0.1,Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day,sentiment
0,0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12,Positive
1,1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero Edition,/games/littlebigplanet-ps-vita-marvel-super-hero-edition/vita-20027059,PlayStation Vita,9.0,Platformer,Y,2012,9,12,Positive
2,2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12,Positive
3,3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11,Positive
4,4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11,Positive


### Create a new Dataframe called ign

In [11]:
ign = original_ign[['sentiment', 'score_phrase', 'title', 'platform','genre', 'editors_choice', 'score']].copy()
ign.head()

Unnamed: 0,sentiment,score_phrase,title,platform,genre,editors_choice,score
0,Positive,Amazing,LittleBigPlanet PS Vita,PlayStation Vita,Platformer,Y,9.0
1,Positive,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero Edition,PlayStation Vita,Platformer,Y,9.0
2,Positive,Great,Splice: Tree of Life,iPad,Puzzle,N,8.5
3,Positive,Great,NHL 13,Xbox 360,Sports,N,8.5
4,Positive,Great,NHL 13,PlayStation 3,Sports,N,8.5


### Create a new colum called is_editor_choice

In [12]:
ign['is_editors_choice'] = ign['editors_choice'].map({'Y': 'editors_choice', 'N': ''})
ign.head()

Unnamed: 0,sentiment,score_phrase,title,platform,genre,editors_choice,score,is_editors_choice
0,Positive,Amazing,LittleBigPlanet PS Vita,PlayStation Vita,Platformer,Y,9.0,editors_choice
1,Positive,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero Edition,PlayStation Vita,Platformer,Y,9.0,editors_choice
2,Positive,Great,Splice: Tree of Life,iPad,Puzzle,N,8.5,
3,Positive,Great,NHL 13,Xbox 360,Sports,N,8.5,
4,Positive,Great,NHL 13,PlayStation 3,Sports,N,8.5,


In [13]:
ign['text'] = ign['title'].str.cat(ign['platform'], sep=' ').str.cat(ign['genre'], sep=' ').str.cat(ign['is_editors_choice'], sep=' ')
X = ign.text
y = ign.score_phrase

## Model#0 : The Dummy Classifier  (Always Choose the Most Frequent Class)
Import all the classes here

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

In [17]:
vect = TfidfVectorizer(stop_words='english',
                       token_pattern=r'\b\w{2,}\b')
dummy = DummyClassifier(strategy='most_frequent', random_state=0)

dummy_pipeline = make_pipeline(vect, dummy)

dummy_pipeline.named_steps

{'dummyclassifier': DummyClassifier(constant=None, random_state=0, strategy='most_frequent'),
 'tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
         stop_words='english', strip_accents=None, sublinear_tf=False,
         token_pattern='\\b\\w{2,}\\b', tokenizer=None, use_idf=True,
         vocabulary=None)}

### Cross Validation

In [19]:
# Cross Validation
cv = cross_val_score(dummy_pipeline, X, y, scoring='accuracy', cv=10, n_jobs=-1)
print('\nDummy Classifier\'s Accuracy: %0.5f\n' % cv.mean())




Dummy Classifier's Accuracy: 0.25627



## Model #1: MultinomialNB Classifier

In [21]:
from sklearn.naive_bayes import MultinomialNB
vect = TfidfVectorizer(stop_words='english',
                       token_pattern=r'\b\w{2,}\b',
                       min_df=1, max_df=0.1,
                       ngram_range=(1,2))

mnb = MultinomialNB(alpha=2)
mnb_pipeline = make_pipeline(vect, mnb)

mnb_pipeline.named_steps

{'multinomialnb': MultinomialNB(alpha=2, class_prior=None, fit_prior=True),
 'tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=0.1, max_features=None, min_df=1,
         ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
         stop_words='english', strip_accents=None, sublinear_tf=False,
         token_pattern='\\b\\w{2,}\\b', tokenizer=None, use_idf=True,
         vocabulary=None)}

In [23]:
# Cross Validation
cv = cross_val_score(mnb_pipeline, X, y, scoring='accuracy', cv=10, n_jobs=-1)
print('\nMultinomialNB Classifier\'s Accuracy: %0.5f\n' % cv.mean())




MultinomialNB Classifier's Accuracy: 0.32350



## Model #2: RNN Classifier using TFLearn
#### import all the libraries

In [24]:
import tflearn
from tflearn.data_utils import to_categorical, pad_sequences
from tflearn.datasets import imdb