Sentiment Analysis (Google's word2vec package for sentiment analysis)

https://www.kaggle.com/c/word2vec-nlp-tutorial/data


The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

In [1]:
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
tf.__version__

'2.2.0'

In [0]:
dataset = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Bag_of_Words/word2vec_nlp/labeledTrainData.tsv.zip",
                      header=0,delimiter="\t",quoting=3)

In [3]:
dataset.shape

(25000, 3)

In [4]:
dataset.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [0]:
## Extract the review and sentiment
sentiment = dataset["sentiment"]
review = dataset["review"]

In [6]:
type(sentiment),type(review)

(pandas.core.series.Series, pandas.core.series.Series)

In [0]:
## Need to convert the type in list, becuase Kera Tokenizer use list as an input
sentiment = sentiment.tolist()
review = review.tolist()

In [8]:
type(sentiment),type(review)

(list, list)

In [9]:
## lets display the ant review data
review[3]

'"It must be assumed that those who praised this film (\\"the greatest filmed opera ever,\\" didn\'t I read somewhere?) either don\'t care for opera, don\'t care for Wagner, or don\'t care about anything except their desire to appear Cultured. Either as a representation of Wagner\'s swan-song, or as a movie, this strikes me as an unmitigated disaster, with a leaden reading of the score matched to a tricksy, lugubrious realisation of the text.<br /><br />It\'s questionable that people with ideas as to what an opera (or, for that matter, a play, especially one by Shakespeare) is \\"about\\" should be allowed anywhere near a theatre or film studio; Syberberg, very fashionably, but without the smallest justification from Wagner\'s text, decided that Parsifal is \\"about\\" bisexual integration, so that the title character, in the latter stages, transmutes into a kind of beatnik babe, though one who continues to sing high tenor -- few if any of the actors in the film are the singers, and we

In [0]:
## Now convert the Text into Number using Keras Tokenizer, we can also use the NLTK library
tokenizer = keras.preprocessing.text.Tokenizer(num_words=6000)
## Lets fit all the reviews
tokenizer.fit_on_texts(review)

In [11]:
## Lets display to total index or total unique words count
len(tokenizer.index_word) 

88582

In [12]:
## Lets convert Text into numbers using TFIDF
review_feature = tokenizer.texts_to_matrix(review,mode="tfidf")
review_feature.shape
## We are getting only 6000 outof 88582 because we used num_words = 6000 during tokenization

(25000, 6000)

In [13]:
type(review_feature)

numpy.ndarray

In [14]:
review_feature[0:1,:]

array([[0.        , 2.75042893, 2.34574918, ..., 0.        , 0.        ,
        0.        ]])

In [0]:
## Also we need to convert the sentiment into numpy array
sentiment = np.array(sentiment)

In [16]:
type(sentiment)

numpy.ndarray

In [0]:
## Lets build the model usinf Fully Connected Dense neural network
model = keras.models.Sequential()
## Normalize the dataset
model.add(keras.layers.BatchNormalization(input_shape=(6000,)))

## Add dense layers
model.add(keras.layers.Dense(1000,activation="relu"))
model.add(keras.layers.Dropout(0.8))
model.add(keras.layers.Dense(500,activation="relu"))
model.add(keras.layers.Dropout(0.7))
model.add(keras.layers.Dense(300,activation="relu"))
model.add(keras.layers.Dropout(0.6))
model.add(keras.layers.Dense(100,activation="relu"))
model.add(keras.layers.Dropout(0.6))


In [0]:
## Add output layer
model.add(keras.layers.Dense(1,activation="sigmoid"))  ## Because we need only 1 output


In [0]:
## Compile the model
model.compile(optimizer="adam",loss="binary_crossentropy",metrics=["accuracy"])


In [21]:
## Train the model
model.fit(x=review_feature,y=sentiment,validation_split=0.2,batch_size=32,epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7f0756bad128>

There are overfitting in above accuracy so we need to tune our parameter. But accuracy still not good so we will try some other method to improve this.