## Sentiment Analysis using TFIDF and LSTM

In [1]:
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
tf.__version__

'2.2.0'

In [0]:
## Import dataset
dataset = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Bag_of_Words/word2vec_nlp/labeledTrainData.tsv.zip",
                      header=0,delimiter="\t",quoting=3)

In [3]:
dataset.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [0]:
 reviews = dataset["review"].tolist()
 sentiment = np.array(dataset["sentiment"].tolist())

In [5]:
len(reviews),len(sentiment)

(25000, 25000)

In [6]:
reviews[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [0]:
## Convert text to number using kears
tokenizer = keras.preprocessing.text.Tokenizer(num_words=5000)
tokenizer.fit_on_texts(reviews)

In [8]:
len(tokenizer.word_index)  ## these are total words but we are going to select only top 5000 unique word for TFIDF

88582

In [0]:
## convert text to TFIDF
input_data = tokenizer.texts_to_matrix(reviews,mode="tfidf")

In [10]:
input_data.shape

(25000, 5000)

In [11]:
input_data[0]

array([0.        , 2.75042893, 2.34574918, ..., 0.        , 0.        ,
       0.        ])

In [0]:
## Build the LSTM Model
## Initilize the model
model = keras.models.Sequential()
## Reshape the data from 1D to 2D becuase LSTM takes 2D input
model.add(keras.layers.Reshape((5000,1),input_shape=(5000,))) ## input shape--we need to feed input review row via row
## Normalize the input data
model.add(keras.layers.BatchNormalization())
## add LSTM Layer
model.add(keras.layers.LSTM(128)) ## size of the memory=128 for both cell and hidden state
## what is time stamp this need to remember ,it need to lookup 6K words and each word is represengted by single number.so ^K is the time stamp.
## We are only interested to get final output. and we need output 1 and if it is between 0 and 1 we will use sigmoid.

In [0]:
## Output layer
model.add(keras.layers.Dense(1,activation="sigmoid"))


In [0]:
## compile the model
model.compile(optimizer="adam",loss="binary_crossentropy",metrics=["accuracy"])

In [15]:
## Train the model
model.fit(x=input_data,y=sentiment,validation_split=0.2,epochs=5,batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f93290a28d0>

It feeded all the 5000 words from each review one by one and at last of each review word it gives the output(ht).RNN can work with sequence data, but TFIDF doesn't provide the output in sequence although It provide the output as per the index of the words, which do not followup how they appear in a perticular document(sentence).SO we can notice that it is very very slow to train the data.

## Sentiment Analysis using Pretrained word2vec and LSTM

In [16]:
dataset.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [0]:
## we are going to use same above text to number Tokenizer
## Instead of convert text to TFIDF we will convert Text to sequence. this convert each review's word to index number
reviews_seq = tokenizer.texts_to_sequences(reviews)
## Where are keeping track of word sequence in each review 

In [0]:
#reviews[0]

In [0]:
#reviews_seq[0] 

In [0]:
#tokenizer.word_index

In [21]:
## Lets check the lenght of each review
len(reviews_seq[0]),len(reviews_seq[1])

(403, 148)

as we see that lenght of each reviews is not same and we need make all these same lenght, so we will use padding

In [0]:
reviews_seq = keras.preprocessing.sequence.pad_sequences(reviews_seq,
                                                         maxlen= 300## max review lenght can change as per your experience and dataset
                                                         ,padding="pre")

In [23]:
len(reviews_seq[0]),len(reviews_seq[1])

(300, 300)

In [24]:
reviews_seq[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [25]:
reviews_seq[0]

array([2006, 1156,   18,    4,  261,   11,    6,   29,   41,  485, 1878,
         35,  891,   22, 2588,   37,    8,  550,   92,   22,   23,  167,
          5,  780,   11,    2,  166,    9,  354,   46,  200,  680,   32,
         15,    5,    1,  228,    4,   11,   17,   18,    2,   88,    4,
         24,  448,   59,  132,   12,   26,   90,    9,   15,    1,  448,
         60,   45,  280,    6,   63,  324,    4,   87,    7,    7,    1,
        776,  788,   19,  224,   51,    9,  414,  514,    6,   61,   20,
         15,  888,  231,   39,   35,    1, 3537, 1670,  717,    2,  911,
          6, 1075,   14,    3,   29,  972, 1389, 1631,  135,   26,  490,
        348,   35,   75,    6,  721,   69,   85,   24, 2454,  911,  106,
         12,   26,  470,   81,    5,  121,    9,    6,   26,   34,    6,
       1664,  520,   35,   10,  276,   26,   40, 4138,  225,    7,    7,
        772,    4,  643,  180,    8,   11,   37, 1583,   80,    3,  516,
          2,    3, 2353,    2,    1,  223, 2119, 27

In [0]:
## lets use pretrained word2vec model
## use path to get details how we train word2vec
## https://github.com/atulpatelDS/NLP/blob/master/Word2vec.ipynb
from gensim.models import Word2Vec

In [27]:
word2vec_model=Word2Vec.load("sample_data/word2vec-movie-IMDB")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [28]:
## Lets check the shape of loaded model
word2vec_model.wv.vectors.shape

(28322, 50)

in TFIDF we have 1 number for each word and in word2vec we have 50 number for each word.
We need to create the array of 5000 words so I need word2vec embedding.

In [29]:
embedding_vector_length = word2vec_model.wv.vectors.shape[1]
embedding_vector_length

50

In [30]:
max_words = 5000
## 1 becuase we also need to add pad sequence
embedding_matrix = np.zeros((max_words+1,embedding_vector_length))
embedding_matrix.shape ## matrix with all zero value

(5001, 50)

In [0]:
#tokenizer.word_index.items()
#word2vec_model.wv.vocab
#word2vec_model.wv["hard"]

In [0]:
## lets fillup these zero values with actual vectors 
for word,i in sorted(tokenizer.word_index.items(),key=lambda x:x[1]):
  if i > max_words:  ## ignore all words greated than 5000
    break
  if word in word2vec_model.wv.vocab:
    embedding_vector=word2vec_model.wv[word]
    embedding_matrix[i] = embedding_vector

In [33]:
#embedding_matrix[1]
embedding_matrix[0] ## get zero becuase there is no word2vec embedding in the pretrained model for blank spaces

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [0]:
## Now we have word2vec numbers for 5000 words
## Lets buld the graph
model_wv = keras.models.Sequential()

The Embedding layer is used to create word vectors for incoming words. It sits between the input and the LSTM layer, i.e. the output of the Embedding layer is the input to the LSTM layer.

The weights for the Embedding layer can either be initialized with random values, or more commonly, they are initialized with third-party word embeddings such as word2vec, GloVe or fasttext (or others) and these weights can optionally be fine-tuned during training.Using third party embeddings to build word vectors is as a form of transfer learning, since you transfer the semantic information between words that was learned during the embedding process.

In [0]:
## Add embedding Layer
max_review_length = 300
## input for embedding layer (batch_size,max review lenght)  ## max review length = 300
## output from embedding layer = (batch_size,300,50)  ## 50>ebedding size in word2vec
model_wv.add(keras.layers.Embedding(5001,embedding_vector_length,input_length=max_review_length,
                                 weights=[embedding_matrix],trainable = False))
## If we dont have pre traineed embedding then we can set trainable = True t
## and remove the weights=[embedding_matrix]

In [36]:
## Add LSTM LAyer
model_wv.add(keras.layers.LSTM(128,dropout=0.2,recurrent_dropout=0.2))
## recurrent_dropout means it apply before the lstm layer




In [0]:
## Add ouput layer
model_wv.add(keras.layers.Dense(1,activation="sigmoid"))
## compile the model
model_wv.compile(optimizer="adam",metrics="accuracy",loss="binary_crossentropy")

In [38]:
## Tarin the model
model_wv.fit(reviews_seq,sentiment,validation_split=0.2,
          epochs=5,
          batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f919c8967f0>

1. Memory Size = 128 : Same in TFIDF and word2vec
2. Time steps = TFIDF: 5000, W2Vec: 300 -> Better for w2vec : 17 times. This can we change as per your requirement
3. Input size at each time step: TFIDF: 1 W2vec: 50 -> Better for TFIDF
4. Sequencing of words in a Document: TFIDF has no clue of sequencing although word2vec has sequencing.
5. Accuracy: TFIDF very

In [0]:
## Lets use GRU layer instead of LSTM


In [0]:
## Now we have word2vec numbers for 5000 words
## Lets buld the graph
model_gru = keras.models.Sequential()

In [0]:
## Add embedding Layer
max_review_length = 300
## input for embedding layer (batch_size,max review lenght)  ## max review length = 300
## output from embedding layer = (batch_size,300,50)  ## 50>ebedding size in word2vec
model_gru.add(keras.layers.Embedding(5001,embedding_vector_length,input_length=max_review_length,
                                 weights=[embedding_matrix],trainable = False))
## If we dont have pre traineed embedding then we can set trainable = True t
## and remove the weights=[embedding_matrix]

In [41]:
## Add LSTM LAyer
model_gru.add(keras.layers.GRU(128,dropout=0.2,recurrent_dropout=0.2))
## recurrent_dropout means it apply before the lstm layer




In [0]:
## Add ouput layer
model_gru.add(keras.layers.Dense(1,activation="sigmoid"))
## compile the model
model_gru.compile(optimizer="adam",metrics="accuracy",loss="binary_crossentropy")

In [43]:
## Tarin the model
model_gru.fit(reviews_seq,sentiment,validation_split=0.2,
          epochs=5,
          batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f919b24eba8>