# Text classification 
## Sentiment analysis
It is a natural language processing problem where text is understood and the underlying intent is predicted. Here, you need to  predict the sentiment of movie reviews as either positive or negative in Python using the Keras deep learning library.

## Data description
The dataset is the Large Movie Review Dataset often referred to as the IMDB dataset.

The [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) (often referred to as the IMDB dataset) contains 25,000 highly polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given moving review has a positive or negative sentiment.  Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).

## Loading dataset
First, we will load complete dataset and analyze some properties of it.<br>


In [11]:
import numpy as np
from matplotlib import pyplot
import pandas as pd
import keras
from keras import regularizers,layers
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

In [75]:
data = pd.read_csv('IMDB_Dataset.csv')

In [76]:
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [77]:
## We have left first two columns and taken other columns as input features
X = data.iloc[:, 0].values

# 2nd column is output labels
y = data.iloc[:, 1].values

In [78]:
Y = []
for i in range(len(y)):
    if y[i] == 'negative':
        Y.append(0)
    if y[i] == 'positive':
        Y.append(1) 

In [79]:
Y_oh = []
for i in range(len(y)):
    if Y[i] == 0:
        Y_oh.append([1, 0])
    if Y[i] == 1:
        Y_oh.append([0, 1])    

In [80]:
Y_oh = np.array(Y_oh)
Y_oh

array([[0, 1],
       [0, 1],
       [0, 1],
       ...,
       [1, 0],
       [1, 0],
       [1, 0]])

In [81]:
#txt preprocessing
import nltk 
import string 
import re 

In [82]:
for i in range(len(X)):
    X[i] = X[i].lower().replace('<br>', '').replace('<br />', '')

In [83]:
import string
for i in range(len(X)):
    for c in string.punctuation:
        X[i]= X[i].replace(c,"")

In [84]:
X

array(['one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty

In [85]:
# remove whitespace from text 
for i in range(len(X)):
    X[i] = " ".join(X[i].split())

In [86]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

In [87]:
ENGLISH_STOP_WORDS = set(ENGLISH_STOP_WORDS)

In [88]:
'''nltk.download('stopwords')
nltk.download('punkt')'''

"nltk.download('stopwords')\nnltk.download('punkt')"

In [89]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

In [90]:
stop_words = set(stopwords.words('english')) 

In [91]:
stop_words.update(ENGLISH_STOP_WORDS)

In [92]:
def remove_stopwords(text):  
    word_tokens = word_tokenize(text) 
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text 

In [93]:
# calling of remove_stopwords function
for i in range(len(X)):
    X[i] = remove_stopwords(X[i])

In [94]:
X

array([list(['reviewers', 'mentioned', 'watching', '1', 'oz', 'episode', 'youll', 'hooked', 'right', 'exactly', 'happened', 'methe', 'thing', 'struck', 'oz', 'brutality', 'unflinching', 'scenes', 'violence', 'set', 'right', 'word', 'trust', 'faint', 'hearted', 'timid', 'pulls', 'punches', 'regards', 'drugs', 'sex', 'violence', 'hardcore', 'classic', 'use', 'wordit', 'called', 'oz', 'nickname', 'given', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'focuses', 'mainly', 'emerald', 'city', 'experimental', 'section', 'prison', 'cells', 'glass', 'fronts', 'face', 'inwards', 'privacy', 'high', 'agenda', 'em', 'city', 'home', 'manyaryans', 'muslims', 'gangstas', 'latinos', 'christians', 'italians', 'irish', 'moreso', 'scuffles', 'death', 'stares', 'dodgy', 'dealings', 'shady', 'agreements', 'far', 'awayi', 'say', 'main', 'appeal', 'fact', 'goes', 'shows', 'wouldnt', 'dare', 'forget', 'pretty', 'pictures', 'painted', 'mainstream', 'audiences', 'forget', 'charm', 'forget', 'romanceoz

In [102]:
t = open("sentiment_lexicons.txt", "r")

In [106]:
d = t.read()

In [110]:
type(d)

str

In [111]:
d = d.split()

In [142]:
d[5267]

'perplexity'

In [113]:
len(d)

6789

In [114]:
id_to_word = dict(enumerate(d))

In [115]:
id_to_word

{0: 'a+',
 1: 'abound',
 2: 'abounds',
 3: 'abundance',
 4: 'abundant',
 5: 'accessable',
 6: 'accessible',
 7: 'acclaim',
 8: 'acclaimed',
 9: 'acclamation',
 10: 'accolade',
 11: 'accolades',
 12: 'accommodative',
 13: 'accomodative',
 14: 'accomplish',
 15: 'accomplished',
 16: 'accomplishment',
 17: 'accomplishments',
 18: 'accurate',
 19: 'accurately',
 20: 'achievable',
 21: 'achievement',
 22: 'achievements',
 23: 'achievible',
 24: 'acumen',
 25: 'adaptable',
 26: 'adaptive',
 27: 'adequate',
 28: 'adjustable',
 29: 'admirable',
 30: 'admirably',
 31: 'admiration',
 32: 'admire',
 33: 'admirer',
 34: 'admiring',
 35: 'admiringly',
 36: 'adorable',
 37: 'adore',
 38: 'adored',
 39: 'adorer',
 40: 'adoring',
 41: 'adoringly',
 42: 'adroit',
 43: 'adroitly',
 44: 'adulate',
 45: 'adulation',
 46: 'adulatory',
 47: 'advanced',
 48: 'advantage',
 49: 'advantageous',
 50: 'advantageously',
 51: 'advantages',
 52: 'adventuresome',
 53: 'adventurous',
 54: 'advocate',
 55: 'advocated',

In [117]:
word_to_id = {b:a for a, b in id_to_word.items()}

In [118]:
word_to_id

{'a+': 0,
 'abound': 1,
 'abounds': 2,
 'abundance': 3,
 'abundant': 4,
 'accessable': 5,
 'accessible': 6,
 'acclaim': 7,
 'acclaimed': 8,
 'acclamation': 9,
 'accolade': 10,
 'accolades': 11,
 'accommodative': 12,
 'accomodative': 13,
 'accomplish': 14,
 'accomplished': 15,
 'accomplishment': 16,
 'accomplishments': 17,
 'accurate': 18,
 'accurately': 19,
 'achievable': 20,
 'achievement': 21,
 'achievements': 22,
 'achievible': 23,
 'acumen': 24,
 'adaptable': 25,
 'adaptive': 26,
 'adequate': 27,
 'adjustable': 28,
 'admirable': 29,
 'admirably': 30,
 'admiration': 31,
 'admire': 32,
 'admirer': 33,
 'admiring': 34,
 'admiringly': 35,
 'adorable': 36,
 'adore': 37,
 'adored': 38,
 'adorer': 39,
 'adoring': 40,
 'adoringly': 41,
 'adroit': 42,
 'adroitly': 43,
 'adulate': 44,
 'adulation': 45,
 'adulatory': 46,
 'advanced': 47,
 'advantage': 48,
 'advantageous': 49,
 'advantageously': 50,
 'advantages': 51,
 'adventuresome': 52,
 'adventurous': 53,
 'advocate': 54,
 'advocated': 55,

In [119]:
b =[]
for i in X:
    c = []
    for j in i:
        try:
            c.append(word_to_id[j])
        except:
            pass
    b.append(c)

In [120]:
b = np.array(b)

In [121]:
b

array([list([1532, 6086, 2457, 1532, 1843, 3576, 6266, 286, 5382, 2851, 5773, 110, 1345, 268, 4856, 6086, 5007, 1758, 1424, 4437, 2779, 4631, 5382, 4643, 5382, 308, 6391, 2834]),
       list([1979, 310, 1575, 1991, 1149, 856, 1153, 1642, 6234]),
       list([1979, 926, 5322, 5832, 1976, 1086, 4633, 3118, 5610, 965, 1594, 1532, 1661, 3052, 1013, 856]),
       ...,
       list([831, 2251, 4632, 3671, 3579, 1072, 2338, 672, 6299, 4761, 2015, 6299, 3349, 1097, 6299, 672, 6777]),
       list([3106, 3522, 6637, 2752, 6724, 1686, 1087, 4650, 268, 4730, 2411, 4287, 2511, 5462, 5007, 5578, 1087, 4707, 119, 1185, 1366]),
       list([684, 831, 195, 6454, 4194, 5322, 6752, 3875, 1991, 1991])],
      dtype=object)

In [122]:
#test train split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(b, Y_oh, test_size = 0.20, random_state = 0)

In [123]:
print(len(X_train))
print(len(X_test))
print(len(Y_train))
print(len(X_test))

40000
10000
40000
10000


In [124]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(40000,)
(10000,)
(40000, 2)
(10000, 2)


In [125]:
def multi_hot_encode(sequences, dimension):
    
    results = np.zeros((len(sequences), dimension))
    for i in range(len(sequences)):
        for j in range(len(sequences[i])):
            results[i][sequences[i][j]] = 1
    return results


In [126]:
x_train = multi_hot_encode(X_train, 6789)
x_test = multi_hot_encode(X_test, 6789)

In [137]:
x_train.shape

(40000, 6789)

In [134]:
x_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
def summarize_data():
  """
  Output:
                    classes: list, list of unique classes in y  
                no_of_words: int, number of unique words in dataset x 
     list_of_review_lengths: list,  list of lengths of each review 
         mean_review_length: float, mean(list_of_review_lengths), a single floating point value
          std_review_length: float, standard_deviation(list_of_review_lengths), a single floating point value
  """

  import statistics
  classes = np.unique(y)
  no_of_words = len(np.unique(np.concatenate(X)))
  list_of_review_lengths = [len(i) for i in X]
  mean_review_length = statistics.mean(list_of_review_lengths)
  std_review_length = statistics.stdev(list_of_review_lengths)
  return classes, no_of_words, list_of_review_lengths, mean_review_length, std_review_length


classes, no_of_words, list_of_review_lengths, mean_review_length, std_review_length = summarize_data()


In [None]:
def one_hot(y):
  """
  Inputs:
    y: numpy array with class labels
  Outputs:
    y_oh: numpy array with corresponding one-hot encodings
  """

  #y_oh = np.zeros(len(classes))
  oh = []
  for i in range(0, len(y)):
    if y[i] == 0:
      oh.append([1, 0])
    else:
      oh.append([0, 1]) 
  y_oh = np.array(oh)
  return y_oh
y_train = one_hot(y_train)
y_test = one_hot(y_test)

In [None]:
def multi_hot_encode(sequences, dimension):
  """
    Input:
          sequences: list of sequences in X_train or X_test

    Output:
          results: mult numpy matrix of shape(len(sequences), dimension)
                  
  """
  # YOUR CODE HERE
  results = np.zeros((len(sequences), dimension))
  for i in range(len(sequences)):
    for j in range(len(sequences[i])):
      results[i][sequences[i][j]] = 1
  return results


## Build Model
Build a multi layered feed forward network in keras. 

### Create the model

In [129]:
def create_model():
    """
    Output:
        model: A compiled keras model
    """
    model = Sequential()
    model.add(Embedding(6789, 32, input_length = 6789))
    model.add(Flatten())
    model.add(Dense(32, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])
    return model
  
model = create_model()
print(model.summary())

W0731 17:00:24.603832 20824 deprecation_wrapper.py:119] From C:\Users\Akash\Anaconda3\envs\rootenv\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0731 17:00:25.811648 20824 deprecation_wrapper.py:119] From C:\Users\Akash\Anaconda3\envs\rootenv\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0731 17:00:26.157675 20824 deprecation_wrapper.py:119] From C:\Users\Akash\Anaconda3\envs\rootenv\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0731 17:00:26.929621 20824 deprecation_wrapper.py:119] From C:\Users\Akash\Anaconda3\envs\rootenv\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0731 17:00:26.975485 20824 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 6789, 32)          217248    
_________________________________________________________________
flatten_1 (Flatten)          (None, 217248)            0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                6951968   
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 66        
Total params: 7,169,282
Trainable params: 7,169,282
Non-trainable params: 0
_________________________________________________________________
None


### Fit the Model

In [136]:
#x_strat.shape
X_train.shape

(40000,)

In [139]:
import matplotlib.pyplot as plt
def fit(model):
    """
    Action:
        Fit the model created above using training data as x_strat and y_strat
        and validation_data as x_dev and y_dev, verbose=2 and store it in 'history' variable.
        
        evaluate the model using x_test, y_test, verbose=0 and store it in 'scores' list
    Output:
        scores: list of length 2
        history_dict: output of history.history where history is output of model.fit()
    """

    history = model.fit(x_train, Y_train, validation_data=(x_test, Y_test), epochs=10, batch_size=128, verbose=1)
    scores =  model.evaluate(x_test, Y_test, verbose=0)
    history_dict = history.history
    return scores,history_dict
    
scores,history_dict = fit(model)    


W0801 11:08:51.275639 20824 deprecation_wrapper.py:119] From C:\Users\Akash\Anaconda3\envs\rootenv\lib\site-packages\keras\backend\tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Train on 40000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


ValueError: Error when checking input: expected embedding_1_input to have shape (6789,) but got array with shape (1,)

In [145]:
 Accuracy=scores[1]*100
print('Accuracy of your model is')
print(scores[1]*100)

NameError: name 'scores' is not defined

In [None]:
history_dict['loss']

### Verify whether training in converged or not

In [None]:
import matplotlib.pyplot as plt
plt.clf()
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, (len(history_dict['loss']) + 1))
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
plt.clf()
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
epochs = range(1, (len(history_dict['acc']) + 1))
plt.plot(epochs, acc_values, 'bo', label='Training acc')
plt.plot(epochs, val_acc_values, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

### Advanced
Write 5 reviews on your own with at least 20 words. See if your model correctly predicts the sentiment on these reviews


In [None]:
def accuracy(x_strat, y_strat, model):

    acc = model.evaluate(x_strat, y_strat)[1]
    return acc

acc = accuracy(x_test, y_test, model)
print('Test accuracy is, ', acc*100, '%')

In [None]:
a = ['this movie is disaster in the history of Bollywood. i never expected this from such a great actor and movie maker.', 'this movie is filled with awful moments and i am very depressed after watching this movie.', 'what a great movie i loved it so much acting is great by a great actor and the makers have done a good job. they have put a lot of effort in this.', 'this movie is total worth watching, good comedy movie i have seen ever some laughing scenes are not that good but over all the movie is very good.', 'these kinds of movies comes in once in ceturies. a must watch movie, full of enjoyment, very much loved by teeagers']

In [None]:
print(len(a))

In [None]:
b =[]
for i in a:
  c = []
  for j in i.split():
    try:
      c.append(word_to_id[j])
    except:
      pass
  b.append(c)

In [None]:
b = np.array(b)

In [None]:
b

In [None]:
validation = multi_hot_encode(b, 10000)

In [None]:
print(validation)

In [None]:
print('Negative , positive')
print(model.predict(validation))