## Sentiment Analysis on IMDB Reviews
This is a simple project created with the intention of putting into practice knowledge learned about Machine Learning, Natural Language Processing and Sentiment Anaylisis.

## Dataset
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.

we do not need to download datasets locally as necessary functions have been included to download dataset from jupyter notebook

In [None]:
from keras.datasets import imdb

***Data Preparation***

In [None]:
((XT,YT),(Xt,Yt)) = imdb.load_data(num_words=10000)   #XT training # Xt testing

In [None]:
len(Xt),len(XT)

In [None]:
print(XT[0])

In [None]:
word_idx = imdb.get_word_index()

In [None]:
# print(word_idx.items())    # you run this cell to see the output

In [None]:
idx_word = dict([value,key] for (key,value) in word_idx.items())

In [None]:
# print(idx_word.items())    # you can also run this cell to see the output

In [None]:
actual_review = ' '.join([idx_word.get(idx-3,'#') for idx in XT[0]])

In [None]:
print(actual_review)

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

In [None]:
##next step ----> Vectorize the data
## Vocab size --> 10,000 we will make sure every sentence is represented by a vector of len 10,000 [0000010001001011...]


def  vectorize_sentences(sentences,dim = 10000):
  outputs = np.zeros((len(sentences),dim))


  for i,idx in enumerate(sentences):
    outputs[i,idx] = 1

  return outputs

In [None]:
X_train  = vectorize_sentences(XT)
X_test = vectorize_sentences(Xt)

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
print(X_train[0])

In [None]:
Y_train  = np.asarray(YT).astype('float32')
Y_test = np.asarray(Yt).astype('float32')

### Build a network
## Define our model architecture
1 use fully connected/dense layers with RELU activation

2 two hidden layers with 16 unit each

3 one output layer with 1 unit(sigmoid activation funct)


In [None]:
from keras import models
from keras.layers import Dense

In [None]:
# define the model
model  = models.Sequential()
model.add(Dense(16,activation = 'relu' , input_shape = (10000,)))
model.add(Dense(16,activation = 'relu'))
model.add(Dense(1,activation = 'sigmoid'))

In [None]:
# here we are compiling
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy']) # you can use adam insted of rmsprop

In [None]:
model.summary()

## Training and validation

In [None]:
x_val = X_train[:5000]
x_train_new = X_train[5000:]

y_val = Y_train[:5000]
y_train_new = Y_train[5000:]

In [None]:
hist = model.fit(x_train_new,y_train_new,epochs = 4,batch_size=512,validation_data =(x_val,y_val))

## Visualize

In [None]:
h = hist.history

In [None]:
plt.plot(h['val_loss'],label = 'validation loss')
plt.plot(h['loss'],label = 'training loss')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.legend()
plt.show()
plt.style.use('seaborn')

In [None]:
plt.plot(h['val_accuracy'],label = 'validation Acc')
plt.plot(h['accuracy'],label = 'training Acc')
plt.xlabel('epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
plt.style.use('seaborn')

In [None]:
h = hist.history

In [None]:
# let's calculate accuracy
model.evaluate(X_test,Y_test)[1]

In [None]:
model.evaluate(X_train,Y_train)[1]