### Connect to Kaggle

We will be using data available on Kaggle platform for this exercise. The data is available at https://www.kaggle.com/c/word2vec-nlp-tutorial/data. We will first connect Colab to Kaggle. Instructions for downloading kaggle data to Colab can be found [in this post](https://towardsdatascience.com/setting-up-kaggle-in-google-colab-ebb281b61463).

In [0]:
!pip install kaggle --quiet

In [0]:
#Make a directory for Kaggle
!mkdir .kaggle

In [0]:
#Connect Google drive to colab
from google.colab import drive
drive.mount('/gdrive')

In [0]:
#Copy kaggle.json file. Change gdrive folder based on where you have saved your json file from Kaggle
!cp '/gdrive/My Drive/AI-ML/Machine-Learning/Code/Utilities/kaggle.json' /content/.kaggle/kaggle.json

In [0]:
#Check if json file is there
!ls -l /content/.kaggle

In [0]:
!mkdir ~/.kaggle
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!kaggle config set -n path -v{/content}
!chmod 600 /root/.kaggle/kaggle.json

Verify Kaggle connection

In [0]:
!kaggle datasets list

#### Download Movie Reviews data

In [0]:
!kaggle competitions download -c word2vec-nlp-tutorial -p /content

In [0]:
#Confirm data has been downloaded
!ls -l

Import the dataset as pandas dataframe

In [0]:
import numpy as np
import pandas as pd

In [0]:
df = pd.read_csv('labeledTrainData.tsv.zip',header=0, delimiter="\t", quoting=3)

In [0]:
df.shape

In [0]:
df.head()

In [0]:
df.loc[0, 'review']

Split Data into Training and Test Data

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(
    df['review'],
    df['sentiment'],
    test_size=0.2, 
    random_state=42
)

In [0]:
X_train.shape, X_test.shape

### Build the Tokenizer

In [0]:
import tensorflow as tf

In [0]:
desired_vocab_size = 10000 #Vocablury size
t = tf.keras.preprocessing.text.Tokenizer(num_words=desired_vocab_size, oov_token=32) # num_words -> Vocablury size

In [0]:
#Fit tokenizer with actual training data
t.fit_on_texts(X_train.tolist())

In [0]:
#Vocabulary
t.word_index

### Prepare Training and Test Data

Get the word index for each of the word in the review

In [0]:
X_train[0:1]

In [0]:
X_train = t.texts_to_sequences(X_train.tolist())

In [0]:
print(X_train[0:1])

In [0]:
X_test = t.texts_to_sequences(X_test)

How many words in each review?

In [0]:
len(X_train[200])

### Pad Sequences - Important

In [0]:
#Define maximum number of words to consider in each review
max_review_length = 300

In [0]:
#Pad training and test reviews
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train,
                                                        maxlen=max_review_length,
                                                        padding='pre')
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, 
                                                       maxlen=max_review_length, 
                                                       padding='pre')

In [0]:
X_train.shape

In [0]:
X_test.shape

In [0]:
X_train[200]

### Download Google Word2Vec model

In [0]:
!pip install googledrivedownloader

In [0]:
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id='0B7XkCwpI5KDYNlNUTTlSS21pQmM',
                                    dest_path='./GoogleNews-vectors-negative300.bin.gz',
                                    unzip=True)

In [0]:
import gzip
import shutil

In [0]:
with gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb') as f_in:
    with open('GoogleNews-vectors-negative300.bin', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [0]:
!ls -l

### Get Pre-trained Embeddings

In [0]:
import gensim

In [0]:
from gensim.models import Word2Vec, KeyedVectors

# Load pretrained model
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [0]:
#Loading your own trained
#word2vec = gensim.models.Word2Vec.load('word2vec-movie-50')

In [0]:
embedding_vector_length = 300

In [0]:
#Initialize embedding matrix
embedding_matrix = np.zeros((desired_vocab_size + 1, embedding_vector_length))

In [0]:
#Load word vectors for each word from Google Word2Vec model
for word, i in sorted(t.word_index.items(),key=lambda x:x[1]):
    if i > (desired_vocab_size+1):
        break
    try:
        embedding_vector = model[word] #Reading word's embedding from Google Word2Vec
        embedding_matrix[i] = embedding_vector
    except:
        pass

In [0]:
#embedding_matrix[2]

### Build Model

In [0]:
#Initialize model
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

Add Embedding layer
 - Embedding Layer Input = Batch_Size * Length of each review

In [0]:
model.add(tf.keras.layers.Embedding(desired_vocab_size + 1, #Vocablury size
                                    300, #Embedding size
                                    weights=[embedding_matrix],
                                    trainable=False,
                                    input_length=max_review_length) #Number of words in each review
          )

In [0]:
model.output

Embedding Layer Output - 
[Batch_Size , Review Length , Embedding_Size]

Add LSTM Layer with 256 as RNN state size

In [0]:
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.LSTM(256)) #RNN State - size of cell state and hidden state
model.add(tf.keras.layers.Dropout(0.2))

In [0]:
model.output

Use Dense layer for output layer

In [0]:
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

In [0]:
#Compile the model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [0]:
model.summary()

### Train Model

In [0]:
model.fit(X_train,y_train,
          epochs=20,
          batch_size=32,          
          validation_data=(X_test, y_test))