<a href="https://colab.research.google.com/github/alexsuakim/Natural-Language-Processing/blob/main/Neural_Probabilistic_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will be focusing on the IMDb dataset. This is a dataset for binary sentiment classification, and is provided with a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. To load the dataset, you can choose these several ways:

1. Refer to Kaggle (https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). You can load the dataset by csv file.

2. Refer to Torchtext (https://pytorch.org/text/stable/index.html). There is a build-in function named `torchtext.datasets.IMDB`(https://pytorch.org/text/stable/datasets.html#torchtext.datasets.IMDB). Different versions of Torchtext have quite different features and functions. You may use previous versions.

3. Refer to Huggingface (https://huggingface.co/datasets/imdb). You can load this dataset directly with the datasets library.

4. Refer to Stanford (https://ai.stanford.edu/~amaas/data/sentiment/). This is the raw dataset.

If you choose to use method 4, you could easily download the dataset by adding this line in your colab notebook:

```
! wget http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz
```

###After loading the dataset, you will need to perform preprocessing (e.g. tokenization, build up vocabulary, etc.) on the text. We will set the minimum token frequency threshold to be 10. Then print out the size of your vocabulary.

- Special notes: you may need some special tokens like `<UNK>`, `<PAD>`, `<BOS>`, `<EOS>`. `<UNK>` represents the tokens that can not be found in our vocabulary. (Why do we need it?) `<PAD>` means padding, and `<BOS>` and `<EOS>` represents beginning-of-sentence and end-of-sentence, respectively.

###Build an appropriate embedding matrix based on your vocabulary and print out the size of this matrix.

###Finally, to get your data prepared, build up Pytorch dataloaders for model training and print out one batch of training data.

- To check whether your dataloader can work successfully, you can choose to use `next(iter(train_dataloader))`. You can refer to https://pytorch.org/tutorials/beginner/basics/data_tutorial.html.

###We choose bidirectional LSTM (BiLSTM) as the model. Train the model for 5 epoches with embedding matrix you obtained earlier, and for each epoch, print out the training loss, training accuracy, testing loss and testing accuracy. You could choose any appropriate loss function and values for hyperparameters.

- This is a challenge question. If you found difficulty understanding the structure of BiLSTM, you may refer to the supplementary note named *notes_on_lstm* inside tutorial 9 for detailed information.

- You definitely want to use GPU for this colab notebook. Go to Edit > Notebook settings as the following: Click on “Notebook settings” and select “GPU”.

In [None]:
import numpy as np
import pandas as pd
import io
import operator
import re
import string
import torch
import tensorflow as tf
from google.colab import files
from string import punctuation
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader, random_split
from keras.utils import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Bidirectional, Activation, Concatenate, Flatten

In [None]:
#download IMDB dataset and upload to Google Colab
uploaded = files.upload()


Saving IMDB.csv to IMDB.csv


In [None]:
#save IMDB dataset as a dataframe
df = pd.read_csv(io.BytesIO(uploaded['IMDB.csv']))
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [None]:
#perform preprocessing (e.g. tokenization, build up vocabulary, etc.) on the text

#preprocessing
for idx, review in enumerate(df['review']):
  #convert to lowercase
  review = review.lower()

  #remove punctuation
  review = ''.join([word for word in review if word not in punctuation])

  #replace review with clean text
  df['review'][idx] = review


#building vocabulary
#create a list with words in all the reviews combined
reviews_combined = ' '.join(df['review'])
words_list = reviews_combined.split()

# count all words from the list vocabulary
all_words = Counter(words_list)

#leave only the words with (count >= 10)
vocabulary = {word:count for (word,count) in all_words.items() if count >= 10}

#print out the size of your vocabulary
vocab_len = len(vocabulary)
print(f'the size of the vocabulary is: {vocab_len}')


the size of the vocabulary is: 30767


In [None]:
#Build an appropriate embedding matrix based on your vocabulary and print out the size of this matrix.
#a dictionary to help map words to their embeddings

#check if word belongs in vocabulary
for review in df['review']:
  review = ' '.join([word for word in review.split() if not word in vocabulary])

#create embeddings
tokenizer = Tokenizer(num_words=None, split=' ')

#fit on texts
tokenizer.fit_on_texts(df['review'].values)

#convert texts to sequences of integers
embeddings = tokenizer.texts_to_sequences(df['review'].values)

#pad shorter reviews with 0's
embeddings = pad_sequences(embeddings)

#print out the size of the embeddings matrix
print(f'the size of the embeddings matrix is: {embeddings.shape}')
print(embeddings)


the size of the embeddings matrix is: (50000, 2469)
[[    0     0     0 ...   122  4018   501]
 [    0     0     0 ...  1900    73   223]
 [    0     0     0 ...    64    15   333]
 ...
 [    0     0     0 ... 23649     2  6058]
 [    0     0     0 ...    68   711    42]
 [    0     0     0 ...   782    10    17]]


In [None]:
#build up Pytorch dataloaders for model training and print out one batch of training data.

#change sentiment from str to int
#sentiment = [1 if (sentiment == 'positive') else 0 for sentiment in df['sentiment']]
sentiment = pd.get_dummies(df['sentiment']).values

#divide dataset into training and testing data
X_train, X_test, Y_train, Y_test = train_test_split(embeddings,sentiment, test_size = 0.2, random_state = 42)

#teypecast to Tensor
X_train_torch = torch.Tensor(X_train)
Y_train_torch = torch.Tensor(Y_train)
X_test_torch = torch.Tensor(X_test)
Y_test_torch = torch.Tensor(Y_test)

#build TensorDataset
train_dataset = TensorDataset(X_train_torch, Y_train_torch)
test_dataset = TensorDataset(X_test_torch, Y_test_torch)

#DataLoader
train_dataloader = DataLoader(dataset = train_dataset, batch_size=1, shuffle = False)

#print out one batch of training data
next(iter(train_dataloader))

[tensor([[   0.,    0.,    0.,  ...,  209.,  342., 3894.]]),
 tensor([[1., 0.]])]

In [None]:
#Build a bidirectional LSTM model, train and test.

#apply Bidirectional LSTM
model = Sequential()
model.add(Embedding(2000, 128,input_length = embeddings.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(Bidirectional(LSTM(10, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])

#Train the model for 5 epoches with embedding matrix you obtained earlier.
#print training loss, training accuracy, testing loss, and testing accuracy for each epoch
model.fit(X_train, Y_train, epochs=5, batch_size=128, verbose=1, validation_data=(X_test,Y_test))



Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fdce08bad90>


###Implement the idea in paper ***A Neural Probabilistic Language Model*** (https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) to train a trigram model. We will use the brown corpus in nltk package as the dataset. Train the model for 5 epoches and print out the training loss, training accuracy, testing loss, and testing accuracy. You can use these codes to download the corpus:

```
import nltk
nltk.download("brown")
from nltk.corpus import brown
```

In [None]:
#download Brown corpus
import nltk
nltk.download("brown")
from nltk.corpus import brown

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


In [None]:
#preprocessing
corpus = brown.sents()
words = [word.lower() for sentence in corpus
                      for word in sentence]
corpus_size = len(words)

#count frequency
frequency = Counter(words)

#sort by highest frequency
vocabulary = {word:idx for idx, (word, _) in enumerate(frequency.most_common())}

#input bigram
X = [(vocabulary[words[i]], vocabulary[words[i + 1]]) for i in range(corpus_size - 2)]
#output the predicted next word
Y = [vocabulary[words[i + 2]] for i in range(corpus_size - 2)]

#typecast to numpy array
X= np.array(X)
Y=np.array(Y)
#print out size of X & Y
print(X.shape)
print(Y.shape)

#split training and testing data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

(1161190, 2)
(1161190,)


In [None]:
vocab_size = len(vocabulary)
embed_dim = 30        # embedding vector dimension
n = 2                # n as in n-gram
hidden_dim = 50       # num of hidden units

# input is a vector of integers
inputs = Input(shape=(n,), name="input")
embedding = Embedding(input_dim=vocab_size,
                      output_dim=embed_dim,
                      mask_zero=True, # input value 0 is a mask
                      input_length=n,
                      name="embed")(inputs)
flatten = Flatten()(embedding)
hidden = Dense(hidden_dim, activation="tanh", name="hidden")(flatten)
outputs = Dense(vocab_size, activation="softmax", name="prob")(hidden)
model = Model(inputs=inputs, outputs=outputs, name="NPLM")

#fit model
model.compile(loss = 'sparse_categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.fit(X_train, Y_train, epochs=5, batch_size=128, verbose=1, validation_data=(X_test,Y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f35539b70a0>