# TASK
## Deadline: 31 martie ora 23:59.

Formular pentru trimiterea temei: https://forms.gle/Bznaciv2MTy4kVL47

Folosind intreg datasetul de mai sus (IMDb reviews) implementati urmatoarele cerinte:
1. Impartiti setul de date in 80% train, 10% validare si 10% test
2. Tokenizati textele si determinati vocabularul (in acest task vom lucra cu reprezentari la nivel de cuvant, NU la nivel de caracter); intrucat vocabularul poate fi foarte mare, incercati sa aplicati una dintre tehnicile mentionate in laborator (10K-20K de cuvinte ar fi o dimensiunea rezonabila a vocabularului)
3. Transformati textele in vectori de aceeasi dimensiune folosind indexul vocabularului (alegeti o dimensiune maxima de circa 500-1000 de tokens)
4. Implementati urmatoarea arhitectura:
    * un Embedding layer pentru vocabularul determinat, ce contine vectori de dimensiune 100
    * un layer dropout cu probabilitate 0.4
    * un layer convolutional 1D cu 100 canale de input si 128 de canale de output, dimensiunea kernelului de 3 si padding 1; asupra rezultatului aplicati un layer de [BatchNormalization](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html) cu 128 features; aplicati apoi functia de activare ReLU, iar in cele din urma un strat de max-pooling 1D cu kernel size 2.
    * un layer convolutional 1D cu 128 canale de input si 128 de canale de output, dimensiunea kernelului de 5 si padding 2; asupra rezultatului aplicati un layer de BatchNormalization cu 128 features; aplicati apoi functia de activare ReLU, iar in cele din urma un strat de max-pooling 1D cu kernel size 2.
    * un layer convolutional 1D cu 128 canale de input si 128 de canale de output, dimensiunea kernelului de 5 si padding 2; asupra rezultatului aplicati un layer de BatchNormalization cu 128 features; aplicati apoi functia de activare ReLU, iar in cele din urma un strat de max-pooling 1D cu kernel size 2.
    * asupra rezultatului ultimului layer, aplicati average-pooling 1D obtinand pentru fiecare canal media tuturor valorilor din vectorul sau corespunzator
    * un layer feed-forward (linear) cu dimensiunea inputului 128, si 2 noduri pentru output (pentru clasificare in 0/1)
5. Antrenati arhitectura folosind cross-entropy ca functie de loss si un optimizer la alegere. La finalul fiecarei epoci evaluati modelul pe datele de validare si salvati weighturile celui mai bun model astfel determinat
6. Evaluati cel mai bun model obtinut pe datele de test.


In [1]:
from collections import Counter
import torch
from torchsummary import summary
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
from tqdm import tqdm
import string
import re
from num2words import num2words
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
import nltk
from nltk import word_tokenize
from unidecode import unidecode
from pprint import pprint
nltk.download('punkt')


[nltk_data] Downloading package punkt to /home/alhiris/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
from urllib.request import urlretrieve
urlretrieve('https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv', 'IMDB_Dataset.csv')

('IMDB_Dataset.csv', <http.client.HTTPMessage at 0x7fbb382ecc10>)

In [3]:
# 1
data = pd.read_csv('IMDB_Dataset.csv')
data = data.dropna()

train_df, test_df = train_test_split(data, test_size=0.20, random_state=42)
test_df, val_df = train_test_split(test_df, test_size=0.5, random_state=42)
data[:10]


Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0
5,"For my humanities quarter project for school, ...",1
6,Arguebly Al Pacino's best role. He plays Tony ...,1
7,Being a big fan of Stanley Kubrick's Clockwork...,1
8,I reached the end of this and I was almost sho...,1
9,There is no doubt that Halloween is by far one...,1


In [4]:
# 2
def preprocess_review(review):
    review_lower = review.lower()
    review_numbers = re.sub(r"((\d+\.)?\d+)", lambda x: num2words(x.group(0), lang="english") , review_lower)
    review_punctuation = review_numbers.translate(str.maketrans('', '', string.punctuation))
    review_punctuation = re.sub(r"\s+", ' ', review_punctuation)
    return review_punctuation

def tokenize_review(review):
    lemmatizer = WordNetLemmatizer()
    review_tokenized = word_tokenize(review)
    stop_words = set(stopwords.words('english'))
    final_review = [lemmatizer.lemmatize(word) for word in review_tokenized if word not in stop_words]
    return final_review

def process_data(data):
    final_data = []
    for i in tqdm(range(len(data))):
        review = data[i]
        preprocessed_review = preprocess_review(review)
        tokenized_review = tokenize_review(preprocessed_review)
        final_data.append(tokenized_review)
    return final_data

x_train, y_train = process_data(train_df.review.tolist()), train_df.sentiment.tolist()
x_test, y_test = process_data(test_df.review.tolist()), test_df.sentiment.tolist()
x_val, y_val = process_data(val_df.review.tolist()), val_df.sentiment.tolist()


100%|██████████| 40000/40000 [01:01<00:00, 651.18it/s]
100%|██████████| 5000/5000 [00:07<00:00, 671.87it/s]
100%|██████████| 5000/5000 [00:07<00:00, 655.12it/s]


In [10]:
def get_vocab(data):
    units = set([unit for review in data for unit in review])
    return units

def word_frequency(data, min_apparitions):
    all_words = [word for reviews in data for word in reviews]
    sorted_vocab = sorted(dict(Counter(all_words)).items(), key=lambda pair: pair[1], reverse=True)
    final_vocab = [k for k,v in sorted_vocab if v > min_apparitions]

    return final_vocab

total_words = get_vocab(x_train)
print(f'Total vocabulary size in data: {len(total_words)}')
print(list(total_words)[:10])

vocabulary = word_frequency(x_train, min_apparitions=16)
print(f'Vocabulary size in data: {len(vocabulary)}')
print(vocabulary[:10])

Total vocabulary size in data: 147956
['neelix', 'thiefturnedspy', 'subvert', 'eyea', 'influenced', 'barbershop', 'frommost', 'performingin', 'pengiun', 'witch']
Vocabulary size in data: 17683
['br', 'movie', 'film', 'one', 'like', 'time', 'good', 'character', 'even', 'get']


In [11]:
# 3
def vectorize_data(data, word_indices, one_hot=False):
    vectorized = []
    for sentence in data:
        indexed_sentence = [word_indices[word] if word in word_indices.keys() else word_indices['UNK'] for word in sentence]

        if one_hot:
            indexed_sentence = np.eye(len(word_indices))[indexed_sentence]

        vectorized.append(indexed_sentence)

    return vectorized

def make_padding(data, max_length=500):
    return torch.tensor([
        sentence[:max_length] + [1] * max(0, max_length - len(sentence))
        for sentence in data
    ])


word_indices = dict((word, index + 2) for index, word in enumerate(vocabulary))
indices_word = dict((index + 2, word) for index, word in enumerate(vocabulary))

indices_word[0] = 'UNK'
word_indices['UNK'] = 0

indices_word[1] = 'PAD'
word_indices['PAD'] = 1

vocabulary_size = len(indices_word)

x_train_vectorized = make_padding(vectorize_data(x_train, word_indices), max_length=1000)
x_test_vectorized = make_padding(vectorize_data(x_test, word_indices), max_length=1000)
x_val_vectorized = make_padding(vectorize_data(x_val, word_indices), max_length=1000)
print(x_train_vectorized.shape)

torch.Size([40000, 1000])


In [12]:
# 4
class Dataset(torch.utils.data.Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __getitem__(self, item):
        return self.data[item], self.labels[item]

    def __len__(self):
        return len(self.labels)

class Model(torch.nn.Module):
    def __init__(self, vocabulary_size):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocabulary_size, 100, padding_idx=1)
        self.dropout1 = torch.nn.Dropout(0.4)
        self.conv1 = torch.nn.Sequential(
            torch.nn.Conv1d(in_channels=100, out_channels=128, kernel_size=3, padding=1),
            torch.nn.BatchNorm1d(num_features=128),
            torch.nn.ReLU(),
            torch.nn.MaxPool1d(kernel_size=2)
        ) # 1000 -> 500
        self.conv2 = torch.nn.Sequential(
            torch.nn.Conv1d(in_channels=128, out_channels=128, kernel_size=5, padding=2),
            torch.nn.BatchNorm1d(num_features=128),
            torch.nn.ReLU(),
            torch.nn.MaxPool1d(kernel_size=2)
        ) # 500 -> 250
        self.conv3 = torch.nn.Sequential(
            torch.nn.Conv1d(in_channels=128, out_channels=128, kernel_size=5, padding=2),
            torch.nn.BatchNorm1d(num_features=128),
            torch.nn.ReLU(),
            torch.nn.MaxPool1d(kernel_size=2)
        ) # 250 -> 125
        self.average_layer = torch.nn.AvgPool1d(kernel_size=125) # 125 -> 1
        self.convolutions = torch.nn.Sequential(
            self.conv1,
            self.conv2,
            self.conv3,
            self.average_layer
        )
        self.flatten = torch.nn.Flatten()
        self.linear = torch.nn.Linear(in_features=128, out_features=2)

        self.classifier = torch.nn.Sequential(
            self.flatten,
            self.linear
        )

    def forward(self, x):
        embeddings = self.embedding(x)
        embeddings = embeddings.permute(0, 2, 1)
        x = self.convolutions(embeddings)
        output = self.classifier(x)
        return output

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = Model(vocabulary_size=vocabulary_size).to(DEVICE)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

loss_fn = torch.nn.CrossEntropyLoss()



In [13]:
# 5
def validation_fn(net: torch.nn.Module, test_loader: DataLoader):
    net.eval()
    all_predictions = torch.tensor([])
    all_targets = torch.tensor([])
    for batch in test_loader:
        inputs, targets = batch
        inputs = inputs.long().to(DEVICE)
        targets = targets.to(DEVICE)

        with torch.no_grad():
            output = net(inputs)

        predictions = output.argmax(1)
        all_targets = torch.cat([all_targets, targets.detach().cpu()])
        all_predictions = torch.cat([all_predictions, predictions.detach().cpu()])

    val_acc = (all_predictions == all_targets).float().mean().numpy()
    return val_acc



def train_fn(epochs: int, train_loader: DataLoader, test_loader: DataLoader,
             net: torch.nn.Module, loss_fn: torch.nn.Module, optimizer: torch.optim.Optimizer):
    best_val_acc = 0
    for epoch_n in range(epochs):
        print(f"Epoch #{epoch_n + 1}")
        net.train()
        with tqdm(train_loader, unit='batch') as t_loader:
            for batch in t_loader:
                net.zero_grad()

                inputs, targets = batch
                inputs = inputs.long().to(DEVICE)
                targets = targets.to(DEVICE)

                output = net(inputs)
                loss = loss_fn(output, targets)

                loss.backward()
                optimizer.step()

        val_acc = validation_fn(net, test_loader)
        print(f'Epoch {epoch_n + 1} has accuracy {val_acc}')
        if val_acc > best_val_acc:
            torch.save(net.state_dict(), "./model")
            best_val_acc = val_acc
        # validare

    print("Best validation accuracy ", best_val_acc)

train_dataset = Dataset(x_train_vectorized, y_train)
test_dataset = Dataset(x_test_vectorized, y_test)
val_dataset = Dataset(x_val_vectorized, y_val)

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)
val_dataloader = DataLoader(val_dataset, batch_size=64, shuffle=False)


In [14]:
train_fn(
    epochs=20,
    train_loader=train_dataloader,
    test_loader=test_dataloader,
    net=model,
    loss_fn=loss_fn,
    optimizer=optimizer
)

Epoch #1


100%|██████████| 625/625 [00:14<00:00, 43.72batch/s]


Epoch 1 has accuracy 0.6776000261306763
Epoch #2


100%|██████████| 625/625 [00:14<00:00, 43.38batch/s]


Epoch 2 has accuracy 0.8740000128746033
Epoch #3


100%|██████████| 625/625 [00:14<00:00, 44.10batch/s]


Epoch 3 has accuracy 0.8808000087738037
Epoch #4


100%|██████████| 625/625 [00:14<00:00, 44.22batch/s]


Epoch 4 has accuracy 0.8697999715805054
Epoch #5


100%|██████████| 625/625 [00:14<00:00, 44.14batch/s]


Epoch 5 has accuracy 0.8745999932289124
Epoch #6


100%|██████████| 625/625 [00:14<00:00, 43.60batch/s]


Epoch 6 has accuracy 0.8777999877929688
Epoch #7


100%|██████████| 625/625 [00:14<00:00, 43.74batch/s]


Epoch 7 has accuracy 0.8557999730110168
Epoch #8


100%|██████████| 625/625 [00:14<00:00, 43.49batch/s]


Epoch 8 has accuracy 0.8781999945640564
Epoch #9


100%|██████████| 625/625 [00:14<00:00, 43.93batch/s]


Epoch 9 has accuracy 0.8614000082015991
Epoch #10


100%|██████████| 625/625 [00:14<00:00, 43.72batch/s]


Epoch 10 has accuracy 0.8546000123023987
Epoch #11


100%|██████████| 625/625 [00:14<00:00, 44.18batch/s]


Epoch 11 has accuracy 0.8772000074386597
Epoch #12


100%|██████████| 625/625 [00:14<00:00, 44.03batch/s]


Epoch 12 has accuracy 0.8791999816894531
Epoch #13


100%|██████████| 625/625 [00:14<00:00, 43.95batch/s]


Epoch 13 has accuracy 0.871999979019165
Epoch #14


100%|██████████| 625/625 [00:14<00:00, 43.89batch/s]


Epoch 14 has accuracy 0.8772000074386597
Epoch #15


100%|██████████| 625/625 [00:14<00:00, 43.20batch/s]


Epoch 15 has accuracy 0.8736000061035156
Epoch #16


100%|██████████| 625/625 [00:14<00:00, 44.03batch/s]


Epoch 16 has accuracy 0.8763999938964844
Epoch #17


100%|██████████| 625/625 [00:14<00:00, 43.62batch/s]


Epoch 17 has accuracy 0.8799999952316284
Epoch #18


100%|██████████| 625/625 [00:14<00:00, 44.16batch/s]


Epoch 18 has accuracy 0.8754000067710876
Epoch #19


100%|██████████| 625/625 [00:14<00:00, 43.66batch/s]


Epoch 19 has accuracy 0.8521999716758728
Epoch #20


100%|██████████| 625/625 [00:14<00:00, 43.62batch/s]


Epoch 20 has accuracy 0.8758000135421753
Best validation accuracy  0.8808


In [16]:
best_model = Model(vocabulary_size=vocabulary_size).to(DEVICE)
best_model.load_state_dict(torch.load("./model"))
best_model.eval()

# I used test instead of validation above, so I will just use val here
acc = validation_fn(best_model, val_dataloader)
print(f'Accuracy obtained on test data: {acc}')


Accuracy obtained on test data: 0.8848000168800354
