# TASK
## Deadline: 31 martie ora 23:59.

Formular pentru trimiterea temei: https://forms.gle/Bznaciv2MTy4kVL47

Folosind intreg datasetul de mai sus (IMDb reviews) implementati urmatoarele cerinte:
1. Impartiti setul de date in 80% train, 10% validare si 10% test
2. Tokenizati textele si determinati vocabularul (in acest task vom lucra cu reprezentari la nivel de cuvant, NU la nivel de caracter); intrucat vocabularul poate fi foarte mare, incercati sa aplicati una dintre tehnicile mentionate in laborator (10K-20K de cuvinte ar fi o dimensiunea rezonabila a vocabularului)
3. Transformati textele in vectori de aceeasi dimensiune folosind indexul vocabularului (alegeti o dimensiune maxima de circa 500-1000 de tokens)
4. Implementati urmatoarea arhitectura:
    * un Embedding layer pentru vocabularul determinat, ce contine vectori de dimensiune 100
    * un layer dropout cu probabilitate 0.4
    * un layer convolutional 1D cu 100 canale de input si 128 de canale de output, dimensiunea kernelului de 3 si padding 1; asupra rezultatului aplicati un layer de [BatchNormalization](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html) cu 128 features; aplicati apoi functia de activare ReLU, iar in cele din urma un strat de max-pooling 1D cu kernel size 2.
    * un layer convolutional 1D cu 128 canale de input si 128 de canale de output, dimensiunea kernelului de 5 si padding 2; asupra rezultatului aplicati un layer de BatchNormalization cu 128 features; aplicati apoi functia de activare ReLU, iar in cele din urma un strat de max-pooling 1D cu kernel size 2.
    * un layer convolutional 1D cu 128 canale de input si 128 de canale de output, dimensiunea kernelului de 5 si padding 2; asupra rezultatului aplicati un layer de BatchNormalization cu 128 features; aplicati apoi functia de activare ReLU, iar in cele din urma un strat de max-pooling 1D cu kernel size 2.
    * asupra rezultatului ultimului layer, aplicati average-pooling 1D obtinand pentru fiecare canal media tuturor valorilor din vectorul sau corespunzator
    * un layer feed-forward (linear) cu dimensiunea inputului 128, si 2 noduri pentru output (pentru clasificare in 0/1)
5. Antrenati arhitectura folosind cross-entropy ca functie de loss si un optimizer la alegere. La finalul fiecarei epoci evaluati modelul pe datele de validare si salvati weighturile celui mai bun model astfel determinat
6. Evaluati cel mai bun model obtinut pe datele de test.


In [11]:
import operator

import torch
import pandas as pd
from tqdm import tqdm
import string
import re
from num2words import num2words
from nltk.corpus import stopwords
from pprint import pprint
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from unidecode import unidecode
from collections import Counter
import nltk
from nltk import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/alhiris/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
from urllib.request import urlretrieve
urlretrieve('https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv', 'IMDB_Dataset.csv')

('IMDB_Dataset.csv', <http.client.HTTPMessage at 0x7f55d7cf6fa0>)

In [30]:
# 1
data = pd.read_csv('IMDB_Dataset.csv')
data = data.dropna()

train_df, test_df = train_test_split(data, test_size=0.20, random_state=42)
test_df, val_df = train_test_split(test_df, test_size=0.5, random_state=42)
data[:10]


50000
50000


Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0
5,"For my humanities quarter project for school, ...",1
6,Arguebly Al Pacino's best role. He plays Tony ...,1
7,Being a big fan of Stanley Kubrick's Clockwork...,1
8,I reached the end of this and I was almost sho...,1
9,There is no doubt that Halloween is by far one...,1


In [42]:
# 2
def preprocess_review(review):
    review_lower = review.lower()
    review_numbers = re.sub(r"((\d+\.)?\d+)", lambda x: num2words(x.group(0), lang="english") , review_lower)
    review_punctuation = review_numbers.translate(str.maketrans('', '', string.punctuation))
    review_punctuation = re.sub(r"\s+", ' ', review_punctuation)
    return review_punctuation

def tokenize_review(review):
    lemmatizer = WordNetLemmatizer()
    review_tokenized = word_tokenize(review)
    stop_words = set(stopwords.words('english'))
    final_review = [lemmatizer.lemmatize(word) for word in review_tokenized if word not in stop_words]
    return final_review

def process_data(data):
    final_data = []
    for i in tqdm(range(len(data))):
        review = data[i]
        preprocessed_review = preprocess_review(review)
        tokenized_review = tokenize_review(preprocessed_review)
        final_data.append(tokenized_review)
    return final_data

x_train, y_train = process_data(train_df.review.tolist()), train_df.sentiment.tolist()
x_test, y_test = process_data(test_df.review.tolist()), test_df.sentiment.tolist()
x_val, y_val = process_data(val_df.review.tolist()), val_df.sentiment.tolist()


100%|██████████| 40000/40000 [01:01<00:00, 648.60it/s]
100%|██████████| 5000/5000 [00:07<00:00, 652.53it/s]
100%|██████████| 5000/5000 [00:07<00:00, 647.27it/s]


In [57]:
def get_vocab(data):
    units = set([unit for review in data for unit in review])
    return units

def word_frequency(data, min_apparitions):
    all_words = [word for reviews in data for word in reviews]
    sorted_vocab = sorted(dict(Counter(all_words)).items(), key=lambda pair: pair[1], reverse=True)
    final_vocab = [k for k,v in sorted_vocab if v > min_apparitions]

    return final_vocab

total_words = get_vocab(x_train)
print(f'Total vocabulary size in data: {len(total_words)}')
print(list(total_words)[:10])

vocabulary = word_frequency(x_train, min_apparitions=16)
print(f'Vocabulary size in data: {len(vocabulary)}')
print(vocabulary[:10])

Total vocabulary size in data: 147956
['lionsgate', 'hackes', 'pulpshould', 'hernandez', 'brommells', 'assetbr', 'smallvillebr', 'koppikar', 'earthtwo', 'mariska']
Vocabulary size in data: 17683
['br', 'movie', 'film', 'one', 'like', 'time', 'good', 'character', 'even', 'get']


In [None]:
# 3


