**Assignment 5**

Build CNN model for sentiment analysis (binary classification) of IMDB Reviews (https://www.kaggle.com/utathya/imdb-review-dataset). You can use data with label="unsup" for pretraining of embeddings. Here you are forbidden to use test dataset for pretraining of embeddings.

Your quality metric is accuracy score on test dataset. Look at "type" column for train/test split.

You can use pretrained embeddings from external sources.

You have to provide data for trials with different hyperparameter values.

You have to beat following baselines:

[3 points] acc = 0.75

[5 points] acc = 0.8

[8 points] acc = 0.9

[2 points] for using unsupervised data

In [1]:
import pandas as pd
import numpy as np
from sklearn.externals import joblib
import nltk
import gensim
import spacy

from sklearn import metrics

import torch as tt
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torchtext.data import Field, LabelField, BucketIterator, TabularDataset, Iterator, Dataset, RawField

from tqdm import tqdm

SEED = 42
np.random.seed(SEED)



In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
import os
os.chdir('gdrive/My Drive/Colab Notebooks')

In [5]:
!head imdb.csv

,type,review,label,file
0,test,"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",neg,0_2.txt
1,test,"This is an example of why the majority of ac

In [10]:
data = pd.read_csv('imdb.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,type,review,label,file
0,0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt
3,3,test,Not even the Beatles could write songs everyon...,neg,10002_3.txt
4,4,test,Brass pictures (movies is not a fitting word f...,neg,10003_3.txt


In [0]:
data = data.drop(["file"], axis=1)
data = data.drop(["Unnamed: 0"], axis=1)

In [12]:
data.head()

Unnamed: 0,type,review,label
0,test,Once again Mr. Costner has dragged out a movie...,neg
1,test,This is an example of why the majority of acti...,neg
2,test,"First of all I hate those moronic rappers, who...",neg
3,test,Not even the Beatles could write songs everyon...,neg
4,test,Brass pictures (movies is not a fitting word f...,neg


In [0]:
test_data = data.loc[data["type"] == "test"].drop(["type"], axis=1)
train_data = data.loc[data["type"] == "train"].drop(["type"], axis=1)
train_data = train_data.loc[train_data["label"] != "unsup"]

In [0]:
train_data.to_csv("imdb_train.csv", encoding="utf-8")
test_data.to_csv("imdb_test.csv", encoding="utf-8")

In [0]:
import spacy


spacy_en = spacy.load('en')

def tokenizer(text): # create a tokenizer function
    return [tok.text for tok in spacy_en.tokenizer(text) if tok.text.isalpha()]            

In [23]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
classes={
    'neg': 0,
    'pos': 1
}

TEXT = Field(include_lengths=True, batch_first=True, 
             tokenize=tokenizer,
             eos_token='<eos>',
             lower=True,
             stop_words=nltk.corpus.stopwords.words('english'))

LABEL = LabelField(dtype=tt.int64, use_vocab=True, preprocessing=lambda x: classes[x])

train, test = TabularDataset.splits(path=".", train='imdb_train.csv', test="imdb_test.csv", format='csv',
               fields=[('id', None), ('review', TEXT), ('label', LABEL)], 
               skip_header=True)

In [0]:
TEXT.build_vocab(train,
                 max_size = 25000,
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train)

In [0]:
state = np.random.seed(SEED)
train, valid = train.split(0.7, stratified=True, random_state=state)

In [0]:
import random, torch
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

#stuff

In [0]:
classes={
    'neg':0,
    'unsup':1,
    'pos':2
}

TEXT = Field(include_lengths=True, batch_first=True, 
             tokenize=tokenizer,
             eos_token='<eos>',
             lower=True,
             stop_words=nltk.corpus.stopwords.words('english'))
LABEL = LabelField(dtype=tt.int64, use_vocab=True, preprocessing=lambda x: classes[x])

dataset = TabularDataset('imdb_fixed.csv', format='csv', 
                         fields=[(None, None), ('type', RawField()), ('review', TEXT),('label', LABEL), (None, None)], 
                         skip_header=True)

KeyboardInterrupt: ignored

In [0]:
# TEXT.build_vocab(dataset, min_freq=10, vectors="glove.6B.100d")
TEXT.build_vocab(dataset, min_freq=5)
len(TEXT.vocab.itos)

52275

In [0]:
LABEL.build_vocab(dataset)

In [0]:
LABEL.vocab.itos[:10]

[1, 0, 2]

In [0]:
train = Dataset(dataset.examples, dataset.fields, filter_pred= lambda x: x.type == 'train')

In [0]:
test = Dataset(dataset.examples, dataset.fields, filter_pred= lambda x: x.type == 'test')

In [0]:
train, valid = train.split(0.7, stratified=True, random_state=np.random.seed(SEED))

ValueError: ignored

In [30]:
np.unique([x.label for x in train.examples], return_counts=True)

(array([0, 1]), array([8750, 8750]))

In [31]:
np.unique([x.label for x in valid.examples], return_counts=True)

(array([0, 1]), array([3750, 3750]))

In [32]:
np.unique([x.label for x in test.examples], return_counts=True)

(array([0, 1]), array([12500, 12500]))

#continue

In [0]:
class MyModel(nn.Module):
    
    def __init__(self, vocab_size, embed_size, hidden_size, kernels):

        super(MyModel, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        self.convs = nn.ModuleList([nn.Conv1d(embed_size, hidden_size, k, padding=5) for k in kernels])
        
        self.fc = nn.Linear(hidden_size * len(kernels), 3)
        
    def forward(self, x):
        
        x = self.embedding(x)
        x = x.transpose(1,2)
        
        concatenated = []
        for conv in self.convs:
            z = conv(x)
            z = F.avg_pool1d(z, kernel_size=z.size(2))
            z = z.squeeze(2)
            concatenated.append(z)
            
        x = tt.cat(concatenated, 1)
        x = self.fc(x)
        return x

In [0]:
tt.cuda.empty_cache()

batch_size = 32

model = MyModel(len(TEXT.vocab.itos),
                embed_size=100,
                hidden_size=128,
                kernels=[2,3,4,5]
               )

train_loader, valid_loader, test_loader = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(batch_size, batch_size, batch_size),
    shuffle=True,
    sort_key=lambda x: len(x.review),
#     sort_within_batch=True,
)

optimizer = optim.Adam(model.parameters())
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, verbose=True, cooldown=5)
criterion = nn.CrossEntropyLoss()

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(device)
criterion = criterion.to(device)

In [0]:
def train_func(model, train_iterator, valid_iterator, criterion, optimizer, epochs):
  best_valid_loss = 10.
  acc_scores = []

  for n_epoch in range(epochs):
      
      train_losses = []
      valid_losses = []
      valid_targets = []
      valid_pred_class = []
      
      model.train()
      
      for xy in train_loader:

          x = xy.review[0]
          y = xy.label

          x = x.to(device)
          y = y.to(device)
          
          optimizer.zero_grad()
          
          pred = model(x)
          loss = criterion(pred, y)
          
          loss.backward()
          
          optimizer.step()
          
          train_losses.append(loss.item())

      model.eval()
      
      for xy in valid_loader:

          x = xy.review[0]
          y = xy.label

          x = x.to(device)

          with torch.no_grad():

              pred = model(x)

              pred = pred.cpu()

              valid_targets.append(y.numpy())
              valid_pred_class.append(np.argmax(pred, axis=1))

              loss = criterion(pred, y)

              valid_losses.append(loss.item())
          
      mean_valid_loss = np.mean(valid_losses)

      valid_targets = np.concatenate(valid_targets).squeeze()
      valid_pred_class = np.concatenate(valid_pred_class).squeeze()

      acc_score = metrics.accuracy_score(valid_targets, valid_pred_class)

      acc_scores.append(acc_score)
      
      print('Losses: train - {:.3f}, test - {:.3f}'.format(np.mean(train_losses), mean_valid_loss))
      print('Accuracy score - {:.3f}'.format(acc_score))
          
      # Early stopping:
      if mean_valid_loss < best_valid_loss:
          best_valid_loss = mean_valid_loss
      else:
          print('Early stopping')
          break
      return train_losses, acc_scores

In [109]:
losses_1, acc_scores_1 = train_func(model, train_loader, valid_loader, criterion, optimizer, epochs=3)

Losses: train - 0.659, test - 0.540
Accuracy score - 0.799


Теперь посмотрим на результаты тестовой выборки

In [0]:
def test_func(model, test_iterator):
  model.eval()

  test_targets = []
  test_pred_class = []

  for xy in test_iterator:
      x = xy.review[0]
      y = xy.label

      x = x.to(device)

      with tt.no_grad():

          pred = model(x)

          pred = pred.cpu()

          test_targets.append(y.numpy())
          test_pred_class.append(np.argmax(pred, axis=1))

  test_targets = np.concatenate(test_targets).squeeze()
  test_pred_class = np.concatenate(test_pred_class).squeeze()

  acc = metrics.accuracy_score(test_targets, test_pred_class)

  return acc

In [111]:
test_res_1 = test_func(model, test_loader)
print("Test accuracy: ", np.mean(test_res_1))

Test accuracy:  0.79112


Добавим Dropout = 0.3, накинем на результат фильтров два линейных слоя, вместо одного с активационной функцией ReLU

Также изменим следующие гиперпараметры: embed_size=300, hidden_size=150, 
kernels=[3,4,5], batch_size = 64

In [0]:
classes={
    'neg': 0,
    'pos': 1
}

TEXT = Field(include_lengths=True, batch_first=True, 
             tokenize=tokenizer,
             eos_token='<eos>',
             lower=True,
             stop_words=nltk.corpus.stopwords.words('english'))

LABEL = LabelField(dtype=tt.int64, use_vocab=True, preprocessing=lambda x: classes[x])

train, test = TabularDataset.splits(path=".", train='imdb_train.csv', test="imdb_test.csv", format='csv',
               fields=[('id', None), ('review', TEXT), ('label', LABEL)], 
               skip_header=True)

TEXT.build_vocab(train,
                 max_size = 25000,
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train)

state = np.random.seed(SEED)
train, valid = train.split(0.7, stratified=True, random_state=state)

In [0]:
class MyModel_2(nn.Module):
    
    def __init__(self, vocab_size, embed_size, hidden_size, kernels, dropout_rate):

        super(MyModel_2, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embed_size)

        self.out = nn.Dropout(dropout_rate)

        self.convs = nn.ModuleList([nn.Conv1d(embed_size, hidden_size, k, padding=5) for k in kernels])
        
        self.linear_1 = torch.nn.Linear(hidden_size * len(kernels), round((hidden_size * len(kernels))/2))
        self.relu = torch.nn.ReLU()
        self.linear_2 = torch.nn.Linear(round((hidden_size * len(kernels))/2), 2) 
        
    def forward(self, x):
        
        x = self.embedding(x)
        x = self.out(x)
        x = x.transpose(1,2)
        
        concatenated = []
        for conv in self.convs:
            z = conv(x)
            z = F.avg_pool1d(z, kernel_size=z.size(2))
            z = z.squeeze(2)
            concatenated.append(z)
            
        x = tt.cat(concatenated, 1)

        x = self.linear_1(x)
        x = self.relu(x)    
        x = self.linear_2(x)

        return x

In [0]:
tt.cuda.empty_cache()

batch_size = 64

model = MyModel_2(len(TEXT.vocab.itos),
                embed_size=300,
                hidden_size=150,
                kernels=[3,4,5],
                dropout_rate = 0.3
                )

train_loader, valid_loader, test_loader = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(batch_size, batch_size, batch_size),
    shuffle=True,
    sort_key=lambda x: len(x.review),
#     sort_within_batch=True,
)

optimizer = optim.Adam(model.parameters())
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, verbose=True, cooldown=5)
criterion = nn.CrossEntropyLoss()

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(device)
criterion = criterion.to(device)

In [121]:
losses_2, acc_scores_2 = train_func(model, train_loader, valid_loader, criterion, optimizer, epochs = 3)

Losses: train - 0.711, test - 0.546
Accuracy score - 0.752


In [122]:
test_res_2 = test_func(model, test_loader)
print("Test accuracy: ", np.mean(test_res_2))

Test accuracy:  0.75564


Результаты ухудшились попробуем уменьшить количество обучаемых признаков, но увеличить количество эпох. Дополнительный линейный слой тоже уберем, dropout оставим, но уменьшим

In [0]:
class MyModel_3(nn.Module):
    
    def __init__(self, vocab_size, embed_size, hidden_size, kernels, dropout_rate):

        super(MyModel_3, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        self.out = nn.Dropout(dropout_rate)
        
        self.convs = nn.ModuleList([nn.Conv1d(embed_size, hidden_size, k, padding=5) for k in kernels])
        
        self.fc = nn.Linear(hidden_size * len(kernels), 3)
        
    def forward(self, x):
        
        x = self.embedding(x)
        x = self.out(x)
        x = x.transpose(1,2)
        
        concatenated = []
        for conv in self.convs:
            z = conv(x)
            z = F.avg_pool1d(z, kernel_size=z.size(2))
            z = z.squeeze(2)
            concatenated.append(z)
            
        x = tt.cat(concatenated, 1)
        x = self.fc(x)
        return x

In [0]:
classes={
    'neg': 0,
    'pos': 1
}

TEXT = Field(include_lengths=True, batch_first=True, 
             tokenize=tokenizer,
             eos_token='<eos>',
             lower=True,
             stop_words=nltk.corpus.stopwords.words('english'))

LABEL = LabelField(dtype=tt.int64, use_vocab=True, preprocessing=lambda x: classes[x])

train, test = TabularDataset.splits(path=".", train='imdb_train.csv', test="imdb_test.csv", format='csv',
               fields=[('id', None), ('review', TEXT), ('label', LABEL)], 
               skip_header=True)

TEXT.build_vocab(train,
                 max_size = 25000,
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train)

state = np.random.seed(SEED)
train, valid = train.split(0.7, stratified=True, random_state=state)

In [0]:
tt.cuda.empty_cache()

batch_size = 32

model = MyModel_3(len(TEXT.vocab.itos),
                embed_size=100,
                hidden_size=128,
                kernels=[2,3,4,5],
                dropout_rate = 0.2
                )

train_loader, valid_loader, test_loader = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(batch_size, batch_size, batch_size),
    shuffle=True,
    sort_key=lambda x: len(x.review),
#     sort_within_batch=True,
)

optimizer = optim.Adam(model.parameters())
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, verbose=True, cooldown=5)
criterion = nn.CrossEntropyLoss()

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(device)
criterion = criterion.to(device)

In [132]:
losses_4, acc_scores_4 = train_func(model, train_loader, valid_loader, criterion, optimizer, epochs = 10)

Losses: train - 0.667, test - 0.514
Accuracy score - 0.791


In [133]:
test_res_4 = test_func(model, test_loader)
print("Test accuracy: ", np.mean(test_res_4))

Test accuracy:  0.7844


Результат стал лучше, но все равно ниже, чем у первой модели

Теперь попробуем изменить ограничение у словаря, добавить еще один кернел = 6, увеличить размер эмбеддинга до 200

In [0]:
classes={
    'neg': 0,
    'pos': 1
}

TEXT = Field(include_lengths=True, batch_first=True, 
             tokenize=tokenizer,
             eos_token='<eos>',
             lower=True,
             stop_words=nltk.corpus.stopwords.words('english'))

LABEL = LabelField(dtype=tt.int64, use_vocab=True, preprocessing=lambda x: classes[x])

train, test = TabularDataset.splits(path=".", train='imdb_train.csv', test="imdb_test.csv", format='csv',
               fields=[('id', None), ('review', TEXT), ('label', LABEL)], 
               skip_header=True)

TEXT.build_vocab(train,
                 min_freq=10,
                 vectors="glove.6B.100d")

LABEL.build_vocab(train)

state = np.random.seed(SEED)
train, valid = train.split(0.7, stratified=True, random_state=state)

In [0]:
tt.cuda.empty_cache()

batch_size = 32

model = MyModel_3(len(TEXT.vocab.itos),
                embed_size=200,
                hidden_size=128,
                kernels=[2,3,4,5,6],
                dropout_rate = 0.2
                )

train_loader, valid_loader, test_loader = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(batch_size, batch_size, batch_size),
    shuffle=True,
    sort_key=lambda x: len(x.review),
#     sort_within_batch=True,
)

optimizer = optim.Adam(model.parameters())
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, verbose=True, cooldown=5)
criterion = nn.CrossEntropyLoss()

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(device)
criterion = criterion.to(device)

In [141]:
losses_5, acc_scores_5 = train_func(model, train_loader, valid_loader, criterion, optimizer, epochs = 10)

Losses: train - 0.660, test - 0.474
Accuracy score - 0.824


In [142]:
test_res_5 = test_func(model, test_loader)
print("Test accuracy: ", np.mean(test_res_5))

Test accuracy:  0.81564


Данный результат - самый лучший из всех полученных