### Assignment 2: Text Classification

In [1]:
! ls

Dockerfile		       image.tar
README.MD		       jigsaw-incredibly-simple-naive-bayes-0-768.ipynb
ReplaceDataset.ipynb	       model.py
__pycache__		       project
baseline_model.ckpt	       requirements.txt
baseline_solution.ipynb        run.sh
classify_text_with_bert.ipynb  trainer.py
data			       utils.py
data_preprocessing.py	       venv
gensim-data


In [2]:
# %load_ext autoreload
# %autoreload 2
import os
import csv
from random import seed
from pathlib import Path
from itertools import chain
import torch
from tqdm import tqdm
from IPython.display import HTML, display
import gensim.downloader as api
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pack_sequence
from data_preprocessing import read_data,read_test, Tokenizer, TextDataset,\
    Vocab,train_test_split
from utils import show_example
from model import prepare_emb_matrix, RecurrentClassifier
from trainer import Trainer
# from google.colab import files

Toxicity, such as insults, threats and hate speech, in online conversations is a real threat to productive sharing of opinons. To mitigate this problem automatic comment filtering system may be applied.
In this assignment you are provided with data, collected by [Jigsaw](https://jigsaw.google.com/
) company from Wikipedia’s talk page edits. Each comment was labeled with toxicity rating from 0 to 5. Here are some examples of the least toxic comments. 

In [3]:
data = read_data()
seed(4)

# change this at your own risk 
ratings_to_show = (0, 1, 2)
display(HTML(show_example(data, ratings=ratings_to_show)))       

Toxicity,Comment
😇,legal action wikipedia has a policy wp nlt and i agree with it completely however they do add the following if you must take legal action we cannot prevent you from doing so however it is required that you do not edit wikipedia until the legal matter has been resolved to ensure that all legal processes happen via proper legal channels you should instead contact the person or people involv...
😐,don t be a douche towards me
😧,ok now you re pissing me off stop accusing me of things that aren t real what in the his noddlinesses name are you talking about i repeat in vulgar and block worthy language because your irrationality is making you impossible i don t give a shit about the reference to affirmative action i ve said it three times repeated it in the post you are refering to and still you went on to make...


The task is to build a classifier system. Let's create a baseline recurrent model. We'll start with building a vocabulary

In [4]:
# Press shift-tab to check docstrings
tok = Tokenizer()
tok_texts = [tok.tokenize(t) for t in chain(*data.values())]
vocab = Vocab(tok_texts, max_vocab_size=30000)

Then data is splitted into train and validation parts 

In [4]:
train_texts, train_labels, val_texts, val_labels = train_test_split(data)
train_dataset = TextDataset([tok.tokenize(t) for t in train_texts], train_labels, vocab)
val_dataset = TextDataset([tok.tokenize(t) for t in val_texts], val_labels, vocab)

Then pretrained embeddings are obtained with Gensim - it'll automatically download them for you. [Here](https://github.com/RaRe-Technologies/gensim-data#models
) you can see other pretrained embeddings.

In [None]:
# store embeddings in current directory
os.environ["GENSIM_DATA_DIR"] = str(Path.cwd())
# will download embeddings or load them from disk
gensim_model = api.load("glove-wiki-gigaword-100")
emb_matrix = prepare_emb_matrix(gensim_model, vocab)



Now let's define hyperparameters for our baseline model. It'll be a 2-layered unidiractional LSTM

In [6]:
config = {
    "freeze": True,
    "cell_type": "LSTM",
    "cell_dropout": 0.3,
    "num_layers": 2,
    "hidden_size": 128,
    "out_activation": "relu",
    "bidirectional": True,
    "out_dropout": 0.2,
    "out_sizes": [200],
}

trainer_config = {
    "lr": 3e-4,
    "n_epochs": 10,
    "weight_decay": 1e-6,
    "batch_size": 128,
    "device": "cuda" if torch.cuda.is_available() else "cpu"
}
clf_model = RecurrentClassifier(config, vocab, emb_matrix)

In [7]:
train_dataloader = DataLoader(train_dataset, 
                              batch_size=trainer_config["batch_size"],
                              shuffle=True,
                              num_workers=2,
                              collate_fn=train_dataset.collate_fn)
val_dataloader = DataLoader(val_dataset, 
                            batch_size=trainer_config["batch_size"],
                            shuffle=False,
                            num_workers=2,
                            collate_fn=val_dataset.collate_fn)
t = Trainer(trainer_config)


In [None]:
t.fit(clf_model, train_dataloader, val_dataloader)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=231.0), HTML(value='')))

Let's save model, load it from checkpoint and check on some commments

In [11]:
t.save("baseline_model.ckpt")

In [12]:
t = Trainer.load("baseline_model.ckpt")

In [13]:
def predict_toxicity(model, comment):
    tok_text = tok.tokenize(comment)
    indexed_text = torch.tensor(vocab.vectorize(tok_text)).to(t.device)
    rating = model(pack_sequence([indexed_text])).argmax().item()
    print(f"Toxicity rating for \"{comment}\" is: {rating}") 


In [14]:
predict_toxicity(t.model, "Please sir do not delete my edits")
predict_toxicity(t.model, "They are nazi pal, forget it")
predict_toxicity(t.model, "You suck")

Toxicity rating for "Please sir do not delete my edits" is: 0
Toxicity rating for "They are nazi pal, forget it" is: 0
Toxicity rating for "You suck" is: 0


Now let's prepare a submission file

In [15]:
test_uuids, test_texts = read_test("data/test.csv")
test_dataloader = DataLoader( TextDataset([tok.tokenize(t) for t in test_texts], [-1] * len(test_texts), vocab), 
                            batch_size=trainer_config["batch_size"],
                            shuffle=False,
                            num_workers=2,
                            collate_fn=val_dataset.collate_fn)

predictions = t.predict(test_dataloader)

In [16]:
def save_test_predictions(test_predictions, path):
    assert len(test_predictions) == len(test_texts)
    with open(path, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(["uuid","comment_text","toxicity"])
        for uuid, text, pred in zip(test_uuids, test_texts, test_predictions):
            writer.writerow([uuid, text, pred])
        

In [17]:
save_test_predictions(predictions, "./best_results.csv")

# Evaluation

Note that generally not all errors are equal: for example, predicting score 0 while the target is 5 is seemingly worse than predicting 3 instead of 4. That's why your model will be evaluated by [MSE](https://en.wikipedia.org/wiki/Mean_squared_error) loss used for regression tasks. Also note that your model is not required to predict integers, scores like 0.31, 2.718 etc. are fine too. Good luck!

### Optional part: automatic hyperparameter tuning with Optuna

Optuna is a framework which lets you easily tweak hypermarameters of your model. In this part we'll use it to improve quality on validation data - we'll select model with best accuracy.

In [16]:
from optuna import create_study
from pprint import pprint


BEST_ACC = 0.0

def objective(trial):
    global BEST_ACC
    
    n_hidden_layers = trial.suggest_int("n_hidden_layers", 0, 3)
    hidden_layer_size = trial.suggest_int("hidden_layer_size", 10, 1000)
    
    config = {
        "freeze": True,
        "cell_type": trial.suggest_categorical("cell_type", ["RNN", "LSTM", "GRU"]),
        "cell_dropout": trial.suggest_loguniform("cell_dropout", 1e-9, 0.9),
        "num_layers": trial.suggest_int("num_layers", 1, 3),
        "hidden_size": trial.suggest_int("hidden_size", 10, 1000),
        "out_activation": trial.suggest_categorical("out_activation", 
                                                    ["sigmoid", "tanh", "relu", "elu"]),
        "bidirectional": trial.suggest_categorical("bidirectional", [True, False]),
        "out_dropout": trial.suggest_loguniform("out_dropout", 1e-9, 0.9),
        "out_sizes": [hidden_layer_size] * n_hidden_layers,
    }

    trainer_config = {
        "lr": trial.suggest_loguniform("lr", 1e-5, 1e-3),
        "n_epochs": 10,
        "weight_decay": trial.suggest_loguniform("weight_decay", 1e-9, 1e-1),
        "batch_size": 128,
        "device": "cuda" if torch.cuda.is_available() else "cpu",
        "verbose": False,
    }
    
    pprint({**config, **trainer_config})
        
    clf_model = RecurrentClassifier(config, vocab, emb_matrix)
    t = Trainer(trainer_config)
    t.fit(clf_model, train_dataloader, val_dataloader)
    val_acc =  t.history["val_acc"][-1]
    if val_acc > BEST_ACC:
        BEST_ACC = val_acc
        t.save("optuna_model.ckpt")
    return val_acc

In [None]:
study = create_study(direction="maximize")
# you can set more trials
study.optimize(objective, n_trials=10)

[32m[I 2022-03-19 23:26:02,500][0m A new study created in memory with name: no-name-dc881e23-bbac-4108-b200-840bf137bf79[0m


{'batch_size': 128,
 'bidirectional': True,
 'cell_dropout': 1.363200811191809e-09,
 'cell_type': 'LSTM',
 'device': 'cuda',
 'freeze': True,
 'hidden_size': 945,
 'lr': 0.0002709704518031192,
 'n_epochs': 10,
 'num_layers': 1,
 'out_activation': 'sigmoid',
 'out_dropout': 0.0001898671521172181,
 'out_sizes': [],
 'verbose': False,
 'weight_decay': 2.659101704286755e-09}


  "num_layers={}".format(dropout, num_layers))
[32m[I 2022-03-20 00:09:08,492][0m Trial 0 finished with value: 0.6824389100074768 and parameters: {'n_hidden_layers': 0, 'hidden_layer_size': 311, 'cell_type': 'LSTM', 'cell_dropout': 1.363200811191809e-09, 'num_layers': 1, 'hidden_size': 945, 'out_activation': 'sigmoid', 'bidirectional': True, 'out_dropout': 0.0001898671521172181, 'lr': 0.0002709704518031192, 'weight_decay': 2.659101704286755e-09}. Best is trial 0 with value: 0.6824389100074768.[0m


{'batch_size': 128,
 'bidirectional': True,
 'cell_dropout': 1.737630799239919e-07,
 'cell_type': 'LSTM',
 'device': 'cuda',
 'freeze': True,
 'hidden_size': 403,
 'lr': 0.00012744432100486,
 'n_epochs': 10,
 'num_layers': 1,
 'out_activation': 'elu',
 'out_dropout': 1.4000305680633312e-09,
 'out_sizes': [],
 'verbose': False,
 'weight_decay': 0.09296283365311885}


  "num_layers={}".format(dropout, num_layers))
[32m[I 2022-03-20 00:21:12,200][0m Trial 1 finished with value: 0.5399115085601807 and parameters: {'n_hidden_layers': 0, 'hidden_layer_size': 30, 'cell_type': 'LSTM', 'cell_dropout': 1.737630799239919e-07, 'num_layers': 1, 'hidden_size': 403, 'out_activation': 'elu', 'bidirectional': True, 'out_dropout': 1.4000305680633312e-09, 'lr': 0.00012744432100486, 'weight_decay': 0.09296283365311885}. Best is trial 0 with value: 0.6824389100074768.[0m


{'batch_size': 128,
 'bidirectional': True,
 'cell_dropout': 7.095218214633027e-05,
 'cell_type': 'LSTM',
 'device': 'cuda',
 'freeze': True,
 'hidden_size': 500,
 'lr': 7.329019208213213e-05,
 'n_epochs': 10,
 'num_layers': 2,
 'out_activation': 'elu',
 'out_dropout': 0.0001300590300630621,
 'out_sizes': [],
 'verbose': False,
 'weight_decay': 3.411667708190876e-06}


[32m[I 2022-03-20 01:10:00,920][0m Trial 2 finished with value: 0.666089653968811 and parameters: {'n_hidden_layers': 0, 'hidden_layer_size': 813, 'cell_type': 'LSTM', 'cell_dropout': 7.095218214633027e-05, 'num_layers': 2, 'hidden_size': 500, 'out_activation': 'elu', 'bidirectional': True, 'out_dropout': 0.0001300590300630621, 'lr': 7.329019208213213e-05, 'weight_decay': 3.411667708190876e-06}. Best is trial 0 with value: 0.6824389100074768.[0m


{'batch_size': 128,
 'bidirectional': True,
 'cell_dropout': 0.0002312210196735422,
 'cell_type': 'LSTM',
 'device': 'cuda',
 'freeze': True,
 'hidden_size': 640,
 'lr': 3.9504684560636326e-05,
 'n_epochs': 10,
 'num_layers': 2,
 'out_activation': 'relu',
 'out_dropout': 8.066887477400993e-06,
 'out_sizes': [297, 297],
 'verbose': False,
 'weight_decay': 1.6439079357128654e-06}


### Final prediction

In [None]:
t = Trainer.load("optuna_model.ckpt")

In [None]:
save_test_predictions(t.predict(test_dataloader), "./best_results_hparam_search.csv")