# Multilabel Classification
In multi-label classification, a given text sequence should be labeled with the correct subset of a set of pre-defined labels (note that the subset can include both the null set and the full set of labels itself). For this, we will be using the Toxic Comments dataset where each text can be labeled with any subset of the labels - toxic, severe_toxic, obscene, threat, insult, identity_hate.

## 1. Mounting the drive and navigating to the resource folder.

The toxic comments database has been stored in the path - ``` data/multilabel_classfication```

In [3]:
cd /content/drive/MyDrive/Colab Notebooks/T5_Multilabel

/content/drive/MyDrive/Colab Notebooks/T5_Multilabel


In [4]:
import pandas as pd
import json
from sklearn.model_selection import train_test_split

Before you proceed, please move the dataset to the ideal location using the following steps
1. Download the [Toxic Comments dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/).
2. Extract the csv files to data/multilabel_classification

## 2. Preprocessing The Data

The inputs and outputs of a T5 model is always text. A particular task is specified by using a prefix text that lets the model know what it should do with the input. The input data format for a T5 model in Simple Transformers reflects this fact. The input is a Pandas dataframe with the 3 columns — `prefix`, `input_text`, and ```target_text```.

In the following cell, we convert our data to test and train dataframe with the `prefix` as `multilabel_classification`. Further, the test-to-train ratio chosen is 1:9. Once the dataframes are created, we run a sanity check to ensure that all of the data in the dataframes is in text format.

In [4]:
prefix = "data/multilabel_classification/"

multi_train_df = pd.read_csv(prefix + 'train.csv')
multi_train_df["comment_text"].str.replace('\n', ' ').str.replace('\t', ' ')

for col in multi_train_df.columns:
    if col not in ["id", "comment_text"]:
        multi_train_df[col] = multi_train_df[col].apply(lambda x: col if x else "")

multi_train_df["target_text"] = multi_train_df['toxic'].str.cat(multi_train_df[[col for col in multi_train_df.columns if col not in ["id", "comment_text", "toxic"]]], sep=',')
multi_train_df["target_text"] = multi_train_df["target_text"].apply(lambda x: ",".join(word for word in x.split(",") if word)).apply(lambda x: x if x else "clean")
multi_train_df["input_text"] = multi_train_df["comment_text"].str.replace('\n', ' ')
multi_train_df["prefix"] = "multilabel classification"
multi_train_df = multi_train_df[["prefix", "input_text", "target_text"]]

multi_train_df, multi_eval_df = train_test_split(multi_train_df, test_size=0.1)

multi_train_df.head()

Unnamed: 0,prefix,input_text,target_text
38130,multilabel classification,I was mentioning surnames for comparison. In o...,clean
134794,multilabel classification,""" Hi No, I am a Khalsa. I feel sorry for you...",clean
106762,multilabel classification,warnings????!?!?! I have numerous warnings f...,toxic
156885,multilabel classification,polar bears are completely purple and have sca...,clean
103087,multilabel classification,Sorted... i is gonna get u....,clean


In [5]:
train_df = pd.concat([multi_train_df]).astype(str)
eval_df = pd.concat([multi_eval_df]).astype(str)

In [6]:
train_df.to_csv("data/train.tsv", "\t")
eval_df.to_csv("data/eval.tsv", "\t")

## 3. Creating Pretrained Instance of T5 Model

We will be using the [Simple Transformers library](https://github.com/ThilinaRajapakse/simpletransformers) which is based on the [Hugging Face Transformers](https://github.com/huggingface/transformers) to train the T5 model.
The instructions given below will install all the requirements.
- Install Anaconda or Miniconda Package Manager from [here](https://www.anaconda.com/products/individual).
- Create a new virtual environment and install packages.
  - conda create -n simpletransformers python
  - conda activate simpletransformers
  - conda install pytorch cudatoolkit=10.1 -c pytorch
- Install simpletransformers.
  - pip install simpletransformers

**NOTE** - The first two steps are necessary only if you choose to run the files on your local system.


In [7]:
!pip install simpletransformers

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/35/ef/0b70ae95138064d665d9298c4d96afba2edf4b86dc44f762807ceb12668e/simpletransformers-0.61.4-py3-none-any.whl (213kB)
[K     |████████████████████████████████| 215kB 7.3MB/s 
[?25hCollecting streamlit
[?25l  Downloading https://files.pythonhosted.org/packages/d9/99/a8913c21bd07a14f72658a01784414ffecb380ddd0f9a127257314fea697/streamlit-0.80.0-py2.py3-none-any.whl (8.2MB)
[K     |████████████████████████████████| 8.2MB 11.6MB/s 
[?25hCollecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/54/90/43b396481a8298c6010afb93b3c1e71d4ba6f8c10797a7da8eb005e45081/datasets-1.5.0-py3-none-any.whl (192kB)
[K     |████████████████████████████████| 194kB 48.1MB/s 
[?25hCollecting tensorboardx
[?25l  Downloading https://files.pythonhosted.org/packages/07/84/46421bd3e0e89a92682b1a38b40efc22dafb6d8e3d947e4ceefd4a5fabc7/tensorboardX-2.2-py2.py3-none-any.whl (120kB)
[K     |███████████

## 4. Training The T5 Model (t5-small)
Some important model arguments are -
- `max_seq_length`: Chosen such that most samples are not 
truncated. Increasing the sequence length significantly affects the memory consumption of the model, so it’s usually best to keep it as short as possible.
- `evaluate_during_training`: We’ll periodically test the model against the test data to see how it’s learning.
- `evaluate_during_training_steps`: The aforementioned period at which the model is tested.
- `evaluate_during_training_verbose`: Show us the results when a test is done.
- `fp16`: FP16 or mixed-precision training reduces the memory consumption of training the models (meaning larger batch sizes can be trained effectively).
- `save_eval_checkpoints`: By default, a model checkpoint will be saved when an evaluation is performed during training. 
- `reprocess_input_data`: Controls whether the features are loaded from cache (saved to disk) or whether tokenization is done again on the input sequences. It only really matters when doing multiple runs.
- `overwrite_output_dir`: This will overwrite any previously saved models if they are in the same output directory.
- `wandb_project`: Used for visualization of training progress. When run, a session link is created where all the necessary plots are shown in a dashboard.

In [None]:
import pandas as pd
from simpletransformers.t5 import T5Model


train_df = pd.read_csv("data/train.tsv", sep="\t").astype(str)
eval_df = pd.read_csv("data/eval.tsv", sep="\t").astype(str)

model_args = {
    "max_seq_length": 196,
    "train_batch_size": 16,
    "eval_batch_size": 64,
    "num_train_epochs": 1,
    "evaluate_during_training": True,
    "evaluate_during_training_steps": 15000,
    "evaluate_during_training_verbose": True,
    
    "use_multiprocessing": False,
    "fp16": False,

    "save_steps": -1,
    "save_eval_checkpoints": True,
    "save_model_every_epoch": False,

    "reprocess_input_data": True,
    "overwrite_output_dir": True,

    "wandb_project": "T5 - Multi-Label",
}

model = T5Model("t5", "t5-small", args=model_args)

model.train_model(train_df, eval_data=eval_df)

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

  0%|          | 0/143613 [00:00<?, ?it/s]



Using Adafactor for T5


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Label(value=' 0.04MB of 0.04MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

Running Epoch 0 of 1:   0%|          | 0/8976 [00:00<?, ?it/s]



  0%|          | 0/15958 [00:00<?, ?it/s]

(8976,
 {'eval_loss': [0.09889124576747417],
  'global_step': [8976],
  'train_loss': [0.195659339427948]})

## 5. Testing The Model

To test the model, we use the prescribed metrics of a weighted F1-Score, Precision and Accuracy. The results are evaluated using the sklearn.metrics library which provides efficient implementation of F1, Precision and Recall calculation. The model finetuned through this experiment can be found in the outputs folder of the repository in the folder titled "best_model".

In [43]:
import json
from datetime import datetime
from pprint import pprint
from statistics import mean

import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr
from simpletransformers.t5 import T5Model
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from transformers.data.metrics.squad_metrics import compute_exact, compute_f1


def f1(truths, preds):
    return mean([compute_f1(truth, pred) for truth, pred in zip(truths, preds)])

def exact(truths, preds):
    return mean([compute_exact(truth, pred) for truth, pred in zip(truths, preds)])

def precision(truths, preds):
    return mean([compute_precision_score(truth, pred) for truth, pred in zip(truths, preds)])

model_args = {
    "overwrite_output_dir": True,
    "max_seq_length": 196,
    "eval_batch_size": 32,
    "num_train_epochs": 1,
    "use_multiprocessing": False,
    "num_beams": None,
    "do_sample": True,
    "max_length": 50,
    "top_k": 50,
    "top_p": 0.95,
    "num_return_sequences": 3,
}

# Load the trained model
model = T5Model("t5", "outputs/best_model", args=model_args)

# Load the evaluation data
df = pd.read_csv("data/eval.tsv", sep="\t").astype(str)

# Prepare the data for testing
to_predict = [
    prefix + ": " + str(input_text)
    for prefix, input_text in zip(df["prefix"].tolist(), df["input_text"].tolist())
]
truth = df["target_text"].tolist()
tasks = df["prefix"].tolist()

# Get the model predictions
preds = model.predict(to_predict)

# Saving the predictions if needed
with open(f"predictions/predictions_{datetime.now()}.txt", "w") as f:
    for i, text in enumerate(df["input_text"].tolist()):
        f.write(str(text) + "\n\n")

        f.write("Truth:\n")
        f.write(truth[i] + "\n\n")

        f.write("Prediction:\n")
        for pred in preds[i]:
            f.write(str(pred) + "\n")
        f.write(
            "________________________________________________________________________________\n"
        )

# Taking only the first prediction
preds = [pred[0] for pred in preds]
df["predicted"] = preds

# Evaluating the tasks separately
output_dict = {
    "multilabel classification": {"truth": [], "preds": [],}
}

results_dict = {}

for task, truth_value, pred in zip(tasks, truth, preds):
    output_dict[task]["truth"].append(truth_value)
    output_dict[task]["preds"].append(pred)

print("-----------------------------------")
print("Results: ")
for task, outputs in output_dict.items():
    if task == "multilabel classification":
        try:
            task_truth = output_dict[task]["truth"]
            task_preds = output_dict[task]["preds"]
            results_dict[task] = {
                "F1 Score": f1_score(truth,preds,average='weighted'),
                "Exact matches": exact(task_truth, task_preds),
                "Precision": precision_score(truth,preds,average='weighted'),
                "Recall": recall_score(truth,preds,average='weighted'),

            }
            print(f"Scores for {task}:")
            print(f"F1 score: {f1(task_truth, task_preds)}")
            print(f"Exact matches: {exact(task_truth, task_preds)}")
            print(f"Precision: {precision_score(truth,preds,average='weighted')}")
            print(f"Recall: {recall_score(truth,preds,average='weighted')}")
            print()
        except:
            pass

Generating outputs:   0%|          | 0/499 [00:00<?, ?it/s]



Decoding outputs:   0%|          | 0/47874 [00:00<?, ?it/s]

-----------------------------------
Results: 


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Scores for multilabel classification:
F1 score: 0.9085725028199022
Exact matches: 0.9085725028199022
Precision: 0.9085989470177637
Recall: 0.9085725028199022



In [48]:
#Saving the Output to a File
with open(f"results/result.json", "w") as f:
    json.dump(results_dict, f)

# 6. Model Predictions

***Sample 1***

- ***Statement -*** Can you explain your changes to various numbers in the Hurricane Wilma and Conversion of Units articles?

- Truth:
    - clean

- Prediction:
    - clean

***Sample 2***

- ***Statement -*** Herb for life, you shit!

- Truth:
    - toxic,obscene,insult

- Prediction:
    - toxic,severe_toxic,obscene,insult

## 7. Conclusion

We were successfully able to finetune the T5-small model for the task of multilabel classification. Despite only a single epoch, the model was able to perform considerably well owing to the high number of pretrained parameters. The results can be found in the `results` folder and the performance of the model in terms of it's training and validation loss can be found in the plots folder.