# Binary Classification
The goal of binary classification in NLP is to classify a given text sequence into one of two classes. Binary classification of text sequence is exceptionally useful in furthering AI's learning of natural language and understanding the sentiment from based on a given context.

In our task, we use the Yelp Review Polarity dataset to classify the sentiment of the text as either positive ( "1" ) or negative ( "0" ). The Yelp reviews polarity dataset is constructed by considering stars 1 and 2 negative, and 3 and 4 positive. For each polarity 280,000 training samples and 19,000 testing samples are take randomly. In total there are 560,000 trainig samples and 38,000 testing samples.

## 1. Mounting the drive and navigating to the resource folder.

The Yelps Review Polarity database has been stored in the path - ``` data/binary_classification```

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
cd /content/drive/MyDrive/Colab Notebooks/T5_FineTune

/content/drive/MyDrive/Colab Notebooks/T5_FineTune


In [11]:
import pandas as pd
import json
from sklearn.model_selection import train_test_split
import csv

## 2. Preprocessing The Data

The inputs and outputs of a T5 model are always text. A particular task is specified by using a prefix text that lets the model know what it should do with the input. The input data format for a T5 model in Simple Transformers reflects this fact. The input is a Pandas dataframe with the 3 columns — `prefix`, `input_text`, and ```target_text```.

In the following cell, we convert our data to test and train dataframe with the `prefix` as `binary classification`. Further, the test-to-train ratio chosen is 3:20. Once the dataframes are created, we run a sanity check to ensure that all of the data in the dataframes is in text format.

Before you proceed, please move the dataset to the ideal location using the following steps in case it isn't already loaded
1. Download the [Yelps Review Polarity Dataset](https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews/).
2. Extract `train.csv` and `test.csv` to `data/binary_classification`

In [8]:

prefix = 'data/binary_classification/'

binary_train_df = pd.read_csv(prefix + 'train.csv',header=None,nrows=200000)
binary_train_df.head()

binary_eval_df = pd.read_csv(prefix + 'test.csv',header=None,nrows=30000)
print(binary_eval_df.head())

binary_train_df[0] = (binary_train_df[0] == 2).astype(int)
binary_eval_df[0] = (binary_eval_df[0] == 2).astype(int)

binary_train_df = pd.DataFrame({
    'prefix': ["binary classification" for i in range(len(binary_train_df))],
    'input_text': binary_train_df[1].str.replace('\n', ' '),
    'target_text': binary_train_df[0].astype(str),
})

print(binary_train_df.head())

binary_eval_df = pd.DataFrame({
    'prefix': ["binary classification" for i in range(len(binary_eval_df))],
    'input_text': binary_eval_df[1].str.replace('\n', ' '),
    'target_text': binary_eval_df[0].astype(str),
})


print(binary_eval_df.head())

   0                                                  1
0  2  Contrary to other reviews, I have zero complai...
1  1  Last summer I had an appointment to get new ti...
2  2  Friendly staff, same starbucks fair you get an...
3  1  The food is good. Unfortunately the service is...
4  2  Even when we didn't have a car Filene's Baseme...
                  prefix  ... target_text
0  binary classification  ...           0
1  binary classification  ...           1
2  binary classification  ...           0
3  binary classification  ...           0
4  binary classification  ...           1

[5 rows x 3 columns]
                  prefix  ... target_text
0  binary classification  ...           1
1  binary classification  ...           0
2  binary classification  ...           1
3  binary classification  ...           0
4  binary classification  ...           1

[5 rows x 3 columns]


In [9]:
train_df = pd.concat([binary_train_df]).astype(str)
eval_df = pd.concat([binary_eval_df]).astype(str)

In [10]:
train_df.to_csv("data/train.tsv", "\t")
eval_df.to_csv("data/eval.tsv", "\t")

## 3. Creating Pretrained Instance of T5 Model

We will be using the [Simple Transformers library](https://github.com/ThilinaRajapakse/simpletransformers) which is based on the [Hugging Face Transformers](https://github.com/huggingface/transformers) to train the T5 model.
The instructions given below will install all the requirements.
- Install Anaconda or Miniconda Package Manager from [here](https://www.anaconda.com/products/individual).
- Create a new virtual environment and install packages.
  - conda create -n simpletransformers python
  - conda activate simpletransformers
  - conda install pytorch cudatoolkit=10.1 -c pytorch
- Install simpletransformers.
  - pip install simpletransformers

**NOTE** - The first two steps are necessary only if you choose to run the files on your local system.


In [7]:
pip install simpletransformers

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/35/ef/0b70ae95138064d665d9298c4d96afba2edf4b86dc44f762807ceb12668e/simpletransformers-0.61.4-py3-none-any.whl (213kB)
[K     |█▌                              | 10kB 18.8MB/s eta 0:00:01[K     |███                             | 20kB 19.7MB/s eta 0:00:01[K     |████▋                           | 30kB 15.8MB/s eta 0:00:01[K     |██████▏                         | 40kB 14.4MB/s eta 0:00:01[K     |███████▊                        | 51kB 9.1MB/s eta 0:00:01[K     |█████████▎                      | 61kB 10.4MB/s eta 0:00:01[K     |██████████▊                     | 71kB 10.4MB/s eta 0:00:01[K     |████████████▎                   | 81kB 10.9MB/s eta 0:00:01[K     |█████████████▉                  | 92kB 10.2MB/s eta 0:00:01[K     |███████████████▍                | 102kB 10.6MB/s eta 0:00:01[K     |█████████████████               | 112kB 10.6MB/s eta 0:00:01[K     |██████████████████▌   

## 4. Training The T5 Model (t5-small)
Some important model arguments are -
- `max_seq_length`: Chosen such that most samples are not 
truncated. Increasing the sequence length significantly affects the memory consumption of the model, so it’s usually best to keep it as short as possible.
- `evaluate_during_training`: We’ll periodically test the model against the test data to see how it’s learning.
- `evaluate_during_training_steps`: The aforementioned period at which the model is tested.
- `evaluate_during_training_verbose`: Show us the results when a test is done.
- `fp16`: FP16 or mixed-precision training reduces the memory consumption of training the models (meaning larger batch sizes can be trained effectively).
- `save_eval_checkpoints`: By default, a model checkpoint will be saved when an evaluation is performed during training. 
- `reprocess_input_data`: Controls whether the features are loaded from cache (saved to disk) or whether tokenization is done again on the input sequences. It only really matters when doing multiple runs.
- `overwrite_output_dir`: This will overwrite any previously saved models if they are in the same output directory.
- `wandb_project`: Used for visualization of training progress. When run, a session link is created where all the necessary plots are shown in a dashboard.

*NOTE - The optimizer used for the training of the T5 model is the AdaFactor Optimizer*

In [None]:
import pandas as pd

from simpletransformers.t5 import T5Model

train_df = pd.read_csv("data/train.tsv", sep="\t").astype(str)
eval_df = pd.read_csv("data/eval.tsv", sep="\t").astype(str)

model_args = {
    "max_seq_length": 196,
    "train_batch_size": 16,
    "eval_batch_size": 64,
    "num_train_epochs": 1,
    "evaluate_during_training": True,
    "evaluate_during_training_steps": 15000,
    "evaluate_during_training_verbose": True,
    "use_multiprocessing": False,
    "fp16": False,
    "save_steps": -1,
    "save_eval_checkpoints": False,
    "save_model_every_epoch": True,
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "wandb_project": "T5 Binary Classification",
}

model = T5Model("t5", "t5-small", args=model_args)

model.train_model(train_df, eval_data=eval_df)

  0%|          | 0/200000 [00:00<?, ?it/s]



Using Adafactor for T5


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

[34m[1mwandb[0m: Currently logged in as: [33mthey_way_shh[0m (use `wandb login --relogin` to force relogin)


Running Epoch 0 of 1:   0%|          | 0/12500 [00:00<?, ?it/s]

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
  exp_avg_sq_row.mul_(beta2t).add_(1.0 - beta2t, update.mean(dim=-1))


  0%|          | 0/30000 [00:00<?, ?it/s]

Exception in thread Thread-18:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 470, in _handle_results
    task = get()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/reductions.py", line 287, in rebuild_storage_fd
    storage = cls._new_shared_fd(fd, size)
RuntimeError: unable to mmap 1568 bytes from file <filename not specified>: Cannot allocate memory (12)

Process ForkPoolWorker-2:
Process ForkPoolWorker-2:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multipr

AssertionError: ignored

AssertionError: ignored

## 5. Testing The Model

To test the model, we use the prescribed metrics of a weighted F1-Score, Precision and Accuracy. The results are evaluated using the sklearn.metrics library which provides efficient implementation of F1, Precision and Recall calculation. The model finetuned through this experiment can be found in the outputs folder of the repository in the folder titled "best_model".

In [12]:
import json
from datetime import datetime
from pprint import pprint
from statistics import mean

import numpy as np
import pandas as pd
from simpletransformers.t5 import T5Model
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from transformers.data.metrics.squad_metrics import compute_exact, compute_f1


def f1(truths, preds):
    return mean([compute_f1(truth, pred) for truth, pred in zip(truths, preds)])


def exact(truths, preds):
    return mean([compute_exact(truth, pred) for truth, pred in zip(truths, preds)])

model_args = {
    "overwrite_output_dir": True,
    "max_seq_length": 196,
    "eval_batch_size": 32,
    "num_train_epochs": 1,
    "use_multiprocessing": False,
    "num_beams": None,
    "do_sample": True,
    "max_length": 50,
    "top_k": 50,
    "top_p": 0.95,
    "num_return_sequences": 3,
}

# Load the trained model
model = T5Model("t5", "outputs/checkpoint-12500-epoch-1", args=model_args)

# Load the evaluation data
df = pd.read_csv("data/eval.tsv", sep="\t").astype(str)

# Prepare the data for testing
to_predict = [
    prefix + ": " + str(input_text)
    for prefix, input_text in zip(df["prefix"].tolist(), df["input_text"].tolist())
]
truth = df["target_text"].tolist()
tasks = df["prefix"].tolist()

# Get the model predictions
preds = model.predict(to_predict)

# Saving the predictions if needed
with open(f"predictions/predictions_{datetime.now()}.txt", "w") as f:
    for i, text in enumerate(df["input_text"].tolist()):
        f.write(str(text) + "\n\n")

        f.write("Truth:\n")
        f.write(truth[i] + "\n\n")

        f.write("Prediction:\n")
        for pred in preds[i]:
            f.write(str(pred) + "\n")
        f.write(
            "________________________________________________________________________________\n"
        )

# Taking only the first prediction
preds = [pred[0] for pred in preds]
df["predicted"] = preds

output_dict = {
    "binary classification": {"truth": [], "preds": [],}
}

results_dict = {}

for task, truth_value, pred in zip(tasks, truth, preds):
    output_dict[task]["truth"].append(truth_value)
    output_dict[task]["preds"].append(pred)

print("-----------------------------------")
print("Results: ")
for task, outputs in output_dict.items():
    if task == "binary classification":
        try:
            task_truth = [int(t) for t in output_dict[task]["truth"]]
            task_preds = [int(p) for p in output_dict[task]["preds"]]
            results_dict[task] = {
                "F1 Score": f1_score(task_truth, task_preds),
                "Accuracy Score": accuracy_score(task_truth, task_preds),
                "Precision": precision_score(task_truth,task_preds,average='weighted'),
                "Recall": recall_score(task_truth,task_preds,average='weighted')
            }
            print(f"Scores for {task}:")
            print(f"F1 score: {results_dict[task]['F1 Score']}")
            print(f"Accuracy Score: {results_dict[task]['Accuracy Score']}")
            print(f"Precision: {results_dict[task]['Precision']}")
            print(f"Recall: {results_dict[task]['Recall']}")
            print()
        except:
            pass

Generating outputs:   0%|          | 0/938 [00:00<?, ?it/s]



Decoding outputs:   0%|          | 0/90000 [00:00<?, ?it/s]

-----------------------------------
Results: 
Scores for binary classification:
F1 score: 0.9281883584041857
Accuracy Score: 0.9268
Precision: 0.9281770564184022
Recall: 0.9268



In [13]:
with open(f"result.json", "w") as f:
    json.dump(results_dict, f)

## 6. Results of Simulation

***Sample 1***

- **_Review -_** Last summer I had an appointment to get new tires and had to wait a super long time. I also went in this week for them to fix a minor problem with a tire they put on. They \"fixed\" it for free, and the very next morning I had the same issue. I called to complain, and the \"manager\" didn't even apologize!!! So frustrated. Never going back.  They seem overpriced, too. 

- ***Prediction -*** 0 (Negative Review)

***Sample 2***

- **_Review -_** Contrary to other reviews, I have zero complaints about the service or the prices. I have been getting tire service here for the past 5 years now, and compared to my experience with places like Pep Boys, these guys are experienced and know what they're doing. 
Also, this is one place that I do not feel like I am being taken advantage of, just because of my gender. Other auto mechanics have been notorious for capitalizing on my ignorance of cars, and have sucked my bank account dry. But here, my service and road coverage has all been well explained - and let up to me to decide. \nAnd they just renovated the waiting room. It looks a lot better than it did in previous years. 

- ***Prediction -*** 1 (Positive Review)



## 7. Conclusion

We were successfully able to finetune the T5-small model for the task of binary classification. Despite only a single epoch, the model was able to perform considerably well owing to the high number of pretrained parameters. The results can be found in the `results.json` file in the home directory and the performance of the model in terms of it's training and validation loss can be found in the plots folder. With such exceptional results with minimal training and across so many NLP tasks, the T5 Model really is the finest State-of-The-Art Creation in the field of Natural Language Processing.