## Textutal Entailment ##
### Fine-tuning with multiple inputs ###

Application: 

* Multiple-choice questions: we need to input the question and answers and ask the model to pick the right answer.
* Chatbots with past context
* Question-answering

How can we build a transformer that can handle multiple sentences as inputs?

We do not need to change the number of inputs. We can train the existing transformer to understand multiple input sentences concatenated into the same input.

BERT is pretrained with 2 tasks: the 2nd is "next sentence prediction"
Input: 2 sentences from corpus
Target: whether or not the 2nd sentence follows the 1st sentence (binary classification)

Input format:
"[CLS] This is sentence one. [SEP] This is sentence two. [SEP]"

Example:

ENTAILMENT: "Bob buys a car" entails "Bob owns a car"

NO ENTAILMENT: "Bob buys cheese" does not entail "Bob doesn't have cheese"

In [4]:
import os
import json
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sn
from pprint import pprint
import textwrap
from pathlib import Path
from pprint import pprint
from matplotlib import pyplot as plt

# Appearance of the Notebook
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# PyTorch
import torch
from torchinfo import summary

# Hugging Face 
from transformers import pipeline, set_seed, AutoTokenizer
from transformers import Trainer, TrainingArguments
from transformers import AutoModelForSequenceClassification

# This HuggingFace community-driven open-source library of datasets
from datasets import load_dataset, load_metric

from sklearn.metrics import f1_score, accuracy_score, confusion_matrix

# Import this module with autoreload
%load_ext autoreload
%autoreload 2
import transformermodels as tm
print(f'Package version: {tm.__version__}')
print(f'PyTorch version: {torch.__version__}')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Package version: 0.0.post1.dev8+gd983d5a.d20240721
PyTorch version: 2.3.1+cu121


In [5]:
# GPU checks
is_cuda = torch.cuda.is_available()
print(f'CUDA available: {is_cuda}')
print(f'Number of GPUs found:  {torch.cuda.device_count()}')

if is_cuda:
    print(f'Current device ID:     {torch.cuda.current_device()}')
    print(f'GPU device name:       {torch.cuda.get_device_name(0)}')
    print(f'CUDNN version:         {torch.backends.cudnn.version()}')
    device_str = 'cuda:0'
    torch.cuda.empty_cache() 
else:
    device_str = 'cpu'
device = torch.device(device_str)
print()
print(f'Device for model training/inference: {device}')

CUDA available: True
Number of GPUs found:  1
Current device ID:     0
GPU device name:       NVIDIA GeForce GTX 1080 with Max-Q Design
CUDNN version:         8902

Device for model training/inference: cuda:0


In [3]:
# Helper functions and parameters
def wrap(x):
    return textwrap.fill(x, replace_whitespace=False, fix_sentence_endings=True)

# Directories
data_dir = os.path.join(os.environ.get('HOME'), 'data', 'transformers')
model_dir = os.path.join(data_dir, 'model_trained')
Path(model_dir).mkdir(parents=True, exist_ok=True)
# Load the HuggingFace datasets
# Full list of datasets
# https://huggingface.co/datasets
# dataset = load_dataset('amazon_polarity')

# The custom data set
#!wget -nc https://lazyprogrammer.me/course_files/AirlineTweets.csv
data_file_name = 'AirlineTweets.csv'
data_file = os.path.join(data_dir, data_file_name)
df_ = pd.read_csv(data_file)
display(df_.head(2))

# We only need the text and the labels
df = df_[['text', 'airline_sentiment']]
display(df.head())

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)


Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials t...,positive
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative
