### <font color="blue">***Fine-tuning the Pre-trained T5-small Model in Hugging Face for Information Extraction Task***</font>
T5 model is a type of seq2seq model based on transformer architecture. We customize this model to extract location mentions in a microblog.

### <font color="blue"> ***INTRODUCTION 🥇*** </font>
**T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format**. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate...
**T5 comes in different sizes:**
*   *T5-small:* 60M parameters, NLP tasks
*   *t5-base:* 220M parameters, general-purpose language processing
*   *t5-large:* 770M parameters, higher accuracy or more complex NLP
*   *t5-3b:* 3B parameters, high accuracy on complex and large-scale NLP tasks
*   *t5-11b:* 11B parameters, best accuracy in complex NLP tasks

In this notebook, I aim to **test the potency 🚀 of fine-tuning the T5-small model** for an **Information Extraction 📄** task. Specifically, the goal is to customize the model to recognize and extract **location mentions 📍 in a microblog**.






#### <font color="#D6B4FC"> ***Import librairies ✍*** </font> : Install Transformers and Datasets from Hugging Face


In [None]:
%%bash
pip install nltk
pip install datasets
pip install transformers[torch]
pip install huggingface_hub

In [2]:
import nltk
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer,DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

In [3]:
#Modify the runtime type to T4 GPU
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device('cpu')

#### <font color="blue">**1. Load a Pre-trained Model Checkpoint 🔄**</font>
- **Purpose**: We load the `t5-small` model from Hugging Face and configure it for fine-tuning on the Information Extraction task.

#### <font color="blue">**2. Load a Pre-trained Tokenizer 🧩**</font>
- **Purpose**: Tokenizers are essential in NLP, translating text into data that the model can process. Models operate only on numbers, so tokenizers convert text inputs into numerical data.
  
  The **tokenization pipeline** consists of four main steps:
  - **Normalization 🔧**: Cleans up text by removing whitespaces, handling Unicode, etc.
  - **Pre-tokenization ✂️**: Splits text into smaller entities or sub-tokens.
  - **Modeling 🤖**: Applies a sub-word tokenization algorithm for the model.
  - **Post-processing 🔍**: Adds special tokens required by the model.

  **Outputs of the tokenizer**:
  - **input_ids 🔢**: Numerical indices for each token in the sentence.
  - **attention_mask 👀**: Flags indicating which tokens to attend to.
  - **token_type_ids 🧩**: Identifies the sequence a token belongs to if there’s more than one sentence.



In [6]:
# Load the tokenizer, model, and data_collator from Hugging Face
model = "t5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model)
tokenizer = AutoTokenizer.from_pretrained("t5-small")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [8]:
inputs=tokenizer(input,return_tensors="pt").input_ids
inputs

tensor([[18742,    66,  1128,  2652,     7,    41, 11160,     7,    61,    45,
            48,  2179, 12139,    10,     3,    31, 10991,    23, 10405,   896,
             6, 10271,     6, 10056,    57, 26926, 18368,   117,   209,  3586,
             5,    31,     1]])

### <font color="#D6B4FC">Data Preparation 📖</font>

In [11]:
#load the dataset from Train_1.csv :
import pandas as pd
data=pd.read_csv("/content/Train_1.csv")
def data_preprocessing(data):
  data=data.dropna() #remove missing data
  data=data.drop(["tweet_id"],axis=1) #remove "id" column
  data.columns=["microblog","location"] # rename columns
  return data
data=data_preprocessing(data) #pandas dataframe

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11849 entries, 1 to 73071
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   microblog  11849 non-null  object
 1   location   11849 non-null  object
dtypes: object(2)
memory usage: 277.7+ KB


In [13]:
#convert the pandas dataframe to dataset
from datasets import Dataset
dataset=Dataset.from_pandas(data,preserve_index=False)

In [14]:
# Split the dataset into training and evaluation sets with an 80/20 ratio
dataset=dataset.train_test_split(test_size=0.2)

In [15]:
dataset

DatasetDict({
    train: Dataset({
        features: ['microblog', 'location'],
        num_rows: 9479
    })
    test: Dataset({
        features: ['microblog', 'location'],
        num_rows: 2370
    })
})

In [16]:
# Print a sample entry: shows the location associated with a microblog
example = dataset["train"][0]
for key in example:
    print("{}: \"{}\"".format(key, example[key]))

microblog":"Here is a live look of the damage that was caused by the flooding in Ellicott City, Maryland. Live video courtesy of Fox 5 DC."
location":"Ellicott City Maryland"


We will create a function to preprocess the training and test data in batch. The preprocessing function will perform the following actions:
- Prepend the prefix  to each text document to indicate to the T5 model that the task at hand is specific Information Extraction task.
- Convert the input microblogs and location labels into a tokenized format that can be processed by the T5 model.
- Set the max_length parameter to ensure that the tokenized inputs and labels do not exceed a certain length, truncating any text that is too long.
- Assign the tokenized labels to the labels field of model_inputs, which will be used during training to calculate the loss and optimize the model's parameters.

In [17]:
# We prefix our tasks with "answer the question"
prefix = "Extract all location mentions (LMs) from this microblog: "

# Define the preprocessing function

def preprocess_function(examples):
   """Add prefix to the sentences, tokenize the text, and set the labels"""
   # The "inputs" are the tokenized answer:
   inputs = [prefix + doc for doc in examples["microblog"]]
   model_inputs = tokenizer(inputs)

   # The "labels" are the tokenized outputs:
   labels = tokenizer(text_target=examples["location"],
                      max_length=512,
                      truncation=True)

   model_inputs["labels"] = labels["input_ids"]
   #Delete examples["microblog"], examples["location"] to avoid out-of-memory errors
   del examples["microblog"]
   del examples["location"]
   return model_inputs

In [18]:
# Map the preprocessing function across our dataset
dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/9479 [00:00<?, ? examples/s]

Map:   0%|          | 0/2370 [00:00<?, ? examples/s]

In [19]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 9479
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2370
    })
})

#### <font color="#D6B4FC"> ***Evaluation metrics ✍*** </font>

We will use the ROUGE metric for training. We will load the evaluation method from the Huggingface Evaluate library.

In [31]:
! pip install -q evaluate rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [32]:
import evaluate
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [30]:
def compute_metrics(eval_pred):
    # Retrieves the evaluation predictions tuple into predictions and labels.
    predictions, labels = eval_pred

    # Decodes the tokenized predictions back to text, skipping any special tokens.
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replaces any -100 values in labels with the tokenizer's pad_token_id.
    # This is done because -100 is often used to ignore special tokens when calculating the loss function during training.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decodes the tokenized labels back to text, skipping any special tokens.
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Computes the ROUGE metric between the decoded predictions and decoded labels.
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels)

    # Calculates the length of each prediction by counting the non-padding tokens.
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    # Computes the mean length of the predictions and adds it to the result dictionary under the key "gen_len".
    result["gen_len"] = np.mean(prediction_lens)

    # Rounds each value in the result dictionary to 4 decimal places for cleaner output, and returns the result.
    return {k: round(v, 4) for k, v in result.items()}

#### <font color="blue">**Define a Data Collator 📦**</font>
- **Purpose**: The data collator prepares the pre-processed data, converting it into PyTorch tensors with fixed dimensions, ready for the model.
  - **Padding 📏**: Uses the tokenizer’s `.pad` method (with `return_tensors='pt'`) to pad sequences to the maximum length within a batch. Padding tokens are replaced by `-100` to ignore them in loss computation.
  - **Un-padded Labels ✂️**: The labels remain un-padded.
  - **Truncation ✂️**: For data exceeding the maximum length, truncation is applied.

In [24]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="t5-small")

#### <font color="#D6B4FC"> ***Training ✍*** </font>

In [None]:
# Global Parameters
L_RATE = 3e-4
BATCH_SIZE = 8
PER_DEVICE_EVAL_BATCH = 4
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIM = 3
NUM_EPOCHS = 3

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
   output_dir="T5-small-ie",
   evaluation_strategy="epoch",
   learning_rate=L_RATE,
   per_device_train_batch_size=BATCH_SIZE,
   per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
   weight_decay=WEIGHT_DECAY,
   save_total_limit=SAVE_TOTAL_LIM,
   num_train_epochs=NUM_EPOCHS,
   predict_with_generate=True,
   push_to_hub=False
)

In [34]:
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=dataset["train"],
   eval_dataset=dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

In [35]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.2134,0.483003,0.9034,0.2781,0.891,0.8907,3.8966
2,0.2625,0.423954,0.9145,0.2984,0.9019,0.9016,4.003
3,0.1867,0.444164,0.9185,0.3031,0.9049,0.9047,3.9916




TrainOutput(global_step=3555, training_loss=0.23284243517954809, metrics={'train_runtime': 606.1146, 'train_samples_per_second': 46.917, 'train_steps_per_second': 5.865, 'total_flos': 650961286987776.0, 'train_loss': 0.23284243517954809, 'epoch': 3.0})

In [39]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Finally, save the model and push it to the Hugging Face Hub

In [None]:
new_model="t5-small-ie"
trainer.model.save_pretrained(new_model)
trainer.model.push_to_hub(new_model, use_temp_dir=False)