# ML: Summarization from abstract to title

In this ML, we will use T5 model to train model for summarizing research abstract to generate title

T5 model: https://huggingface.co/docs/transformers/en/model_doc/t5

In [2]:
!pip install -U transformers
!pip install -U simpletransformers
# !pip install wandb

Collecting transformers
  Downloading transformers-4.40.2-py3-none-any.whl.metadata (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.0/138.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.40.2-py3-none-any.whl (9.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.15.2
    Uninstalling tokenizers-0.15.

In [3]:
import numpy as np
import pandas as pd
import re
import logging
import wandb
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import T5Tokenizer, T5ForConditionalGeneration
import os

## Preprocessing Data

In [5]:
df = pd.read_csv("/kaggle/input/data-science-project-merged-researches/merged_researches.csv", index_col=0)

In [6]:
df.head()

Unnamed: 0,id,doi,eid,cover_date,title,abstract,subject_areas,auth_keywords,authors_id,citedby_count,ref_count,ref_ids,published_year,published_month,published_day
0,85077976956,10.1007/978-3-319-98485-8_15,2-s2.0-85077976956,2018-12-31,Public health and international epidemiology f...,,MEDI,,36729660500|14720203700,1.0,76,0002667983|33750367977|85013970385|77953026614...,2018,12,31
1,85060936020,10.23919/PIERS.2018.8597669,2-s2.0-85060936020,2018-12-31,Flexible Printed Active Antenna for Digital Te...,"© 2018 The Institute of Electronics, Informati...",MATE|ENGI,,57192376216|6507497381,1.0,4,85006043726|85046336244|85060914424|85046368249,2018,12,31
2,85052201238,10.1016/j.ces.2018.08.042,2-s2.0-85052201238,2018-12-31,Parametric study of hydrogen production via so...,© 2018 Elsevier LtdComputational fluid dynamic...,ENGI|CHEM|CENG,,57202924002|7004487886|25923304100|50662017700,21.0,42,2942655685|84908055658|85052199786|84859716773...,2018,12,31
3,85051498032,10.1016/j.apsusc.2018.08.059,2-s2.0-85051498032,2018-12-31,Superhydrophobic coating from fluoroalkylsilan...,© 2018 Elsevier B.V. A superhydrophobic/supero...,MATE|CHEM|PHYS,,24074703800|57190429582|7403123085|7401969567|...,37.0,45,78349312523|53249093621|84980335769|8486252720...,2018,12,31
4,85050678366,10.1016/j.aca.2018.07.045,2-s2.0-85050678366,2018-12-31,Electrochemical impedance-based DNA sensor usi...,© 2018 Elsevier B.V. A label-free electrochemi...,BIOC|CHEM|ENVI,,56524669400|6506927536|6602082849|8532633300|6...,68.0,55,56249149272|33749077283|0037834610|73949151264...,2018,12,31


In [7]:
df = df[['title', 'abstract', 'published_year']]
df.head()

Unnamed: 0,title,abstract,published_year
0,Public health and international epidemiology f...,,2018
1,Flexible Printed Active Antenna for Digital Te...,"© 2018 The Institute of Electronics, Informati...",2018
2,Parametric study of hydrogen production via so...,© 2018 Elsevier LtdComputational fluid dynamic...,2018
3,Superhydrophobic coating from fluoroalkylsilan...,© 2018 Elsevier B.V. A superhydrophobic/supero...,2018
4,Electrochemical impedance-based DNA sensor usi...,© 2018 Elsevier B.V. A label-free electrochemi...,2018


In [8]:
df.shape

(45167, 3)

### Cleaning Data

- Remove all researches which have any missing value

In [9]:
df = df.dropna()

In [10]:
df.shape

(43644, 3)

- Preprocessing abstract

In [11]:
def text_preprocessing(s):
    s = s.lower()
    # Change 't to 'not'
    s = re.sub(r"\'t", " not", s)
    # Remove @name
    s = re.sub(r'(@.*?)[\s]', ' ', s)
    # Isolate and remove punctuations except '?'
    s = re.sub(r'([\'\"\.\(\)\!\?\\\/\,])', r' \1 ', s)
    s = re.sub(r'[^\w\s\?]', ' ', s)
    # Remove some special characters
    s = re.sub(r'([\;\:\|•«\n])', ' ', s)
    # Remove trailing whitespace
    s = re.sub(r'\s+', ' ', s).strip()

    return s

In [12]:
df['abstract'] = df['abstract'].apply(text_preprocessing)

In [13]:
df.head()

Unnamed: 0,title,abstract,published_year
1,Flexible Printed Active Antenna for Digital Te...,2018 the institute of electronics information ...,2018
2,Parametric study of hydrogen production via so...,2018 elsevier ltdcomputational fluid dynamics ...,2018
3,Superhydrophobic coating from fluoroalkylsilan...,2018 elsevier b v a superhydrophobic superoleo...,2018
4,Electrochemical impedance-based DNA sensor usi...,2018 elsevier b v a label free electrochemical...,2018
5,Evaluation of outsourcing transportation contr...,2018 czestochowa university of technology all ...,2018


## Naming Column for Training

In [14]:
df.columns = ['target_text', 'input_text', 'year']

In [15]:
df['input_text'] = "summarize: " + df['input_text']

In [16]:
df.head()

Unnamed: 0,target_text,input_text,year
1,Flexible Printed Active Antenna for Digital Te...,summarize: 2018 the institute of electronics i...,2018
2,Parametric study of hydrogen production via so...,summarize: 2018 elsevier ltdcomputational flui...,2018
3,Superhydrophobic coating from fluoroalkylsilan...,summarize: 2018 elsevier b v a superhydrophobi...,2018
4,Electrochemical impedance-based DNA sensor usi...,summarize: 2018 elsevier b v a label free elec...,2018
5,Evaluation of outsourcing transportation contr...,summarize: 2018 czestochowa university of tech...,2018


## Spliting Data into Training and Testing Data

- We will use researches that are published in 2023 as testing researches, otherwise will be testing researches

In [17]:
train_df = df[df['year'] != 2023][['target_text', 'input_text']].reset_index(drop=True)
eval_df = df[df['year'] == 2023][['target_text', 'input_text']].reset_index(drop=True)

In [18]:
train_df.shape

(38395, 2)

In [19]:
eval_df.shape

(5249, 2)

## Tokenizer and Dataset

In [20]:
# let's define model parameters specific to T5
model_params = {
    "MODEL": "t5-small",  # model_type: t5-base/t5-large
    "TRAIN_BATCH_SIZE": 8,  # training batch size
    "VALID_BATCH_SIZE": 8,  # validation batch size
    "TRAIN_EPOCHS": 4,  # number of training epochs
    "VAL_EPOCHS": 4,  # number of validation epochs
    "LEARNING_RATE": 1e-4,  # learning rate
    "MAX_SOURCE_TEXT_LENGTH": 512,  # max length of source text
    "MAX_TARGET_TEXT_LENGTH": 50,  # max length of target text
    "SEED": 2024,  # set seed for reproducibility
}

In [21]:
class SummarizingDataset(Dataset):
    """
    Creating a custom dataset for reading the dataset and
    loading it into the dataloader to pass it to the
    transformer for finetuning the model
    """
    
    def __init__(
        self, dataframe, tokenizer, source_len, target_len, source_text, target_text
    ):
        """
        Initializes a Dataset class

        Args:
            dataframe (pandas.DataFrame): Input dataframe
            tokenizer (transformers.tokenizer): Transformers tokenizer
            source_len (int): Max length of source text
            target_len (int): Max length of target text
            source_text (str): column name of source text
            target_text (str): column name of target text
        """
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = target_len
        self.target_text = self.data[target_text]
        self.source_text = self.data[source_text]

    def __len__(self):
        """returns the length of dataframe"""

        return len(self.target_text)

    def __getitem__(self, index):
        """return the input ids, attention masks and target ids"""

        source_text = str(self.source_text[index])
        target_text = str(self.target_text[index])

        # cleaning data so as to ensure data is in string type
        source_text = " ".join(source_text.split())
        target_text = " ".join(target_text.split())

        source = self.tokenizer.batch_encode_plus(
            [source_text],
            max_length=self.source_len,
            pad_to_max_length=True,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )
        target = self.tokenizer.batch_encode_plus(
            [target_text],
            max_length=self.summ_len,
            pad_to_max_length=True,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )

        source_ids = source["input_ids"].squeeze()
        source_mask = source["attention_mask"].squeeze()
        target_ids = target["input_ids"].squeeze()
        target_mask = target["attention_mask"].squeeze()

        return {
            "source_ids": source_ids.to(dtype=torch.long),
            "source_mask": source_mask.to(dtype=torch.long),
            "target_ids": target_ids.to(dtype=torch.long),
            "target_ids_y": target_ids.to(dtype=torch.long),
        }

In [22]:
# tokenzier for encoding the text
tokenizer = T5Tokenizer.from_pretrained(model_params["MODEL"])

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [23]:
training_set = SummarizingDataset(
    train_df,
    tokenizer,
    model_params["MAX_SOURCE_TEXT_LENGTH"],
    model_params["MAX_TARGET_TEXT_LENGTH"],
    "input_text",
    "target_text",
)

val_set = SummarizingDataset(
    eval_df,
    tokenizer,
    model_params["MAX_SOURCE_TEXT_LENGTH"],
    model_params["MAX_TARGET_TEXT_LENGTH"],
    "input_text",
    "target_text",
)

In [24]:
train_params = {
    "batch_size": model_params["TRAIN_BATCH_SIZE"],
    "shuffle": True,
    "num_workers": 0,
}

val_params = {
    "batch_size": model_params["VALID_BATCH_SIZE"],
    "shuffle": False,
    "num_workers": 0,
}

# Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
training_loader = DataLoader(training_set, **train_params)
val_loader = DataLoader(val_set, **val_params)

## Training

In [25]:
# Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
device

'cuda'

In [26]:
from rich.table import Column, Table
from rich import box
from rich.console import Console

In [27]:
# define a rich console logger
console = Console(record=True)

training_logger = Table(
    Column("Epoch", justify="center"),
    Column("Steps", justify="center"),
    Column("Loss", justify="center"),
    title="Training Status",
    pad_edge=False,
    box=box.ASCII,
)

In [28]:
def train(epoch, tokenizer, model, device, loader, optimizer):

    """
    Function to be called for training with the parameters passed from main function

    """

    model.train()
    for _, data in enumerate(loader, 0):
        y = data["target_ids"].to(device, dtype=torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data["source_ids"].to(device, dtype=torch.long)
        mask = data["source_mask"].to(device, dtype=torch.long)

        outputs = model(
            input_ids=ids,
            attention_mask=mask,
            decoder_input_ids=y_ids,
            labels=lm_labels,
        )
        loss = outputs[0]

        if _ % 100 == 0:
            training_logger.add_row(str(epoch), str(_), str(loss))
            console.print(training_logger)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [29]:
def validate(epoch, tokenizer, model, device, loader):

  """
  Function to evaluate model for predictions

  """
  model.eval()
  predictions = []
  actuals = []
  with torch.no_grad():
      for _, data in enumerate(loader, 0):
          y = data['target_ids'].to(device, dtype = torch.long)
          ids = data['source_ids'].to(device, dtype = torch.long)
          mask = data['source_mask'].to(device, dtype = torch.long)

          generated_ids = model.generate(
              input_ids = ids,
              attention_mask = mask, 
              max_length=150, 
              num_beams=2,
              repetition_penalty=2.5, 
              length_penalty=1.0, 
              early_stopping=True
              )
          preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
          target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
          if _%10==0:
              console.print(f'Completed {_}')

          predictions.extend(preds)
          actuals.extend(target)
  return predictions, actuals

In [30]:
# Set random seeds and deterministic pytorch for reproducibility
torch.manual_seed(model_params["SEED"])  # pytorch random seed
np.random.seed(model_params["SEED"])  # numpy random seed
torch.backends.cudnn.deterministic = True

# logging
console.log(f"""[Model]: Loading {model_params["MODEL"]}...\n""")

# Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary.
# Further this model is sent to device (GPU/TPU) for using the hardware.
model = T5ForConditionalGeneration.from_pretrained(model_params["MODEL"])
model = model.to(device)

# Defining the optimizer that will be used to tune the weights of the network in the training session.
optimizer = torch.optim.Adam(
    params=model.parameters(), lr=model_params["LEARNING_RATE"]
)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [51]:
output_dir = "/kaggle/working/"

# Training loop
console.log(f"[Initiating Fine Tuning]...\n")

for epoch in range(model_params["TRAIN_EPOCHS"]):
    train(epoch, tokenizer, model, device, training_loader, optimizer)

print(f"[Saving Model]...\n")
# Saving the model after training
path = os.path.join(output_dir, "model_files")
model.save_pretrained(path)
tokenizer.save_pretrained(path)

# evaluating test dataset
console.log(f"[Initiating Validation]...\n")
for epoch in range(model_params["VAL_EPOCHS"]):
    predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
    final_df = pd.DataFrame({"Generated Text": predictions, "Actual Text": actuals})
    final_df.to_csv(os.path.join(output_dir, "predictions.csv"))

[Saving Model]...



## Load Model and Predict

In [36]:
loaded_model = T5ForConditionalGeneration.from_pretrained("/kaggle/input/datasci-summarizer-t5/transformers/version-1.0/1").to(device)
loaded_tokenizer = T5Tokenizer.from_pretrained("/kaggle/input/datasci-summarizer-t5/transformers/version-1.0/1")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [37]:
val_set = SummarizingDataset(
    eval_df,
    loaded_tokenizer,
    model_params["MAX_SOURCE_TEXT_LENGTH"],
    model_params["MAX_TARGET_TEXT_LENGTH"],
    "input_text",
    "target_text",
)

val_loader = DataLoader(val_set, **val_params)

In [38]:
predictions, actuals = validate(0, loaded_tokenizer, loaded_model, device, val_loader)
final_df = pd.DataFrame({"Generated Text": predictions, "Actual Text": actuals})
final_df.to_csv(os.path.join("/kaggle/working/", "predictions.csv"))

NameError: name 'output_dir' is not defined

In [40]:
final_df.head()

Unnamed: 0,Generated Text,Actual Text
0,pyrocatechol violet copper ion graphene oxide ...,Graphene oxide-alginate hydrogel-based indicat...
1,Cu(H3Tea) ligand in copper(II)-carboxylate com...,Rare coordination behavior of triethanolamine ...
2,Ammonia nitrogen removal and oxygen net primar...,Total ammonia nitrogen removal and microbial c...
3,anaerobic baffled biofilm membrane bioreactor ...,Effects of microaeration and sludge recirculat...
4,the bioaccumulation of heavy metals in edible ...,Bioaccumulation of heavy metals in commerciall...
