# T5 pretrained transformer model - News Headline Exraction
* Notebook by Adam Lang
* Date: 10/1/2024

# Overview
* In this notebook we will use a pre-trained T5 transformer model for Newspaper headline extraction.
* T5 is the text-to-text transformer from google via huggingface.

# Model and Task we will perform
* We will use a T5 model that has been specifically pre-trained for newspaper headline generation to take in text and generate newspaper headlines.
* model card: https://huggingface.co/Michau/t5-base-en-generate-headline

## Imports

In [1]:
##data sci standard imports
import pandas as pd
import numpy as np

## NLP imports
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
import gensim
from gensim.models import Word2Vec

## pytorch and ML imports
import torch
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, RandomSampler
from sklearn.model_selection import train_test_split

## huggingface imports - T5 model
from transformers import T5ForConditionalGeneration, T5Tokenizer

##other imports
from collections import Counter
from tqdm import tqdm

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [2]:
## PyTorch device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## Load Dataset

In [3]:
## data path
data_path = '/content/drive/MyDrive/Colab Notebooks/Deep Learning Notebooks/NLP_deep_learning/BERT_transformers/news_summary.csv'
## load csv
dataset = pd.read_csv(data_path, encoding='latin-1')
dataset.head()

Unnamed: 0,headlines,text
0,upGrad learner switches to career in ML & AI w...,"Saurav Kant, an alumnus of upGrad and IIIT-B's..."
1,Delhi techie wins free food from Swiggy for on...,Kunal Shah's credit card bill payment platform...
2,New Zealand end Rohit Sharma-led India's 12-ma...,New Zealand defeated India by 8 wickets in the...
3,Aegon life iTerm insurance plan helps customer...,"With Aegon Life iTerm Insurance plan, customer..."
4,"Have known Hirani for yrs, what if MeToo claim...",Speaking about the sexual harassment allegatio...


## Train, Test, and Validation Datasets

In [4]:
## train_dataset
train_dataset, test_dataset = train_test_split(dataset, shuffle=True, test_size=0.2, random_state=42)
train_dataset, val_dataset = train_test_split(train_dataset, shuffle=True, test_size=0.1, random_state=42)


## print size of datasets
print(f"Train set size {len(train_dataset)}")
print(f"Validation set size: {len(val_dataset)}")
print(f"Test set size: {len(test_dataset)}")

Train set size 70848
Validation set size: 7872
Test set size: 19681


## Load T5 model from Huggingface
* As mentioned above, this model was pretrained on 500k news articles to generate news headlines.

In [6]:
## load model
model = T5ForConditionalGeneration.from_pretrained('Michau/t5-base-en-generate-headline')

## load tokenizer
tokenizer = T5Tokenizer.from_pretrained('Michau/t5-base-en-generate-headline',clean_up_tokenization_spaces=True,
                                        legacy=False)

## instantiate model + send to device
model = model.to(device)

## Function to Generate Headlines
* We will incorporate the model into our function.

In [27]:
## function to generate headlines
def gen_headlines(text):
  ## encode text
  encoding = tokenizer.encode_plus("headline: " + text, max_length=1024, return_tensors='pt',
                                   truncation=True)
  ## input ids
  input_ids = encoding['input_ids'].to(device) ## to device
  ## attention masks
  attention_masks = encoding['attention_mask'].to(device) ## to device
  ## outputs
  outputs = model.generate(input_ids = input_ids, attention_mask = attention_masks,
                           max_length=100, min_length=50,length_penalty=2.0,
                           num_beams=3,
                           early_stopping=True)

  return tokenizer.decode(outputs[0], skip_special_tokens=True)

## Evaluation Function
* A function to randomly eval test set.
* We will us the METEOR SCORE from nltk to evaluate the text generated.

In [28]:
## eval function
def eval_random_test(n=10):
  """Function randomly takes in 10 inputs to generate news headlines and evaluates output with METEOR SCORE."""
  for i in range(n):
    print(i)
    ## sample to evaluate
    eval_sample = test_dataset.iloc[i:i+1, :]
    print('news_article > ', eval_sample['text'].iloc[0])
    ## headline
    headline = eval_sample['headlines'].iloc[0]
    print('original_headline = ', headline)
    ## output sentence - use gen_headlines function
    output_sentence = gen_headlines(eval_sample['text'].iloc[0])
    print('predicted_headline < ', output_sentence)
    ## compute meteor score on text generated
    print(f"meteor score: {nltk.translate.meteor_score.single_meteor_score(headline.split(), output_sentence.split())}")


In [29]:
## test results
eval_random_test(n=10)

0
news_article >  Students in Karnataka will get extra marks if their parents cast votes in the upcoming assembly elections, the Associated Management of Primary and Secondary Schools has announced. The "Encouraging Marks" will be added in the 2018-19 academic year. The association said, "After casting their votes, parents can visit member schools...and confirm that they voted by showing the indelible ink mark."
original_headline =  K'taka students to get extra marks if parents vote in polls
predicted_headline <  "Encouraging Marks" to be added in 2018-19 Academic Year - Associated Management of Primary and Secondary Schools, Karnataka, Says Associated Management of Primary and Secondary Schools (AMSS) & AMS.
meteor score: 0.078125
1
news_article >  Syrian anti-aircraft defences on Monday shot down missiles over two air bases, Syria's state media said. The missiles targeted Shayrat air base in the Homs province and another base northeast of the capital Damascus. This comes days after t

# Summary
* METEOR scores range from 0 to 1 with those closer to 1 indicating better quality of text translation or generation.
* We can see there are a few headlines here with METEOR scores of 0.5 or 0.6 so not bad for out of the box run using this pre-trained model.
  * Playing around with some of the hyperparams of the model components does help improve the outputs a bit (e.g. num_beams, max_length, min_length), but this may vary depending upon the dataset.