# Language model

In this notebook you will learn how to fine-tune a large language model for a specific task.
We will take advantage of the T5 model and we will fine tune to summarize daily news.

T5 is designed to treat all NLP (Natural language processing) tasks as text-to-text tasks. This means tasks like translation, summarization, classification, and question answering are all framed as converting input text to output text.

In [1]:
# imports
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForSeq2Seq
from datasets import load_dataset, DatasetDict

# Load the CNN/Daily Mail dataset
# The CNN / DailyMail Dataset is an English-language dataset containing just over 300k 
# unique news articles as written by journalists at CNN and the Daily Mail
# Ref: https://huggingface.co/datasets/abisee/cnn_dailymail
# Load dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")


# Subsampling factor
# In order to train large language model like the one we wll use n the notebook we need to use gpu
# otherwise training will take too much, since we don't have it here we will train for just one epoch
# a very small subsample
factor = 500

# Function to subsample the dataset by the specified factor
def subsample_by_factor(dataset_split, factor):
    num_samples = len(dataset_split) // factor
    return dataset_split.shuffle(seed=42).select(range(num_samples))

# Apply subsampling to each split and keep the DatasetDict structure
dataset = DatasetDict({
    'train': subsample_by_factor(dataset['train'], factor),
    'validation': subsample_by_factor(dataset['validation'], factor),
    'test': subsample_by_factor(dataset['test'], factor)
})

In [2]:
# Load the model and tokenizer
model_name = "t5-small"  # You can use larger models like "t5-small" or "t5-large"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


The T5 model uses a standard encoder-decoder transformer architecture.

Encoder: Takes an input sequence and processes it to generate a sequence of hidden representations.
Decoder: Takes those hidden representations from the encoder and generates the output sequence (which could be a translation, summary, or any other text output depending on the task).

The T5 tokenizer is responsible for converting input text into a format that the T5 model can process. Specifically, it transforms text into token IDs (integers) that represent the individual subword units or tokens in the model's vocabulary. Likewise, it can convert the model's output (a sequence of token IDs) back into human-readable text. T5 uses a SentencePiece tokenizer as its base, which is a subword tokenization method.

# Dataset preparation

In [3]:
# Define the device (use 'cuda' if GPU is available)
# Note: training on cpu will take long time, just for demonstrative purposes here we will train on cpu
# but them we will use a model that we've previously trained on a gpu for more epochs.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [4]:
# let's have a look to the dataset
train_data = dataset['train']

In [10]:
# Print the first few articles and their summaries
for i in range(3):  # Adjust the range to print more examples
    i = i + 60
    print(f"Article {i}:")
    print(train_data[i]['article'])
    print("\nSummary:")
    print(train_data[i]['highlights'])
    print("-" * 80)

Article 60:
By . Nick Harris for MailOnline . Stevan Jovetic lit up the Etihad Stadium on Monday evening, giving an eye-catching all-round performance for Manchester City that merited his two goals in the 3-1 win over Liverpool and easily deserved his man of the match award. But if you think the 24-year-old Montenegro striker is now destined to be a leading scorer as City romp to the Premier League title, his career to date suggests you better think again. Any strike force is only as strong as its weakest link, and there is no doubt Jovetic remains City‚Äôs weakest link among the first-team squad‚Äôs four strikers: Sergio Aguero, Edin Dzeko and Alvaro Negredo being the others. VIDEO Scroll down to watch Pellegrini hail Jovetic . Weakest link: Stevan Jovetic is a decent player but won't fire Manchester City to the title . Superstar: Sergio Aguero slots the ball home seconds after coming on against Liverpool . The real deal: Aguero is the main man at City and theit title hopes will hinge

In [11]:
# Tokenization function
def tokenize_function(examples):
    inputs = examples["article"]  # Input texts
    outputs = examples["highlights"]  # Target summaries

    # Tokenize and truncate inputs
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")

    # Tokenize and truncate outputs (labels)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(outputs, max_length=128, truncation=True, padding="max_length")

    # Set the labels
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [12]:
# Tokenize the dataset in a format compatible with our model
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/574 [00:00<?, ? examples/s]



Map:   0%|          | 0/26 [00:00<?, ? examples/s]

Map:   0%|          | 0/22 [00:00<?, ? examples/s]

# Training

In [18]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",
    report_to="none",  # üëà THIS disables HF / Trackio / WandB / etc
)


### DataCollatorForSeq2Seq

The DataCollatorForSeq2Seq is a utility class provided by Hugging Face's transformers library, designed specifically for preparing data when fine-tuning or evaluating sequence-to-sequence (seq2seq) models, such as T5 or BART.

Its main role is to process a batch of input data (e.g., text pairs for translation, summarization, etc.) and prepare it for training or evaluation by handling tasks like padding and attention mask creation. This is crucial when using variable-length sequences in a model that requires fixed-size inputs for batched processing.

Key Functions of DataCollatorForSeq2Seq:
Padding:

In most NLP datasets, sentences have varying lengths. However, models expect all sequences in a batch to have the same length.
DataCollatorForSeq2Seq ensures that all input and target sequences are padded to the maximum length in the batch. Padding ensures that each sequence in the batch has the same length, which is necessary for batching inputs together efficiently.
Padding can be done dynamically (for each batch) rather than to a fixed length for the entire dataset, improving memory efficiency.
Attention Masks:

It generates attention masks that tell the model which tokens are actual input tokens and which are padding tokens. The attention mask is typically a binary mask where:
1 represents the actual tokens.
0 represents padding tokens.
This helps the model distinguish between real input data and padding so it can focus on relevant parts of the sequence during training.
Label Padding:

In sequence-to-sequence tasks, both the input sequence and the output sequence (target or labels) might have different lengths. DataCollatorForSeq2Seq ensures the labels (target sequences) are also padded to the same length in a batch.
It pads the labels with -100 instead of 0, which is the default token ID for padding. This tells the loss function (typically Cross Entropy Loss) to ignore the padded tokens when calculating the loss during training.
Optional Use with Mixed Precision:

DataCollatorForSeq2Seq can be configured to support mixed precision training by properly handling floating-point formats for improved training efficiency.
Support for Teacher Forcing:

In sequence-to-sequence models, the decoder often uses a technique called teacher forcing, where the ground truth output tokens are fed into the model during training, instead of using the model's own predictions. DataCollatorForSeq2Seq ensures that the target sequences (labels) are appropriately prepared for this during training.
Model-Specific Features:

For models like T5 or BART, which have an encoder-decoder structure, DataCollatorForSeq2Seq prepares both the input sequence (for the encoder) and the target sequence (for the decoder).
When generating text (e.g., during evaluation), it can be configured to work with specific decoding strategies such as beam search or greedy decoding.

In [19]:
# Define the data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)
                                       
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,  # Pass the tokenizer to the Trainer
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save the model
model.save_pretrained("./fine-tuned-bart-example")
tokenizer.save_pretrained("./fine-tuned-bart-example")

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.176687
2,1.130800,1.172639
3,1.130800,1.174018
4,1.043900,1.181456
5,1.043900,1.18317
6,0.971700,1.186799
7,0.938800,1.187496
8,0.938800,1.191059
9,0.912700,1.193481
10,0.912700,1.194493


('./fine-tuned-bart-example/tokenizer_config.json',
 './fine-tuned-bart-example/special_tokens_map.json',
 './fine-tuned-bart-example/spiece.model',
 './fine-tuned-bart-example/added_tokens.json')

# Testing

In this section we will test the model and see how it has improved in summarizing news

In [20]:
# Let's define some news as string and let's observe the different output of the original vs trained model

text1 = """

In a thrilling match last night, Thunderstrike FC secured a dramatic 3-2 win over their fierce rivals, Ironclad United, thanks to a stunning last-minute goal. The tense showdown took place at the packed Horizon Stadium, where fans were treated to an intense, back-and-forth contest.

Thunderstrike FC took the lead early in the first half with a header from striker Marco Alvarez in the 12th minute. However, Ironclad United quickly equalized, with midfielder Jordan Blake slotting home a powerful shot from outside the box just before halftime.

The second half saw both teams battling for dominance. Ironclad United went ahead in the 65th minute when defender Alan Knight converted a corner kick with a towering header. But Thunderstrike FC fought back, and their persistence paid off when winger Leon Hart leveled the score with a curling free kick in the 78th minute.

Just as the match seemed destined for a draw, Thunderstrike‚Äôs young sensation, Ethan Morales, delivered the decisive blow in stoppage time, calmly slotting the ball into the bottom corner after a brilliant solo run.

The win pushes Thunderstrike FC to the top of the league table, while Ironclad United faces a tough battle to recover in the coming weeks.

---

This fictional news story portrays an exciting football match without using real team or player names.
"""

In [21]:
# Example text to summarize
text2 = """
Powerful Storm Slams California, Bringing Heavy Rain and Flooding Concerns

A powerful storm has swept across California, unleashing torrential rains and strong winds, leaving thousands without power and prompting evacuation orders in several areas. Meteorologists have warned that the storm, which developed rapidly off the Pacific coast, could be one of the most intense in recent years.

The storm hit Northern California early Friday morning, with rainfall totals reaching over 4 inches in some regions. Coastal towns have been particularly hard-hit, with waves crashing over sea walls and flooding streets. Wind gusts exceeding 60 mph have downed trees and power lines, leaving many residents in the dark.

Inland areas are also experiencing severe flooding, especially in low-lying regions and areas near rivers. Emergency responders are on high alert as rising waters threaten homes and businesses. Evacuation orders have been issued for parts of Sonoma County, where floodwaters are nearing critical levels.

‚ÄúWe‚Äôre asking everyone to stay indoors and avoid any unnecessary travel,‚Äù said a state emergency spokesperson. ‚ÄúOur teams are working around the clock to restore power and assist those in need.‚Äù

Forecasters predict the storm will continue into the weekend, bringing additional rain and the potential for mudslides in areas recently affected by wildfires.
"""

# Print the string
print(text2)


Powerful Storm Slams California, Bringing Heavy Rain and Flooding Concerns

A powerful storm has swept across California, unleashing torrential rains and strong winds, leaving thousands without power and prompting evacuation orders in several areas. Meteorologists have warned that the storm, which developed rapidly off the Pacific coast, could be one of the most intense in recent years.

The storm hit Northern California early Friday morning, with rainfall totals reaching over 4 inches in some regions. Coastal towns have been particularly hard-hit, with waves crashing over sea walls and flooding streets. Wind gusts exceeding 60 mph have downed trees and power lines, leaving many residents in the dark.

Inland areas are also experiencing severe flooding, especially in low-lying regions and areas near rivers. Emergency responders are on high alert as rising waters threaten homes and businesses. Evacuation orders have been issued for parts of Sonoma County, where floodwaters are nearing 

### Original model

In [22]:
# Load the model and tokenizer
model_name = "t5-small"  # You can use larger models like "t5-small" or "t5-large"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

In [23]:
# Tokenize and summarize
inputs = tokenizer(text1, return_tensors="pt", max_length=1024, truncation=True)

In [24]:
summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)

# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

in the 78th minute. Thunderstrike FC secured a dramatic 3-2 win over their fierce rivals, Ironclad United. the win pushes Thunderstrike FC to the top of the league table, while Ironclad United faces a tough battle to recover.


In [25]:
# Tokenize and summarize
inputs = tokenizer(text2, return_tensors="pt", max_length=1024, truncation=True)

summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)

# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

, with rainfall totals reaching over 4 inches in some regions. Coastal towns have been particularly hard-hit, with rain totals reaching over 4 inches in some regions. Coastal towns have been particularly hard-hit, with waves crashing over sea walls and flooding streets.


### Trained model

In [27]:
# Load the fine-tuned model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("./fine-tuned-bart-example")
tokenizer = T5Tokenizer.from_pretrained("./fine-tuned-bart-example")

In [28]:
# Tokenize and summarize
inputs = tokenizer(text1, return_tensors="pt", max_length=1024, truncation=True)

summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)

# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

Thunderstrike FC secured a dramatic 3-2 win over their fierce rivals, Ironclad United. The win pushes Thunderstrike FC to the top of the league table.


In [29]:
# Tokenize and summarize
inputs = tokenizer(text2, return_tensors="pt", max_length=1024, truncation=True)

summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)

# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

Meteorologists warn the storm could be one of the most intense in recent years. The storm hit Northern California early Friday morning, with rainfall totals reaching over 4 inches. Inland areas are also experiencing severe flooding, especially in low-lying regions. Emergency responders are on high alert as rising waters threaten homes and businesses.


The model has greatly improved, it has learned to create finite sentences and report main concepts of the articles.

# Summarizing News from the web

In this section we will lean how to use the llm model to summarize news from the web

In [33]:
# fetching news from some urls and ask the model to summarize
from newspaper import Article
import requests

# Load the fine-tuned model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("./fine-tuned-bart-example")
tokenizer = T5Tokenizer.from_pretrained("./fine-tuned-bart-example")

def fetch_news(url):
    """Fetches a news article from a given URL."""
    article = Article(url)
    article.download()
    article.parse()
    return article.text

def summarize_article(article_text):
    """Summarizes the given article text using the fine-tuned model."""
    inputs = tokenizer(article_text, return_tensors="pt", max_length=1024, truncation=True)
    summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# List of news URLs to summarize (you can modify this list)
news_urls = [
    "https://www.nature.com/articles/d41586-024-03327-z",
    "http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/",
    #"https://www.cnn.com/2024/10/06/world/your-news-article-url/index.html",  # Example URL
    # Add more news article URLs as needed
]

# Loop through URLs, fetch and summarize each article
for url in news_urls:
    try:
        print(f"Fetching news from: {url}")
        print("---------------------------------")
        article_text = fetch_news(url)
        print(f"Article:\n{article_text}\n")
        print("---------------------------------")
        summary = summarize_article(article_text)
        print(f"Summary:\n{summary}\n")
        print("----------------------------------------------------------------------------------------------")
    except Exception as e:
        print(f"Error processing {url}: {e}")

Fetching news from: https://www.nature.com/articles/d41586-024-03327-z
---------------------------------
Article:
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

---------------------------------
Summary:
Thank you for visiting nature.com. You are using a browser version with limited support for CSS.

----------------------------------------------------------------------------------------------
Fetching news from: http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/
---------------------------------
Article:
By Leigh Ann Caldwell

WASHINGTON (CNN) ‚Äî Not everyone subscribes to a New Year‚Äôs resolution, but Americans will be required to follow new laws in 2014.

Some 40,000 me

In the previous section we have passed some urls containing news to the model to summarize them.
In the next section we will do a live query by combaning known libraries as Beautiful Soup that is a library that makes it easy to scrape information from web pages and GoogleNews that allows to search for google news about a specific subject.

In [34]:
import requests
from bs4 import BeautifulSoup
from GoogleNews import GoogleNews
from urllib.parse import urlparse, parse_qs

# Function to clean the URL
def clean_url2(url):
    parsed_url = urlparse(url)
    cleaned_url = parsed_url.scheme + "://" + parsed_url.netloc + parsed_url.path
    return cleaned_url

def clean_url(url):
    # First, extract the part of the URL that contains the actual article link
    parsed_url = urlparse(url)
    query_params = parse_qs(parsed_url.query)

    # Check if the URL contains a 'url' or 'q' query parameter (common in Google News links)
    if 'url' in query_params:
        cleaned_url = query_params['url'][0]
    elif 'q' in query_params:
        cleaned_url = query_params['q'][0]
    else:
        # If no such query parameters, use the original URL
        cleaned_url = url

    # Now, truncate the URL at the first '&' symbol, if present
    cleaned_url = cleaned_url.split('&')[0]  # Truncate at the first '&'

    return cleaned_url

# Function to search Google News
def search_google_news(query, language='en', pages=1):
    googlenews = GoogleNews(lang=language)
    googlenews.search(query)

    all_articles = []
    for page in range(1, pages + 1):
        googlenews.getpage(page)
        all_articles.extend(googlenews.result())

    return all_articles

def get_link_text(url):
    return clean_url(url)
    
# Function to fetch article text from a URL
def get_article_text(url):
    url = get_link_text(url)  # Clean the URL before making the request
    try:
        # Send a GET request to the URL
        response = requests.get(url)
        response.raise_for_status()  # Check for HTTP errors

        # Parse the page content with BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract the main content (this can vary by website, so we target common tags)
        paragraphs = soup.find_all('p')  # Get all <p> tags, common for article text
        article_text = ' '.join([para.get_text() for para in paragraphs])

        return article_text
    except Exception as e:
        print(f"Failed to fetch the article text: {e}")
        return None

# Example usage
query = "AI technology"
news_articles = search_google_news(query)

urls = []
# Display results
for idx, article in enumerate(news_articles, 1):
    print(f"{idx}. {article['title']}")
    print(f"Source: {article['media']}")
    print(f"Published: {article['date']}")
    print(f"Link: {article['link']}\n")
    urls.append(get_link_text(article['link']))
    # print(f"Text: {get_article_text(article['link'])}\n")

1. Matter announces India‚Äôs first AI platform for next-gen electric two-wheeler motorcycles: All details
Source: Times of India
Published: 5 minutes ago
Link: https://timesofindia.indiatimes.com/technology/tech-news/matter-announces-indias-first-ai-platform-for-next-gen-electric-two-wheeler-motorcycles-all-details/articleshow/127003232.cms&ved=2ahUKEwjKrd3c8JySAxVWOPsDHXZ8G20QxfQBegQIAhAC&usg=AOvVaw3zNw-E9obfhqJhDCKWWjFn

2. Politically toxic and‚Ä¶: Why space has become an essential frontier in the AI race for Elon Musk and other
Source: Times of India
Published: 7 minutes ago
Link: https://timesofindia.indiatimes.com/technology/tech-news/politically-toxic-and-why-space-has-become-an-essential-frontier-in-the-ai-race-for-elon-musk-and-other-tech-ceos/articleshow/127003400.cms&ved=2ahUKEwjKrd3c8JySAxVWOPsDHXZ8G20QxfQBegQIBhAC&usg=AOvVaw3vnREynY4Z7Vuwu0rg1QcO

3. From AI to chips: Ashwini Vaishnaw unveils India‚Äôs big tech playbook at Davos
Source: Northeast Herald
Published: 0 minut

In [35]:
# Loop through URLs, fetch and summarize each article
for article in news_articles[:5]:
    try:
        print(f"Fetching news from: {article['link']}")
        print("---------------------------------")
        article_text = get_article_text(article['link'])
        print(f"Article:\n{article['title']}\n")
        print(f"Published: {article['date']}")
        print("---------------------------------")
        summary = summarize_article(article_text)
        print(f"Summary:\n{summary}\n")
        print("----------------------------------------------------------------------------------------------")
    except Exception as e:
        print(f"Error processing {url}: {e}")
    print("---------------------------------------------------------------------------------")
    print("---------------------------------------------------------------------------------")
    print("---------------------------------------------------------------------------------")
    print("---------------------------------------------------------------------------------")

Fetching news from: https://timesofindia.indiatimes.com/technology/tech-news/matter-announces-indias-first-ai-platform-for-next-gen-electric-two-wheeler-motorcycles-all-details/articleshow/127003232.cms&ved=2ahUKEwjKrd3c8JySAxVWOPsDHXZ8G20QxfQBegQIAhAC&usg=AOvVaw3zNw-E9obfhqJhDCKWWjFn
---------------------------------
Article:
Matter announces India‚Äôs first AI platform for next-gen electric two-wheeler motorcycles: All details

Published: 5 minutes ago
---------------------------------
Summary:
17,999 8,999 17,434 68,999 74,999 9,249

----------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
Fetching news from: https://timeso