
# Summarizing articles using T5

In this notebook, we use a pre-trained [T5 model](https://huggingface.co/t5-small) from Hugging Face to generate summaries of long-form text data (like newspaper articles). We apply it to a DataFrame containing enhanced metadata with the scraped content and export the resulting summaries. You can chose this summary over the NLP-based one for future analysis or categorisation of the collection content. 

Running this notebook takes time, please make sure your computer is charging, your internet connection is stable, and sleep mode is disabled.



##  Import required libraries

We begin by importing the libraries we need:
- `pandas` for handling tabular data
- `transformers` from Hugging Face to load and use the T5 model
- `tqdm` for progress bars during processing
- `torch` as the backend framework for the T5 model


In [None]:

import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration
from tqdm.notebook import tqdm
import torch



## Load the T5 Model and Tokenizer

We initialise the T5 model and tokenizer. 
We use the `"t5-small"` version for speed and simplicity, though larger versions (like `"t5-base"` or `"t5-large"`) could give better results at the cost of more memory.


In [None]:

# Initialise T5 model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")



##  Load the Dataset

We load the DataFrame from a CSV file named `AoT_enhanced.csv`, 
which is expected to contain a column named `full_text` with the raw article text,
and a `url` column for reference. This dataset contains the columns with the parsed title, keywords, and summaries.


In [None]:

# Load the DataFrame with pre-parsed full text
df = pd.read_csv('data/AoT5_enhanced.csv')



## Ensure text column is string type

Before processing, we make sure that the `full_text` column is properly cast to string type,
since the T5 model expects a text input.
Missing or NaN values will be converted to the string `"nan"` at this stage,
so we handle that later in our loop.


In [None]:

#Ensure 'full_text' column is of string type and handle NaNs
df['full_text'] = df['full_text'].astype(str)



##  Define function to process text with T5

We define a function `process_with_t5()` that:
1. Adds the "summarize:" prefix expected by T5.
2. Tokenizes the input and truncates to 1024 tokens.
3. Generates a summary with beam search decoding.
4. Returns the summary.

Note: Although the function returns a placeholder for keywords,
T5 is not designed for keyword extraction.


In [None]:

# Function to generate summary and keywords using T5
def process_with_t5(text):
    try:
        # Summarization
        input_text = "summarize: " + text
        inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=1024, truncation=True)
        outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
        summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Keyword Extraction
        keywords = "Keywords extraction not implemented for T5"

        return summary, keywords
    except Exception as e:
        print(f"Error processing text with T5: {e}")
        return "Error in T5 processing", "Error in T5 processing"



##  Loop through the dataset and generate summaries

We loop through each row of the DataFrame and:
- Skip empty or invalid text values
- Use the `process_with_t5()` function to generate a summary
- Append the results (summary and placeholder keywords) to new lists

We track progress with `tqdm` for a visual progress bar.


In [None]:

# Apply T5 on the pre-parsed full text
summaries = []
keywords = []
urls = []

with tqdm(total=len(df), desc="Processing with T5", unit="text") as pbar:
    for index, row in df.iterrows():
        full_text = row['full_text']
        
        if pd.isna(full_text) or not isinstance(full_text, str) or full_text.strip() == "":
            summary = "No text available"
            keyword = "No keywords available"
        else:
            summary, keyword = process_with_t5(full_text)
        
        summaries.append(summary)
        keywords.append(keyword)
        urls.append(row['url'])
        pbar.update(1)



##  Add generated summaries to the DataFrame

We add two new columns to the original DataFrame:
- `LLM summary`: the generated summary

The 'LLM' prefix indicates the values were generated using a language model (T5).


In [None]:

df['LLM summary'] = summaries


##  Ensure title and keyword columns exist

As a safety check, we ensure the original `title` and `keywords` columns still exist.
If not, we add them back. 
(This block assumes they may already be missing or need to be restored.)


In [None]:

if 'title' not in df.columns:
    df['title'] = df['title']
if 'keywords' not in df.columns:
    df['keywords'] = df['keywords']



## 📊 Preview and clean the DataFrame

We print a few rows to inspect the result and display the full DataFrame with all columns and rows visible.


In [None]:

print(df.head())
print(df.shape)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df



## 💾 Export the final data

Finally, we export the updated DataFrame to a CSV file, which now includes the generated summaries.
Define your own destination folder and file name.

In [None]:

df.to_csv('Aot_enhanced_with_LLM_summaries.csv', index=False)
