<font size="5" color="red"><b>News Summarization Project</b></font>


<font size="4">Welcome to the News Summarization project! In this project, we aim to automatically generate concise and informative summaries for news articles using Natural Language Processing (NLP) and fine-tuning pre-trained Large Language Model (LLM).</font>

Project Overview:

- Objective: Create an automatic news summarization tool using the T5 model.
- Steps: Data Preparation, Preprocessing, Model Fine-Tuning, and Model Deployment.
- Benefits: Save time by getting key points from news articles without reading the entire   content.

Usage:

1. Load news articles dataset.
2. Preprocess dataset and create a custom dataset.
3. Fine-tune T5 model for news summarization.
4. Deploy the model using Gradio for interactive summarization.

Feel free to customize hyperparameters and experiment for the best results.

Let's start building an automated news summarization tool!

In [1]:
#Install required libraries
! pip install -q transformers accelerate sentencepiece gradio

In [2]:
# Importing stock libraries
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration


caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [3]:
# Import the 'cuda' module from the 'torch' library to check if GPU (CUDA) support is available
from torch import cuda
# The 'device' variable now indicates whether the code will run on GPU ('cuda') or CPU ('cpu')
# You can use this 'device' variable to move tensors and models to the appropriate device for computation
device = 'cuda' if cuda.is_available() else 'cpu'

In [4]:
# Read the CSV file 'news_summary.csv' from the specified path and use 'latin-1' encoding
df = pd.read_csv('/kaggle/input/news-summary/news_summary.csv',encoding='latin-1')
# Select only the 'text' and 'ctext' columns from the DataFrame
df = df[['text','ctext']]
# Prepend 'summarize: ' to the 'ctext' column in the DataFrame
df.ctext = 'summarize: ' + df.ctext

In [5]:
# Create a training dataset by randomly sampling 1000 rows from the DataFrame 'df'
# The 'random_state' parameter ensures reproducibility of the random sampling
# Reset the index of the sampled dataset and drop the previous index column
train_dataset=df.sample(1000, random_state = 42).reset_index().drop('index', axis=1)
# Drop any rows containing missing values (NaN) from the training dataset
train_dataset = train_dataset.dropna()

In [6]:
print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))

FULL Dataset: (4514, 2)
TRAIN Dataset: (976, 2)


In [7]:
# Create a tokenizer instance using the pre-trained 't5-base' tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base")

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [8]:
# Creating a custom dataset class for loading dataframe into the dataloader to pass it to the neural network at a later stage for finetuning the model and to prepare it for predictions

class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, index):
        ctext = str(self.data.iloc[index]['ctext'])
        
        text = str(self.data.iloc[index]['text'])

        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt', truncation = True)
        target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt', truncation = True)

        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

In [9]:
# Create a training dataset using the CustomDataset class
training_set = CustomDataset(train_dataset, tokenizer, 500, 125)

In [10]:
# Create a training data loader using the DataLoader class
# Each batch will contain 10 samples, and the order of samples is shuffled
# The 'num_workers' parameter controls the number of parallel data loading processes (set to 0 for single-process loading)
training_loader = DataLoader(training_set, batch_size = 10, shuffle =  True, num_workers = 0)

In [11]:
# Create a T5 model instance using the pre-trained 't5-base' model
model = T5ForConditionalGeneration.from_pretrained("t5-base")
# Move the model to the specified device (GPU if available, else CPU)
model = model.to(device)

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [12]:
# Create an optimizer instance using the Adam optimizer
optimizer = torch.optim.Adam(params =  model.parameters(), lr= 3e-5)

In [13]:
# Creating the training function. This will be called later. It is run depending on the epoch value.
# The model is put into train mode and then we enumerate over the training loader and passed to the defined network 

def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    for _,data in enumerate(loader, 0):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
        loss = outputs[0]
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f'training of epoch {epoch} ended with loss = {loss.item()}')

In [14]:
# Iterate through a specified number of epochs for training
for epoch in range(3):
    train(epoch, tokenizer, model, device, training_loader, optimizer)



training of epoch 0 ended with loss = 1.5711437463760376
training of epoch 1 ended with loss = 1.767512321472168
training of epoch 2 ended with loss = 1.7535284757614136


In [15]:
# Specify the path to save the model's weights and configuration
PATH = '/kaggle/working/model_weights'
# Use the 'torch.save' function to save the model's state dictionary
# The state dictionary contains the trained parameters of the model
torch.save({
    'model_state_dict': model.state_dict()
}, PATH)

In [16]:
# Load the saved checkpoint from the specified path
ckp = torch.load(PATH)
# Access the keys of the loaded checkpoint
# These keys correspond to the elements saved in the checkpoint
# For example, 'model_state_dict' holds the trained model's parameters
ckp.keys()

dict_keys(['model_state_dict'])

In [17]:
# Load a new instance of T5 model from the base pre-trained version
# The model will initially have the architecture of the base T5 model
saved_model = T5ForConditionalGeneration.from_pretrained("t5-base").to("cuda")
# Load the trained model's state dictionary from the loaded checkpoint
# This will update the model's parameters with the trained values
saved_model.load_state_dict(ckp['model_state_dict'])
# Set the model's mode to evaluation
# This disables dropout and other training-specific behaviors
saved_model.eval()

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

In [18]:
def generate(input_text):
    # Tokenize the input text and convert it to input IDs
  input_ids = tokenizer(input_text, return_tensors="pt", max_length=500, truncation=True).input_ids.to("cuda")
    # Generate a summarized output using the saved model
  output = saved_model.generate(input_ids, max_length=125, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    # Decode the generated output into human-readable text
  return tokenizer.decode(output[0], skip_special_tokens=True)

In [19]:
# Define the input text for testing
Text = '''Lashkar-e-Taiba's Kashmir commander Abu Dujana was killed in an encounter in a village in Pulwama district of Jammu and Kashmir earlier this week. Dujana, who had managed to give the security forces a slip several times in the past, carried a bounty of Rs 15 lakh on his head.Reports say that Dujana had come to meet his wife when he was trapped inside a house in Hakripora village. Security officials involved in the encounter tried their best to convince Dujana to surrender but he refused, reports say.According to reports, Dujana rejected call for surrender from an Army officer. The Army had commissioned a local to start a telephonic conversation with Dujana. After initiating the talk, the local villager handed over the phone to the army officer."Kya haal hai? Maine kaha, kya haal hai (How are you. I asked, how are you)?" Dujana is heard asking the officer. The officer replies: "Humara haal chhor Dujana. Surrender kyun nahi kar deta. Tu galat kar rha hai (Why don't you surrender? You have married this girl. What you are doing isn't right.)"When told that he is being used by Pakistani agencies as a pawn, Dujana, who sounded calm and unperturbed of the situation, said "Hum nikley they shaheed hone. Main kya karu. Jisko game khelna hai, khelo. Kabhi hum aage, kabhi aap, aaj aapne pakad liya, mubarak ho aapko. Jisko jo karna hai karlo (I had left home for martyrdom. What can I do? Today you caught me. Congratulations. "Surrender nahi kar sakta. Jo meri kismat may likha hoga, Allah wahi karega, theek hai? (I won't surrender. Allaah would do whatever is there in my fate)" Dujana went on to say. Dujana, who belonged to Pakistan, was Lashkar-e-Taiba's divisional commander in south Kashmir. He was among the top 10 terrorists identified by the Indian Army in Jammu and Kashmir.With a Rs 15 lakh bounty on his head, Dujana was labelled an 'A++' terrorist - the top grade which was also given to Burhan Wani.Security forces received inputs that during the last few days he was frequenting the houses of his wife Rukaiya and girlfriend Shazia. Police was keeping a watch on both the houses. when it was confirmed he was present in his wife's house, security forces moved in to trap him.ALSO READ:After Abu Dujana, security forces prepare new hitlist of most wanted terroristsAbu Dujana encounter: Jilted lover turned police informer led security forces to LeT commander'''

In [20]:
# Generate the summarized version using the generate function
generate(Text)

'Dujana, who had managed to give the security forces a slip several times in the past, was killed in an encounter in a village in Pulwama district of Jammu and Kashmir earlier this week. "Why don\'t you surrender? You have married this girl. What you are doing isn\'t right."'

In [21]:
# Import the Gradio library for creating interactive interfaces
import gradio as gr
# Create a Gradio interface
iface = gr.Interface(
    fn=generate,
    inputs = gr.inputs.Textbox(lines=10, label="Input News Article"),
    outputs = gr.outputs.Textbox(label="Summarized News"),
    title="News Summarization: This website is made by Vivek Kumar",
    description="Enter a news article to get a summarized version."
)

# Deploy and share the interface
iface.launch(share=True)

  inputs = gr.inputs.Textbox(lines=10, label="Input News Article"),
  inputs = gr.inputs.Textbox(lines=10, label="Input News Article"),
  inputs = gr.inputs.Textbox(lines=10, label="Input News Article"),
  outputs = gr.outputs.Textbox(label="Summarized News"),


Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://2ea38289e3ac961dcf.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


