## Loading the CNN Daily Mail Dataset (Training and Validation sets)

In [1]:
import datasets

# Load the dataset
dataset = datasets.load_dataset('cnn_dailymail', '3.0.0')

train_data = dataset['train'].shuffle(seed=42).select(range(int(0.001 * len(dataset['train']))))
val_data = dataset['validation'].shuffle(seed=42).select(range(int(0.001 * len(dataset['validation']))))


## Data Exploration

In [2]:
print(train_data[0]) 

# Average word count in articles and summaries
article_lengths = [len(sample['article'].split()) for sample in train_data]
summary_lengths = [len(sample['highlights'].split()) for sample in train_data]

print("Average article length:", sum(article_lengths) / len(article_lengths))
print("Average summary length:", sum(summary_lengths) / len(summary_lengths))


{'article': "By . Anthony Bond . PUBLISHED: . 07:03 EST, 2 March 2013 . | . UPDATED: . 08:07 EST, 2 March 2013 . Three members of the same family who died in a static caravan from carbon monoxide poisoning would have been unconscious 'within minutes', investigators said today. The bodies of married couple John and Audrey Cook were discovered alongside their daughter, Maureen, at the mobile home they shared on Tremarle Home Park in Camborne, west Cornwall. The inquests have now opened into the deaths last Saturday, with investigators saying the three died along with the family's pet dog, of carbon monoxide poisoning from a cooker. Tragic: The inquests have opened into the deaths of three members of the same family who were found in their static caravan last weekend. John and Audrey Cook are pictured . Awful: The family died following carbon monoxide poisoning at this caravan at the Tremarle Home Park in Camborne, Cornwall . It is also believed there was no working carbon monoxide detect

## Data Cleaning and Tokenizing

In [3]:
from transformers import BartTokenizer
import re

# Initialize the tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

# Function to clean and tokenize data
def preprocess_data(examples):
    examples['article'] = [re.sub(r'\s+', ' ', article) for article in examples['article']]
    examples['highlights'] = [re.sub(r'\s+', ' ', summary) for summary in examples['highlights']]

    # Tokenize articles and summaries
    inputs = tokenizer(examples['article'], truncation=True, padding='max_length', max_length=1024)
    targets = tokenizer(examples['highlights'], truncation=True, padding='max_length', max_length=150)
    return {'input_ids': inputs['input_ids'], 'attention_mask': inputs['attention_mask'], 'labels': targets['input_ids']}

train_data = train_data.map(preprocess_data, batched=True)
val_data = val_data.map(preprocess_data, batched=True)




Map:   0%|          | 0/287 [00:00<?, ? examples/s]

Map:   0%|          | 0/13 [00:00<?, ? examples/s]

## Model Training using BART model

In [4]:
from transformers import BartForConditionalGeneration, Trainer, TrainingArguments

# Loading BART model for summarization
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# Set training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=2, 
    per_device_eval_batch_size=2,
    weight_decay=0.01,
    num_train_epochs=3,  
    logging_dir='./logs',
    logging_steps=10,
    save_steps=10,
    save_total_limit=1,
    gradient_accumulation_steps=16,  
    fp16=False  
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer
)

# Train the model
trainer.train()


Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['model.shared.weight', 'model.encoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The following columns in the training set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, highlights, article. If id, highlights, article are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 287
  Num Epochs = 3
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 16
  Total optimization steps = 27
  Number of trainable parameters = 406290432


Epoch,Training Loss,Validation Loss
1,No log,1.722975
2,4.158800,1.138577
3,1.172300,1.051491


The following columns in the evaluation set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, highlights, article. If id, highlights, article are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 13
  Batch size = 2
Saving model checkpoint to ./results\checkpoint-10
Configuration saved in ./results\checkpoint-10\config.json
Model weights saved in ./results\checkpoint-10\pytorch_model.bin
tokenizer config file saved in ./results\checkpoint-10\tokenizer_config.json
Special tokens file saved in ./results\checkpoint-10\special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, highlights, article. If id, highlights, article are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running E

TrainOutput(global_step=27, training_loss=2.1936102266664856, metrics={'train_runtime': 17138.5169, 'train_samples_per_second': 0.05, 'train_steps_per_second': 0.002, 'total_flos': 1865877062418432.0, 'train_loss': 2.1936102266664856, 'epoch': 3.0})

## Model Testing and Performance evaluation using BLEU

In [16]:
import nltk
from nltk.translate.bleu_score import corpus_bleu
from datasets import load_dataset
from transformers import BartForConditionalGeneration, Trainer, TrainingArguments, BartTokenizer
import torch

In [17]:
# Loading the test dataset
test_data = dataset['test'].shuffle(seed=42).select(range(int(0.001 * len(dataset['test']))))
test_data = test_data.map(preprocess_data, batched=True)

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

In [18]:
# Defining BLEU evaluation function
def compute_bleu(predictions, references):
    reference_lists = [[ref.split()] for ref in references]
    prediction_lists = [pred.split() for pred in predictions]
    return corpus_bleu(reference_lists, prediction_lists)

In [19]:
# Define the Trainer for evaluation
trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=test_data,
    tokenizer=tokenizer,
)

In [24]:
# Perform evaluation on the test dataset
results = trainer.evaluate()
# Display the results
print("Evaluation Results:")
for key, value in results.items():
    print(f"{key}: {value}")


The following columns in the evaluation set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, highlights, article. If id, highlights, article are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 11
  Batch size = 2


Evaluation Results:
eval_loss: 0.9898132085800171
eval_runtime: 90.9034
eval_samples_per_second: 0.121
eval_steps_per_second: 0.066


## Data Exploration after training and testing

In [26]:
import random

sample_idx = random.randint(0, len(test_data) - 1)
sample = test_data[sample_idx]

print("Article:")
print(sample['article'])
print("\nTrue Summary:")
print(sample['highlights'])

inputs = tokenizer(sample['article'], return_tensors='pt', truncation=True, padding='max_length', max_length=1024).to(model.device)

output = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=150)

predicted_summary = tokenizer.decode(output[0], skip_special_tokens=True)

print("\nPredicted Summary:")
print(predicted_summary)


Article:
Arsene Wenger admits he is concerned Theo Walcott’s confidence is plummeting after his struggles with England this week. The Arsenal manager will have a heart-to-heart chat with the forward ahead of Saturday’s crunch top-four clash against Liverpool. Walcott was hauled off after 55 minutes of England’s 1-1 draw in Italy on Tuesday night. Theo Walcott struggled for England and Arsene Wenger admits he is concerned by the winger's confidence . Walcott was replaced by Ross Barkley after just 55 minutes of England's 1-1 draw against Italy on Tuesday . 2 - Premier League goals for Walcott this season - his average haul per season during his time at Arsenal is 5.6. It was the latest disappointment in a difficult season for the 26-year-old, who has struggled for game time since returning from a long-term lay-off due to a serious knee injury. With Alex Oxlade-Chamberlain out of Liverpool’s visit due to a hamstring strain, and Danny Welbeck a major doubt after sustaining a knee problem 

In [27]:
# Function to generate summary for a user input
def generate_summary(user_input):
    inputs = tokenizer(user_input, return_tensors='pt', truncation=True, padding='max_length', max_length=1024).to(model.device)

    summary_ids = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=150)

    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

user_input = input("Enter a paragraph to summarize: ")

summary = generate_summary(user_input)
print("\nGenerated Summary:")
print(summary)


Enter a paragraph to summarize: Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, and self-correction. AI can be classified into narrow AI, which is specialized for specific tasks, and general AI, which aims to perform any intellectual task that a human can do. As AI technology advances, it has the potential to revolutionize various industries, including healthcare, finance, and transportation.

Generated Summary:
Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, and self-correction. As AI technology advances, it has the potential to revolutionize various industries, including healthcare, finance, and transportation.


## Saving the model

In [28]:

model.save_pretrained("C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model")
tokenizer.save_pretrained("C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model")


Configuration saved in C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model\config.json
Model weights saved in C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model\pytorch_model.bin
tokenizer config file saved in C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model\tokenizer_config.json
Special tokens file saved in C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model\special_tokens_map.json


('C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model\\tokenizer_config.json',
 'C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model\\special_tokens_map.json',
 'C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model\\vocab.json',
 'C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model\\merges.txt',
 'C:/Users/91862/Desktop/Text_Sum_Infosys/saved_model\\added_tokens.json')