# Text Summarization using Hugging Face

### Summarization
Summarization is a task of getting short summaries from long documents i.e. news articles or research articles. Basically it can be of two types i.e. Extractive and abstractive summarization.



### Extractive Summarization

Extractive Summarization is a shortening of paragraphs in large documents i.e. news articles, medical publications or research articles throught extracting important information from those documents without keeping context in mind.


### Abstractive Summarization

Abstractive Summarization is quite different from prior basic summarization technique. In prior summarization, resulting summaries may or maynot be meaningful because it's just a process of extracting important sentences from long documents but in abstractive summarization , resulting summaries tries to consider context for whole document and then summarize it accordingly where words maynot be exact similar to given documents.



from datasets import load_dataset

dataset = load_dataset("cnn_dailymail")


### Imports

In [None]:
# Install and import the modules
!pip install torch
!pip install transformers

import json
import torch
from torch.utils.data import DataLoader, Dataset



## Dataset used
https://huggingface.co/datasets/cnn_dailymail


We will use Cnn-Daily News Summary dataset here to perform summarization using T5 pretrained model.

## Load Dataset

Notes: The data set is in the form of a dict with the fields

- id: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from
- article: a string containing the body of the news article
- highlights: a string containing the highlight of the article as written by the article author

Also the initial load results in three subsets - train and test and validation


# SKIP the Below Load if the file has been loaded once and part saved in local memory or gdrive

In [None]:
!pip install datasets
#load cnn dataset
import datasets
from datasets import load_dataset
dataset = load_dataset("cnn_dailymail",'3.0.0')

Collecting datasets
  Downloading datasets-2.16.0-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.16.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


Downloading data:   0%|          | 0.00/313M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/304M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/155M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
# check dataset
dataset.keys()

dict_keys(['train', 'validation', 'test'])

  ##  The  dataset is too large so for now we will consider just 400 rows for training and 200 rows for validation

  Steps
  - subset the original dataset in portions for train and validation
  - write to local memory
  - SKIP if already done once

In [None]:
import json
train_dataset = dataset['train'][:4000]
val_dataset = dataset['validation'][:200]

# Save cnn train dataset
with open('drive/MyDrive/LLM_data/cnn_train_4000_data.json','w') as f:
   json.dump(train_dataset, f)

# Save cnn validation dataset
with open('drive/MyDrive/LLM_data/cnn_val_200_data.json','w') as f:
   json.dump(val_dataset, f)

## Read the train and validation from local memory or drive

In [None]:
## Read train data from saved dump

# Opening cnn train data file
f = open('drive/MyDrive/LLM_data/cnn_train_4000_data.json')

# read cnn train data
train_data = json.load(f)

# select first entry of article from the train data
sample_text = train_data["article"][0]
print("       Sample Text      ")
print(sample_text, '/n')

# select first entry of highlight  from the train data
sample_highlight = train_data["highlights"][0]
print("        Sample Highlight     ")
print(sample_highlight)

       Sample Text      
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box off

In [None]:
## Read val data from saved dump

# Opening cnn train data file
f = open('drive/MyDrive/LLM_data/cnn_val_200_data.json')

# read cnn train data
val_data = json.load(f)

# select first entry of each type from the val data set
val_sample_text = val_data["article"][0]
print("       Sample Text      ")
print(val_sample_text, '/n')
#
val_sample_highlight = val_data["highlights"][0]
print("        Sample Highlight     ")
print(val_sample_highlight)



       Sample Text      
(CNN)Share, and your gift will be multiplied. That may sound like an esoteric adage, but when Zully Broussard selflessly decided to give one of her kidneys to a stranger, her generosity paired up with big data. It resulted in six patients receiving transplants. That surprised and wowed her. "I thought I was going to help this one person who I don't know, but the fact that so many people can have a life extension, that's pretty big," Broussard told CNN affiliate KGO. She may feel guided in her generosity by a higher power. "Thanks for all the support and prayers," a comment on a Facebook page in her name read. "I know this entire journey is much bigger than all of us. I also know I'm just the messenger." CNN cannot verify the authenticity of the page. But the power that multiplied Broussard's gift was data processing of genetic profiles from donor-recipient pairs. It works on a simple swapping principle but takes it to a much higher level, according to Californi

## Data Cleaning

- We will remove '--'  from the text
- We will remove the names within parenthesis from the text

In [None]:
# Pre process sample text
import re


# Pre Process
import re
sample_text = train_data["article"][0]
# Check before pre process
print("BEFORE Pre Process : ", sample_text)

sample_text = re.sub('\(.*?\)','',sample_text)

sample_text = sample_text.replace('--','')

# Check after pre process
print("AFTER Pre Process : ", sample_text)


BEFORE Pre Process :  LONDON, England   Harry Potter star Daniel Radcliffe gains access to a reported £20 million  fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds  books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how he'll

## Tokenizer

We will be using T5TokenizerFast in this example

Tokenize sample text

In [None]:
# Prepare sample text for tokenization

# select first entry of article from the train data
#sample_text = train_data["article"][0]

sample_text = str(sample_text)
sample_text = ' '.join(sample_text.split())
# Set Max lengths for padding
max_txt_len = 250

# Prepare sample highlight  for tokenization
sample_highlight = str(sample_highlight)
sample_highlight = ' '.join(sample_highlight.split())



max_summ_len = 150


In [None]:
# Set parameters for text and summary for padding
max_txt_len = 250


from transformers import T5Model, T5TokenizerFast, T5Config, T5ForConditionalGeneration
from transformers.optimization import AdamW

# Invoke tokenizer
tokenizer = T5TokenizerFast.from_pretrained('t5-base')

# Each source sequence is encoded and padded to max length in batches
source  = tokenizer.batch_encode_plus([sample_text],truncation = True, max_length=max_txt_len,return_tensors='pt',padding = True)


max_summ_len = 150
# Tokenize sample highlight
# Each sample sequence is encoded and padded to max length in batches
target = tokenizer.batch_encode_plus([sample_highlight],truncation = True, max_length=max_summ_len,return_tensors='pt',padding =True)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
# print encoding for source
for key, value in source.items():
    print( '{} : {}'.format( key, value ) )

input_ids : tensor([[  301, 24796,  4170,     6,  2789,  8929, 16023,  2213,  4173,  6324,
         12591,    15, 11391,   592,    12,     3,     9,  2196,  3996,  1755,
           770, 13462,    38,     3,    88,  5050,   507,    30,  2089,     6,
            68,     3,    88, 10419,     7,     8,   540,   751,    31,    17,
          4061,     3,     9, 10783,    30,   376,     5,  4173,  6324, 12591,
            15,    38,  8929, 16023,    16,    96, 15537,   651, 16023,    11,
             8,  5197,    13,     8, 12308,   121,   304,     8, 19142,    13,
         29517,  6710,   343,     7,   300,     8,   296,     6,     8,  1021,
          7556,   845,     3,    88,    65,   150,  1390,    12,  9030,    17,
           449,   112,  1723,   550,    30,  1006,  2948,     6,  3281,    11,
         17086,  2251,     5,    96,   196,   278,    31,    17,   515,    12,
            36,    80,    13,   273,   151,   113,     6,    38,  1116,    38,
            79,   919, 14985,  8247,   8

In [None]:
# print encoding for target
for key, value in target.items():
    print( '{} : {}'.format( key, value ) )

input_ids : tensor([[ 8929, 16023,  2213,  4173,  6324, 12591,    15,  2347,  3996,  1755,
           329, 13462,    38,     3,    88,  5050,   507,  2089,     3,     5,
          5209,  7556,   845,     3,    88,    65,   150,  1390,    12,  9030,
            17,   449,   112,  1723,   550,     3,     5,  6324, 12591,    15,
            31,     7,  8783,    45,   166,   874, 16023,  4852,    43,   118,
          1213,    16,  2019,  3069,     3,     5,     1]])
attention_mask : tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1]])


In [None]:
# Extract input ids for source
source_ids = source['input_ids'].squeeze()
# Extract attention Mask for source
source_masks = source['attention_mask'].squeeze()
# Extract input ids for target
target_ids = target['input_ids'].squeeze()
# Extract attention Mask for source
target_masks = target['attention_mask'].squeeze()

# print and check  target ids
print("target ids")
print(target_ids)
# print and check target masks
print("target mask")
print(target_masks)

target ids
tensor([ 8929, 16023,  2213,  4173,  6324, 12591,    15,  2347,  3996,  1755,
          329, 13462,    38,     3,    88,  5050,   507,  2089,     3,     5,
         5209,  7556,   845,     3,    88,    65,   150,  1390,    12,  9030,
           17,   449,   112,  1723,   550,     3,     5,  6324, 12591,    15,
           31,     7,  8783,    45,   166,   874, 16023,  4852,    43,   118,
         1213,    16,  2019,  3069,     3,     5,     1])
target mask
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1])


In [None]:
# display tokens and ids for the source
tokens = tokenizer.convert_ids_to_tokens(source_ids)
for token, id in zip(tokens, source_ids):
    print('{:8}{:8,}'.format(token,id))

▁L           301
OND       24,796
ON         4,170
,              6
▁England   2,789
▁Harry     8,929
▁Potter   16,023
▁star      2,213
▁Daniel    4,173
▁Rad       6,324
cliff     12,591
e             15
▁gains    11,391
▁access      592
▁to           12
▁              3
a              9
▁reported   2,196
▁£         3,996
20         1,755
▁million     770
▁fortune  13,462
▁as           38
▁              3
he            88
▁turns     5,050
▁18          507
▁on           30
▁Monday    2,089
,              6
▁but          68
▁              3
he            88
▁insist   10,419
s              7
▁the           8
▁money       540
▁won         751
'             31
t             17
▁cast      4,061
▁              3
a              9
▁spell    10,783
▁on           30
▁him         376
.              5
▁Daniel    4,173
▁Rad       6,324
cliff     12,591
e             15
▁as           38
▁Harry     8,929
▁Potter   16,023
▁in           16
▁"            96
Har       15,537
ry           651
▁Potter   16,

## Datasets and Data Loading

REFERENCE
https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

### Background

- We ideally want our dataset code to be decoupled from our model training code for better readability and modularity.

- PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data.

- Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.


### Creating a Custom Dataset Class
A custom Dataset class must implement three functions: __init__, __len__, and __getitem_


### Explanation of each of the three functions

**__init__**

- The __init__ function is run once when instantiating the Dataset object
We intstantiate the tokenizer  and extract the occurences of 'article' and 'highlights' from the supplied data

**__len__**

- The __len__ function returns the number of samples in our dataset.

**__getitem__**

- The __getitem__ function loads and returns a sample from the dataset at the given index idx.

- Based on the index, it extracts a sample text and the corresponding highlight.

- It tokenizes the text and the summary and extracts the input ids and the attention masks

- it then returns the input ids and the attention masks for the text and the summary



In [None]:
class CustomDataset(Dataset):
  def __init__(self,dataset,tokenizer,source_len,summ_len):
    self.dataset = dataset
    self.tokenizer = tokenizer
    self.text_len = source_len
    self.summ_len = summ_len
    self.text = self.dataset['article']
    self.summary = self.dataset['highlights']


  def __len__(self):
    return len(self.text)


  def __getitem__(self,i):
    summary = str(self.summary[i])
    summary = ' '.join(summary.split())
    text = str(self.text[i])
    text = ' '.join(text.split())
    source = self.tokenizer.batch_encode_plus([text],max_length=self.text_len,return_tensors='pt',pad_to_max_length=True) # Each source sequence is encoded and padded to max length in batches
    target = self.tokenizer.batch_encode_plus([summary],max_length=self.summ_len,return_tensors='pt',pad_to_max_length=True) # Each target sequence is encoded and padded to max lenght in batches


    source_ids = source['input_ids'].squeeze()
    source_masks = source['attention_mask'].squeeze()
    target_ids = target['input_ids'].squeeze()
    target_masks = target['attention_mask'].squeeze()


    return {
        'source_ids':source_ids.to(torch.long),
        'source_masks':source_masks.to(torch.long),
        'target_ids':target_ids.to(torch.long),
        'target_masks':target_masks.to(torch.long)
    	   }


### Dataloader

The Dataset retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval.

DataLoader is an iterable that abstracts this complexity for us in an easy API.

### Training

We define a function for the train process as coded below

**Input Arguments:**
- epoch
- transformer model
- data loader
- optimizer
- device : cuda or cpu

**Key Process Steps**


## Let us Check some of the key process steps outside the function definition

- The model loops through the data loader
- loads the data to device CPU or GPU
- y_ids - select all the Ids in the sequence except last one - This will be decoder input
- lm_label - Skip the pre sequence addition and select all ids : this will be the loss function label
- check if padding token exsits in the label if so then replace with -100 as internally the loss function compute will neglect them
- move the source id, source masks to device
- invoke model
- print loss at every 10th step or 10 batches
- optimize weights through back prop loss



In [None]:
### Training
def train(epoch,model,tokenizer,loader,optimizer,device):
  model.train()
  print(loader)
  for step,data in enumerate(loader,0):
    y = data['target_ids'].to(device)
    y_ids = y[:,:-1].contiguous() # all ids except last one
    lm_labels = y[:,1:].clone().detach() # copy the address and detach label
    lm_labels[y[:,1:]==tokenizer.pad_token_id] = -100 # if it's padded token then assign it to -100
    source_ids = data['source_ids'].to(device)
    masks = data['source_masks'].to(device)
    outputs = model(input_ids = source_ids,attention_mask = masks,decoder_input_ids=y_ids,labels=lm_labels)
    loss  = outputs[0]
    if step%100==0:
      print('Epoch:{} | Loss:{}'.format(epoch,loss))
    optimizer.zero_grad()
    loss.backward() # optimize weights through backpropagation loss
    optimizer.step()



## Notes on the T5 model

- T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format.

- It is trained using teacher forcing. This means that for training we always need an input sequence and a target sequence.

- The input sequence is fed to the model using input_ids.

- The target sequence is shifted to the right, i.e. prepended by a start-sequence token and fed to the decoder using the decoder_input_ids.

- In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the labels.

- The PAD token is hereby used as the start-sequence token. T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.



## Notes on the model input format

- encodings.labels represent the desired output and have two uses: as decoder_input_ids and as labels for the loss function.

- These two are identical except labels do not include the right-shift token at the start. Therefore, we create two copies of encodings.labels: one for decoder input and one for loss labels.

- We remove the starting right-shift token from labels as this token is not part of the expected output.

- We then remove the last token from decoder_input_ids to equalize tensor sizes.

- In the code below - y_ids represent the decoder input
and lm_labels represent the lables for the loss function

## Note on handling padding for loss functions

- Frequently, model inputs are padded to some maximum length to ensure consistent tensor sizes.
- This is accomplished by appending padding tokens to the inputs.
- These tokens need to be excluded from loss calculations.
- Huggingface’s loss functions are defined to exclude the ID -100 during loss calculations.
- Therefore, we need to convert all padding token IDs in labels to -100

In [None]:
y = target['input_ids']
print(y)

##################################################################################################
# tensor.contiguous(memory_format=torch.contiguous_format) → Tensor
# Returns a contiguous in memory tensor containing the same data as self tensor.
# If self tensor is already in the specified memory format, this function returns the self tensor.
##################################################################################################

y_ids = y[:,:-1].contiguous()
print(y_ids)


lm_labels = y[:,1:].clone().detach() # copy the address and detach label
print(lm_labels)


lm_labels[y[:,1:]==tokenizer.pad_token_id] = -100
print(lm_labels)
#################################################################################################
# tensor.detach() creates a tensor that shares storage with tensor that does not require gradient.
# tensor.clone() creates a copy of tensor that imitates the original tensor's requires_grad field.
# You should use detach() when attempting to remove a tensor from a computation graph,
# and clone as a way to copy the tensor while still keeping the copy as a part of the computation graph it came from.
##################################################################################################

tensor([[ 8929, 16023,  2213,  4173,  6324, 12591,    15,  2347,  3996,  1755,
           329, 13462,    38,     3,    88,  5050,   507,  2089,     3,     5,
          5209,  7556,   845,     3,    88,    65,   150,  1390,    12,  9030,
            17,   449,   112,  1723,   550,     3,     5,  6324, 12591,    15,
            31,     7,  8783,    45,   166,   874, 16023,  4852,    43,   118,
          1213,    16,  2019,  3069,     3,     5,     1]])
tensor([[ 8929, 16023,  2213,  4173,  6324, 12591,    15,  2347,  3996,  1755,
           329, 13462,    38,     3,    88,  5050,   507,  2089,     3,     5,
          5209,  7556,   845,     3,    88,    65,   150,  1390,    12,  9030,
            17,   449,   112,  1723,   550,     3,     5,  6324, 12591,    15,
            31,     7,  8783,    45,   166,   874, 16023,  4852,    43,   118,
          1213,    16,  2019,  3069,     3,     5]])
tensor([[16023,  2213,  4173,  6324, 12591,    15,  2347,  3996,  1755,   329,
         13462,   

### Evaluation

**The Steps are as follows:**

- Initiate model for eval
- Loop through validation data loader
- Extract source id , source mask and taregt ids
- Predict using model parameters as described below
- Decode the predictions and the labels using parameters described below
- predictions are extended to a list at each iterat
- the extended list is finally returned

**The Prediction / generate paramaters are as follows**

- input_ids : the validation set input token Ids

- attention mask - attenion mask for input tokens

- max_length (int, optional, defaults to model.config.max_length) — The maximum length of the sequence to be generated

- num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.

- repetition_penalty (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty.

- Exponential penalty to the length. 1.0 means that the beam score is penalized by the sequence length. 0.0 means no penalty. Set to values < 0.0 in order to encourage the model to generate longer sequences, to a value > 0.0 in order to encourage the model to produce shorter sequences.


**The decode sequence has the parameters**

- skip_special_tokens
(bool, optional, defaults to False) — Whether or not to remove special tokens in the decoding.

- clean_up_tokenization_spaces

(bool, optional, defaults to True) — Whether or not to clean up the tokenization spaces.

In [None]:
def validation(tokenizer,model,device,loader):
  model.eval()
  predictions = []
  actual = []
  with torch.no_grad():
    for step,data in enumerate(loader,0):
      ids = data['source_ids'].to(device)
      mask = data['source_masks'].to(device)
      y_id = data['target_ids'].to(device)
      prediction = model.generate(input_ids=ids,attention_mask = mask,num_beams=2,max_length=170,repetition_penalty=2.5,early_stopping=True,length_penalty=1.0)


      # Decode y_id and prediction #
      preds = [tokenizer.decode(p,skip_special_tokens=True,clean_up_tokenization_spaces=False) for p in prediction]
      target = [tokenizer.decode(t,skip_special_tokens=True,clean_up_tokenization_spaces=False) for t in y_id]


      if step%20==0:
        print('block of 20 steps Completed')
      #print('predictions',preds)
      #print('actual',target)
      predictions.extend(preds)
      actual.extend(target)
  return predictions,actual


## Main Driver code




### Define Model and parameters

In [None]:
# define number of epochs
epochs = 1

# define device
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

# define Tokenizer
tokenizer = T5TokenizerFast.from_pretrained('t5-base')




## Prepare Dataset ##
  ##  We will use cnn_dailymail summarization dataset for abstractive summarization #

###  SKIP BELOW if already loaded from local storage



In [None]:
#dataset = load_dataset('cnn_dailymail','3.0.0')


# As we can observe, dataset is too large so for now we will consider just 8k rows for training and 4k rows for validation
train_dataset = dataset['train'][:8000]
val_dataset = dataset['validation'][:4000]

### Data Cleaning Applied on the train and val datasets to remove

- text in parentheis
- '--'  



In [None]:
# check - number of entries in train
print("nos of train data entries", len(train_data['article']))

# check - number of entries in validation
print("nos of val data entries", len(val_data['article']))

nos of train data entries 4000
nos of val data entries 200


In [None]:
!pip install regex

# define pre process function
import re
def preprocess(dataset):
    dataset['article'] = [re.sub('\(.*?\)','',t) for t in dataset['article']]
    dataset['article'] = [t.replace('--','') for t in dataset['article']]
    return dataset


# Pre process the data set
train_dataset = preprocess(train_data)
val_dataset = preprocess(val_data)



In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Tokenize Input data

We will use the CustomDataset Class for this

In [None]:
### pass train and validation data sets through CustomDataset function
train_dataset = CustomDataset(train_dataset,tokenizer,270,160)
val_dataset = CustomDataset(val_dataset,tokenizer,270,160)


In [None]:
# check number of entries in train
print("After Tokenization of train " , len(train_dataset))

# check number of entries in validation
print("After Tokenization of val " , len(val_dataset))

# check first entry
#print(train_dataset[0])

After Tokenization of train  4000
After Tokenization of val  200


### Use Data Loader to get batch feed

In [None]:
train_loader = DataLoader(dataset=train_dataset,batch_size=2,shuffle=True,num_workers=0)
val_loader = DataLoader(dataset = val_dataset,batch_size=2,num_workers=0)


### Fine Tune Model


#### Step 1  Instantiate model

In [None]:
# Define model
model = T5ForConditionalGeneration.from_pretrained('t5-base').to(device)
optimizer = AdamW(model.parameters(),lr=3e-4)

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



#### Step 2 call Train function

In [None]:

  # Call train function
  for epoch in range(epochs):
      train(epoch,model,tokenizer,train_loader,optimizer,device)



### Save the fine tuned model

In [None]:
# save to gdrive
model.save_pretrained("drive/MyDrive/LLM_data/t5small_4000_cnn", from_pt=True)

### Load the fine tuned model from local / drive storage

In [None]:
# load from gdrive
model = T5ForConditionalGeneration.from_pretrained("drive/MyDrive/LLM_data/t5small_4000_cnn")

## Run Validation

In [None]:

  # Call validation function
  for epoch in range(epochs):
    pred,target = validation(epoch,tokenizer,model,device,val_loader)
    #print(pred,target)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


block of 20 steps Completed


KeyboardInterrupt: ignored

## Save Validation Results

- We create a Data Frame with the Actual and Predicted Validation data and save it a CSV file to Local

- Read from saved file and check a few records for comparison


In [None]:
import pandas as pd
import numpy as np
# convert pred list to numpy array
arr_pred = np.array(pred)
# convert pred list to numpy array
arr_target = np.array(target)
# convert to Dataframe
txt_sum_df = pd.DataFrame({'pred':arr_pred, 'actual':arr_target})
# Write CSV
txt_sum_df.to_csv('drive/MyDrive/LLM_data/txt_sum_preds.csv', index=False)


In [None]:
import pandas as pd
# Read from saved location
txt_sum_preds  = pd.read_csv('drive/MyDrive/LLM_data/txt_sum_preds.csv')

# Check data
txt_sum_preds.head()

Unnamed: 0,pred,actual
0,"a stranger. ""I know I'm just the messenger,"" s...",Zully Broussard decided to give a kidney to a ...
1,Major League Soccer was born. the San Jose Cla...,The 20th MLS season begins this weekend . Leag...
2,spent the night in hospital as a precaution . ...,Bafetimbi Gomis collapses within 10 minutes of...
3,McIlroy launched the club used to play the off...,Rory McIlroy throws club into water at WGC Cad...
4,"Cayman Naib, 13, was last seen wearing a gray ...","Cayman Naib, 13, hasn't been heard from since ..."


### Randomly check pred vs actual summaries from the validation output


In [None]:
import textwrap
import numpy as np
idx = np.random.randint(0,txt_sum_preds.shape[0])
# check Actual
print(" ---    Actual --- \n    ")
print(textwrap.fill(txt_sum_preds["actual"][idx], 40))
# Check predicted
print(" \n ---    Predicted --- \n    ")
print(textwrap.fill(txt_sum_preds["pred"][idx],40))

 ---    Actual --- 
    
It will be a first time for the tour
stateside . First show will be in
Louisville, Kentucky .
 
 ---    Predicted --- 
    
Prince and 3rdEyeGirl are bringing the
Hit & Run Tour to the U.S. for the first
time.


## Inference Fine tuned Model



###. Step 1  Load bbc news txt file

We choose this article

https://www.bbc.com/news/world-asia-india-67657873


In [None]:
# Read bbc news file  in .txt format
path = "drive/MyDrive/LLM_data/bbcnews.txt"
bbc_file = open(path, 'r')
text = bbc_file.read()
print(textwrap.fill(text, 80))

While AI has already disrupted Hollywood with writers going on a strike, the
debate around the contentious issue is not widespread in the Indian film
industry which employs tens of thousands of people. Some Indian film industry
creators are underplaying the threat of AI for now, while others feel it needs
to be taken very seriously. Director Shekhar Kapur's debut Indian film, Masoom
(1983), followed a woman's journey towards accepting a child born out of her
husband's extramarital affair. For the sequel to this emotional film, which had
delicately handled the complexities around infidelity and social diktats, Kapur
decided to experiment with AI tool ChatGPT. The award-winning director was
amazed at "how intuitively AI understood the moral conflict in the plot" and
gave him a script in seconds. The AI-generated script depicted the child growing
up to resent his father, shifting the gears of their relationship from the first
film. The future with AI will be "chaotic", Kapur says, as mach

### Step 2 Pre Process text

In [None]:
### data clean
import re
# Pre Process
text = re.sub('\(.*?\)','',text)
text = text.replace('--','')

# make ready for tokenizer
text = str(text)
text = ' '.join(text.split())

# Invoke tokenizer
tokenizer = T5TokenizerFast.from_pretrained('t5-base')

# Tokenize text
# Each source sequence is encoded and padded to max length in batches
source  = tokenizer.batch_encode_plus([text],return_tensors='pt',padding = True)


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:

# Extract input Ids and attention Masks
source_ids = source['input_ids']
source_masks = source['attention_mask']

# convert to pyTorch tensor
ids = source_ids.to(torch.long)
mask = source_masks.to(torch.long)


### Step 3 Inference Fine Tuned Model trained on 4000 data rows  to generate summary

In [None]:
# Generate summary
generate_ids = model.generate(input_ids=ids,attention_mask = mask,max_length=170)

# Decode Summary
summary_decoded = [tokenizer.decode(gen_id, skip_special_tokens=True) for gen_id in generate_ids]

# Create Output Text
output_txt = "".join(summary_decoded)


### Step 4 Compare and Check

In [None]:
# print actual text
print("\n Actual text  \n ")
print(textwrap.fill(text, 80))

# print generated summary
print("\n Summary \n ")
print(textwrap.fill(output_txt, 80))



 Actual text  
 
While AI has already disrupted Hollywood with writers going on a strike, the
debate around the contentious issue is not widespread in the Indian film
industry which employs tens of thousands of people. Some Indian film industry
creators are underplaying the threat of AI for now, while others feel it needs
to be taken very seriously. Director Shekhar Kapur's debut Indian film, Masoom ,
followed a woman's journey towards accepting a child born out of her husband's
extramarital affair. For the sequel to this emotional film, which had delicately
handled the complexities around infidelity and social diktats, Kapur decided to
experiment with AI tool ChatGPT. The award-winning director was amazed at "how
intuitively AI understood the moral conflict in the plot" and gave him a script
in seconds. The AI-generated script depicted the child growing up to resent his
father, shifting the gears of their relationship from the first film. The future
with AI will be "chaotic", Kapur s

### Step 5  Now Load Model trained on 400 data rows and generate and check summary

In [None]:
# load from gdrive
model_400 = T5ForConditionalGeneration.from_pretrained("drive/MyDrive/LLM_data/t5small_400_cnn")

In [None]:
# Generate summary
generate_ids = model_400.generate(input_ids=ids,attention_mask = mask,max_length=170)

# Decode Summary
summary_decoded = [tokenizer.decode(gen_id, skip_special_tokens=True) for gen_id in generate_ids]

# Create Output Text
output_txt_400 = "".join(summary_decoded)

In [None]:
# print generated summary
print("\n Summary \n ")
print(textwrap.fill(output_txt_400, 80))


 Summary 
 
AI has already disrupted Hollywood with writers going on a strike. But the
debate around the contentious issue is not widespread in the Indian film
industry. Some creators are underplaying the threat of AI for now.


### Load the base model and check summary generation

In [None]:
# Load Base Model

# define device
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

model_base = T5ForConditionalGeneration.from_pretrained('t5-base').to(device)

In [None]:
# Generate summary
generate_ids = model_base.generate(input_ids=ids,attention_mask = mask,max_length=170)

# Decode Summary
summary_decoded = [tokenizer.decode(gen_id, skip_special_tokens=True) for gen_id in generate_ids]

# Create Output Text
output_txt_base = "".join(summary_decoded)

In [None]:
# print generated summary
print("\n Summary \n ")
print(textwrap.fill(output_txt_base, 80))


 Summary 
 
the debate around AI is not widespread in the Indian film industry. he was
amazed at "how intuitively AI understood the moral conflict in the plot" and
gave him a script in seconds. The script depicted the child growing up to resent
his father, shifting the gears from the first film.
