# Abstractive:
generate new text that captures the most relevant information.


In [1]:
#!pip install transformers datasets

In [2]:
#!pip install transformers datasets evaluate rouge_score

In [3]:
#!pip install --upgrade pip

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
# so we are using the billsum data set here
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

Found cached dataset billsum (C:/Users/vishal567795/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc)


In [6]:
#Split the dataset into a train and test set with the train_test_split method:
billsum = billsum.train_test_split(test_size=0.2)

In [7]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 23102 of the\nRevenue and Taxation Code\nis amended to read:\n23102.\nAny corporation or limited liability company holding or organized to hold stock or bonds of any other corporation or corporations, and not trading in stock or bonds or other securities held, and engaging in no activities other than the receipt and disbursement of dividends from stock or interest from bonds, and no activities other than those exempted under subdivision (c) of Section 191 of the Corporations Code, is not a corporation or limited liability company doing business in this State for the purposes of this chapter or Chapter 10.6.\nSECTION 1.\nSection 17941 of the Revenue and Taxation Code is amended to read:\n17941.\n(a) For each taxable year beginning on or after January 1, 1997, a limited liability company doing business in this\nstate (as\nstate, as\ndefined in Section\n23101)\n23101,\nshall pay annually to this 

There are two fields that you’ll want to use:

text: the text of the bill which’ll be the input to the model.
summary: a condensed version of text which’ll be the model target.

# Preprocessing

In [8]:
# The next step is to load a T5 tokenizer to process text and summary:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The preprocessing function you want to create needs to:

1.Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2.Use the keyword text_target argument when tokenizing labels.
3.Truncate sequences to be no longer than the maximum length set by the max_length parameter.

In [9]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map method. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once:

In [10]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)


Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

Now create a batch of examples using DataCollatorForSeq2Seq. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [11]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

# Evaluate

Including a metric during training is often helpful for evaluating your model’s performance. You can quickly load a evaluation method with the 🤗 Evaluate library. For this task, load the ROUGE metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):

In [12]:
import evaluate

rouge = evaluate.load("rouge")

Then create a function that passes your predictions and labels to compute to calculate the ROUGE metric:

In [13]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Your compute_metrics function is ready to go now, and you’ll return to it when you setup your training.

# Train

You’re ready to start training your model now! Load T5 with AutoModelForSeq2SeqLM:

In [14]:
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

In [15]:
#pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [16]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_billsum["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_billsum["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


In [17]:
import tensorflow as tf

model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [18]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)

In [19]:
from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir="my_awesome_billsum_model",
    tokenizer=tokenizer,
)

C:\Users\vishal567795\Desktop\Text_Summarizer\my_awesome_billsum_model is already a clone of https://huggingface.co/vishal567795/my_awesome_billsum_model. Make sure you pull the latest changes with `repo.git_pull()`.


In [20]:
callbacks = [metric_callback, push_to_hub_callback]

In [21]:
#model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks)      

In [22]:
text = """summarize: Ram is a name that carries immense significance in various cultures, religions, and mythologies across the world. In Hinduism, Ram is considered one of the most revered and influential deities, known as the seventh avatar of Lord Vishnu. His story, beautifully depicted in the ancient Indian epic, the Ramayana, is a timeless tale that continues to inspire millions.Ram, also known as Rama or Shri Ram, is portrayed as an ideal human being, a paragon of virtue, and a true embodiment of righteousness. Born in Ayodhya to King Dasharatha and Queen Kaushalya, Ram was destined to fulfill a great purpose. As a young prince, he was adored by all, possessing a charming personality, unmatched skills, and an unwavering commitment to justice.However, Ram's life took a dramatic turn when his father was coerced into exiling him from the kingdom for fourteen years, due to the plotting of his stepmother, Kaikeyi. Despite the injustice he faced, Ram accepted his exile with humility and embraced the hardships that awaited him. Accompanied by his devoted wife, Sita, and loyal brother, Lakshmana, Ram embarked on an arduous journey through forests, facing numerous challenges and encountering mythical creatures.

Throughout his exile, Ram demonstrated exceptional qualities of leadership, compassion, and unwavering devotion to duty. He valiantly protected sages, defeated formidable demons, and upheld righteousness in every situation. His unwavering commitment to truth and justice won him the unwavering loyalty of his followers and earned him the respect of divine beings.

One of the most significant episodes in Ram's life is his encounter with the demon king Ravana. Ravana, with his ten heads and extraordinary powers, had abducted Sita, Ram's beloved wife. Driven by love and righteousness, Ram gathered an army of allies, including the brave monkey warrior Hanuman, to wage a fierce battle against Ravana and his forces. The epic confrontation culminated in Ram's triumph over evil, symbolizing the victory of good over darkness.

Ram's unwavering love for Sita and his relentless pursuit to rescue her exemplify his devotion and commitment to his loved ones. He is seen as the ideal husband, respecting and cherishing his wife throughout their journey together. Their reunion after the arduous trials and tribulations is celebrated as an epitome of love, trust, and unwavering loyalty.

The story of Ram has transcended religious boundaries and has become a universal symbol of morality, righteousness, and the triumph of good over evil. It teaches valuable lessons about the importance of adhering to one's duty, the power of unwavering faith, and the strength of familial and social bonds. Ram's virtuous character continues to inspire millions of people to strive for righteousness and embody the values he represents.

In addition to his divine persona, Ram's legacy is also seen in the architectural marvels he left behind. Ayodhya, his birthplace, is revered as a holy site and is believed to be the location of his kingdom. The magnificent temples dedicated to Ram, such as the iconic Ram Janmabhoomi temple, attract devotees from all corners of the globe, fostering a sense of spirituality and devotion.

"""

In [23]:
from transformers import pipeline

summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model")
summarizer(text)

Token indices sequence length is longer than the specified maximum sequence length for this model (779 > 512). Running this sequence through the model will result in indexing errors


[{'summary_text': 'Ram is considered one of the most revered and influential deities, known as the seventh avatar of Lord Vishnu. His story, beautifully depicted in the ancient Indian epic, the Ramayana, continues to inspire millions of people to strive for righteousness and embody the values he represents. Ram is portrayed as an ideal human being, a paragon of virtue, and a true embodiment of righteousness.'}]

In [24]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model")
inputs = tokenizer(text, return_tensors="tf").input_ids

Token indices sequence length is longer than the specified maximum sequence length for this model (777 > 512). Running this sequence through the model will result in indexing errors


In [25]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model",from_pt=True)
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFT5ForConditionalGeneration: ['encoder.embed_tokens.weight', 'lm_head.weight', 'decoder.embed_tokens.weight']
- This IS expected if you are initializing TFT5ForConditionalGeneration from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFT5ForConditionalGeneration from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [26]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

'Ramayana is a name that carries immense significance in various cultures, religions, and mythologies across the world. Ram is considered one of the most revered and influential deities, known as the seventh avatar of Lord Vishnu. His story, beautifully depicted in the ancient Indian epic, the Ramayana, is a timeless tale that continues to inspire millions. Ram is portrayed as an ideal human being, a paragon of virtue, and'