# Abstractive:
generate new text that captures the most relevant information.


In [1]:
#!pip install transformers datasets evaluate rouge_score



In [2]:
#!pip install --upgrade pip



In [3]:
# so we are using the billsum data set here
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

Found cached dataset billsum (C:/Users/vishal567795/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc)


In [4]:
#Split the dataset into a train and test set with the train_test_split method:
billsum = billsum.train_test_split(test_size=0.2)

In [5]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThe Legislature finds and declares all of the following:\n(a) According to United States Census Bureau, California has a poverty rate of 23.5 percent, the highest rate of any state in the country.\n(b) Children born into poverty are at higher risk of health and developmental disparities, including, but not limited to, premature birth, low birth weight, infant mortality, crime, domestic violence, developmental delays, dropping out of high school, substance abuse, unemployment, and child abuse and neglect.\n(c) In 2014, the Legislature passed Assembly Concurrent Resolution No. 155 by Assembly Member Raul Bocanegra, recognizing that research over the last two decades in the evolving fields of neuroscience, molecular biology, public health, genomics, and epigenetics reveals that experiences in the first few years of life build changes into the biology of the human body that, in turn, influence the person’

There are two fields that you’ll want to use:

text: the text of the bill which’ll be the input to the model.
summary: a condensed version of text which’ll be the model target.

# Preprocessing

In [6]:
# The next step is to load a T5 tokenizer to process text and summary:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The preprocessing function you want to create needs to:

1.Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2.Use the keyword text_target argument when tokenizing labels.
3.Truncate sequences to be no longer than the maximum length set by the max_length parameter.

In [7]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map method. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once:

In [8]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)


Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

Now create a batch of examples using DataCollatorForSeq2Seq. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [9]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

# Evaluate

Including a metric during training is often helpful for evaluating your model’s performance. You can quickly load a evaluation method with the 🤗 Evaluate library. For this task, load the ROUGE metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):

In [10]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Then create a function that passes your predictions and labels to compute to calculate the ROUGE metric:

In [11]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Your compute_metrics function is ready to go now, and you’ll return to it when you setup your training.

# Train

You’re ready to start training your model now! Load T5 with AutoModelForSeq2SeqLM:

In [12]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

ImportError: 
AutoModelForSeq2SeqLM requires the PyTorch library but it was not found in your environment.
However, we were able to find a TensorFlow installation. TensorFlow classes begin
with "TF", but are otherwise identically named to our PyTorch classes. This
means that the TF equivalent of the class you tried to import would be "TFAutoModelForSeq2SeqLM".
If you want to use TensorFlow, please use TF classes instead!

If you really do want to use PyTorch please go to
https://pytorch.org/get-started/locally/ and follow the instructions that
match your environment.


In [13]:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117


SyntaxError: invalid syntax (3060902485.py, line 1)