# Text Summarization

Summarization involves producing a condensed version of a document or article that retains the key information. There are two main types of summarization:

- **Extractive**: Selecting and extracting the most important details directly from the text.
- **Abstractive**: Generating new text that conveys the essential information in a more concise form.

In this project we will:

- Fine-tune the T5 model on a subset of the BillSum dataset containing California state bills for abstractive summarization.
- Run inference using our fine-tuned model.

## Installing Required Libraries

In [1]:
!pip install transformers datasets evaluate rouge_score

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m 

## Loading The Dataset

In [2]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.27k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

ca_test-00000-of-00001.parquet:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

In [3]:
billsum = billsum.train_test_split(test_size=0.2)

In [4]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 54964.5 of the Government Code is amended to read:\n54964.5.\n(a) A nonprofit organization or an officer, employee, or agent of a nonprofit organization shall not use, or permit another to use, public resources received from any local agency to make a contribution or expenditure not authorized by law.\n(b) As used in this section and Section 84222.5, the following terms have the following meanings:\n(1) “Local agency” has the same meaning as that term is defined in paragraph (4) of subdivision (b) of Section 54964 and shall also include a public entity created pursuant to the Joint Exercise of Powers Act (Chapter 5 (commencing with Section 6500) of Division 7 of Title 1) by one or more entities described in Section 54964.\n(2) “Nonprofit organization” means an entity incorporated under the Nonprofit Corporation Law (Division 2 (commencing with Section 5000) of Title 1 of the Corporations Code)

we will focus on two key fields:

- **text**: the full content of the bill, serving as the input for the model.
- **summary**: a shortened version of the text, which will be the target output for the model.

## Data Preprocessing

In [5]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [6]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [7]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [8]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Evaluate

In [9]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [10]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [11]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Training The Model

In [12]:
training_args = Seq2SeqTrainingArguments(
    output_dir="output",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    logging_steps=62
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,3.6322,2.809929,0.1263,0.0367,0.1058,0.1062,19.0
2,2.881,2.582771,0.1377,0.0484,0.1141,0.1142,19.0
3,2.7187,2.492684,0.1494,0.0567,0.1231,0.1231,19.0
4,2.6404,2.441141,0.1795,0.0802,0.1498,0.1495,19.0
5,2.5926,2.410806,0.1909,0.0908,0.1611,0.1611,19.0
6,2.5527,2.386251,0.1954,0.0954,0.1658,0.1658,19.0
7,2.5286,2.372561,0.1963,0.0963,0.1663,0.1663,19.0
8,2.5131,2.362059,0.1961,0.0963,0.166,0.1662,19.0
9,2.4998,2.357382,0.1967,0.0975,0.1664,0.1667,19.0
10,2.4911,2.355185,0.1967,0.097,0.1662,0.1664,19.0




TrainOutput(global_step=620, training_loss=2.705018984886908, metrics={'train_runtime': 706.6332, 'train_samples_per_second': 13.996, 'train_steps_per_second': 0.877, 'total_flos': 2677060833116160.0, 'train_loss': 2.705018984886908, 'epoch': 10.0})

## Save The model

In [13]:
model.save_pretrained('./model')
tokenizer.save_pretrained('./model')

('./model/tokenizer_config.json',
 './model/special_tokens_map.json',
 './model/spiece.model',
 './model/added_tokens.json',
 './model/tokenizer.json')

## Inference

In [14]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [15]:
saved_model_path = "./model"

loaded_tokenizer = AutoTokenizer.from_pretrained(saved_model_path)
inputs = loaded_tokenizer(text, return_tensors="pt").input_ids

In [16]:
loaded_model = AutoModelForSeq2SeqLM.from_pretrained(saved_model_path)
outputs = loaded_model.generate(inputs, max_new_tokens=100, do_sample=False)

In [17]:
loaded_tokenizer.decode(outputs[0], skip_special_tokens=True)

"The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll ask the ultra-wealthy and corporations to pay their fair share."

In [18]:
! zip -r model.zip model

  adding: model/ (stored 0%)
  adding: model/tokenizer.json (deflated 74%)
  adding: model/tokenizer_config.json (deflated 95%)
  adding: model/config.json (deflated 62%)
  adding: model/generation_config.json (deflated 29%)
  adding: model/spiece.model (deflated 48%)
  adding: model/special_tokens_map.json (deflated 85%)
  adding: model/model.safetensors (deflated 11%)
