<a href="https://colab.research.google.com/github/anyuanay/medium/blob/main/src/working_huggingface/Working_with_HuggingFace_ch3_Text_Summarization_by_Pre_Trained_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Working with Hugging Face Models and Datasets
## Chapter 3: Text Summarization Using Models in Hugging Face
### Lesson 3.1: Utilizing ChatGPT to Navigate the Usage of Hugging Face for Text Summarization

In this lesson, we will learn how to use ChatGPT as a technical assistant when we need to summarize text using models in Hugging Face.

# Install Transformers and Datasets from Hugging Face

In [1]:
# Transformers installation
! pip install -q transformers[torch] datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

# Load The Billsum Summarization Dataset

In [2]:
from datasets import load_dataset

# Load the dataset
billsum = load_dataset("billsum", split="ca_test")

The dataset has 3 features: text, summary, and title:

In [5]:
billsum

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})

Let us take a look at an example from the dataset:

In [6]:
rec = billsum[1]
for key in rec:
    print(key, ":", rec[key])

text : The people of the State of California do enact as follows:


SECTION 1.
Section 1170.02 is added to the Penal Code, to read:
1170.02.
A prisoner is not eligible for resentence or recall pursuant to subdivision (e) of Section 1170 if he or she was convicted of first-degree murder if the victim was a peace officer, as defined in Section 830.1, 830.2, 830.3, 830.31, 830.32, 830.33, 830.34, 830.35, 830.36, 830.37, 830.4, 830.5, 830.6, 830.10, 830.11, or 830.12, who was killed while engaged in the performance of his or her duties, and the individual knew, or reasonably should have known, that the victim was a peace officer engaged in the performance of his or her duties, or the victim was a peace officer or a former peace officer under any of the above-enumerated sections, and was intentionally killed in retaliation for the performance of his or her official duties.
SEC. 2.
Section 3550 of the Penal Code is amended to read:
3550.
(a) Notwithstanding any other law, except as provided 

# Load the pre-trained PEGASUS model

In [3]:
! pip install SentencePiece



In [4]:
import torch

from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Load the pre-trained PEGASUS model and tokenizer
model_name = 'google/pegasus-large'  # You can replace this with 'google/pegasus-billsum' if it becomes available

tokenizer = PegasusTokenizer.from_pretrained(model_name)

model = PegasusForConditionalGeneration.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)neration_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

# Summarize a Text

In [5]:
# Assume the text to summarize is in the 'text' column of the first entry of the dataset
text_to_summarize = billsum[0]['text']

In [6]:
text_to_summarize

'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThe Legislature finds and declares all of the following:\n(a) (1) Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. These organizations help preserve the memories and incidents of the great hostilities fought by our nation, and preserve and strengthen comradeship among members.\n(2) These veterans’ organizations also own and manage various properties including lodges, posts, and fraternal halls. These properties act as a safe haven where veterans of all ages and their families can gather together to find camaraderie and fellowship, share stories, and seek support from people who understand their unique experiences. This aids in the healing process for these returning veterans, and ensures their health and happiness.\n(b) As a result of congressional chartering of these veterans’ organizations, the United States Internal Reve

In [7]:
# Tokenize the text
inputs = tokenizer(text_to_summarize, truncation=True, return_tensors="pt", max_length=512)

In [8]:
# Generate the summary
summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=150, min_length=40)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [9]:
print(summary)

(b) As a result of congressional chartering of these veterans’ organizations, the United States Internal Revenue Service created a special tax exemption for these organizations under Section 501(c)(19) of the Internal Revenue Code. (c) Section 501(c)(19) of the Internal Revenue Code and related federal regulations provide for the exemption for posts or organizations of war veterans, or an auxiliary unit or society of, or a trust or foundation for, any such post or organization that, among other attributes, carries on programs to perpetuate the memory of deceased veterans and members of the Armed Forces and to comfort their survivors, conducts programs for religious, charitable, scientific, literary, or educational purposes, sponsors or participates in activities of a patriotic nature, and provides social and


# Evaluate the Summary

In [10]:
given_summary = billsum[0]['summary']
given_summary

'Existing property tax law establishes a veterans’ organization exemption under which property is exempt from taxation if, among other things, that property is used exclusively for charitable purposes and is owned by a veterans’ organization.\nThis bill would provide that the veterans’ organization exemption shall not be denied to a property on the basis that the property is used for fraternal, lodge, or social club purposes, and would make specific findings and declarations in that regard. The bill would also provide that the exemption shall not apply to any portion of a property that consists of a bar where alcoholic beverages are served.\nSection 2229 of the Revenue and Taxation Code requires the Legislature to reimburse local agencies annually for certain property tax revenues lost as a result of any exemption or classification of property for purposes of ad valorem property taxation.\nThis bill would provide that, notwithstanding Section 2229 of the Revenue and Taxation Code, no a

In [14]:
! pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [15]:
from rouge import Rouge

In [16]:
# Evaluate the generated summary against the given summary using the ROUGE metric
rouge = Rouge()
scores = rouge.get_scores(summary, given_summary, avg=True)

In [17]:
print(scores)

{'rouge-1': {'r': 0.25773195876288657, 'p': 0.30864197530864196, 'f': 0.28089887144489334}, 'rouge-2': {'r': 0.03896103896103896, 'p': 0.05357142857142857, 'f': 0.04511277707954149}, 'rouge-l': {'r': 0.20618556701030927, 'p': 0.24691358024691357, 'f': 0.22471909616399455}}
