# Summarization Hugging Face Pipeline
* Notebook by Adam Lang
* Date: 12/3/2024

# Overview
* In this notebook I will demonstrate how to implement a hugging face summarization pipeline.
* We have to install `Sacremoses'. Sacremoses is a Python library that provides a port of the Moses tokenizer, truecaser, and other text normalization tools used in natural language processing (NLP).
  * link: https://pypi.org/project/sacremoses/

# Install Dependencies

In [1]:
!pip install -U transformers #upgrades
!pip install -U sentencepiece #upgrades
!pip install -U sacremoses #upgrades

Collecting transformers
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.46.3-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m71.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.46.2
    Uninstalling transformers-4.46.2:
      Successfully uninstalled transformers-4.46.2
Successfully installed transformers-4.46.3
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sac

In [2]:
## imports
from transformers import pipeline
import pandas as pd

# Summarization Pipeline with Hugging Face
* The default summarization model is: `sshleifer/distilbart-cnn-12-6 and revision a4f8f3e`
  * model card: https://huggingface.co/sshleifer/distilbart-cnn-12-6
  * This is a distilled BART model.

In [3]:
## lets get some text -- example HPI from medical demo note
text = """
History of Present Illness:  Patient is a 48 year-old well-nourished Hispanic male with a 2-month history of Rheumatoid Arthritis and strong family history of autoimmune diseases presenting after an episode of lightheadedness and muscle weakness. Patient began experiencing symptoms 4 months ago (November 2017). At that time he experienced fatigue and joint pain in the knees and hands. He was diagnosed with Rheumatoid Arthritis. He was given a short course of corticosteroids at that time that alleviated his symptoms. He was also started on methotrexate at that time. However, he felt that the medication was ineffective and stopped after 2 weeks.  For the past two months, the patient has been experiencing worsening symptoms. He has been experiencing progressively worsening headaches accompanied with lightheadedness, light and sound sensitivity, nausea, and vomiting. He reports no loss of consciousness associated with the headaches.  No convulsion, change of vision, or loss of continence. When the headaches began 2 months ago, they would last about half of a day and occur approximately once per week. They increased in frequency and duration and over the last month have been almost daily and lasted most of the day. He is unable to eat during headaches. Concurrently, the patient is experiencing worsening joint pain in the knees and hands.  The pain is constant, accompanied by swollen and hot joints, and not alleviated by NSAIDS.  Also in the last two months, he has experienced a dry mouth that makes swallowing food difficult and a burning sensation in his eyes.   In the last month, the patient has been experiencing night sweats, chills, and subjective fevers almost every night. This has impacted his sleep significantly, and he has not been able to sleep more than 4 consecutive hours in over one month.  Three days ago, the patient was at work when a headache came on, he felt particularly light headed and weak. His left work early on that day.  In the last three days the patient has had a constant headache and lightheadedness, and felt unable to eat.  When he has tried to eat, he has vomited immediately after eating.  He has had no changes to his bowel movements. No blood in the stool or urine. The joint pain has returned to a 10/10 in severity in the past 3 days. The patient has felt too weak to walk or leave the bedroom.  He was brought to the hospital by his sister, a nurse, after two days being unable to leave bed. At this time, his sister noticed a facial rash in the pre-auricular area that extended over the eyelids and bridge of the nose as well as cervical lymphadenopathy.  The patient was unaware of these findings and did not know how long the rash or lymphadenopathy had been present for.
At the time of the physical exam, the rash was limited to the pre-auricular area.

"""

In [4]:
## setup pipeline
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [5]:
## get outputs from the summarizer pipeline
outputs = summarizer(text)
outputs

[{'summary_text': ' A 48 year-old Hispanic male with a 2-month history of Rheumatoid Arthritis and strong family history of autoimmune diseases presenting after an episode of lightheadedness and muscle weakness . Patient began experiencing symptoms 4 months ago (November 2017). At that time he experienced fatigue and joint pain in the knees and hands . For the past two months, the patient has been experiencing worsening symptoms . The headaches began 2 months ago, they would last about half of a day and occur approximately once per week .'}]

In [9]:
## to get acutal output we have to index list
summary_output = outputs[0]["summary_text"]
summary_output

' A 48 year-old Hispanic male with a 2-month history of Rheumatoid Arthritis and strong family history of autoimmune diseases presenting after an episode of lightheadedness and muscle weakness . Patient began experiencing symptoms 4 months ago (November 2017). At that time he experienced fatigue and joint pain in the knees and hands . For the past two months, the patient has been experiencing worsening symptoms . The headaches began 2 months ago, they would last about half of a day and occur approximately once per week .'

In [10]:
#lets get len of the summary_output
len(summary_output)

525

# Let's try a Medical Specific Summarization Model
* We can use this model: `Falconsai/medical_summarization`
  * model card: https://huggingface.co/Falconsai/medical_summarization
* Let's see if it is better at summarizing medical specific text vs. the `distilbart` model which is finetuned on social media/newspaper data.

In [11]:
## setup pipeline
med_summarizer = pipeline("summarization",
                          model="Falconsai/medical_summarization")

## get output
med_output = med_summarizer(text)
med_output

config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Token indices sequence length is longer than the specified maximum sequence length for this model (664 > 512). Running this sequence through the model will result in indexing errors


[{'summary_text': 'history of present Illness: Patient is a 48 year-old well-nourished Hispanic male with a 2-month history of Rheumatoid Arthritis and strong family history of autoimmune diseases presenting after an episode of lightheadedness and muscle weakness . he began experiencing symptoms 4 months ago (November 2017 ) . at that time he experienced fatigue and joint pain in the knees and hands .'}]

In [12]:
med_summary_output = med_output[0]["summary_text"]
med_summary_output

'history of present Illness: Patient is a 48 year-old well-nourished Hispanic male with a 2-month history of Rheumatoid Arthritis and strong family history of autoimmune diseases presenting after an episode of lightheadedness and muscle weakness . he began experiencing symptoms 4 months ago (November 2017 ) . at that time he experienced fatigue and joint pain in the knees and hands .'

In [13]:
## get len of med_summary_output
len(med_summary_output)

385

# Summary
* We can see the length of the summary was shorter and more likely more precise.
* We could compare the semantic similarity using a NER model or a similarity metric like cosine similarity.