## ***CLINICAL TEXT SUMMARIZATION LAB***

## Libraries

Here are the most important libraries we'll be using for this lab:



*   **SpaCy** - Efficient text processing
*   **Tika** - Extract PDF data
*   **Rouge_Score** - Model performance metric
*   **en_core_web_lg** - Small English Language model with "tokenization" and "lemmatization"









In [None]:
%%capture

%pip install spacy
!python -m spacy download en_core_web_lg
%pip install tika
%pip install rouge_score

In [None]:
# ML Libraries
import spacy
import re
from tika import parser
from rouge_score import rouge_scorer
from transformers import pipeline, AutoTokenizer

# Visualization
import plotly.express as px
import pandas as pd

In this next cell, we will be retrieving a key from ML platform **HuggingFace**.

In [None]:
%%capture

from google.colab import userdata

userdata.get('NDL') # Grab key

## Data Preprocessing & Extraction

To summarize clinical documents, we first need the raw text. Here, we define a `parser` object from  the `tika` library to extract raw text from a PDF file.

In [None]:
%%capture

raw_text = parser.from_file("ENGERIX-B.pdf")

Note: `raw_text` is a dictionary. To access the words, we call `raw_text['content']`



In [None]:
content_text = raw_text['content']
content_text

These next few steps are important. We need to **clean** our document to avoid any errors. Specifically, we don't want our model to think `'  '` is a valid character, or a space in between paragraphs.

In [None]:
def clean_text(text):
  """Cleans extra lines and extra whitespace."""
  text = re.sub(r'\n+', ' ', text)
  text = re.sub(r'\s+', ' ', text).strip()
  return text

cleaned_text = clean_text(content_text)

Then, we pass our document to the `en_core_web_lg` model for auto **tokenization** and **lemmatization**.

In [None]:
nlp = spacy.load("en_core_web_lg") # Model w/ auto tokenization and lemmatization
doc = nlp(cleaned_text)

doc

Let's test it 😎

In [None]:
sentences = []
for sent in doc.sents:
  cleaned_sentence = sent.text.strip()
  if cleaned_sentence:
    sentences.append(cleaned_sentence)

Tokenization and Lemmatization have very subtle differences. Let's look at them :)

In [None]:
# Example of what tokenization looks like vs Lemmatization
tokens = []
lemmas = []

for token in doc:
  tokens.append(token.text)
  lemmas.append(token.lemma_)

print("First 10 tokens:", tokens[:10])
print("First 10 lemmas:", lemmas[:10])

## Summarization Model

The moment you've ALL been waiting for. Our text is loaded in and preprocessed, and now, we pass the text into a **pre-trained** model, `t5-large`, to do the summarization for us.

In [None]:
model_name = "t5-large"

summarizer = pipeline("summarization", model=model_name) # Load model
tokenizer = AutoTokenizer.from_pretrained(model_name) # Count tokens accurately

In [None]:
def chunk_text_by_sentence(sentences, max_tokens=512):
  """
  Splits a list of sentences into chunks, ensuring each chunk doesn't exceed the word limit.
  For large texts, this step ensures we don't exceed our computer's memory limits.
  """

  chunks, current_chunk = [], []
  current_tokens = 0

  for sentence in sentences:
    sentence_tokens = len(tokenizer.encode(sentence, add_special_tokens=False))

    # If adding this sentence exceeds max_tokens, save the current chunk and start a new one
    if current_tokens + sentence_tokens > max_tokens:
      if current_chunk:
        chunks.append(" ".join(current_chunk))
      current_chunk, current_tokens = [sentence], sentence_tokens
    else:
      current_chunk.append(sentence)
      current_tokens += sentence_tokens

    # Add the last chunk if it contains any sentences
  if current_chunk:
    chunks.append(" ".join(current_chunk))

  return chunks

In [None]:
text_chunks = chunk_text_by_sentence(sentences, max_tokens=300)

len(text_chunks)

In [None]:
def abstractive_summary(chunks):
  """
  Summarizes each chunk individually, then merges them into one.
  """

  chunk_summaries = []
  for chunk in chunks:
    # We are summarizing each chunk
    summary = summarizer(chunk, max_length=50, min_length=20, do_sample=False)[0]['summary_text']
    chunk_summaries.append(summary)

  # we are just combining all the chunk summaries into 1
  combined_summary = " ".join(chunk_summaries)

  # this is the final summary
  final_summary = summarizer(combined_summary, max_length=100, min_length=50, do_sample=False)[0]['summary_text']

  return final_summary

In [None]:
%%capture

final_summary = abstractive_summary(text_chunks)

In [None]:
#Cleaning the summary further. Talk about how this really depends dataset to dataset
def clean_summary(summary):
  summary = re.sub(r'\b(glaxosmithkline|vaers|wwwvaershhsgov|dosage and administration|contact.*?vaers)\b', '', summary, flags=re.IGNORECASE)
  summary = re.sub(r'\s+', ' ', summary).strip()
  return summary

final_summary = clean_summary(final_summary)
print("\nFinal Concise Summary:", final_summary)

## Model Evaluation

In [None]:
# Reference summary for ROUGE evaluation
reference_summary = """
ENGERIX-B is a vaccine indicated for immunization against infection caused by all known subtypes of hepatitis B virus.
It is administered intramuscularly in a three-dose schedule for most individuals and a four-dose schedule for adults on hemodialysis.
Available as a sterile suspension in prefilled syringes and vials.
Contraindicated in individuals with severe allergic reactions to any hepatitis B vaccine component.
Common adverse reactions include injection-site soreness (22%) and fatigue (14%).
Fainting can occur after administration, and precautions should be taken.
Infants with low birth weight and premature infants require special considerations.
Should not be mixed with other vaccines in the same syringe.
Lower antibody responses observed in individuals over 60 years of age.
"""

In [None]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference_summary, final_summary)
print("\nROUGE scores:", scores)

In [None]:
# Visualizing ROUGE scores using Plotly
df = pd.DataFrame({
    "Metric": ["Precision", "Recall", "F1-Score"],
    "ROUGE-1": [scores["rouge1"].precision, scores["rouge1"].recall, scores["rouge1"].fmeasure],
    "ROUGE-2": [scores["rouge2"].precision, scores["rouge2"].recall, scores["rouge2"].fmeasure],
    "ROUGE-L": [scores["rougeL"].precision, scores["rougeL"].recall, scores["rougeL"].fmeasure],
})

fig = px.bar(df, x="Metric", y=["ROUGE-1", "ROUGE-2", "ROUGE-L"],
             barmode="group", title="ROUGE Score Comparison",
             labels={"value": "Score", "variable": "ROUGE Metric"})
fig.show()