<a href="https://colab.research.google.com/github/camillan/llm-learning/blob/main/summarization_of_microplastics_articles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install transformers beautifulsoup4 requests biopython




In [2]:
import requests
from bs4 import BeautifulSoup

def get_article_text(url):
    try:
        response = requests.get(url, timeout=10)
        soup = BeautifulSoup(response.content, "html.parser")
        # Combine all paragraph tags into one string
        paragraphs = [p.get_text() for p in soup.find_all("p")]
        return " ".join(paragraphs)
    except Exception as e:
        print(f"❌ Error fetching {url}: {e}")
        return ""


In [6]:
from transformers import pipeline
from Bio import Entrez

# ===== SETUP =====
Entrez.email = "cjn250@gmail.com"  # Required by NCBI
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# ===== FUNCTIONS =====

def fetch_pubmed_abstract(pmid):
    """Fetch abstract text from PubMed using a PMID (e.g., '22375028')."""
    try:
        handle = Entrez.efetch(db="pubmed", id=pmid, rettype="abstract", retmode="text")
        return handle.read()
    except Exception as e:
        print(f"❌ Error fetching {pmid}: {e}")
        return ""

def summarize_text(text, max_len=100, min_len=30):
    """Summarize a given block of text using a pre-trained model."""
    if not text.strip():
        return ""
    text = text[:4000]  # Truncate to fit model input size
    try:
        summary = summarizer(text, max_length=max_len, min_length=min_len, do_sample=False)
        return summary[0]['summary_text']
    except Exception as e:
        print(f"❌ Error summarizing: {e}")
        return ""

def summarize_pubmed_articles(pmids):
    """Summarize multiple PubMed articles and return both individual and combined summaries."""
    all_summaries = []

    for pmid in pmids:
        print(f"🧬 Fetching and summarizing PubMed ID {pmid}")
        abstract = fetch_pubmed_abstract(pmid)
        summary = summarize_text(abstract)
        if summary:
            all_summaries.append(f"From PMID {pmid}:\n{summary}\n")

    combined_text = " ".join(all_summaries)
    print("\n🧠 Generating meta-summary of all articles...")
    final_summary = summarize_text(combined_text, max_len=250, min_len=80)

    return final_summary, all_summaries

# ===== INPUT: PMIDs instead of PMCIDs =====
pmids = [
    "32193409",
    "38226412",
    "38142809",
    "39669275",
    "38967482",
    "32513186"
]

# ===== RUN =====
meta_summary, summaries = summarize_pubmed_articles(pmids)

# ===== OUTPUT =====
print("\n📄 INDIVIDUAL SUMMARIES:")
for s in summaries:
    print(s)

print("\n🧾 FINAL META-SUMMARY:")
print(meta_summary)


Device set to use cuda:0


🧬 Fetching and summarizing PubMed ID 32193409
🧬 Fetching and summarizing PubMed ID 38226412
🧬 Fetching and summarizing PubMed ID 38142809
🧬 Fetching and summarizing PubMed ID 39669275
🧬 Fetching and summarizing PubMed ID 38967482
🧬 Fetching and summarizing PubMed ID 32513186

🧠 Generating meta-summary of all articles...

📄 INDIVIDUAL SUMMARIES:
From PMID 32193409:
Microplastics generated when opening plastic packaging. Millions of tonnes of plastics have been released into the environment. Although the risk of plastics to humans is not yet resolved, microplastics have entered our bodies.

From PMID 38226412:
Microplastics (MPs) and nanoplastics (NPs) have become a growing concern in dermatology. The study delves into their capacity to breach the cutaneous barrier. Evidence suggests that MPs and NPs may indeed incite cutaneous alterations, provoke inflammatory responses, and disturb the homeostasis of the skin.

From PMID 38142809:
Microplastic pollution has emerged as a new environment