# Information Extraction & Summarization

**Objective:** Extract meaningful entities or features from text and generate document summaries.


## Dataset loading

In [2]:
import pandas as pd

sample_df = pd.read_csv('sample_df.csv')
display(sample_df.head())

Unnamed: 0,title,abstract,processed_abstract
0,SVD Perspectives for Augmenting DeepONet Flexi...,Deep operator networks (DeepONets) are power...,deep operator network deeponet powerful archit...
1,Towards robust audio spoofing detection: a det...,"Automatic speaker verification, like every o...",automatic speaker verification like every biom...
2,Guided Random Forest in the RRF Package,Random Forest (RF) is a powerful supervised ...,random forest rf powerful supervise learner po...
3,Best Arm Identification in Generalized Linear ...,"Motivated by drug design, we consider the be...",motivated drug design consider good arm identi...
4,Conditional Affordance Learning for Driving in...,Most existing approaches to autonomous drivi...,exist approach autonomous driving fall one two...


## Entity & Information Extraction

In [4]:
!pip install spacy --quiet
!python -m spacy download en_core_web_sm --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m97.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [5]:
import re
import pandas as pd
from collections import Counter
import spacy

nlp = spacy.load("en_core_web_sm")

def regex_extraction(text):
    """
    Extracts:
    - Dates in formats like 2023-05-10, 05/10/2023, May 2023
    - Numbers/metrics (including decimals)
    - Email addresses
    """
    # Date patterns
    date_pattern = r"\b(?:\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{2,4}|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4})\b"

    # Numeric/metric patterns
    number_pattern = r"\b\d+(?:\.\d+)?\b"

    # Email pattern
    email_pattern = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"

    dates = re.findall(date_pattern, text)
    numbers = re.findall(number_pattern, text)
    emails = re.findall(email_pattern, text)

    return {
        "dates": dates,
        "numbers": numbers,
        "emails": emails
    }

def spacy_ner_extraction(texts, batch_size=20):
    """
    Runs spaCy NER on a list of texts and returns entity label counts.
    """
    entity_counter = Counter()
    entity_examples = {}

    for doc in nlp.pipe(texts, batch_size=batch_size):
        for ent in doc.ents:
            entity_counter[ent.label_] += 1
            # Store a small sample of examples per entity type
            if ent.label_ not in entity_examples:
                entity_examples[ent.label_] = set()
            if len(entity_examples[ent.label_]) < 5:
                entity_examples[ent.label_].add(ent.text)

    # Convert examples sets to lists
    entity_examples = {k: list(v) for k, v in entity_examples.items()}

    return entity_counter, entity_examples

In [6]:
sample = sample_df.sample(n=50, random_state=42)
sample_texts = sample["processed_abstract"].tolist()

# Rule-based extraction on first sample text
print("Rule-Based Extraction Example")
print(regex_extraction(sample_texts[0]))

# NER extraction on all sample texts
entity_counts, entity_samples = spacy_ner_extraction(sample_texts)
print("\n Entity Counts")
print(entity_counts)
print("\n Entity Examples")
print(entity_samples)

🔹 Rule-Based Extraction Example
{'dates': [], 'numbers': ['1', '2', '3'], 'emails': []}

🔹 Entity Counts
Counter({'CARDINAL': 77, 'ORG': 24, 'ORDINAL': 19, 'DATE': 17, 'PERSON': 11, 'GPE': 4, 'NORP': 4, 'TIME': 1, 'PERCENT': 1})

🔹 Entity Examples
{'DATE': ['vector', 'come year', 'come decade', 'covid-19', 'today'], 'CARDINAL': ['3', '2', '1', 'three', 'one'], 'ORDINAL': ['second', 'third', 'first'], 'ORG': ['cnn', 'linear', 'mt simon sandstone', 'algorithm estimate comfort level function', 'modifie'], 'TIME': ['overnight'], 'GPE': ['automaton', 'santa barbara county', 'california', 'cleveland'], 'PERSON': ['ipn achieve state', 'gan generate', 'max', 'gan fbgan', 'baye algorithm'], 'PERCENT': ['2d 3d'], 'NORP': ['aviris', 'boolean', 'pfaffian', 'english']}


## Summarization

In [12]:
from transformers import pipeline

# 1. Load pre-trained BART summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def abstractive_summarization(texts, max_length=130, min_length=50):
    """
    Summarizes a list of texts using BART.
    Parameters:
        texts (list[str]): List of documents to summarize
        max_length (int): Maximum length of generated summary tokens
        min_length (int): Minimum length of generated summary tokens
    Returns:
        list[str]: List of summaries
    """
    summaries = []
    for text in texts:
        # Ensure text isn't too long for the model
        input_text = text[:1024]  # BART handles ~1024 tokens
        summary = summarizer(input_text, max_length=max_length, min_length=min_length, do_sample=False)
        summaries.append(summary[0]['summary_text'])
    return summaries


sample_texts = sample_df["abstract"].sample(n=3, random_state=42).tolist()

summaries = abstractive_summarization(sample_texts)

for i, (orig, summ) in enumerate(zip(sample_texts, summaries), start=1):
    print(f"\n🔹 Example {i}")
    print(f"Original Text ({len(orig)} characters):\n", orig[:500], "...")
    print(f"\nGenerated Summary ({len(summ)} characters):\n", summ)


Device set to use cpu



🔹 Example 1
Original Text (1496 characters):
   Tree-based machine learning models such as random forests, decision trees,
and gradient boosted trees are the most popular non-linear predictive models
used in practice today, yet comparatively little attention has been paid to
explaining their predictions. Here we significantly improve the
interpretability of tree-based models through three main contributions: 1) The
first polynomial time algorithm to compute optimal explanations based on game
theory. 2) A new type of explanation that directl ...

Generated Summary (376 characters):
 Tree-based machine learning models are the most popular non-linear predictive models used in practice today. Here we significantly improve theinterpretability of tree-based models through three main contributions. We show how combining many high-quality local explanations allows us to represent global model structure while retaining local faithfulness to the original model.

🔹 Example 2
Original Text (1036 