# Deep NLP Project - Phase 1

## Prompt-Based Abstractive Summarization with Semantic Coverage Control

### Phase 1: Data Loading & Semantic Extraction Pipeline

This notebook implements:
1. Dataset loading and exploration (CNN/DailyMail)
2. Ground truth coverage analysis
3. SigExt-based phrase extraction
4. Semantic grouping (WHO, WHAT, WHEN, WHERE, NUMERIC)
5. Improved WHAT extraction
6. Extraction statistics and gap analysis

In [1]:
# -*- coding: utf-8 -*-
"""
Deep NLP Project - Phase 1: Data Loading & Semantic Extraction Pipeline

This notebook implements:
1. Dataset loading and exploration (CNN/DailyMail)
2. Ground truth coverage analysis
3. SigExt-based phrase extraction
4. Semantic grouping (WHO, WHAT, WHEN, WHERE, NUMERIC)
5. Improved WHAT extraction with better verb phrase capture
6. Extraction statistics and gap analysis
"""


'\nDeep NLP Project - Phase 1: Data Loading & Semantic Extraction Pipeline\n\nThis notebook implements:\n1. Dataset loading and exploration (CNN/DailyMail)\n2. Ground truth coverage analysis\n3. SigExt-based phrase extraction\n4. Semantic grouping (WHO, WHAT, WHEN, WHERE, NUMERIC)\n5. Improved WHAT extraction with better verb phrase capture\n6. Extraction statistics and gap analysis\n'

## SETUP & DEPENDENCIES

In [2]:
!pip install -q datasets transformers spacy scikit-learn rouge-score tqdm
!python -m spacy download en_core_web_sm -q

import os
import json
import re
import statistics
from collections import defaultdict
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

os.makedirs('/content/data', exist_ok=True)


  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## CONFIGURATION

In [3]:
SUBSET_SIZE = 200  # Number of samples to process

print("✅ Setup complete!")
print(f"   SUBSET_SIZE = {SUBSET_SIZE}")


✅ Setup complete!
   SUBSET_SIZE = 200


## PHASE 1.1: LOAD DATASET

In [4]:
from datasets import load_dataset

print("Loading CNN/DailyMail dataset...")
dataset = load_dataset("cnn_dailymail", "3.0.0")

samples = []
for ex in dataset['validation'].select(range(SUBSET_SIZE)):
    samples.append({
        'id': ex['id'],
        'article': ex['article'],
        'highlights': ex['highlights']
    })

with open('/content/data/validation_samples.json', 'w') as f:
    json.dump(samples, f)

print(f"✅ Loaded {len(samples)} samples")

# Dataset statistics
article_lengths = [len(s['article']) for s in samples]
highlight_lengths = [len(s['highlights']) for s in samples]

print(f"\n📊 Dataset Statistics:")
print(f"   Articles:   avg {statistics.mean(article_lengths):.0f} chars, "
      f"min {min(article_lengths)}, max {max(article_lengths)}")
print(f"   Highlights: avg {statistics.mean(highlight_lengths):.0f} chars, "
      f"min {min(highlight_lengths)}, max {max(highlight_lengths)}")


Loading CNN/DailyMail dataset...


README.md: 0.00B [00:00, ?B/s]

3.0.0/train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

3.0.0/validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

3.0.0/test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

✅ Loaded 200 samples

📊 Dataset Statistics:
   Articles:   avg 3346 chars, min 534, max 9259
   Highlights: avg 193 chars, min 77, max 339


## PHASE 1.2: GROUND TRUTH COVERAGE ANALYSIS

In [5]:
print("\n" + "="*60)
print("GROUND TRUTH COVERAGE ANALYSIS")
print("="*60)

# Define patterns for each semantic category
PATTERNS = {
    'who': [
        re.compile(r'\b[A-Z][a-z]+\s+[A-Z][a-z]+\b'),  # Proper names
        re.compile(r'\b(president|ceo|minister|police|officials|doctor|judge)\b', re.I)
    ],
    'what': [
        re.compile(r'\b(said|announced|reported|killed|arrested|won|lost|died)\b', re.I),
        re.compile(r'\b(launched|signed|passed|approved|released|claimed)\b', re.I)
    ],
    'when': [
        re.compile(r'\b(monday|tuesday|wednesday|thursday|friday|saturday|sunday)\b', re.I),
        re.compile(r'\b\d{4}\b'),  # Years
        re.compile(r'\b(yesterday|today|last|next)\s+\w+', re.I)
    ],
    'where': [
        re.compile(r'\bin\s+[A-Z][a-z]+'),  # "in Location"
        re.compile(r'\b(city|country|hospital|court|school|building)\b', re.I)
    ],
    'numeric': [
        re.compile(r'\$[\d,]+'),  # Money
        re.compile(r'\b\d+%'),    # Percentages
        re.compile(r'\b\d{2,}\b') # Numbers with 2+ digits
    ]
}

def check_coverage(text):
    """Check which semantic categories are present in text."""
    return {cat: any(p.search(text) for p in patterns)
            for cat, patterns in PATTERNS.items()}

# Analyze ground truth summaries
coverage_counts = defaultdict(int)
for sample in samples:
    coverage = check_coverage(sample['highlights'])
    for cat, present in coverage.items():
        if present:
            coverage_counts[cat] += 1

print("\nCategory presence in REFERENCE summaries:\n")
gt_analysis = {}
for cat in ['who', 'what', 'when', 'where', 'numeric']:
    pct = coverage_counts[cat] / len(samples) * 100
    gt_analysis[cat] = pct
    bar = '█' * int(pct / 2) + '░' * (50 - int(pct / 2))
    print(f"  {cat.upper():<8} {bar} {pct:.1f}%")

with open('/content/data/ground_truth_analysis.json', 'w') as f:
    json.dump(gt_analysis, f, indent=2)

print("\n✅ Ground truth analysis saved")



GROUND TRUTH COVERAGE ANALYSIS

Category presence in REFERENCE summaries:

  WHO      ███████████████████████████████████████░░░░░░░░░░░ 79.0%
  WHAT     ██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 20.0%
  WHEN     █████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 27.5%
  WHERE    █████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 26.0%
  NUMERIC  ███████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 38.5%

✅ Ground truth analysis saved


## PHASE 1.3: PHRASE EXTRACTION (SigExt Baseline)

In [6]:
print("\n" + "="*60)
print("PHRASE EXTRACTION (SigExt)")
print("="*60)

import spacy
nlp = spacy.load('en_core_web_sm')

# Entity types to extract
ENTITY_TYPES = {
    'PERSON', 'ORG', 'GPE', 'LOC', 'DATE', 'TIME',
    'MONEY', 'PERCENT', 'CARDINAL', 'NORP', 'EVENT'
}

def extract_phrases(text, doc_id):
    """Extract significant phrases using spaCy NER and dependency parsing."""
    doc = nlp(text[:10000])  # Limit for efficiency
    phrases, seen = [], set()

    # 1. Named Entity Recognition
    for ent in doc.ents:
        if ent.label_ in ENTITY_TYPES and ent.text.lower() not in seen:
            seen.add(ent.text.lower())
            phrases.append({
                'text': ent.text.strip(),
                'type': 'entity',
                'entity_label': ent.label_
            })

    # 2. Noun Chunks (multi-word expressions)
    for chunk in doc.noun_chunks:
        if len(chunk.text.split()) >= 2 and chunk.text.lower() not in seen:
            seen.add(chunk.text.lower())
            phrases.append({
                'text': chunk.text.strip(),
                'type': 'noun_phrase',
                'entity_label': ''
            })

    # 3. Verb Phrases (ROOT verb + direct object)
    for token in doc:
        if token.pos_ == 'VERB' and token.dep_ == 'ROOT':
            for child in token.children:
                if child.dep_ == 'dobj':
                    vp = f"{token.lemma_} {child.text}"
                    if vp.lower() not in seen:
                        seen.add(vp.lower())
                        phrases.append({
                            'text': vp,
                            'type': 'verb_phrase',
                            'entity_label': ''
                        })

    return {'doc_id': doc_id, 'phrases': phrases[:30]}

print("\nExtracting phrases from articles...")
extracted = [extract_phrases(s['article'], s['id']) for s in tqdm(samples)]

with open('/content/data/extracted_phrases.json', 'w') as f:
    json.dump(extracted, f)

# Statistics
total_phrases = sum(len(e['phrases']) for e in extracted)
avg_phrases = total_phrases / len(extracted)
print(f"\n✅ Extracted {total_phrases} phrases from {len(extracted)} documents")
print(f"   Average: {avg_phrases:.1f} phrases/document")



PHRASE EXTRACTION (SigExt)

Extracting phrases from articles...


100%|██████████| 200/200 [00:25<00:00,  7.71it/s]


✅ Extracted 5959 phrases from 200 documents
   Average: 29.8 phrases/document





In [7]:
# show one extracted example (structure)
print(type(extracted), len(extracted))
print(extracted[0].keys())
print("doc_id:", extracted[0]["doc_id"])
print("num phrases:", len(extracted[0]["phrases"]))
print("first phrase item:", extracted[0]["phrases"][0])


<class 'list'> 200
dict_keys(['doc_id', 'phrases'])
doc_id: a4942dd663020ca54575471657a0af38d82897d6
num phrases: 30
first phrase item: {'text': 'Zully Broussard', 'type': 'entity', 'entity_label': 'PERSON'}


GIVING SCORES TO PHRASES

In [8]:
!pip -q install sentence-transformers scikit-learn rouge-score

import re
import numpy as np
from tqdm import tqdm
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# map doc_id -> highlights
highlights_map = {s["id"]: s["highlights"] for s in samples}

# sentence split for highlights
def split_to_sents(text):
    return [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]

rouge = rouge_scorer.RougeScorer(["rouge1"], use_stemmer=True)

def phrase_salience_rouge(phrase_text, highlights):
    hs = split_to_sents(highlights)
    best = 0.0
    for h in hs:
        score = rouge.score(h, phrase_text)["rouge1"].fmeasure
        if score > best:
            best = score
    return best


# Build training dataset

THRESH = 0.22          # tune later
MAX_PHRASES_PER_DOC = 30

X_texts, y = [], []
meta = []  # for debug later if needed

print("Building ROUGE-labeled phrase training set...")
for doc in tqdm(extracted):
    doc_id = doc["doc_id"]
    hl = highlights_map.get(doc_id)
    if hl is None:
        continue

    for p in doc["phrases"][:MAX_PHRASES_PER_DOC]:
        txt = p["text"]
        s = phrase_salience_rouge(txt, hl)
        X_texts.append(txt)

        meta.append({"doc_id": doc_id, "rouge_score": s, **p})



Building ROUGE-labeled phrase training set...


100%|██████████| 200/200 [00:04<00:00, 49.96it/s]


FINDING THRESHOLD

In [9]:
import numpy as np

scores = np.array([m["rouge_score"] for m in meta])

print("ROUGE score distribution check:")
for t in [0.15, 0.20, 0.25, 0.30, 0.35]:
    pos_ratio = (scores >= t).mean()
    print(f"THRESH = {t:.2f} → positive ratio = {pos_ratio:.3f}")


ROUGE score distribution check:
THRESH = 0.15 → positive ratio = 0.190
THRESH = 0.20 → positive ratio = 0.115
THRESH = 0.25 → positive ratio = 0.083
THRESH = 0.30 → positive ratio = 0.043
THRESH = 0.35 → positive ratio = 0.026


In [10]:
THRESH = 0.20
y = []

for m in meta:
    y.append(1 if m["rouge_score"] >= THRESH else 0)

y = np.array(y)


In [11]:


X_texts = np.array(X_texts)


print("Total phrase examples:", len(X_texts))
print("Positive ratio:", y.mean())


# Encode phrases + train classifier

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

print("Encoding phrases...")
X_emb = encoder.encode(
    X_texts.tolist(),
    batch_size=64,
    show_progress_bar=True,
    normalize_embeddings=True
)

X_train, X_test, y_train, y_test = train_test_split(
    X_emb, y, test_size=0.2, random_state=42, stratify=y
)

clf = LogisticRegression(max_iter=2000, class_weight="balanced")
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
print(classification_report(y_test, pred, digits=4))


# Inference: score + top-k selection per doc

def score_phrase_list(phrase_dicts):
    texts = [p["text"] for p in phrase_dicts]
    emb = encoder.encode(texts, batch_size=64, show_progress_bar=False, normalize_embeddings=True)
    probs = clf.predict_proba(emb)[:, 1]
    out = []
    for p, s in zip(phrase_dicts, probs):
        out.append({**p, "salience_score": float(s)})
    return out




Total phrase examples: 5959
Positive ratio: 0.11528779996643732


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding phrases...


Batches:   0%|          | 0/94 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0     0.9469    0.7100    0.8115      1055
           1     0.2369    0.6934    0.3532       137

    accuracy                         0.7081      1192
   macro avg     0.5919    0.7017    0.5823      1192
weighted avg     0.8653    0.7081    0.7588      1192



## PHASE 1.4: SEMANTIC GROUPING

In [12]:
print("\n" + "="*60)
print("SEMANTIC GROUPING")
print("="*60)

# Map entity labels to semantic categories
CAT_MAP = {
    'PERSON': 'who', 'ORG': 'who', 'NORP': 'who',
    'GPE': 'where', 'LOC': 'where', 'FAC': 'where',
    'DATE': 'when', 'TIME': 'when',
    'MONEY': 'numeric', 'PERCENT': 'numeric', 'CARDINAL': 'numeric',
    'EVENT': 'what'
}

def group_phrases(doc):
    """Group extracted phrases into semantic categories."""
    grouped = {
        'doc_id': doc['doc_id'],
        'who': [], 'what': [], 'when': [],
        'where': [], 'numeric': [], 'other': []
    }

    for p in doc['phrases']:
        label = p.get('entity_label', '')
        # Map to category based on entity label or phrase type
        if label in CAT_MAP:
            cat = CAT_MAP[label]
        elif p['type'] == 'verb_phrase':
            cat = 'what'
        else:
            cat = 'other'

        grouped[cat].append({
            'text': p['text'],
            'confidence': 0.85
        })

    return grouped

print("\nGrouping phrases into semantic categories...")
grouped_data = [group_phrases(doc) for doc in tqdm(extracted, desc="Grouping")]

with open('/content/data/grouped_phrases.json', 'w') as f:
    json.dump(grouped_data, f)

grouped_map = {g['doc_id']: g for g in grouped_data}
print(f"✅ Grouped {len(grouped_data)} documents")



SEMANTIC GROUPING

Grouping phrases into semantic categories...


Grouping: 100%|██████████| 200/200 [00:00<00:00, 12088.90it/s]

✅ Grouped 200 documents





## PHASE 1.5: IMPROVED WHAT EXTRACTION

In [14]:
print("\n" + "="*60)
print("IMPROVED WHAT EXTRACTION")
print("="*60)
from rouge_score import rouge_scorer
rouge = rouge_scorer.RougeScorer(["rouge1"], use_stemmer=True)

def phrase_score_rouge1_f(phrase_text, reference_summary):
    # ROUGE-1 F1 between phrase and summary
    return rouge.score(reference_summary, phrase_text)["rouge1"].fmeasure

def extract_and_group_improved(text, doc_id, highlights):
    """
    Improved extraction with better WHAT (verb/event) capture.
    Addresses the low WHAT extraction rate in baseline SigExt.
    """
    doc = nlp(text[:10000])
    grouped = {
        'doc_id': doc_id,
        'who': [], 'what': [], 'when': [],
        'where': [], 'numeric': [], 'other': []
    }
    seen = set()

    # 1. Named Entity Recognition
    for ent in doc.ents:
        if ent.text.lower() not in seen:
            seen.add(ent.text.lower())
            cat = CAT_MAP.get(ent.label_, 'other')
            t = ent.text.strip()
            grouped[cat].append({'text': t, 'score': phrase_score_rouge1_f(t, highlights)})


    # 2. IMPROVED: Better verb phrase extraction
    # Skip common light verbs that don't carry meaning
    LIGHT_VERBS = {'be', 'have', 'do', 'say', 'get', 'make', 'go', 'know', 'take', 'see'}

    for token in doc:
        if token.pos_ == 'VERB' and token.lemma_ not in LIGHT_VERBS:

            # Method A: Verb + Direct Object / Prepositional Object
            for child in token.children:
                if child.dep_ in ('dobj', 'pobj', 'attr'):
                    vp = f"{token.lemma_} {child.text}"
                    if vp.lower() not in seen and len(vp) > 5:
                        seen.add(vp.lower())
                        grouped['what'].append({'text': vp, 'score': phrase_score_rouge1_f(vp, highlights)})


            # Method B: Verb + Particle (phrasal verbs)
            particles = [c for c in token.children if c.dep_ == 'prt']
            if particles:
                vp = f"{token.lemma_} {particles[0].text}"
                if vp.lower() not in seen:
                    seen.add(vp.lower())
                    grouped['what'].append({'text': vp, 'score': phrase_score_rouge1_f(vp, highlights)})


            # Method C: Passive constructions
            if token.dep_ == 'ROOT' and any(c.dep_ == 'auxpass' for c in token.children):
                vp = token.lemma_
                if vp.lower() not in seen and len(vp) > 3:
                    seen.add(vp.lower())
                    grouped['what'].append({'text': vp, 'score': phrase_score_rouge1_f(vp, highlights)})


    # 3. EVENT-related noun phrases
    EVENT_KEYWORDS = {
        'attack', 'election', 'investigation', 'trial', 'crash', 'shooting',
        'murder', 'death', 'fire', 'explosion', 'protest', 'vote', 'debate',
        'announcement', 'decision', 'agreement', 'deal', 'war', 'conflict'
    }

    for chunk in doc.noun_chunks:
        chunk_lower = chunk.text.lower()
        if any(kw in chunk_lower for kw in EVENT_KEYWORDS):
            if chunk_lower not in seen:
                seen.add(chunk_lower)
                t = chunk.text.strip()
                grouped['what'].append({'text': t, 'score': phrase_score_rouge1_f(t, highlights)})


    # Limit phrases per category
    for cat in grouped:
        if cat != 'doc_id':
            grouped[cat] = grouped[cat][:10]

    return grouped

print("\nRe-extracting with improved WHAT detection...")
grouped_data_improved = [
    extract_and_group_improved(s['article'], s['id'], s['highlights'])
    for s in tqdm(samples)
]
#  CATEGORY-AWARE TOP-K SELECTION

TOPK_BUDGET = {
    "what": 8,
    "who": 6,
    "when": 2,
    "where": 3,
    "numeric": 1,
    "other": 0,
}
FINAL_K = sum(TOPK_BUDGET.values())

def select_topk_per_category(grouped_doc, budgets):
    out = {"doc_id": grouped_doc["doc_id"]}
    for cat in ["who", "what", "when", "where", "numeric", "other"]:
        phrases = grouped_doc.get(cat, [])

        # sort by score high->low (if missing score, treat as very low)
        phrases_sorted = sorted(
            phrases,
            key=lambda p: p.get("score", -1.0),
            reverse=True
        )

        k = budgets.get(cat, 0)
        out[cat] = phrases_sorted[:k]
    return out

grouped_data_improved_topk = [select_topk_per_category(doc, TOPK_BUDGET) for doc in grouped_data_improved]

print(f"✅ Fix2 done: selected up to {FINAL_K} phrases per doc with budgets {TOPK_BUDGET}")
with open('/content/data/grouped_phrases_improved_topk.json', 'w') as f:
    json.dump(grouped_data_improved_topk, f, indent=2)





IMPROVED WHAT EXTRACTION

Re-extracting with improved WHAT detection...


100%|██████████| 200/200 [00:28<00:00,  6.94it/s]

✅ Fix2 done: selected up to 20 phrases per doc with budgets {'what': 8, 'who': 6, 'when': 2, 'where': 3, 'numeric': 1, 'other': 0}





'\nwith open(\'/content/data/grouped_phrases_improved.json\', \'w\') as f:\n    json.dump(grouped_data_improved, f)\n\nprint(f"✅ Improved extraction complete")\n'

In [15]:
# 1) Check one document
doc0 = grouped_data_improved_topk[0]
print(doc0["doc_id"])
for cat in ["what","who","when","where","numeric","other"]:
    print(cat, len(doc0[cat]))

# 2) Confirm max total is 20 for random docs
import random
for _ in range(5):
    d = random.choice(grouped_data_improved_topk)
    total = sum(len(d[c]) for c in ["what","who","when","where","numeric","other"])
    print("total phrases:", total)

# 3) Check that phrases are sorted by score (top one should have >= last one)
d = grouped_data_improved_topk[0]
if len(d["what"]) >= 2:
    print("WHAT top score:", d["what"][0].get("score"))
    print("WHAT last score:", d["what"][-1].get("score"))

# 4) Print a few selected phrases (so you can visually sanity check)
d = grouped_data_improved_topk[0]
print("\nTop WHAT phrases:")
for p in d["what"][:5]:
    print(p["score"], "-", p["text"])

print("\nTop WHO phrases:")
for p in d["who"][:5]:
    print(p["score"], "-", p["text"])


a4942dd663020ca54575471657a0af38d82897d6
what 8
who 6
when 2
where 1
numeric 1
other 0
total phrases: 18
total phrases: 19
total phrases: 18
total phrases: 20
total phrases: 20
WHAT top score: 0.08
WHAT last score: 0.0

Top WHAT phrases:
0.08 - give one
0.08 - receive transplants
0.08 - wow her
0.08 - help person
0.08 - extract kidneys

Top WHO phrases:
0.16 - Zully Broussard
0.08333333333333333 - Broussard
0.0 - CNN
0.0 - KGO
0.0 - California Pacific Medical Center


## PHASE 1.6: EXTRACTION STATISTICS & GAP ANALYSIS

In [16]:
print("\n" + "="*60)
print("EXTRACTION CATEGORY PRESENCE ANALYSIS")
print("="*60)

categories = ['who', 'what', 'when', 'where', 'numeric']

# Original extraction stats
extraction_stats = {}
for cat in categories:
    docs_with_cat = sum(1 for g in grouped_data if len(g.get(cat, [])) >= 1)
    pct = docs_with_cat / len(grouped_data) * 100
    extraction_stats[cat] = {
        'docs_with_extraction': docs_with_cat,
        'percentage': pct,
        'avg_phrases_per_doc': sum(len(g.get(cat, [])) for g in grouped_data) / len(grouped_data)
    }

print("\n📊 ORIGINAL Extraction (% of docs with ≥1 phrase):\n")
for cat in categories:
    pct = extraction_stats[cat]['percentage']
    avg = extraction_stats[cat]['avg_phrases_per_doc']
    bar = '█' * int(pct / 2) + '░' * (50 - int(pct / 2))
    print(f"  {cat.upper():<8} {bar} {pct:.1f}%  (avg: {avg:.1f}/doc)")

# Improved extraction stats
extraction_stats_improved = {}
for cat in categories:
    docs_with = sum(1 for g in grouped_data_improved if len(g.get(cat, [])) >= 1)
    extraction_stats_improved[cat] = {
        'percentage': docs_with / len(grouped_data_improved) * 100,
        'avg_phrases_per_doc': sum(len(g.get(cat, [])) for g in grouped_data_improved) / len(grouped_data_improved)
    }

print("\n📊 IMPROVED Extraction:\n")
for cat in categories:
    old_pct = extraction_stats[cat]['percentage']
    new_pct = extraction_stats_improved[cat]['percentage']
    change = new_pct - old_pct
    status = "✅" if change > 5 else ("⚠️" if change > 0 else "")
    print(f"  {cat.upper():<8}: {old_pct:.1f}% → {new_pct:.1f}% ({change:+.1f}%) {status}")

# Save stats
with open('/content/data/extraction_stats.json', 'w') as f:
    json.dump(extraction_stats, f, indent=2)
with open('/content/data/extraction_stats_improved.json', 'w') as f:
    json.dump(extraction_stats_improved, f, indent=2)

# Gap Analysis
print("\n" + "-"*60)
print("EXTRACTION GAP ANALYSIS:")
print("-"*60)

for cat in categories:
    pct = extraction_stats_improved[cat]['percentage']
    if pct < 50:
        print(f"  ⚠️  {cat.upper()}: Only {pct:.1f}% coverage - SIGNIFICANT GAP")
    elif pct < 80:
        print(f"  📊 {cat.upper()}: {pct:.1f}% coverage - moderate")
    else:
        print(f"  ✅ {cat.upper()}: {pct:.1f}% coverage - good")

print("\n✅ All extraction statistics saved")



EXTRACTION CATEGORY PRESENCE ANALYSIS

📊 ORIGINAL Extraction (% of docs with ≥1 phrase):

  WHO      █████████████████████████████████████████████████░ 99.5%  (avg: 12.1/doc)
  WHAT     █████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 18.0%  (avg: 0.2/doc)
  WHEN     █████████████████████████████████████████████████░ 98.0%  (avg: 5.9/doc)
  WHERE    ██████████████████████████████████████████████░░░░ 93.0%  (avg: 3.7/doc)
  NUMERIC  ████████████████████████████████████████████░░░░░░ 89.0%  (avg: 3.4/doc)

📊 IMPROVED Extraction:

  WHO     : 99.5% → 99.5% (+0.0%) 
  WHAT    : 18.0% → 100.0% (+82.0%) ✅
  WHEN    : 98.0% → 98.0% (+0.0%) 
  WHERE   : 93.0% → 93.5% (+0.5%) ⚠️
  NUMERIC : 89.0% → 90.0% (+1.0%) ⚠️

------------------------------------------------------------
EXTRACTION GAP ANALYSIS:
------------------------------------------------------------
  ✅ WHO: 99.5% coverage - good
  ✅ WHAT: 100.0% coverage - good
  ✅ WHEN: 98.0% coverage - good
  ✅ WHERE: 93.5% coverage - good
  ✅ 

## PHASE 1 SUMMARY

In [17]:
print("\n" + "="*60)
print("PHASE 1 COMPLETE - SUMMARY")
print("="*60)

print(f"""
📁 Files Generated:
   • validation_samples.json      - {len(samples)} articles
   • ground_truth_analysis.json   - Reference coverage stats
   • extracted_phrases.json       - SigExt baseline extraction
   • grouped_phrases.json         - Semantic grouping (original)
   • grouped_phrases_improved.json - Semantic grouping (improved)
   • extraction_stats.json        - Original extraction rates
   • extraction_stats_improved.json - Improved extraction rates

📊 Key Findings:
   • Dataset: {len(samples)} CNN/DailyMail validation samples
   • WHAT extraction improved: {extraction_stats['what']['percentage']:.1f}% → {extraction_stats_improved['what']['percentage']:.1f}%
   • All categories now have good extraction coverage

🔜 Next Steps (Phase 2):
   • Build coverage-aware prompts
   • Generate summaries with GPT-3.5 and BART
   • Evaluate with ROUGE and beyond-ROUGE metrics
""")

# Download data
!cd /content && zip -r phase1_results.zip data/
from google.colab import files
files.download('/content/phase1_results.zip')

print("✅ Phase 1 data downloaded!")



PHASE 1 COMPLETE - SUMMARY

📁 Files Generated:
   • validation_samples.json      - 200 articles
   • ground_truth_analysis.json   - Reference coverage stats
   • extracted_phrases.json       - SigExt baseline extraction
   • grouped_phrases.json         - Semantic grouping (original)
   • grouped_phrases_improved.json - Semantic grouping (improved)
   • extraction_stats.json        - Original extraction rates
   • extraction_stats_improved.json - Improved extraction rates

📊 Key Findings:
   • Dataset: 200 CNN/DailyMail validation samples
   • WHAT extraction improved: 18.0% → 100.0%
   • All categories now have good extraction coverage

🔜 Next Steps (Phase 2):
   • Build coverage-aware prompts
   • Generate summaries with GPT-3.5 and BART
   • Evaluate with ROUGE and beyond-ROUGE metrics

  adding: data/ (stored 0%)
  adding: data/extracted_phrases.json (deflated 87%)
  adding: data/validation_samples.json (deflated 61%)
  adding: data/grouped_phrases_improved_topk.json (deflated 86%

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✅ Phase 1 data downloaded!
