# Entity Extraction & Normalization - Deep Dive

This notebook demonstrates how biomedical entity extraction and normalization work.

## Why Entity Processing?

**Problem**: Users ask questions in different ways
- "What is diabete?" (typo)
- "Tell me about DM" (abbreviation)
- "Diabetes mellitus information" (formal name)

**Solution**: Extract and normalize entities to canonical forms

## Setup

In [1]:
import sys
sys.path.append('..')
from agents.question_agent import QuestionAgent
from agents.normalize_agent import NormalizeAgent

  from .autonotebook import tqdm as notebook_tqdm


## Part 1: Entity Extraction with SciSpaCy

In [2]:
# Initialize QuestionAgent
question_agent = QuestionAgent()

[QuestionAgent] Loading SciSpaCy biomedical NER...


  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


### Example 1: Simple Question

In [3]:
question = "What is asthma?"

state = {"question": question}
result = question_agent.run(state)

print(f"Question: {question}")
print(f"Extracted entities: {result['entities']}")
print(f"\nEntity types recognized: DISEASE, CHEMICAL")

[QuestionAgent] Extracted raw entities: ['asthma']
Question: What is asthma?
Extracted entities: ['asthma']

Entity types recognized: DISEASE, CHEMICAL


### Example 2: Complex Question with Multiple Entities

In [4]:
question = "Can metformin treat diabetes?"

state = {"question": question}
result = question_agent.run(state)

print(f"Question: {question}")
print(f"Extracted entities: {result['entities']}")
print(f"\nIdentified:")
print(f"  - 'metformin' → CHEMICAL (drug)")
print(f"  - 'diabetes' → DISEASE")

[QuestionAgent] Extracted raw entities: ['metformin']
Question: Can metformin treat diabetes?
Extracted entities: ['metformin']

Identified:
  - 'metformin' → CHEMICAL (drug)
  - 'diabetes' → DISEASE


### Example 3: Medical Terminology

In [5]:
questions = [
    "What causes hypertension?",
    "How does aspirin work?",
    "What are symptoms of COVID-19?",
    "Can insulin treat type 2 diabetes?"
]

for q in questions:
    state = {"question": q}
    result = question_agent.run(state)
    print(f"Q: {q}")
    print(f"Entities: {result['entities']}\n")

[QuestionAgent] Extracted raw entities: ['hypertension']
Q: What causes hypertension?
Entities: ['hypertension']

[QuestionAgent] Extracted raw entities: ['aspirin']
Q: How does aspirin work?
Entities: ['aspirin']

[QuestionAgent] Extracted raw entities: []
Q: What are symptoms of COVID-19?
Entities: []

[QuestionAgent] Extracted raw entities: []
Q: Can insulin treat type 2 diabetes?
Entities: []



## Part 2: Entity Normalization with Fuzzy Matching

In [6]:
# Initialize NormalizeAgent
normalize_agent = NormalizeAgent()

[NormalizeAgent] Loading canonical entity mappings...
[NormalizeAgent] Loaded 15723 canonical biomedical entities.



### Example 1: Handling Typos

In [7]:
# Simulate typos
typos = ["diabete", "diabetis", "diebetes"]

for typo in typos:
    state = {"entities": [typo]}
    result = normalize_agent.run(state)
    
    print(f"Input:  '{typo}'")
    print(f"Output: '{result['normalized_entities'][0]}'")
    print()

[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: ['diabete']
   → 'diabete' → 'diabetes' (score=93.33333333333333)
Input:  'diabete'
Output: 'diabetes'

[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: ['diabetis']
   → 'diabetis' → 'diabetesis' (score=88.88888888888889)
Input:  'diabetis'
Output: 'diabetesis'

[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: ['diebetes']
   → 'diebetes' → 'diabetes' (score=87.5)
Input:  'diebetes'
Output: 'diabetes'



### Example 2: Handling Variations

In [None]:
# Different ways to say the same thing
variations = [
    "asthma",
    "asma",  # Common misspelling
    "asthmatic"
]

for var in variations:
    state = {"entities": [var]}
    result = normalize_agent.run(state)
    
    print(f"'{var}' - '{result['normalized_entities'][0]}'")

[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: ['asthma']
   → 'asthma' → 'asthma' (score=100.0)
'asthma' → 'asthma'
[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: ['asma']
   → 'asma' → 'anaplasma' (score=90.0)
'asma' → 'anaplasma'
[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: ['asthmatic']
   → 'asthmatic' → 'asthma' (score=90.0)
'asthmatic' → 'asthma'


### Example 3: Complete Pipeline

In [10]:
# Question with typo
question = "What treats diabete?"  # Typo: "diabete"

#  Extract entities
state = {"question": question}
state = question_agent.run(state)

print(f"Question: {question}")
print(f"\nStep 1 - Entity Extraction:")
print(f"  Raw entities: {state['entities']}")

# S Normalize entities
state = normalize_agent.run(state)

print(f"\nStep 2 - Entity Normalization:")
print(f"  Normalized: {state['normalized_entities']}")

if state['normalized_entities']:
    print(f"\n Typo corrected: 'diabete' -> '{state['normalized_entities'][0]}'")
else:
    print("\n Note: 'diabete' was not recognized as a biomedical entity by SciSpaCy")
    print("  Reason: The typo is too different from known medical terms")
    print("  Solution: Use a question with a recognized entity, like 'diabetes'")


[QuestionAgent] Extracted raw entities: []
Question: What treats diabete?

Step 1 - Entity Extraction:
  Raw entities: []
[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: []

Step 2 - Entity Normalization:
  Normalized: []

 Note: 'diabete' was not recognized as a biomedical entity by SciSpaCy
  Reason: The typo is too different from known medical terms
  Solution: Use a question with a recognized entity, like 'diabetes'


## Understanding the Technology

### SciSpaCy NER Model

**Model**: `en_ner_bc5cdr_md`

**Training Data**:
- BC5CDR corpus (BioCreative V)
- PubMed abstracts
- Medical literature

**Entity Types**:
- **DISEASE**: diabetes, asthma, cancer, COVID-19
- **CHEMICAL**: metformin, aspirin, insulin, acetaminophen

**Why SciSpaCy?**
- Specialized for biomedical text
- Better than general NER for medical terms
- Pre-trained (no training needed)

### Fuzzy Matching with RapidFuzz

**Algorithm**: Weighted Ratio (WRatio)

**Canonical Vocabulary**: 182,775 biomedical entities

**Similarity Threshold**: 72%

**Why Fuzzy Matching?**
- Handles typos
- Handles variations
- Fast (optimized C++ implementation)
- Configurable threshold

## Interactive Testing

In [11]:
def process_question(question):
    """Complete entity extraction and normalization"""
    
    print(f"Question: {question}")
    print("="*60)
    
    # Extract
    state = {"question": question}
    state = question_agent.run(state)
    
    print(f"\n1. Extracted Entities:")
    for ent in state['entities']:
        print(f"   - {ent}")
    
    # Normalize
    state = normalize_agent.run(state)
    
    print(f"\n2. Normalized Entities:")
    for ent in state['normalized_entities']:
        print(f"   - {ent}")
    
    print("\n" + "="*60)

# Try your own questions
process_question("What is asthma?")
# process_question("Can asprin treat headaches?")  # Typo: asprin
# process_question("What causes diabetis?")  # Typo: diabetis

Question: What is asthma?
[QuestionAgent] Extracted raw entities: ['asthma']

1. Extracted Entities:
   - asthma
[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: ['asthma']
   → 'asthma' → 'asthma' (score=100.0)

2. Normalized Entities:
   - asthma



## Why This Matters for RAG

### Without Entity Processing:
```
User: "What treats diabete?"
Search: "diabete" (typo)
Results: Poor (no exact match)
Answer: Low quality
```

### With Entity Processing:
```
User: "What treats diabete?"
Extract: ["diabete"]
Normalize: ["diabetes mellitus"]
Search: "diabetes mellitus" (correct)
Results: Excellent
Answer: High quality
```

**Impact**: Better retrieval - Better answers