# ICD-10 Dataset Preprocessing

This notebook generates pre-computed artifacts for the ICD-10 Coding Assistant:

1. Download ICD-10 dataset from Hugging Face
2. Apply chapter enrichment using mapping logic
3. Generate synonym variations from descriptions
4. Generate embeddings for all descriptions
5. Build FAISS index
6. Save artifacts to `data/cache/`

**Expected runtime:** 2-3 minutes (one-time)

**Output artifacts:**
- `data/cache/enriched_dataset.pkl` - DataFrame with code/description/chapter/synonyms
- `data/cache/icd10_index.faiss` - Pre-built FAISS index
- `data/cache/metadata.json` - Dataset metadata

## 1. Setup and Imports

In [5]:
import sys
import os

# Add parent directory to path to import app utilities
sys.path.insert(0, os.path.abspath('..'))

import pandas as pd
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import faiss
import pickle
import json
from datetime import datetime
from tqdm.auto import tqdm

# Import utilities from app
from app.utils.chapter_mapping import get_chapter_from_code
from app.utils.code_formatter import CodeFormatter

print("All imports successful")

All imports successful


## 2. Load Dataset from Hugging Face

In [6]:
# Load dataset
print("Loading ICD-10 dataset from Hugging Face...")
dataset = load_dataset("awacke1/ICD10-Clinical-Terminology")
df = dataset['train'].to_pandas()

print(f"Loaded {len(df)} ICD-10 codes")
print(f"\nOriginal columns: {df.columns.tolist()}")

# Rename columns to lowercase for consistency
df.columns = df.columns.str.lower()
print(f"Renamed columns: {df.columns.tolist()}")

print(f"\nFirst few rows:")
df.head()

Loading ICD-10 dataset from Hugging Face...
Loaded 72750 ICD-10 codes

Original columns: ['Code', 'Description']
Renamed columns: ['code', 'description']

First few rows:


Unnamed: 0,code,description
0,A000,"Cholera due to Vibrio cholerae 01, biovar chol..."
1,A001,"Cholera due to Vibrio cholerae 01, biovar eltor"
2,A009,"Cholera, unspecified"
3,A0100,"Typhoid fever, unspecified"
4,A0101,Typhoid meningitis


## 3. Data Exploration

In [7]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Check data types
print("\nData types:")
print(df.dtypes)

# Sample codes
print("\nSample codes:")
print(df['code'].head(10).tolist())

Missing values:
code           0
description    0
dtype: int64

Data types:
code           str
description    str
dtype: object

Sample codes:
['A000', 'A001', 'A009', 'A0100', 'A0101', 'A0102', 'A0103', 'A0104', 'A0105', 'A0109']


## 4. Chapter Enrichment

Apply ICD-10 chapter mapping to all codes using the chapter_mapping utility.

In [8]:
print("Applying chapter enrichment...")

# Apply chapter mapping to all codes
df['chapter'] = df['code'].apply(get_chapter_from_code)

print(f"Chapter enrichment complete")
print(f"\nChapter distribution:")
print(df['chapter'].value_counts())

# Show sample enriched rows
print("\nSample enriched data:")
df[['code', 'description', 'chapter']].head(10)

Applying chapter enrichment...
Chapter enrichment complete

Chapter distribution:
chapter
XIX. Injury, poisoning and certain other consequences of external causes      40786
XX. External causes of morbidity                                               7067
XIII. Diseases of the musculoskeletal system and connective tissue             6563
VII. Diseases of the eye and adnexa                                            2628
XV. Pregnancy, childbirth and the puerperium                                   2271
II. Neoplasms                                                                  1642
IX. Diseases of the circulatory system                                         1379
XXI. Factors influencing health status and contact with health services        1289
I. Certain infectious and parasitic diseases                                   1064
IV. Endocrine, nutritional and metabolic diseases                               913
XII. Diseases of the skin and subcutaneous tissue                     

Unnamed: 0,code,description,chapter
0,A000,"Cholera due to Vibrio cholerae 01, biovar chol...",I. Certain infectious and parasitic diseases
1,A001,"Cholera due to Vibrio cholerae 01, biovar eltor",I. Certain infectious and parasitic diseases
2,A009,"Cholera, unspecified",I. Certain infectious and parasitic diseases
3,A0100,"Typhoid fever, unspecified",I. Certain infectious and parasitic diseases
4,A0101,Typhoid meningitis,I. Certain infectious and parasitic diseases
5,A0102,Typhoid fever with heart involvement,I. Certain infectious and parasitic diseases
6,A0103,Typhoid pneumonia,I. Certain infectious and parasitic diseases
7,A0104,Typhoid arthritis,I. Certain infectious and parasitic diseases
8,A0105,Typhoid osteomyelitis,I. Certain infectious and parasitic diseases
9,A0109,Typhoid fever with other complications,I. Certain infectious and parasitic diseases


## 5. Synonym Generation

Generate simple synonym variations from descriptions.
This can be enhanced later with NLP techniques.

In [9]:
def generate_synonyms(description: str) -> list:
    """
    Generate simple synonym variations from description.
    
    This is a basic implementation. Can be enhanced with:
    - Medical terminology databases (UMLS, SNOMED CT)
    - NLP-based synonym extraction
    - Abbreviation expansion
    """
    if not description or not isinstance(description, str):
        return []
    
    synonyms = []
    desc_lower = description.lower()
    
    # Common medical abbreviations and variations
    replacements = [
        ('type 2 diabetes mellitus', 'diabetes mellitus type 2'),
        ('type 1 diabetes mellitus', 'diabetes mellitus type 1'),
        ('diabetes mellitus', 'DM'),
        ('myocardial infarction', 'MI'),
        ('congestive heart failure', 'CHF'),
        ('chronic obstructive pulmonary disease', 'COPD'),
        ('hypertension', 'high blood pressure'),
    ]
    
    for original, replacement in replacements:
        if original in desc_lower:
            synonym = description.replace(original, replacement)
            if synonym != description:
                synonyms.append(synonym)
    
    # Limit to 3 synonyms to keep data size manageable
    return synonyms[:3]

print("Generating synonyms...")
df['synonyms'] = df['description'].apply(generate_synonyms)

print(f"Synonym generation complete")
print(f"\nSample synonyms:")
for idx in range(min(5, len(df))):
    if df.iloc[idx]['synonyms']:
        print(f"\nCode: {df.iloc[idx]['code']}")
        print(f"Description: {df.iloc[idx]['description']}")
        print(f"Synonyms: {df.iloc[idx]['synonyms']}")

Generating synonyms...
Synonym generation complete

Sample synonyms:


## 6. Generate Embeddings

Generate embeddings for all descriptions using sentence-transformers.
This is the most time-consuming step (60-90 seconds for 72k codes).

In [10]:
# Load embedding model
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
print(f"Loading embedding model: {model_name}")
model = SentenceTransformer(model_name)
print(f"Model loaded")

# Generate embeddings
print(f"\nGenerating embeddings for {len(df)} descriptions...")
print("This may take 60-90 seconds...")

embeddings = model.encode(
    df['description'].tolist(),
    show_progress_bar=True,
    batch_size=32
)

print(f"\nGenerated embeddings: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")
print(f"Total vectors: {embeddings.shape[0]}")



Loading embedding model: sentence-transformers/all-MiniLM-L6-v2


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Model loaded

Generating embeddings for 72750 descriptions...
This may take 60-90 seconds...


Batches:   0%|          | 0/2274 [00:00<?, ?it/s]


Generated embeddings: (72750, 384)
Embedding dimension: 384
Total vectors: 72750


## 7. Build FAISS Index

Build FAISS index for fast similarity search.

In [11]:
print("Building FAISS index...")

# Create FAISS index (IndexFlatL2 for exact search)
dimension = embeddings.shape[1]  # 384 for all-MiniLM-L6-v2
index = faiss.IndexFlatL2(dimension)

# Add embeddings to index
index.add(embeddings.astype('float32'))

print(f"FAISS index built: {index.ntotal} vectors")
print(f"Index type: IndexFlatL2 (exact search)")
print(f"Dimension: {dimension}")

Building FAISS index...
FAISS index built: 72750 vectors
Index type: IndexFlatL2 (exact search)
Dimension: 384


## 8. Save Artifacts

Save all pre-computed artifacts to `data/cache/`.

In [12]:
# Create cache directory
cache_dir = '../data/cache'
os.makedirs(cache_dir, exist_ok=True)

print(f"Saving artifacts to {cache_dir}/...")

# 1. Save enriched dataset
pkl_path = os.path.join(cache_dir, 'enriched_dataset.pkl')
df.to_pickle(pkl_path)
pkl_size_mb = os.path.getsize(pkl_path) / (1024 * 1024)
print(f"Saved enriched dataset: {pkl_path} ({pkl_size_mb:.1f} MB)")

# 2. Save FAISS index
index_path = os.path.join(cache_dir, 'icd10_index.faiss')
faiss.write_index(index, index_path)
index_size_mb = os.path.getsize(index_path) / (1024 * 1024)
print(f"Saved FAISS index: {index_path} ({index_size_mb:.1f} MB)")

# 3. Save metadata
metadata = {
    'version': '1.0',
    'dataset': 'awacke1/ICD10-Clinical-Terminology',
    'row_count': len(df),
    'embedding_model': model_name,
    'embedding_dimension': dimension,
    'created_at': datetime.utcnow().isoformat(),
    'chapter_count': df['chapter'].nunique()
}

meta_path = os.path.join(cache_dir, 'metadata.json')
with open(meta_path, 'w') as f:
    json.dump(metadata, f, indent=2)
print(f"Saved metadata: {meta_path}")

print("\n" + "="*60)
print("ALL ARTIFACTS SAVED SUCCESSFULLY!")
print("="*60)
print(f"\nTotal artifacts size: {pkl_size_mb + index_size_mb:.1f} MB")
print(f"\nYou can now start the FastAPI server with:")
print(f"  uvicorn app.main:app --reload")

Saving artifacts to ../data/cache/...
Saved enriched dataset: ../data/cache\enriched_dataset.pkl (11.7 MB)
Saved FAISS index: ../data/cache\icd10_index.faiss (106.6 MB)
Saved metadata: ../data/cache\metadata.json

ALL ARTIFACTS SAVED SUCCESSFULLY!

Total artifacts size: 118.3 MB

You can now start the FastAPI server with:
  uvicorn app.main:app --reload


  'created_at': datetime.utcnow().isoformat(),


## 9. Verification

Verify that artifacts can be loaded correctly.

In [13]:
print("Verifying artifacts...")

# Load enriched dataset
df_loaded = pd.read_pickle(pkl_path)
print(f"Loaded dataset: {len(df_loaded)} rows")

# Load FAISS index
index_loaded = faiss.read_index(index_path)
print(f"Loaded FAISS index: {index_loaded.ntotal} vectors")

# Load metadata
with open(meta_path, 'r') as f:
    metadata_loaded = json.load(f)
print(f"Loaded metadata: version {metadata_loaded['version']}")

# Test search
query = "diabetes with foot ulcer"
query_embedding = model.encode([query])
distances, indices = index_loaded.search(query_embedding.astype('float32'), k=3)

print(f"\nTest search for: '{query}'")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    row = df_loaded.iloc[idx]
    print(f"\n{i+1}. Code: {row['code']}")
    print(f"   Description: {row['description']}")
    print(f"   Chapter: {row['chapter']}")
    print(f"   Distance: {dist:.4f}")

print("\n" + "="*60)
print("VERIFICATION COMPLETE!")
print("All artifacts are working correctly.")
print("="*60)

Verifying artifacts...
Loaded dataset: 72750 rows
Loaded FAISS index: 72750 vectors
Loaded metadata: version 1.0

Test search for: 'diabetes with foot ulcer'

1. Code: E13621
   Description: Other specified diabetes mellitus with foot ulcer
   Chapter: IV. Endocrine, nutritional and metabolic diseases
   Distance: 0.0933

2. Code: E08621
   Description: Diabetes mellitus due to underlying condition with foot ulcer
   Chapter: IV. Endocrine, nutritional and metabolic diseases
   Distance: 0.0977

3. Code: E11621
   Description: Type 2 diabetes mellitus with foot ulcer
   Chapter: IV. Endocrine, nutritional and metabolic diseases
   Distance: 0.1556

VERIFICATION COMPLETE!
All artifacts are working correctly.
