# MedPix Data Cleaning and Preprocessing
## Preparing Cases for the Web Application

**GOAL**: Clean and structure the raw MedPix 
**`Cases.json`** data for the frontend application by extracting key metadata from titles, normalizes fields, and writes two JSON files that the backend uses:

- **`cases_cleaned*.json`** — full, detailed cases
- **`cases_summary*.json`** — lightweight list for the gallery (includes `thumbnail` and `url`)

> **Why** our original analysis leaked metadata (age, gender, modality, region) into the title.

Here we

        (1) extract that metadata into its own fields 

        (2) clean the title so it only shows the clinical description (no duplicated metadata)


### Used libraries
- json: For reading/writing JSON files
- pandas: For data analysis and statistics
- numpy: For numerical operations
- pathlib: For cross-platform file path handling
- re: For regex pattern matching
- typing: For type hints

In [43]:
# Import required libraries
import json
import pandas as pd
import numpy as np
from pathlib import Path
import re
from collections import Counter
from typing import Dict, List, Optional, Tuple, Any

print("✅ Libraries loaded successfully!")

✅ Libraries loaded successfully!


In [None]:
CONFIG = {
    'data_paths': {
        'input': '../data/archive/Cases.json',
        'output_dir': '../data/processed',
        'output_files': {
            'cleaned': 'cases_cleaned.json',
            'summary': 'cases_summary.json'
        }
    },
    'metadata_patterns': {
        'age': [
            r'(\d{1,3})\s*y/o\b',
            r'(\d{1,3})\s*yo\b',
            r'(\d{1,3})\s*-\s*(?:year|yr)\s*-\s*old',
            r'(\d{1,3})\s*(?:year|yr)\s*old',
            r'age\s*(\d{1,3})\b',
            r'\b(\d{1,3})\s*(?:F|M)\b',
            r'(\d{1,3})\s*year',
            r'(\d{1,3})\s*yr',
        ],
        'gender': {
            'patterns': [
                r'(\d+)\s*(?:year|yr)[^.]*?\b(female|woman)\b',
                r'(\d+)\s*(?:year|yr)[^.]*?\b(male|man)\b',
                r'\b(female|woman)\b[^.]*?\d+\s*(?:year|yr)',
                r'\b(male|man)\b[^.]*?\d+\s*(?:year|yr)',
                r'\b(\d+)\s*(M)\b',
                r'\b(\d+)\s*(F)\b',
                r'\b(Male)\b',
                r'\b(Female)\b',
            ],
            'mapping': {
                'M': 'Male', 'MALE': 'Male', 'MAN': 'Male',
                'F': 'Female', 'FEMALE': 'Female', 'WOMAN': 'Female'
            }
        },
        'modality': {
            'CT': ['ct', 'computed tomography'],
            'MRI': ['mri', 'mr', 'magnetic resonance'],
            'X-ray': ['x-ray', 'xray', 'radiograph'],
            'Ultrasound': ['ultrasound', 'us', 'sonography'],
            'PET': ['pet', 'positron'],
            'Nuclear': ['nuclear', 'spect']
        },
        'body_regions': {
            'head': [
                'head', 'skull', 'face', 'cranial', 'brain', 'cerebral', 
                'neurological', 'neuro', 'cerebellar', 'cortical', 'meningeal', 
                'cns', 'seizure', 'stroke', 'headache', 'migraine'
            ],
            'neck': [
                'neck', 'cervical', 'thyroid', 'trachea', 'larynx', 'pharynx'
            ],
            'chest': [
                'chest', 'thorax', 'rib', 'lung', 'pulmonary', 'respiratory', 
                'breathing', 'copd', 'asthma', 'pneumonia', 'bronchial', 'pleural',
                'heart', 'cardiac', 'coronary', 'myocardial', 'aortic', 'aorta', 
                'aneurysm', 'vascular', 'artery', 'venous', 'vein', 'blood pressure', 
                'hypertension', 'hypotension', 'infarction', 'cabg', 'cardiovascular'
            ],
            'abdomen': [
                'abdominal', 'abdomen', 'liver', 'hepatic', 'kidney', 'renal', 
                'spleen', 'pancreas', 'pancreatic', 'gallbladder', 'biliary',
                'gastrointestinal', 'gi', 'stomach', 'intestinal', 'colon',
                'esophageal', 'duodenal', 'gastric', 'appendix', 'hernia'
            ],
            'pelvis': [
                'pelvis', 'pelvic', 'bladder', 'prostate', 'uterus', 'uterine',
                'ovary', 'ovarian', 'gynecologic', 'reproductive', 'testicular',
                'rectal', 'anal'
            ],
            'spine': [
                'spine', 'spinal', 'vertebral', 'disc', 'herniation', 'scoliosis'
            ],
            'extremity': [
                'arm', 'leg', 'hand', 'foot', 'wrist', 'ankle', 'shoulder', 'hip', 
                'knee', 'elbow', 'extremity', 'bone', 'skeletal', 'joint', 'muscle', 
                'tendon', 'ligament', 'fracture', 'arthritis', 'carpal', 'subluxation',
                'musculoskeletal'
            ]
        }
    },
    'parsing': {
        'section_headers': [
            'History', 'Exam', 'Findings', 'Case Diagnosis', 'Diagnosis By',
            'Treatment', 'Discussion', 'Differential Diagnosis'
        ],
        'max_images_preview': 5,
        'reasonable_age_range': (1, 120),
    }
}

print(" Configuration loaded")


 Configuration loaded


## Text Processing

In [45]:
def clean_text(text: str) -> str:
    """Clean and normalize"""
    if not isinstance(text, str):
        return ""
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def camel_case(text: str) -> str:
    """Convert text to camelCase"""
    words = text.lower().split()
    return words[0] + ''.join(word.title() for word in words[1:])

def fix_image_paths(image_paths: List[str]) -> List[str]:
    """Convert Windows paths to Unix paths"""
    if not isinstance(image_paths, list):
        return []
    return [path.replace('\\', '/') for path in image_paths]

print(" Text processing functions loaded")


 Text processing functions loaded


## Metadata Extraction Functions

In [46]:

def extract_age(text: str) -> Optional[str]:
    """Extract age from text"""
    if not text:
        return None
    
    age_range = CONFIG['parsing']['reasonable_age_range']
    
    for pattern in CONFIG['metadata_patterns']['age']:
        matches = re.finditer(pattern, text, re.IGNORECASE)
        for match in matches:
            age = match.group(1)
            try:
                if age_range[0] <= int(age) <= age_range[1]:
                    return age
            except (ValueError, TypeError):
                continue
    return None

def extract_gender(text: str) -> Optional[str]:
    """Extract gender from text"""
    if not text:
        return None
    
    text_lower = text.lower()
    
    if 'female' in text_lower or 'woman' in text_lower:
        return 'Female'
    elif 'male' in text_lower or 'man' in text_lower:
        return 'Male'
    
    # Pattern matching
    gender_config = CONFIG['metadata_patterns']['gender']
    for pattern in gender_config['patterns']:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            if match.lastindex == 2:
                gender_key = match.group(2).upper()
            else:
                gender_key = match.group(1).upper()
            
            if gender_key in gender_config['mapping']:
                return gender_config['mapping'][gender_key]
            elif gender_key in ['M', 'F']:
                return 'Male' if gender_key == 'M' else 'Female'
    
    return None

def extract_modality(text: str) -> Optional[str]:
    """Extract imaging modality from text"""
    if not text:
        return None
    
    for modality, keywords in CONFIG['metadata_patterns']['modality'].items():
        for keyword in keywords:
            if re.search(r'\b' + re.escape(keyword) + r'\b', text, re.IGNORECASE):
                return modality
    return None

def extract_body_region(diagnosis: str, history: str, findings: str) -> Optional[str]:
    """Extract body region using medical context analysis"""
    if not any([diagnosis, history, findings]):
        return None
    
    combined_text = f"{diagnosis or ''} {history or ''} {findings or ''}".lower()
    region_scores = {}
    
    for region, keywords in CONFIG['metadata_patterns']['body_regions'].items():
        score = 0
        for keyword in keywords:
            if re.search(r'\b' + re.escape(keyword) + r'\b', combined_text, re.IGNORECASE):
                score += 1
        if score > 0:
            region_scores[region] = score
    
    if region_scores:
        best_region = max(region_scores.items(), key=lambda x: x[1])
        return best_region[0].title()
    
    return None

print("Metadata extraction functions loaded")


Metadata extraction functions loaded


### Case Parsing

In [47]:
def parse_case_sections(case_title: str) -> Dict[str, str]:
    """Parse case title into structured sections"""
    if not isinstance(case_title, str):
        return get_empty_sections()
    
    sections = get_empty_sections()
    
    # Extract main title
    headers_pattern = '|'.join(CONFIG['parsing']['section_headers'])
    title_match = re.search(r'^(.*?)(?=\s*(?:' + headers_pattern + r'))', 
                           case_title, re.IGNORECASE | re.DOTALL)
    if title_match:
        sections['title'] = clean_text(title_match.group(1))
    else:
        sections['title'] = clean_text(case_title)
    
    # Extract diagnosis
    diagnosis_match = re.search(r'CASE\s*[:\-]?\s*(.*?)(?=\s*(?:History|$))', 
                                case_title, re.IGNORECASE)
    if diagnosis_match:
        sections['diagnosis'] = clean_text(diagnosis_match.group(1))
    else:
        sections['diagnosis'] = sections['title']
    
    # Extract other sections
    section_patterns = {
        'history': r'History\s*[:\-]?\s*(.*?)(?=\s*(?:Exam|Findings|Case Diagnosis|Discussion|Treatment|Differential Diagnosis|$))',
        'exam': r'Exam\s*[:\-]?\s*(.*?)(?=\s*(?:Findings|Case Diagnosis|Discussion|Treatment|Differential Diagnosis|$))',
        'findings': r'Findings\s*[:\-]?\s*(.*?)(?=\s*(?:Case Diagnosis|Discussion|Treatment|Differential Diagnosis|$))',
        'caseDiagnosis': r'Case Diagnosis\s*[:\-]?\s*(.*?)(?=\s*(?:Diagnosis By|Discussion|Treatment|$))',
        'diagnosisBy': r'Diagnosis By\s*[:\-]?\s*(.*?)(?=\s*(?:Discussion|Treatment|$))',
        'treatment': r'Treatment\s*[:\-]?\s*(.*?)(?=\s*(?:Discussion|$))',
        'discussion': r'Discussion\s*[:\-]?\s*(.*?)$',
    }
    
    for section, pattern in section_patterns.items():
        match = re.search(pattern, case_title, re.IGNORECASE | re.DOTALL)
        if match:
            sections[section] = clean_text(match.group(1))
    
    return sections

def get_empty_sections() -> Dict[str, str]:
    """Return template for empty sections"""
    sections = {'title': 'Unknown', 'diagnosis': 'Unknown'}
    for header in CONFIG['parsing']['section_headers']:
        sections[camel_case(header)] = ''
    return sections

def extract_case_id(case_folder: str) -> Optional[str]:
    """Extract case ID from folder path"""
    if not case_folder:
        return None
    match = re.search(r'case_(-?\d+)', case_folder)
    return match.group(1) if match else None

print("Case parsing functions loaded")

Case parsing functions loaded


### Main Processing Function

In [48]:
def process_case(raw_case: Dict, index: int) -> Optional[Dict]:
    """Process a single raw case"""
    try:
        case_id = extract_case_id(raw_case.get('Case Folder', '')) or f"case_{index}"
        
        # Parse sections
        original_title = raw_case.get('Case Title', '')
        sections = parse_case_sections(original_title)
        
        # Extract metadata from sections
        priority_sections = ['history', 'exam', 'title']
        
        patient_age = None
        for section in priority_sections:
            if not patient_age and sections.get(section):
                patient_age = extract_age(sections[section])
                if patient_age:
                    break
        
        gender = None
        for section in priority_sections:
            if not gender and sections.get(section):
                gender = extract_gender(sections[section])
                if gender:
                    break
        
        modality = None
        for section in ['findings', 'history', 'exam', 'title']:
            if not modality and sections.get(section):
                modality = extract_modality(sections[section])
                if modality:
                    break
        
        body_region = extract_body_region(
            sections.get('diagnosis', ''),
            sections.get('history', ''),
            sections.get('findings', '')
        )
        
        # Process images
        image_paths = fix_image_paths(raw_case.get('Image Paths', []))
        
        # Build clean case
        clean_case = {
            'id': case_id,
            'url': raw_case.get('URL', ''),
            'title': sections['title'],
            'diagnosis': sections['diagnosis'],
            'patient_age': patient_age,
            'gender': gender,
            'modality_guess': modality,
            'body_region': body_region,
            'imageCount': len(image_paths),
            'imagePaths': image_paths[:CONFIG['parsing']['max_images_preview']],
            'caseFolder': raw_case.get('Case Folder', '').replace('\\', '/')
        }
        
        # Add parsed sections
        clean_case.update({k: v for k, v in sections.items() 
                          if k not in ['title', 'diagnosis']})
        
        return clean_case
        
    except Exception as e:
        print(f"Error processing case {index}: {e}")
        return None

print("Main processing function loaded")


Main processing function loaded


### Load and Process Data

In [49]:
data_path = Path(CONFIG['data_paths']['input'])
with open(data_path, 'r', encoding='utf-8') as f:
    raw_cases = json.load(f)

print(f"Total cases loaded: {len(raw_cases)}")

# Process all cases
print("Processing cases...")
processed_cases = []

for idx, raw_case in enumerate(raw_cases):
    clean_case = process_case(raw_case, idx)
    if clean_case:
        processed_cases.append(clean_case)

print(f"Successfully processed {len(processed_cases)} cases!")
print(f"Failed to process {len(raw_cases) - len(processed_cases)} cases")


Total cases loaded: 7432
Processing cases...
Successfully processed 7432 cases!
Failed to process 0 cases


### Analyze Results

In [50]:
df = pd.DataFrame(processed_cases)

print("Processed Data Summary:")
print("=" * 50)
print(f"Total cases: {len(df)}")

print(f"\nMetadata Extraction Results:")
print(f"Cases with age extracted: {df['patient_age'].notna().sum()}")
print(f"Cases with gender extracted: {df['gender'].notna().sum()}")
print(f"Cases with modality guess: {df['modality_guess'].notna().sum()}")
print(f"Cases with body region: {df['body_region'].notna().sum()}")

print(f"\nGender distribution:")
gender_counts = df['gender'].value_counts()
for gender, count in gender_counts.items():
    if gender is not None:
        print(f"  {gender}: {count} cases")
print(f"  Unknown: {df['gender'].isna().sum()} cases")

print(f"\nBody Region distribution:")
body_region_counts = df['body_region'].value_counts()
for region, count in body_region_counts.items():
    if region is not None:
        print(f"  {region}: {count} cases")
print(f"  Unknown: {df['body_region'].isna().sum()} cases")


Processed Data Summary:
Total cases: 7432

Metadata Extraction Results:
Cases with age extracted: 4717
Cases with gender extracted: 4920
Cases with modality guess: 3321
Cases with body region: 6337

Gender distribution:
  Male: 2660 cases
  Female: 2260 cases
  Unknown: 2512 cases

Body Region distribution:
  Extremity: 1612 cases
  Head: 1356 cases
  Chest: 1219 cases
  Abdomen: 1178 cases
  Pelvis: 407 cases
  Neck: 299 cases
  Spine: 266 cases
  Unknown: 1095 cases


In [None]:
output_dir = Path(CONFIG['data_paths']['output_dir'])
output_dir.mkdir(parents=True, exist_ok=True)

# Save full processed cases
output_file = output_dir / CONFIG['data_paths']['output_files']['cleaned']
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(processed_cases, f, indent=2, ensure_ascii=False)
print(f"Saved {len(processed_cases)} cases to {output_file}")

# Save summary
summary_file = output_dir / CONFIG['data_paths']['output_files']['summary']
cases_summary = [
    {
        'id': case['id'],
        'title': case['title'],
        'diagnosis': case['diagnosis'],
        'imageCount': case['imageCount'],
        'patient_age': case['patient_age'],
        'gender': case['gender'],
        'modality_guess': case['modality_guess'],
        'body_region': case['body_region']
    }
    for case in processed_cases
]

with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(cases_summary, f, indent=2, ensure_ascii=False)
print(f"Saved summary to {summary_file}")


Saved 7432 cases to /Users/lk8545/4DT911-Project-In-Visualization-and-Data-Analysis/data/processed/cases_cleaned.json
Saved summary to /Users/lk8545/4DT911-Project-In-Visualization-and-Data-Analysis/data/processed/cases_summary.json


### Sample

In [56]:
print("\n Sample Processed Cases:")
print(df[['id', 'diagnosis', 'patient_age', 'gender', 'body_region']].head(10))

# Show specific examples
print(f"\nSpecific Examples:")
aaa_cases = df[df['diagnosis'].str.contains('aortic', case=False, na=False)]
if not aaa_cases.empty:
    case = aaa_cases.iloc[0]
    print(f"\nAbdominal Aortic Aneurysm Case:")
    print(f"  Diagnosis: {case['diagnosis']}")
    print(f"  Body Region: {case['body_region']}")
    print(f"  Age: {case['patient_age']}")
    print(f"  Gender: {case['gender']}")



 Sample Processed Cases:
                     id                                          diagnosis  \
0   8892378009084536600  A Neck And Wrist Pain: Bilateral Carpal Tunnel...   
1    -16278608286148448  AAST Grade IV Renal Laceration Of Ectopic Righ...   
2  -9029866025949687595                                  Abdominal Abscess   
3  -7564063985596765026                                  Abdominal Abscess   
4  -2941954293882397787                          Abdominal Aortic Aneurysm   
5   -433199887417431746                          Abdominal Aortic Aneurysm   
6   -336466849161172575                          Abdominal Aortic Aneurysm   
7   8373696802045254037                          Abdominal Aortic Aneurysm   
8    110238475138700915                          Abdominal Aortic Aneurysm   
9   4756220041667792244                          Abdominal Aortic Aneurysm   

  patient_age  gender body_region  
0          53  Female   Extremity  
1          27    Male     Abdomen  
2      