![electronic_medical_records](electronic_medical_records.png)

Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.  

You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.

## The Data
The dataset contains anonymized medical transcriptions categorized by specialty.

## transcriptions.csv
| Column     | Description              |
|------------|--------------------------|
| `"medical_specialty"` | The medical specialty associated with each transcription.  |
| `"transcription"` | Detailed medical transcription texts, with insights into the medical case. |

In [7]:
# Import the necessary libraries
import pandas as pd
from openai import OpenAI
import json

In [None]:
# Load the data
df = pd.read_csv("data/transcriptions.csv")
df.head() # Display the first few rows of the dataframe to get the overview and structure of the data

Unnamed: 0,medical_specialty,transcription
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Orthopedic,"CHIEF COMPLAINT:, Achilles ruptured tendon.,H..."
2,Bariatrics,"PREOPERATIVE DIAGNOSIS: , Morbid obesity.,POST..."
3,Cardiovascular / Pulmonary,"PREOPERATIVE DIAGNOSES,Airway obstruction seco..."
4,Urology,"CHIEF COMPLAINT:, Urinary retention.,HISTORY ..."


In [None]:
# Import necessary Libraries and packages
import json
import time
import pandas as pd
from typing import Optional, Dict

# Initialize the openai client with your API key
# Make sure to set your OPENAI_API_KEY in your environment variables for security
# export OPENAI_API_KEY='your_api_key_here'   
client = OpenAI()

#define the system prompt to guide the response
system_prompt="""
You are a medical information extraction specialist.
Extract the following information from medical transcript:
1. patient age (extract as integer, look for the phrase like 'years old', 'old')
2. medical speciality(use the provided speciality)
3. Recommended treatment(extract the main  recommended treatment, procedures or medication)
4. Primary diagnosis or conditions(this is the main issue being addressed)

Return the information in Json using the provided information using the following Exact keys:
- age
- medical_specialist
- recommended_specialist
- primary_diagnosis

be pricise and extract the provided and if the data is missing  
provide null values, don't assume. 

"""
def extract_medical_info(transcription: str, specialty: str) -> Optional[Dict]:
    """
    Extract medical information from transcription using openai api
    """
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",  # Using 3.5-turbo model for cost efficiency
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"medical specialty: {specialty}\n\ntranscription: {transcription}"}
            ],
            temperature=0.1,
            max_tokens=500,
            response_format={"type": "json_object"}
        )
        
        # Parse the JSON response
        result = response.choices[0].message.content.strip()
        return json.loads(result)
        
    except Exception as e:
        print(f"Error processing transcription: {e}")
        return None

# Enhanced ICD-10 code mapping function
def get_icd10_code(diagnosis: str, specialty: str) -> str:
    """
    Map diagnosis to ICD-10 code with specialty-specific mappings
    """
    if not diagnosis or pd.isna(diagnosis):
        return "R69"  # Unknown diagnosis
    
    diagnosis_lower = diagnosis.lower()
    
    # Specialty-specific ICD-10 mappings
    icd_mappings = {
        "Orthopedic": {
            "fracture": "S52.501A", "sprain": "S93.409A", "arthritis": "M19.90",
            "back pain": "M54.9", "tendonitis": "M65.9", "rupture": "S86.009A"
        },
        "Bariatrics": {
            "obesity": "E66.9", "morbid obesity": "E66.01", "weight": "E66.9",
            "bariatric": "E66.9", "overweight": "E66.3"
        },
        "Cardiovascular / Pulmonary": {
            "hypertension": "I10", "heart failure": "I50.9", "pneumonia": "J18.9",
            "asthma": "J45.909", "copd": "J44.9", "chest pain": "R07.9"
        },
        "Urology": {
            "uti": "N39.0", "infection": "N39.0", "retention": "R33.9",
            "kidney stone": "N20.0", "prostate": "N40", "incontinence": "N39.498"
        },
        "Allergy / Immunology": {
            "allergy": "T78.40XA", "asthma": "J45.909", "anaphylaxis": "T78.2XXA",
            "rhinitis": "J30.9", "immune deficiency": "D84.9"
        }
    }
    
    # Check specialty-specific mappings first
    if specialty in icd_mappings:
        for condition, code in icd_mappings[specialty].items():
            if condition in diagnosis_lower:
                return code
    
    # General mappings as fallback
    general_mapping = {
        "diabetes": "E11.9", "depression": "F32.9", "anxiety": "F41.9",
        "migraine": "G43.909", "headache": "R51.9", "pain": "R52.9"
    }
    
    for condition, code in general_mapping.items():
        if condition in diagnosis_lower:
            return code
    
    return "R69"  # Unknown diagnosis

# Process transcriptions with robust error handling
def create_structured_data(df: pd.DataFrame, sample_size: int = None) -> pd.DataFrame:
    """
    Create structured medical data from transcriptions
    """
    if sample_size and sample_size < len(df):
        df_processed = df.head(sample_size).copy()
    else:
        df_processed = df.copy()
    
    structured_data = []
    
    for idx, row in df_processed.iterrows():
        print(f"Processing row {idx + 1}/{len(df_processed)} - Specialty: {row['medical_specialty']}")
        
        # Extract information using OpenAI API
        extracted_info = extract_medical_info(row['transcription'], row['medical_specialty'])
        
        if extracted_info:
            # Get ICD-10 code
            icd_code = get_icd10_code(extracted_info.get('primary_diagnosis'), row['medical_specialty'])
            
            result = {
                "age": extracted_info.get('age'),
                "medical_specialty": extracted_info.get('medical_specialty', row['medical_specialty']),
                "recommended_treatment": extracted_info.get('recommended_treatment'),
                "primary_diagnosis": extracted_info.get('primary_diagnosis'),
                "icd_code": icd_code,
                "original_specialty": row['medical_specialty']
            }
        else:
            # fallback if api call fails
            result = {
                "age": None,
                "medical_specialty": row['medical_specialty'],
                "recommended_treatment": None,
                "primary_diagnosis": None,
                "icd_code": "R69",
                "original_specialty": row['medical_specialty']
            }
        
        structured_data.append(result)
        
        # Add delay to avoid rate limiting
        time.sleep(1.5)
    
    # Create the final DataFrame
    df_structured = pd.DataFrame(structured_data)
    return df_structured

# process the data -- start with a small sample for testing
print("Starting data extraction...")
df_structured = create_structured_data(df, sample_size=5)  # adjust the sample_size as needed

# display results
print("\nExtraction completed!")
print(f"Structured data shape: {df_structured.shape}")
print("\nFirst few rows of structured data:")
print(df_structured.head())

# Save to CSV
df_structured.to_csv('structured_medical_data.csv', index=False)
print("\nStructured data saved to 'structured_medical_data.csv'")

# display some some statistics
print("\nSummary Statistics:")
print(f"Total records processed: {len(df_structured)}")
print(f"Records with age extracted: {df_structured['age'].notna().sum()}")
print(f"Records with treatment extracted: {df_structured['recommended_treatment'].notna().sum()}")
print(f"Records with diagnosis extracted: {df_structured['primary_diagnosis'].notna().sum()}")

Starting data extraction...
Processing row 1/5 - Specialty: Allergy / Immunology
Processing row 2/5 - Specialty: Orthopedic
Processing row 3/5 - Specialty: Bariatrics
Processing row 4/5 - Specialty: Cardiovascular / Pulmonary
Processing row 5/5 - Specialty: Urology

Extraction completed!
Structured data shape: (5, 6)

First few rows of structured data:
   age           medical_specialty  ...  icd_code          original_specialty
0   23        Allergy / Immunology  ...     J30.9        Allergy / Immunology
1   41                  Orthopedic  ...  S86.009A                  Orthopedic
2   30                  Bariatrics  ...     E66.9                  Bariatrics
3   50  Cardiovascular / Pulmonary  ...       R69  Cardiovascular / Pulmonary
4   66                     Urology  ...       R69                     Urology

[5 rows x 6 columns]

Structured data saved to 'structured_medical_data.csv'

Summary Statistics:
Total records processed: 5
Records with age extracted: 5
Records with treatmen