# Medical Report Data Preprocessing

This notebook performs comprehensive data preprocessing on medical report data for the LabLens project. The preprocessing steps include data loading, cleaning, and preparation for NLP models that will be used for medical report simplification and translation.

## 1. Loading Required Libraries and Data Files

First, we'll import all necessary libraries and load our CSV files.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
import re
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Read the CSV files
discharge_labs = pd.read_csv('mimic_discharge_labs.csv')
demographics = pd.read_csv('mimic_complete_with_demographics.csv')

# Display basic information about the datasets
print("Discharge Labs Dataset Info:")
print(discharge_labs.info())
print("\nFirst few rows of Discharge Labs:")
print(discharge_labs.head())

print("\nDemographics Dataset Info:")
print(demographics.info())
print("\nFirst few rows of Demographics:")
print(demographics.head())

## 2. Data Exploration and Missing Values Analysis

Let's analyze the data quality and identify missing values in both datasets.

In [None]:
# Check for missing values in both datasets
print("Missing values in Discharge Labs Dataset:")
print(discharge_labs.isnull().sum())
print("\nMissing values in Demographics Dataset:")
print(demographics.isnull().sum())

# Get basic statistics for numerical columns
print("\nBasic statistics for Discharge Labs Dataset:")
print(discharge_labs.describe())
print("\nBasic statistics for Demographics Dataset:")
print(demographics.describe())

# Check unique values in categorical columns
for column in discharge_labs.select_dtypes(include=['object']).columns:
    print(f"\nUnique values in {column} (Discharge Labs):")
    print(discharge_labs[column].value_counts().head())

for column in demographics.select_dtypes(include=['object']).columns:
    print(f"\nUnique values in {column} (Demographics):")
    print(demographics[column].value_counts().head())

## 3. Text Data Preprocessing

Now we'll clean the text data by:
1. Removing special characters
2. Standardizing formatting
3. Extracting key medical entities
4. Creating structured features from unstructured text

In [None]:
# Function to clean and preprocess text
def clean_text(text):
    if pd.isna(text):
        return ""
    
    # Convert to string if not already
    text = str(text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove special characters but keep necessary punctuation
    text = re.sub(r'[^a-zA-Z0-9\s\.,:\-/]', '', text)
    
    # Standardize formatting
    text = text.strip()
    
    return text

# Apply text cleaning to the cleaned_text column in discharge_labs
discharge_labs['cleaned_text'] = discharge_labs['cleaned_text'].apply(clean_text)

# Function to extract lab values
def extract_lab_values(text):
    # Pattern to match common lab value formats (e.g., "WBC-9.1", "HGB-12.3")
    lab_pattern = r'([A-Za-z]+)-(\d+\.?\d*)'
    matches = re.findall(lab_pattern, text)
    
    # Convert matches to dictionary
    lab_values = {lab: float(value) for lab, value in matches}
    return lab_values

# Extract lab values and create new columns
discharge_labs['lab_values'] = discharge_labs['cleaned_text'].apply(extract_lab_values)

# Display sample of processed text
print("Sample of processed text:")
print(discharge_labs['cleaned_text'].head())
print("\nSample of extracted lab values:")
print(discharge_labs['lab_values'].head())

## 4. Feature Engineering and Data Transformation

We'll now:
1. Create new features from the extracted lab values
2. Normalize numerical values
3. Encode categorical variables
4. Prepare the data for model input

In [None]:
# Create features from lab values
common_labs = ['WBC', 'HGB', 'HCT', 'PLT', 'NA', 'K', 'CL', 'CO2', 'BUN', 'CREAT', 'GLUCOSE']

# Initialize new columns with NaN
for lab in common_labs:
    discharge_labs[f'lab_{lab}'] = np.nan

# Fill in lab values
for idx, row in discharge_labs.iterrows():
    lab_dict = row['lab_values']
    for lab in common_labs:
        if lab in lab_dict:
            discharge_labs.at[idx, f'lab_{lab}'] = lab_dict[lab]

# Normalize numerical features
scaler = StandardScaler()
lab_columns = [f'lab_{lab}' for lab in common_labs]
discharge_labs[lab_columns] = scaler.fit_transform(discharge_labs[lab_columns].fillna(0))

# Create text length features
discharge_labs['text_word_count'] = discharge_labs['cleaned_text'].str.split().str.len()
discharge_labs['text_sentence_count'] = discharge_labs['cleaned_text'].str.count(r'[.!?]+')

# Display the transformed data
print("Sample of transformed data:")
print(discharge_labs[lab_columns + ['text_word_count', 'text_sentence_count']].head())

## 5. Merging Datasets and Final Preparation

Now we'll merge the processed discharge labs data with demographics and prepare the final dataset for the model.

In [None]:
# Merge datasets on subject_id and hadm_id
final_dataset = pd.merge(discharge_labs, demographics, 
                        on=['subject_id', 'hadm_id'], 
                        how='inner')

# Create age groups
final_dataset['age_group'] = pd.cut(final_dataset['age'], 
                                  bins=[0, 18, 30, 50, 70, 100],
                                  labels=['0-18', '19-30', '31-50', '51-70', '70+'])

# Prepare final features for the model
model_features = lab_columns + ['text_word_count', 'text_sentence_count', 'age_group', 'gender']

# Create final preprocessed dataset
final_preprocessed = final_dataset[model_features + ['cleaned_text', 'abnormal_count']]

# Save the preprocessed data
final_preprocessed.to_csv('preprocessed_medical_data.csv', index=False)

print("Final dataset shape:", final_preprocessed.shape)
print("\nSample of final preprocessed data:")
print(final_preprocessed.head())

# Display summary statistics of the final dataset
print("\nSummary statistics of final dataset:")
print(final_preprocessed.describe())

## Summary of Preprocessing Steps

The preprocessing pipeline has:
1. Loaded and cleaned the discharge labs and demographics data
2. Extracted and normalized lab values
3. Created additional features from text data
4. Merged datasets and created age groups
5. Prepared final dataset for model input

The preprocessed data has been saved to 'preprocessed_medical_data.csv' and is ready for the next phase of model development.