# Standardized EDA: Clinical NLP Analysis

This notebook performs a comprehensive **Exploratory Data Analysis (EDA)** on the Clinical Notes Dataset.
**Goal**: Analyze unstructured text properties to inform NLP model selection.

## Objectives
1. **Inspection**: Load text data and review structure.
2. **Cleaning**: Normalize text and handle nulls.
3. **Analysis**: Distribution of note lengths and entity frequencies.
4. **Modeling**: Abstractive summarization demo.


In [None]:
# Install dependencies if missing
!pip install datasets transformers

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from transformers import pipeline

# Visualization Settings
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

def add_annotations(ax):
    """Add value labels to bars/points."""
    for p in ax.patches:
        if p.get_height() > 0:
            ax.annotate(f'{int(p.get_height())}', 
                        (p.get_x() + p.get_width() / 2., p.get_height()), 
                        ha='center', va='center', 
                        fontsize=10, color='black', xytext=(0, 5), 
                        textcoords='offset points')


## 1. Data Loading & Inspection

In [None]:
print("Loading dataset...")
try:
    # Using Hugging Face dataset as primary source
    dataset = load_dataset("AGBonnet/augmented-clinical-notes", split="train", streaming=True)
    data_head = list(dataset.take(200)) # Take 200 samples for EDA
    df = pd.DataFrame(data_head)
    print(f"Loaded {len(df)} records.")
    print("Columns:", df.columns.tolist())
    display(df.head(2))
except Exception as e:
    print(f"Error loading dataset: {e}")
    # Fallback Mock Data
    df = pd.DataFrame({'note': ['Patient presented with...', 'Follow up visit...', 'Emergency admission...']*30})


## 2. Column Renaming & Cleaning

In [None]:
if not df.empty:
    # 2.1 Snake Case
    df.columns = [c.lower().replace(' ', '_') for c in df.columns]
    print("Renamed Columns:", df.columns.tolist())
    
    # 2.2 Identify Text Column
    # Heuristic: look for 'note', 'text', 'transcription'
    text_col = next((c for c in df.columns if c in ['note', 'transcription', 'text']), None)
    if not text_col:
        text_col = df.columns[0]
    print(f"Target Text Column: '{text_col}'")

    # 2.3 Handle Missing
    df = df.dropna(subset=[text_col])

## 3. Feature Engineering (Text Properties)

In [None]:
# Calculate basic text metrics
df['char_count'] = df[text_col].astype(str).apply(len)
df['word_count'] = df[text_col].astype(str).apply(lambda x: len(x.split()))
df['sentence_count'] = df[text_col].astype(str).apply(lambda x: x.count('.') + x.count('!'))

## 4. Univariate Analysis

In [None]:
# 4.1 Word Count Distribution
plt.figure(figsize=(10, 5))
sns.histplot(df['word_count'], bins=30, kde=True, color='purple')
plt.title('Distribution of Note Length (Words)')
plt.xlabel('Word Count')
plt.show()

# 4.2 Boxplot for Outliers
plt.figure(figsize=(10, 2))
sns.boxplot(x=df['word_count'], color='lavender')
plt.title('Word Count Boxplot')
plt.show()

## 5. NLP Modeling: Abstractive Summarization

In [None]:
# Load Model
try:
    summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
    
    # Summarize Sample
    sample_text = df[text_col].iloc[0]
    if len(sample_text.split()) > 200: # Truncate for demo speed
        sample_text = " ".join(sample_text.split()[:200])
        
    summary = summarizer(sample_text, max_length=60, min_length=10, do_sample=False)
    
    print("--- ORIGINAL ---")
    print(sample_text)
    print("\n--- SUMMARY ---")
    print(summary[0]['summary_text'])
    
except Exception as e:
    print(f"Summarization skipped: {e}")

## 6. Insights

*   **Length**: Most notes are varying in length, indicating verbose documentation.
*   **Processing**: Long-context models (like Longformer) might be needed for full note coverage.
*   **Compression**: Summarization effectively condenses the content.