# Symptom Classifier Training - Clinical-Grade AI Doctor

This notebook trains a ClinicalBERT model to classify 6 medical conditions from patient symptoms.

**Dataset**: Healthcare CSV with 55,500 patient records

**Conditions**: Arthritis, Diabetes, Hypertension, Obesity, Cancer, Asthma

**Runtime**: ~30-45 minutes with GPU

## Step 1: Setup Environment

In [None]:
# Install required packages
!pip install -q transformers datasets torch scikit-learn pandas numpy matplotlib seaborn

In [None]:
# Check GPU availability
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## Step 2: Upload Your Dataset

**Instructions**:
1. Click the folder icon on the left sidebar
2. Click "Upload to session storage"
3. Upload `healthcare_dataset.csv` from `/Users/anaslari/Desktop/doctor_online/datasets/`

In [None]:
# Verify file upload
import os
if os.path.exists('healthcare_dataset.csv'):
    print("✅ Dataset uploaded successfully!")
else:
    print("❌ Please upload healthcare_dataset.csv")

## Step 3: Load and Preprocess Data

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load dataset
df = pd.read_csv('healthcare_dataset.csv')
print(f"Loaded {len(df)} records")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nMedical Conditions:\n{df['Medical Condition'].value_counts()}")

In [None]:
# Create synthetic symptom descriptions from medical conditions
# In a real scenario, you'd have actual symptom text
symptom_templates = {
    'Arthritis': ['joint pain', 'stiffness', 'swelling in joints', 'reduced range of motion'],
    'Diabetes': ['increased thirst', 'frequent urination', 'fatigue', 'blurred vision'],
    'Hypertension': ['headaches', 'shortness of breath', 'nosebleeds', 'chest pain'],
    'Obesity': ['excessive weight gain', 'difficulty breathing', 'joint pain', 'fatigue'],
    'Cancer': ['unexplained weight loss', 'fatigue', 'pain', 'skin changes'],
    'Asthma': ['shortness of breath', 'chest tightness', 'wheezing', 'coughing']
}

def create_symptom_text(row):
    condition = row['Medical Condition']
    age = row['Age']
    gender = row['Gender']
    
    # Get random symptoms for this condition
    symptoms = symptom_templates.get(condition, ['general discomfort'])
    symptom_list = np.random.choice(symptoms, size=min(3, len(symptoms)), replace=False)
    
    # Create natural language description
    text = f"Patient is a {age} year old {gender.lower()} presenting with {', '.join(symptom_list)}."
    return text

# Create symptom descriptions
df['symptom_text'] = df.apply(create_symptom_text, axis=1)
print("\nExample symptom texts:")
print(df[['symptom_text', 'Medical Condition']].head())

In [None]:
# Encode labels
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['Medical Condition'])

# Split data
train_df, temp_df = train_test_split(df, test_size=0.3, random_state=42, stratify=df['label'])
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, stratify=temp_df['label'])

print(f"Train: {len(train_df)} | Val: {len(val_df)} | Test: {len(test_df)}")
print(f"\nClass distribution in train set:\n{train_df['Medical Condition'].value_counts()}")

## Step 4: Prepare Dataset for Training

In [None]:
from transformers import AutoTokenizer
from datasets import Dataset

# Load tokenizer
model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Convert to HuggingFace datasets
train_dataset = Dataset.from_pandas(train_df[['symptom_text', 'label']])
val_dataset = Dataset.from_pandas(val_df[['symptom_text', 'label']])
test_dataset = Dataset.from_pandas(test_df[['symptom_text', 'label']])

# Tokenize
def tokenize_function(examples):
    return tokenizer(examples['symptom_text'], padding='max_length', truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Set format
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print("✅ Datasets prepared for training")

## Step 5: Train Model

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Load model
num_labels = len(label_encoder.classes_)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)

# Define metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_score(labels, predictions, average='weighted')
    }

# Training arguments
training_args = TrainingArguments(
    output_dir='./symptom_classifier',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='f1',
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

print("🚀 Starting training...")
trainer.train()


## Step 6: Evaluate Model

In [None]:
# Evaluate on test set
test_results = trainer.evaluate(test_dataset)
print("\n📊 Test Set Results:")
print(f"Accuracy: {test_results['eval_accuracy']:.4f}")
print(f"F1 Score: {test_results['eval_f1']:.4f}")

In [None]:
# Generate predictions for detailed analysis
predictions = trainer.predict(test_dataset)
pred_labels = np.argmax(predictions.predictions, axis=1)
true_labels = predictions.label_ids

# Classification report
print("\n📋 Classification Report:")
print(classification_report(
    true_labels,
    pred_labels,
    target_names=label_encoder.classes_
))

In [None]:
# Confusion matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(true_labels, pred_labels)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=label_encoder.classes_,
            yticklabels=label_encoder.classes_)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

## Step 7: Save Model

In [None]:
# Save model and tokenizer
model.save_pretrained('./final_symptom_classifier')
tokenizer.save_pretrained('./final_symptom_classifier')

# Save label encoder
import pickle
with open('./final_symptom_classifier/label_encoder.pkl', 'wb') as f:
    pickle.dump(label_encoder, f)

print("✅ Model saved to ./final_symptom_classifier")

## Step 8: Test Inference

In [None]:
# Test with custom examples
test_examples = [
    "Patient is a 45 year old male presenting with joint pain, stiffness, swelling in joints.",
    "Patient is a 60 year old female presenting with increased thirst, frequent urination, fatigue.",
    "Patient is a 55 year old male presenting with headaches, shortness of breath, chest pain.",
]

# Tokenize
inputs = tokenizer(test_examples, padding=True, truncation=True, return_tensors='pt')

# Move to GPU if available
if torch.cuda.is_available():
    inputs = {k: v.cuda() for k, v in inputs.items()}
    model = model.cuda()

# Predict
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=1)

# Print results
print("\n🔍 Test Predictions:")
for text, pred in zip(test_examples, predictions):
    condition = label_encoder.inverse_transform([pred.cpu().item()])[0]
    print(f"\nText: {text}")
    print(f"Predicted: {condition}")

## Step 9: Download Model

**Instructions**:
1. Right-click on `final_symptom_classifier` folder in the file browser
2. Click "Download"
3. Extract the zip file on your Mac
4. Copy contents to `/Users/anaslari/Desktop/doctor_online/mm-hie-backend/app/modules/nlp/models/symptom_classifier/`

**Or use this code to create a zip:**

In [None]:
# Create zip file for download
!zip -r symptom_classifier_model.zip final_symptom_classifier/
print("✅ Model zipped! Download 'symptom_classifier_model.zip' from the file browser.")

## 🎉 Training Complete!

### Next Steps:
1. Download `symptom_classifier_model.zip`
2. Extract on your Mac
3. Copy to backend: `mm-hie-backend/app/modules/nlp/models/symptom_classifier/`
4. Update `symptom_model.py` to load this model
5. Test in your application!

### Expected Performance:
- Accuracy: ~85-95% (depending on synthetic symptom quality)
- F1 Score: ~0.85-0.95
- Inference Time: <100ms per prediction

### Model Info:
- Base: Bio_ClinicalBERT
- Classes: 6 (Arthritis, Diabetes, Hypertension, Obesity, Cancer, Asthma)
- Parameters: ~110M
- Size: ~440MB