# PDX Analysis Tutorial - Data Exploration

This notebook provides a comprehensive exploration of Patient-Derived Xenograft (PDX) datasets including tumor volumes, gene expression, and genomic variants.

## Learning Objectives
- Load and examine PDX datasets
- Perform quality control checks
- Generate descriptive statistics
- Create initial visualizations
- Identify data patterns and potential issues

## Prerequisites
- Python 3.7+
- pandas, numpy, matplotlib, seaborn
- Basic understanding of PDX research

## 🚨 Environment Setup Issues?

**If you encountered dependency conflicts or Jupyter issues, follow these steps first:**

### Option 1: Quick Fix - Use Virtual Environment
```bash
# Create virtual environment
python3 -m venv pdx_env
source pdx_env/bin/activate

# Install packages
pip install --upgrade pip
pip install pandas numpy matplotlib seaborn scipy scikit-learn jupyter

# Start Jupyter
jupyter notebook
```

### Option 2: Use Conda (Recommended)
```bash
conda create -n pdx_analysis python=3.9
conda activate pdx_analysis
conda install pandas numpy matplotlib seaborn scipy scikit-learn jupyter -c conda-forge
jupyter notebook
```

### Option 3: Fix Jupyter Path Issue
```bash
# Check current Python
which python3
python3 --version

# Reinstall Jupyter with current Python
pip3 install --force-reinstall jupyter

# Or use python -m to run Jupyter
python3 -m jupyter notebook
```

In [None]:
# Environment Diagnostic - Run this cell first
import sys
import subprocess

print("=== ENVIRONMENT DIAGNOSTIC ===")
print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")
print(f"Python path: {sys.path[:3]}...")  # Show first 3 paths

# Check if we can import required packages
required_packages = ['pandas', 'numpy', 'matplotlib', 'seaborn']
missing_packages = []

for package in required_packages:
    try:
        __import__(package)
        print(f"✅ {package}: Available")
    except ImportError:
        print(f"❌ {package}: Missing")
        missing_packages.append(package)

if missing_packages:
    print(f"\n⚠️  Missing packages: {missing_packages}")
    print("Run: pip install " + " ".join(missing_packages))
else:
    print("\n🎉 All required packages are available!")
    
# Check Jupyter installation
try:
    result = subprocess.run([sys.executable, '-m', 'jupyter', '--version'], 
                          capture_output=True, text=True)
    if result.returncode == 0:
        print(f"✅ Jupyter: Available")
        print(f"   Version info: {result.stdout.strip()}")
    else:
        print(f"❌ Jupyter: Issue detected")
except Exception as e:
    print(f"❌ Jupyter: {e}")

## 1. Load Required Libraries

First, let's import the necessary Python libraries for data analysis and visualization.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("Set2")

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print("✅ Libraries loaded successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔢 NumPy version: {np.__version__}")
print(f"📈 Matplotlib version: {plt.matplotlib.__version__}")
print(f"🎨 Seaborn version: {sns.__version__}")

## 2. Load PDX Datasets

Now let's load the Patient-Derived Xenograft datasets. We'll examine three key data types:
- **Tumor volumes**: Growth measurements over time
- **Gene expression**: RNA-seq data (TPM values)
- **Genomic variants**: Mutation and copy number data

In [None]:
# Define data paths
data_dir = Path("../data")

# Load tumor volume data
tumor_file = data_dir / "tumor_volumes_mock.csv"
if tumor_file.exists():
    tumor_data = pd.read_csv(tumor_file)
    print(f"✅ Loaded tumor volume data: {tumor_data.shape}")
else:
    print(f"❌ Tumor volume file not found: {tumor_file}")
    tumor_data = None

# Load gene expression data
expression_file = data_dir / "expression_tpm_mock.csv"
if expression_file.exists():
    expression_data = pd.read_csv(expression_file, index_col=0)
    print(f"✅ Loaded expression data: {expression_data.shape}")
else:
    print(f"❌ Expression file not found: {expression_file}")
    expression_data = None

# Load variants data
variants_file = data_dir / "variants_mock.csv"
if variants_file.exists():
    variants_data = pd.read_csv(variants_file)
    print(f"✅ Loaded variants data: {variants_data.shape}")
else:
    print(f"❌ Variants file not found: {variants_file}")
    variants_data = None

# Load enhanced datasets if available
enhanced_tumor_file = data_dir / "tumor_volumes_enhanced.csv"
if enhanced_tumor_file.exists():
    enhanced_tumor_data = pd.read_csv(enhanced_tumor_file)
    print(f"✅ Loaded enhanced tumor data: {enhanced_tumor_data.shape}")
    # Use enhanced data if available
    tumor_data = enhanced_tumor_data
else:
    print("ℹ️ Enhanced tumor data not available, using basic mock data")

print(f"\n📁 Data directory contents:")
if data_dir.exists():
    for file in data_dir.glob("*.csv"):
        size_mb = file.stat().st_size / (1024*1024)
        print(f"  - {file.name}: {size_mb:.2f} MB")
else:
    print("  Data directory not found")

## 3. Tumor Volume Data Exploration

Let's examine the tumor growth data in detail - this is the core measurement in PDX studies.

In [None]:
if tumor_data is not None:
    print("=== TUMOR VOLUME DATA OVERVIEW ===")
    print(f"📊 Dataset shape: {tumor_data.shape}")
    print(f"📅 Timepoints: {tumor_data['Day'].nunique()} unique days")
    print(f"🔬 Models: {tumor_data['Model'].nunique()} PDX models")
    
    # Display first few rows
    print("\n📋 First 10 rows:")
    display(tumor_data.head(10))
    
    # Check for missing values
    print(f"\n❓ Missing values:")
    missing_summary = tumor_data.isnull().sum()
    for col, missing in missing_summary.items():
        if missing > 0:
            print(f"  - {col}: {missing} ({missing/len(tumor_data)*100:.1f}%)")
        else:
            print(f"  - {col}: None")
    
    # Basic statistics
    print(f"\n📈 Volume statistics:")
    volume_stats = tumor_data['Volume_mm3'].describe()
    for stat, value in volume_stats.items():
        print(f"  - {stat}: {value:.2f} mm³")
    
    # Check data types
    print(f"\n🔍 Data types:")
    for col, dtype in tumor_data.dtypes.items():
        print(f"  - {col}: {dtype}")
        
    # Treatment groups
    if 'Arm' in tumor_data.columns:
        print(f"\n💊 Treatment groups:")
        arm_counts = tumor_data['Arm'].value_counts()
        for arm, count in arm_counts.items():
            print(f"  - {arm}: {count} measurements")
else:
    print("❌ No tumor volume data available for exploration")