# PDX Analysis Tutorial - Data Exploration

This notebook provides a comprehensive exploration of Patient-Derived Xenograft (PDX) datasets including tumor volumes, gene expression, and genomic variants.

## Learning Objectives
- Load and examine PDX datasets
- Perform quality control checks
- Generate descriptive statistics
- Create initial visualizations
- Identify data patterns and potential issues

## Prerequisites
- Python 3.7+
- pandas, numpy, matplotlib, seaborn
- Basic understanding of PDX research

## üö® Environment Setup Issues?

**If you encountered dependency conflicts or Jupyter issues, follow these steps first:**

### Option 1: Quick Fix - Use Virtual Environment
```bash
# Create virtual environment
python3 -m venv pdx_env
source pdx_env/bin/activate

# Install packages
pip install --upgrade pip
pip install pandas numpy matplotlib seaborn scipy scikit-learn jupyter

# Start Jupyter
jupyter notebook
```

### Option 2: Use Conda (Recommended)
```bash
conda create -n pdx_analysis python=3.9
conda activate pdx_analysis
conda install pandas numpy matplotlib seaborn scipy scikit-learn jupyter -c conda-forge
jupyter notebook
```

### Option 3: Fix Jupyter Path Issue
```bash
# Check current Python
which python3
python3 --version

# Reinstall Jupyter with current Python
pip3 install --force-reinstall jupyter

# Or use python -m to run Jupyter
python3 -m jupyter notebook
```

In [1]:
# Environment Diagnostic - Run this cell first
import sys
import subprocess

print("=== ENVIRONMENT DIAGNOSTIC ===")
print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")
print(f"Python path: {sys.path[:3]}...")  # Show first 3 paths

# Check if we can import required packages
required_packages = ['pandas', 'numpy', 'matplotlib', 'seaborn']
missing_packages = []

for package in required_packages:
    try:
        __import__(package)
        print(f"‚úÖ {package}: Available")
    except ImportError:
        print(f"‚ùå {package}: Missing")
        missing_packages.append(package)

if missing_packages:
    print(f"\n‚ö†Ô∏è  Missing packages: {missing_packages}")
    print("Run: pip install " + " ".join(missing_packages))
else:
    print("\nüéâ All required packages are available!")
    
# Check Jupyter installation
try:
    result = subprocess.run([sys.executable, '-m', 'jupyter', '--version'], 
                          capture_output=True, text=True)
    if result.returncode == 0:
        print(f"‚úÖ Jupyter: Available")
        print(f"   Version info: {result.stdout.strip()}")
    else:
        print(f"‚ùå Jupyter: Issue detected")
except Exception as e:
    print(f"‚ùå Jupyter: {e}")

=== ENVIRONMENT DIAGNOSTIC ===
Python executable: /Users/minluzhang/projects/2025/git/pdx_analysis_tutorial/pdx_env/bin/python3.12
Python version: 3.12.9 (v3.12.9:fdb81425a9a, Feb  4 2025, 12:21:36) [Clang 13.0.0 (clang-1300.0.29.30)]
Python path: ['/Library/Frameworks/Python.framework/Versions/3.12/lib/python312.zip', '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12', '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/lib-dynload']...
‚úÖ pandas: Available
‚úÖ numpy: Available
‚úÖ matplotlib: Available
‚úÖ seaborn: Available

üéâ All required packages are available!
‚úÖ Jupyter: Available
   Version info: Selected Jupyter core packages...
IPython          : 9.5.0
ipykernel        : 6.30.1
ipywidgets       : 8.1.7
jupyter_client   : 8.6.3
jupyter_core     : 5.8.1
jupyter_server   : 2.17.0
jupyterlab       : 4.4.7
nbclient         : 0.10.2
nbconvert        : 7.16.6
nbformat         : 5.10.4
notebook         : 7.4.5
qtconsole        : not installed
trai

## 1. Load Required Libraries

First, let's import the necessary Python libraries for data analysis and visualization.

In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("Set2")

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print("‚úÖ Libraries loaded successfully!")
print(f"üìä Pandas version: {pd.__version__}")
print(f"üî¢ NumPy version: {np.__version__}")
print(f"üìà Matplotlib version: {plt.matplotlib.__version__}")
print(f"üé® Seaborn version: {sns.__version__}")

‚úÖ Libraries loaded successfully!
üìä Pandas version: 2.3.2
üî¢ NumPy version: 2.3.3
üìà Matplotlib version: 3.10.6
üé® Seaborn version: 0.13.2


## 2. Load PDX Datasets

Now let's load the Patient-Derived Xenograft datasets. We'll examine three key data types:
- **Tumor volumes**: Growth measurements over time
- **Gene expression**: RNA-seq data (TPM values)
- **Genomic variants**: Mutation and copy number data

In [None]:
# Load PDX Realistic Datasets
from pathlib import Path
import os

def load_realistic_data():
    """Load the realistic PDX datasets"""
    data_dir = Path("../data")
    
    print("Loading realistic PDX datasets...")
    print(f"Data directory: {data_dir}")
    
    # Check what files exist
    if data_dir.exists():
        print(f"\nAvailable data files:")
        for file in sorted(data_dir.glob("*.csv")):
            size_mb = file.stat().st_size / (1024*1024)
            print(f"  - {file.name}: {size_mb:.2f} MB")
    else:
        raise FileNotFoundError(f"Data directory not found: {data_dir}")
    
    # Define realistic data files
    file_paths = {
        'tumor_volumes': data_dir / 'tumor_volumes_realistic.csv',
        'expression_tpm': data_dir / 'expression_tpm_realistic.csv',
        'variants': data_dir / 'variants_realistic.csv'
    }
    
    print(f"\nLoading realistic datasets...")
    results = {}
    
    # Load each dataset
    for data_type, filepath in file_paths.items():
        if filepath.exists():
            try:
                if data_type == 'expression_tpm':
                    data = pd.read_csv(filepath, index_col=0)
                else:
                    data = pd.read_csv(filepath)
                
                print(f"‚úÖ {data_type}: Loaded realistic dataset ({data.shape[0]} √ó {data.shape[1]})")
                results[data_type] = data
            except Exception as e:
                print(f"‚ùå Failed to load {filepath.name}: {e}")
                results[data_type] = None
        else:
            print(f"‚ùå {data_type}: File not found ({filepath.name})")
            print(f"   Please run: python ../src/python/generate_realistic_pdx_data.py")
            results[data_type] = None
    
    return results.get('tumor_volumes'), results.get('expression_tpm'), results.get('variants')

# Load realistic data
tumor_data, expression_data, variants_data = load_realistic_data()

# Validation summary
print(f"\nFINAL DATA SUMMARY:")
datasets = [
    ("Tumor Volumes", tumor_data),
    ("Gene Expression", expression_data), 
    ("Variants", variants_data)
]

for name, data in datasets:
    if data is not None:
        print(f"‚úÖ {name}: {data.shape[0]} √ó {data.shape[1]}")
    else:
        print(f"‚ùå {name}: Not available")

# Check if all datasets loaded successfully
missing_datasets = [name for name, data in datasets if data is None]
if missing_datasets:
    print(f"\n‚ö†Ô∏è  Missing datasets: {missing_datasets}")
    print("Please ensure all realistic data files are generated before proceeding.")
else:
    print(f"\nüéâ All realistic datasets loaded successfully!")

Smart data loading - checking available datasets...
Data directory: ../data

Available data files:
  - expression_tpm_effective.csv: 10.43 MB
  - expression_tpm_realistic.csv: 10.43 MB
  - metadata_effective.csv: 0.00 MB
  - metadata_realistic.csv: 0.00 MB
  - tumor_volumes_effective.csv: 0.00 MB
  - tumor_volumes_realistic.csv: 0.00 MB
  - variants_effective.csv: 0.03 MB
  - variants_realistic.csv: 0.04 MB

Loading datasets with priority: realistic ‚Üí effective ‚Üí mock
‚úÖ tumor_volumes: Loaded realistic dataset (150 √ó 4)
‚úÖ expression_tpm: Loaded realistic dataset (20000 √ó 30)
‚úÖ variants: Loaded realistic dataset (750 √ó 8)

FINAL DATA SUMMARY:
‚úÖ Tumor Volumes: 150 √ó 4
‚úÖ Gene Expression: 20000 √ó 30
‚úÖ Variants: 750 √ó 8


## 3. Tumor Volume Data Exploration

Let's examine the tumor growth data in detail - this is the core measurement in PDX studies.

In [4]:
if tumor_data is not None:
    print("=== TUMOR VOLUME DATA OVERVIEW ===")
    print(f"üìä Dataset shape: {tumor_data.shape}")
    print(f"üìÖ Timepoints: {tumor_data['Day'].nunique()} unique days")
    print(f"üî¨ Models: {tumor_data['Model'].nunique()} PDX models")
    
    # Display first few rows
    print("\nüìã First 10 rows:")
    display(tumor_data.head(10))
    
    # Check for missing values
    print(f"\n‚ùì Missing values:")
    missing_summary = tumor_data.isnull().sum()
    for col, missing in missing_summary.items():
        if missing > 0:
            print(f"  - {col}: {missing} ({missing/len(tumor_data)*100:.1f}%)")
        else:
            print(f"  - {col}: None")
    
    # Basic statistics
    print(f"\nüìà Volume statistics:")
    volume_stats = tumor_data['Volume_mm3'].describe()
    for stat, value in volume_stats.items():
        print(f"  - {stat}: {value:.2f} mm¬≥")
    
    # Check data types
    print(f"\nüîç Data types:")
    for col, dtype in tumor_data.dtypes.items():
        print(f"  - {col}: {dtype}")
        
    # Treatment groups
    if 'Arm' in tumor_data.columns:
        print(f"\nüíä Treatment groups:")
        arm_counts = tumor_data['Arm'].value_counts()
        for arm, count in arm_counts.items():
            print(f"  - {arm}: {count} measurements")
else:
    print("‚ùå No tumor volume data available for exploration")

=== TUMOR VOLUME DATA OVERVIEW ===
üìä Dataset shape: (62, 6)
üìÖ Timepoints: 8 unique days
üî¨ Models: 8 PDX models

üìã First 10 rows:


Unnamed: 0,Model,Arm,Day,Volume_mm3,Cancer_Type,Measurement_Date
0,PDX1,control,0,59.586875,BRCA,2024-01-07
1,PDX1,control,4,78.388986,BRCA,2024-01-11
2,PDX1,control,8,94.464729,BRCA,2024-01-15
3,PDX1,control,12,122.681151,BRCA,2024-01-19
4,PDX1,control,16,168.217173,BRCA,2024-01-23
5,PDX1,control,20,208.995318,BRCA,2024-01-27
6,PDX1,control,24,264.107366,BRCA,2024-01-31
7,PDX1,control,28,352.24085,BRCA,2024-02-04
8,PDX2,control,0,111.222587,NSCLC,2024-01-14
9,PDX2,control,4,136.890052,NSCLC,2024-01-18



‚ùì Missing values:
  - Model: None
  - Arm: None
  - Day: None
  - Volume_mm3: None
  - Cancer_Type: None
  - Measurement_Date: None

üìà Volume statistics:
  - count: 62.00 mm¬≥
  - mean: 276.53 mm¬≥
  - std: 236.41 mm¬≥
  - min: 59.59 mm¬≥
  - 25%: 142.91 mm¬≥
  - 50%: 191.04 mm¬≥
  - 75%: 311.59 mm¬≥
  - max: 1429.40 mm¬≥

üîç Data types:
  - Model: object
  - Arm: object
  - Day: int64
  - Volume_mm3: float64
  - Cancer_Type: object
  - Measurement_Date: object

üíä Treatment groups:
  - control: 31 measurements
  - treatment: 31 measurements
