# 🎯 Visualization Skills Assessment

## Assessment Overview
This notebook tests your mastery of data visualization and analysis techniques for digital pathology. Complete all exercises to demonstrate your skills in:

- Data manipulation with pandas
- Statistical visualization with matplotlib/seaborn
- Heatmap analysis and correlation studies
- Dimensionality reduction with UMAP

## Instructions
1. Read each exercise carefully
2. Implement the required solutions
3. Run the validation cells to check your work
4. Achieve >80% on all assessment criteria

Let's test your visualization expertise!
</VSCode Cell>

<VSCode.Cell language="python">
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import umap
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

print("✅ Assessment environment ready!")
print("🎯 Complete all exercises to pass the visualization assessment")

In [None]:
# Generate assessment dataset
np.random.seed(42)

# Create synthetic pathology metadata
n_samples = 500
patient_data = {
    'patient_id': [f'P{i:04d}' for i in range(n_samples)],
    'age': np.random.normal(65, 15, n_samples).astype(int),
    'gender': np.random.choice(['M', 'F'], n_samples, p=[0.45, 0.55]),
    'diagnosis': np.random.choice(['Normal', 'Benign', 'Malignant'], n_samples, p=[0.4, 0.35, 0.25]),
    'tumor_size': np.random.exponential(2.5, n_samples),
    'grade': np.random.choice([1, 2, 3], n_samples, p=[0.3, 0.5, 0.2]),
    'ki67_score': np.random.beta(2, 5, n_samples) * 100,
    'her2_status': np.random.choice(['Negative', 'Positive'], n_samples, p=[0.75, 0.25]),
    'er_status': np.random.choice(['Negative', 'Positive'], n_samples, p=[0.3, 0.7]),
    'pr_status': np.random.choice(['Negative', 'Positive'], n_samples, p=[0.4, 0.6])
}

# Create DataFrame
df_pathology = pd.DataFrame(patient_data)

# Add some correlations to make it realistic
mask_malignant = df_pathology['diagnosis'] == 'Malignant'
df_pathology.loc[mask_malignant, 'tumor_size'] *= 1.5
df_pathology.loc[mask_malignant, 'ki67_score'] *= 1.3
df_pathology.loc[mask_malignant, 'grade'] = np.random.choice([2, 3], mask_malignant.sum(), p=[0.3, 0.7])

print(f"📊 Assessment dataset created: {len(df_pathology)} samples")
print("\n🔍 Dataset preview:")
print(df_pathology.head())

## 🎯 Exercise 1: Data Exploration (20 points)

**Task**: Create a comprehensive data exploration analysis that includes:
1. Summary statistics for numerical variables
2. Value counts for categorical variables  
3. Missing data analysis
4. Basic data quality checks

**Requirements**:
- Use pandas methods effectively
- Present results clearly
- Identify any data quality issues

In [None]:
# 🎯 EXERCISE 1: Implement your data exploration here

def explore_pathology_data(df):
    """
    Comprehensive data exploration function
    
    Args:
        df: pandas DataFrame with pathology data
    
    Returns:
        dict: Summary of exploration findings
    """
    
    # TODO: Implement comprehensive data exploration
    # Hint: Use df.describe(), df.info(), df.isnull().sum(), etc.
    
    print("📊 COMPREHENSIVE DATA EXPLORATION")
    print("=" * 50)
    
    # Basic info
    print(f"Dataset Shape: {df.shape}")
    print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
    
    # TODO: Add your exploration code here
    # Example structure:
    # - Numerical summaries
    # - Categorical summaries  
    # - Missing data analysis
    # - Data quality checks
    
    exploration_results = {
        'total_samples': len(df),
        'missing_data': df.isnull().sum().sum(),
        'numerical_vars': df.select_dtypes(include=[np.number]).columns.tolist(),
        'categorical_vars': df.select_dtypes(include=['object']).columns.tolist()
    }
    
    return exploration_results

# Test your implementation
results = explore_pathology_data(df_pathology)
print(f"\n✅ Exploration completed: {results}")

## 🎯 Exercise 2: Statistical Visualization (25 points)

**Task**: Create a comprehensive statistical visualization dashboard with:
1. Distribution plots for key numerical variables
2. Box plots comparing groups
3. Bar plots for categorical variables
4. A subplot layout with at least 2x2 grid

**Requirements**:
- Use matplotlib and seaborn effectively
- Include proper titles, labels, and legends
- Choose appropriate plot types for each variable
- Create publication-ready figures

In [None]:
# 🎯 EXERCISE 2: Implement your statistical visualization here

def create_statistical_dashboard(df):
    """
    Create comprehensive statistical visualization dashboard
    
    Args:
        df: pandas DataFrame with pathology data
    """
    
    # TODO: Create a 2x3 subplot dashboard
    # Suggested plots:
    # 1. Age distribution histogram
    # 2. Tumor size by diagnosis (box plot)
    # 3. Ki67 score distribution by diagnosis
    # 4. Grade distribution (bar plot)
    # 5. Gender vs diagnosis (stacked bar)
    # 6. Correlation heatmap subset
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('Pathology Data Statistical Dashboard', fontsize=16, fontweight='bold')
    
    # Plot 1: Age distribution
    axes[0, 0].hist(df['age'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0, 0].set_title('Age Distribution')
    axes[0, 0].set_xlabel('Age (years)')
    axes[0, 0].set_ylabel('Frequency')
    
    # TODO: Complete the remaining plots
    # Plot 2: Tumor size by diagnosis
    # Plot 3: Ki67 score distribution  
    # Plot 4: Grade distribution
    # Plot 5: Gender vs diagnosis
    # Plot 6: Correlation analysis
    
    plt.tight_layout()
    plt.show()
    
    return fig

# Test your implementation
dashboard = create_statistical_dashboard(df_pathology)

## 🎯 Exercise 3: Advanced Heatmap Analysis (25 points)

**Task**: Create an advanced correlation heatmap analysis including:
1. Correlation matrix of numerical variables
2. Hierarchical clustering of features
3. Annotated correlation values
4. Custom color scheme appropriate for medical data

**Requirements**:
- Use seaborn's clustermap or heatmap
- Include statistical significance indicators
- Proper color scale and annotations
- Interpret key correlations in comments

In [None]:
# 🎯 EXERCISE 3: Implement your heatmap analysis here

def advanced_correlation_analysis(df):
    """
    Advanced correlation and heatmap analysis
    
    Args:
        df: pandas DataFrame with pathology data
    
    Returns:
        correlation_matrix: numpy array or pandas DataFrame
    """
    
    # Select numerical variables for correlation analysis
    numerical_vars = df.select_dtypes(include=[np.number]).columns
    print(f"📊 Analyzing correlations for: {list(numerical_vars)}")
    
    # TODO: Calculate correlation matrix
    corr_matrix = df[numerical_vars].corr()
    
    # TODO: Create advanced heatmap
    plt.figure(figsize=(10, 8))
    
    # Create heatmap with clustering
    # Hint: Use seaborn.clustermap or seaborn.heatmap
    # Include: annot=True, cmap='RdBu_r', center=0
    
    sns.heatmap(corr_matrix, 
                annot=True, 
                cmap='RdBu_r', 
                center=0,
                square=True,
                fmt='.2f',
                cbar_kws={"shrink": .8})
    
    plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # TODO: Identify and print strongest correlations
    # Find pairs with |correlation| > 0.3
    
    return corr_matrix

# Test your implementation  
corr_results = advanced_correlation_analysis(df_pathology)

## 🎯 Exercise 4: UMAP Dimensionality Reduction (30 points)

**Task**: Implement UMAP analysis for pathology data including:
1. Data preprocessing and scaling
2. UMAP embedding with optimized parameters
3. Visualization colored by different variables
4. Interpretation of clustering patterns

**Requirements**:
- Proper data preprocessing
- Multiple UMAP visualizations with different colorings
- Parameter optimization
- Clear interpretation of results

In [None]:
# 🎯 EXERCISE 4: Implement your UMAP analysis here

def umap_pathology_analysis(df):
    """
    UMAP dimensionality reduction analysis for pathology data
    
    Args:
        df: pandas DataFrame with pathology data
    
    Returns:
        umap_embedding: numpy array with 2D embedding coordinates
    """
    
    # TODO: Prepare data for UMAP
    # Select numerical features
    numerical_features = df.select_dtypes(include=[np.number]).columns
    X = df[numerical_features].copy()
    
    # TODO: Scale the features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # TODO: Apply UMAP
    # Use appropriate parameters: n_neighbors, min_dist, n_components=2
    umap_model = umap.UMAP(
        n_neighbors=15,
        min_dist=0.1,
        n_components=2,
        random_state=42
    )
    
    embedding = umap_model.fit_transform(X_scaled)
    
    # TODO: Create visualization dashboard
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('UMAP Analysis of Pathology Data', fontsize=16, fontweight='bold')
    
    # Plot 1: Color by diagnosis
    for i, diagnosis in enumerate(df['diagnosis'].unique()):
        mask = df['diagnosis'] == diagnosis
        axes[0, 0].scatter(embedding[mask, 0], embedding[mask, 1], 
                          label=diagnosis, alpha=0.7, s=30)
    axes[0, 0].set_title('UMAP colored by Diagnosis')
    axes[0, 0].legend()
    axes[0, 0].set_xlabel('UMAP 1')
    axes[0, 0].set_ylabel('UMAP 2')
    
    # TODO: Complete remaining plots
    # Plot 2: Color by grade
    # Plot 3: Color by age groups
    # Plot 4: Color by tumor size quartiles
    
    plt.tight_layout()
    plt.show()
    
    return embedding

# Test your implementation
umap_embedding = umap_pathology_analysis(df_pathology)
print(f"✅ UMAP embedding shape: {umap_embedding.shape}")

In [None]:
# 🎯 FINAL ASSESSMENT: Validation and Scoring

def validate_visualization_skills():
    """Validate all visualization exercises and calculate score"""
    
    print("🧪 VISUALIZATION SKILLS ASSESSMENT")
    print("=" * 50)
    
    total_score = 0
    max_score = 100
    
    # Check 1: Data exploration completeness
    try:
        results = explore_pathology_data(df_pathology)
        if results['total_samples'] == 500 and results['missing_data'] == 0:
            exploration_score = 20
            print("✅ Exercise 1 (Data Exploration): 20/20 points")
        else:
            exploration_score = 10
            print("⚠️ Exercise 1 (Data Exploration): 10/20 points - incomplete")
        total_score += exploration_score
    except:
        print("❌ Exercise 1 (Data Exploration): 0/20 points - error")
    
    # Check 2: Statistical visualization
    try:
        dashboard = create_statistical_dashboard(df_pathology)
        if dashboard is not None:
            viz_score = 25
            print("✅ Exercise 2 (Statistical Visualization): 25/25 points")
        else:
            viz_score = 15
            print("⚠️ Exercise 2 (Statistical Visualization): 15/25 points - incomplete")
        total_score += viz_score
    except:
        print("❌ Exercise 2 (Statistical Visualization): 0/25 points - error")
    
    # Check 3: Heatmap analysis  
    try:
        corr_matrix = advanced_correlation_analysis(df_pathology)
        if corr_matrix is not None and corr_matrix.shape[0] > 0:
            heatmap_score = 25
            print("✅ Exercise 3 (Heatmap Analysis): 25/25 points")
        else:
            heatmap_score = 15
            print("⚠️ Exercise 3 (Heatmap Analysis): 15/25 points - incomplete")
        total_score += heatmap_score
    except:
        print("❌ Exercise 3 (Heatmap Analysis): 0/25 points - error")
    
    # Check 4: UMAP analysis
    try:
        embedding = umap_pathology_analysis(df_pathology)
        if embedding.shape == (500, 2):
            umap_score = 30
            print("✅ Exercise 4 (UMAP Analysis): 30/30 points")
        else:
            umap_score = 20
            print("⚠️ Exercise 4 (UMAP Analysis): 20/30 points - incorrect shape")
        total_score += umap_score
    except:
        print("❌ Exercise 4 (UMAP Analysis): 0/30 points - error")
    
    # Final assessment
    percentage = (total_score / max_score) * 100
    
    print(f"\n🏆 FINAL SCORE: {total_score}/{max_score} ({percentage:.1f}%)")
    
    if percentage >= 80:
        print("🎉 CONGRATULATIONS! You've mastered visualization skills!")
        print("🚀 Ready to advance to Machine Learning tutorials!")
        grade = "PASS"
    elif percentage >= 60:
        print("📚 Good progress! Review areas for improvement and retry.")
        grade = "NEEDS IMPROVEMENT"
    else:
        print("📖 Keep practicing! Review the tutorial materials and try again.")
        grade = "NEEDS WORK"
    
    assert percentage >= 60, f"Assessment score too low: {percentage:.1f}%. Minimum 60% required."
    
    return {
        'total_score': total_score,
        'percentage': percentage,
        'grade': grade
    }

# Run final assessment
assessment_results = validate_visualization_skills()

## 📚 Assessment Summary

### Skills Evaluated
1. **Data Exploration** (20%): Pandas proficiency and data understanding
2. **Statistical Visualization** (25%): Matplotlib/seaborn mastery and plot design
3. **Heatmap Analysis** (25%): Correlation analysis and advanced heatmaps
4. **UMAP Analysis** (30%): Dimensionality reduction and pattern recognition

### Scoring Criteria
- **90-100%**: Expert level - Ready for advanced machine learning
- **80-89%**: Proficient - Strong foundation with minor gaps
- **60-79%**: Developing - Good understanding, needs more practice
- **<60%**: Needs improvement - Review tutorial materials

### Key Learning Outcomes
✅ **Data Manipulation**: Effective use of pandas for medical data analysis  
✅ **Statistical Thinking**: Appropriate plot selection and interpretation  
✅ **Visual Design**: Publication-ready figures with proper formatting  
✅ **Advanced Techniques**: UMAP for high-dimensional pathology data  

### Next Steps
Upon successful completion, you're ready for:
- **Machine Learning Classification** tutorials
- **Deep Learning with CNNs** for pathology images
- **Advanced spatial analysis** techniques

🎓 **Well done!** You've demonstrated strong visualization skills for digital pathology!