# 🔭 Exoplanet Classification - Quick Start Tutorial

This notebook demonstrates how to use the exoplanet classification system to:
1. Load and explore exoplanet data
2. Train a machine learning model
3. Make predictions and evaluate performance
4. Analyze feature importance and model explanations

Perfect for NASA Space Apps Challenge participants!

In [None]:
# Import required libraries
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import our custom modules
from data import load_dataset, create_sample_data
from model import ExoplanetClassifier
from explain import ModelExplainer

# Set style
plt.style.use('default')
sns.set_palette("husl")

print("✅ All imports successful!")

## 📊 Step 1: Create and Explore Sample Data

First, let's create some realistic sample data that mimics real exoplanet catalogs.

In [None]:
# Create sample data
print("Creating sample exoplanet data...")
sample_df = create_sample_data("../data/sample_kepler.csv", n_samples=1000)

print(f"✅ Created {len(sample_df)} samples with {len(sample_df.columns)} features")
print(f"\nDataset shape: {sample_df.shape}")
print(f"\nColumns: {list(sample_df.columns)}")

In [None]:
# Explore the data
print("📊 Dataset Overview")
print("=" * 50)

# Basic info
display(sample_df.head())

print("\n📈 Basic Statistics:")
display(sample_df.describe())

In [None]:
# Analyze class distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Class counts
class_counts = sample_df['disposition'].value_counts()
axes[0].bar(class_counts.index, class_counts.values, 
           color=['green' if x == 'CONFIRMED' else 'orange' if x == 'CANDIDATE' else 'red' 
                  for x in class_counts.index])
axes[0].set_title('Class Distribution (Counts)')
axes[0].set_ylabel('Number of Samples')
plt.setp(axes[0].xaxis.get_majorticklabels(), rotation=45)

# Class proportions
class_props = sample_df['disposition'].value_counts(normalize=True)
axes[1].pie(class_props.values, labels=class_props.index, autopct='%1.1f%%',
           colors=['green' if x == 'CONFIRMED' else 'orange' if x == 'CANDIDATE' else 'red' 
                   for x in class_props.index])
axes[1].set_title('Class Distribution (Proportions)')

plt.tight_layout()
plt.show()

print("\n🎯 Class Summary:")
for cls, count in class_counts.items():
    pct = count / len(sample_df) * 100
    emoji = "🟢" if cls == "CONFIRMED" else "🟡" if cls == "CANDIDATE" else "🔴"
    print(f"{emoji} {cls}: {count} samples ({pct:.1f}%)")

## 🔬 Step 2: Load and Preprocess Data

Now let's use our data loading pipeline to properly preprocess the data for machine learning.

In [None]:
# Load data using our pipeline
X, y, numeric_cols, cat_cols = load_dataset("../data/sample_kepler.csv")

print(f"✅ Data loaded successfully!")
print(f"Features (X): {X.shape}")
print(f"Labels (y): {y.shape}")
print(f"\nNumeric features ({len(numeric_cols)}): {numeric_cols}")
print(f"\nCategorical features ({len(cat_cols)}): {cat_cols}")

# Check for missing values
missing_values = X.isnull().sum()
print(f"\n🔍 Missing values:")
if missing_values.sum() == 0:
    print("✅ No missing values found!")
else:
    print(missing_values[missing_values > 0])

In [None]:
# Visualize feature distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

# Plot top 6 numeric features
top_features = numeric_cols[:6]

for i, feature in enumerate(top_features):
    if feature in X.columns:
        axes[i].hist(X[feature].dropna(), bins=30, alpha=0.7, edgecolor='black')
        axes[i].set_title(f'{feature}')
        axes[i].set_xlabel('Value')
        axes[i].set_ylabel('Frequency')
    else:
        axes[i].text(0.5, 0.5, f'{feature}\n(not available)', 
                    ha='center', va='center', transform=axes[i].transAxes)

plt.tight_layout()
plt.suptitle('Feature Distributions', y=1.02, fontsize=16)
plt.show()

## 🤖 Step 3: Train the Machine Learning Model

Now let's train our XGBoost classifier to distinguish between confirmed planets, candidates, and false positives.

In [None]:
# Split the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"📊 Data Split:")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\n🎯 Training set class distribution:")
print(y_train.value_counts())
print(f"\n🎯 Test set class distribution:")
print(y_test.value_counts())

In [None]:
# Initialize and build the classifier
print("🚀 Initializing ExoplanetClassifier...")
classifier = ExoplanetClassifier(random_state=42)
classifier.build_pipeline(numeric_cols, cat_cols)

print("✅ Pipeline built successfully!")
print("\n🔧 Pipeline steps:")
for step_name in classifier.pipeline.named_steps.keys():
    print(f"  • {step_name}")

In [None]:
# Train the model (quick training without hyperparameter tuning)
print("🎯 Training the model...")
print("(Using quick training for this demo - set tune_params=True for better results)")

classifier.train(X_train, y_train, tune_params=False)

print("✅ Training completed!")

## 📈 Step 4: Evaluate Model Performance

Let's see how well our model performs on the test set.

In [None]:
# Evaluate on test set
print("📊 Evaluating model performance...")
metrics = classifier.evaluate(X_test, y_test, detailed=True)

print("\n🎯 Key Metrics:")
print(f"Overall Accuracy: {metrics['accuracy']:.3f}")
print(f"Macro Average Precision: {metrics['macro_average_precision']:.3f}")

print("\n📋 Per-Class Average Precision:")
for cls, ap in metrics['average_precision_scores'].items():
    emoji = "🟢" if cls == "CONFIRMED" else "🟡" if cls == "CANDIDATE" else "🔴"
    print(f"{emoji} {cls}: {ap:.3f}")

In [None]:
# Visualize confusion matrix
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

fig, ax = plt.subplots(figsize=(8, 6))

# Create confusion matrix display
cm_display = ConfusionMatrixDisplay(
    confusion_matrix=metrics['confusion_matrix'],
    display_labels=classifier.pipeline.classes_
)

cm_display.plot(ax=ax, cmap='Blues', values_format='d')
ax.set_title('Confusion Matrix', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Prediction probabilities analysis
test_probs = classifier.predict_proba(X_test)
test_preds = classifier.predict(X_test)

# Plot prediction confidence distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

classes = classifier.pipeline.classes_
colors = ['green', 'orange', 'red']

for i, (cls, color) in enumerate(zip(classes, colors)):
    class_probs = test_probs[:, i]
    
    axes[i].hist(class_probs, bins=20, alpha=0.7, color=color, edgecolor='black')
    axes[i].set_title(f'{cls} Probabilities')
    axes[i].set_xlabel('Probability')
    axes[i].set_ylabel('Frequency')
    axes[i].axvline(class_probs.mean(), color='black', linestyle='--', 
                   label=f'Mean: {class_probs.mean():.3f}')
    axes[i].legend()

plt.tight_layout()
plt.suptitle('Prediction Confidence Distributions', y=1.02, fontsize=16)
plt.show()

# Show confidence statistics
max_probs = np.max(test_probs, axis=1)
print(f"\n🎯 Prediction Confidence Statistics:")
print(f"Mean confidence: {max_probs.mean():.3f}")
print(f"Median confidence: {np.median(max_probs):.3f}")
print(f"Min confidence: {max_probs.min():.3f}")
print(f"Max confidence: {max_probs.max():.3f}")
print(f"High confidence predictions (>0.8): {(max_probs > 0.8).sum()} ({(max_probs > 0.8).mean()*100:.1f}%)")

## 🔍 Step 5: Feature Importance and Model Explanation

Let's understand which features are most important for the model's decisions.

In [None]:
# Get feature importance from XGBoost
xgb_classifier = classifier.pipeline.named_steps['classifier']

if hasattr(xgb_classifier, 'feature_importances_'):
    # Create feature importance DataFrame
    feature_names = numeric_cols + cat_cols  # Simplified feature names
    
    # Note: After preprocessing, feature names might be different
    # This is a simplified version for demonstration
    if len(feature_names) == len(xgb_classifier.feature_importances_):
        importance_df = pd.DataFrame({
            'feature': feature_names,
            'importance': xgb_classifier.feature_importances_
        }).sort_values('importance', ascending=False)
    else:
        # Fallback if feature names don't match
        importance_df = pd.DataFrame({
            'feature': [f'feature_{i}' for i in range(len(xgb_classifier.feature_importances_))],
            'importance': xgb_classifier.feature_importances_
        }).sort_values('importance', ascending=False)
    
    # Plot top 15 features
    top_features = importance_df.head(15)
    
    plt.figure(figsize=(10, 8))
    plt.barh(range(len(top_features)), top_features['importance'], color='skyblue')
    plt.yticks(range(len(top_features)), top_features['feature'])
    plt.xlabel('Feature Importance')
    plt.title('Top 15 Most Important Features')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
    print("🔝 Top 10 Most Important Features:")
    for i, (_, row) in enumerate(top_features.head(10).iterrows(), 1):
        print(f"{i:2d}. {row['feature']}: {row['importance']:.4f}")
        
else:
    print("❌ Feature importance not available for this model type")

In [None]:
# Initialize explainer for more detailed analysis
print("🔬 Initializing model explainer...")

try:
    explainer = ModelExplainer(classifier.pipeline, X_train)
    
    # Compute permutation importance
    print("📊 Computing permutation importance...")
    perm_importance = explainer.compute_permutation_importance(X_test, y_test, n_repeats=5)
    
    print("\n🎯 Top 10 Features by Permutation Importance:")
    for i, (_, row) in enumerate(perm_importance.head(10).iterrows(), 1):
        print(f"{i:2d}. {row['feature']}: {row['importance_mean']:.4f} (±{row['importance_std']:.4f})")
        
    # Plot permutation importance
    explainer.plot_permutation_importance(top_n=15)
    
except Exception as e:
    print(f"⚠️ Explainer initialization failed: {e}")
    print("This is normal if some dependencies are missing.")

## 🎯 Step 6: Make Predictions on New Data

Let's see how to use the trained model to classify new exoplanet candidates.

In [None]:
# Create a few example candidates
new_candidates = pd.DataFrame({
    'orbital_period': [5.2, 365.25, 1.5, 88.0],
    'transit_depth': [2000, 800, 5000, 1200],
    'transit_duration': [2.5, 8.0, 1.2, 3.0],
    'planet_radius': [1.1, 1.8, 0.8, 2.5],
    'stellar_teff': [5800, 4200, 6500, 5400],
    'stellar_logg': [4.5, 4.8, 4.2, 4.6],
    'stellar_metallicity': [0.1, -0.2, 0.3, 0.0],
    'stellar_radius': [1.0, 0.8, 1.2, 0.9],
    'stellar_mass': [1.0, 0.7, 1.3, 0.8],
    'snr': [15.2, 8.5, 25.0, 12.1],
    'impact_parameter': [0.3, 0.7, 0.1, 0.5],
    'disposition_score': [0.85, 0.45, 0.92, 0.65]
})

print("🔮 Making predictions for new candidates...")
print("\n📋 New Candidates:")
display(new_candidates)

# Make predictions
predictions = classifier.predict(new_candidates)
probabilities = classifier.predict_proba(new_candidates)

# Create results DataFrame
results_df = new_candidates.copy()
results_df['Prediction'] = predictions
results_df['Confidence'] = np.max(probabilities, axis=1)

# Add probability columns
for i, cls in enumerate(classifier.pipeline.classes_):
    results_df[f'P({cls})'] = probabilities[:, i]

print("\n🎯 Prediction Results:")
display(results_df[['orbital_period', 'transit_depth', 'snr', 'Prediction', 'Confidence']])

print("\n📊 Detailed Probabilities:")
for i, pred in enumerate(predictions):
    emoji = "🟢" if pred == "CONFIRMED" else "🟡" if pred == "CANDIDATE" else "🔴"
    print(f"\nCandidate {i+1}: {emoji} {pred} (Confidence: {np.max(probabilities[i]):.3f})")
    for j, cls in enumerate(classifier.pipeline.classes_):
        print(f"  P({cls}): {probabilities[i,j]:.3f}")

In [None]:
# Visualize prediction probabilities
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

classes = classifier.pipeline.classes_
colors = ['green', 'orange', 'red']

for i in range(len(new_candidates)):
    ax = axes[i]
    probs = probabilities[i]
    
    bars = ax.bar(classes, probs, color=colors)
    ax.set_title(f'Candidate {i+1}: {predictions[i]}')
    ax.set_ylabel('Probability')
    ax.set_ylim(0, 1)
    
    # Add value labels on bars
    for bar, prob in zip(bars, probs):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{prob:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.suptitle('Prediction Probabilities for New Candidates', y=1.02, fontsize=16)
plt.show()

## 💾 Step 7: Save the Model

Let's save our trained model so we can use it later or deploy it in the web app.

In [None]:
# Save the model
print("💾 Saving trained model...")

model_path = "../models/exoplanet_classifier.joblib"
metadata_path = "../models/metadata.json"

classifier.save(model_path, metadata_path)

print(f"✅ Model saved successfully!")
print(f"Model file: {model_path}")
print(f"Metadata file: {metadata_path}")

# Verify we can load it back
print("\n🔄 Testing model loading...")
loaded_classifier = ExoplanetClassifier.load(model_path, metadata_path)
print("✅ Model loaded successfully!")

# Quick test to make sure it works
test_pred = loaded_classifier.predict(new_candidates[:1])
print(f"\n🧪 Quick test - Prediction for first candidate: {test_pred[0]}")
print("✅ Model is working correctly!")

## 🎉 Congratulations!

You've successfully:

1. ✅ **Created sample exoplanet data** with realistic features
2. ✅ **Trained an XGBoost classifier** to distinguish between confirmed planets, candidates, and false positives
3. ✅ **Evaluated model performance** with accuracy, confusion matrix, and class-specific metrics
4. ✅ **Analyzed feature importance** to understand what drives predictions
5. ✅ **Made predictions on new candidates** with confidence scores
6. ✅ **Saved the model** for later use

## 🚀 Next Steps

1. **Launch the web app**: Run `streamlit run ../app/streamlit_app.py` to try the interactive interface

2. **Try real data**: Download actual Kepler or TESS data and train on real exoplanet catalogs

3. **Improve the model**: 
   - Enable hyperparameter tuning with `tune_params=True`
   - Try SMOTE for class imbalance with `use_smote=True`
   - Add more sophisticated feature engineering

4. **Add explainability**: Install SHAP (`pip install shap`) for detailed prediction explanations

5. **Deploy for the challenge**: Use this as your base system for the NASA Space Apps Challenge!

---

**Happy exoplanet hunting!** 🔭✨