# 🚀 Exoplanet Detection - Google Colab Training (2025)

**Optimized for Google Colab with latest packages (October 2025)**

This notebook:
- Downloads pre-extracted features from GitHub Release
- Trains XGBoost model on balanced dataset (500+500)
- Optimized for Colab Free/Pro with GPU support
- Latest package versions (XGBoost 2.1.x, scikit-learn 1.5.x)

## 📋 Step 1: Environment Setup

Install latest packages compatible with Python 3.10+ (Colab default in 2025)

In [None]:
# Check Python version
import sys
print(f"Python version: {sys.version}")

# Install/upgrade packages (2025 latest versions)
!pip install -q --upgrade pip
!pip install -q xgboost>=2.1.0 scikit-learn>=1.5.0 pandas>=2.2.0 numpy>=1.26.0 matplotlib>=3.9.0 seaborn>=0.13.0

print("✅ Packages installed successfully!")

In [None]:
# Verify installations
import xgboost as xgb
import sklearn
import pandas as pd
import numpy as np

print(f"XGBoost: {xgb.__version__}")
print(f"scikit-learn: {sklearn.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")

## 📥 Step 2: Download Pre-extracted Features

Download `balanced_features.csv` from GitHub Release

In [None]:
import os
import urllib.request
from pathlib import Path

# GitHub Release URL (update with your actual release)
REPO_OWNER = "exoplanet-spaceapps"
REPO_NAME = "exoplanet-starter"
RELEASE_TAG = "v1.0-features"  # Update this!
ASSET_NAME = "balanced_features.csv"

# Download URL
download_url = f"https://github.com/{REPO_OWNER}/{REPO_NAME}/releases/download/{RELEASE_TAG}/{ASSET_NAME}"

# Create data directory
os.makedirs('data', exist_ok=True)
features_path = Path('data/balanced_features.csv')

# Download if not exists
if not features_path.exists():
    print(f"📥 Downloading features from: {download_url}")
    urllib.request.urlretrieve(download_url, features_path)
    print(f"✅ Downloaded: {features_path}")
else:
    print(f"✅ Features already exist: {features_path}")

# Verify file
file_size = features_path.stat().st_size / 1024 / 1024
print(f"📊 File size: {file_size:.2f} MB")

## 🔍 Step 3: Load and Explore Data

In [None]:
# Load features
df = pd.read_csv('data/balanced_features.csv')

print(f"Total samples: {len(df)}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Filter successful extractions
df_success = df[df['status'] == 'success'].copy()

print(f"Successful extractions: {len(df_success)} ({len(df_success)/len(df)*100:.1f}%)")

# Label distribution
label_counts = df_success['label'].value_counts()
print(f"\nLabel distribution:")
print(f"  True (label=1): {label_counts.get(1, 0)}")
print(f"  False (label=0): {label_counts.get(0, 0)}")

# Data info
df_success.info()

## 📊 Step 4: Data Preparation

In [None]:
from sklearn.model_selection import train_test_split

# Feature columns
feature_columns = [
    'flux_mean', 'flux_std', 'flux_median', 'flux_mad',
    'flux_skew', 'flux_kurt',
    'bls_period', 'bls_duration', 'bls_depth', 'bls_power', 'bls_snr'
]

# Prepare X and y
X = df_success[feature_columns].copy()
y = df_success['label'].copy()

# Handle NaN values
nan_counts = X.isnull().sum()
if nan_counts.sum() > 0:
    print("⚠️ NaN values detected, filling with median:")
    for col in feature_columns:
        if X[col].isnull().sum() > 0:
            median_val = X[col].median()
            X[col].fillna(median_val, inplace=True)
            print(f"  {col}: {nan_counts[col]} NaNs filled with {median_val:.4f}")

print(f"\n✅ Features shape: {X.shape}")
print(f"✅ Labels shape: {y.shape}")

In [None]:
# Train-test split (stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nTrain labels: {dict(y_train.value_counts())}")
print(f"Test labels: {dict(y_test.value_counts())}")

## 🤖 Step 5: Train XGBoost Model

Using latest XGBoost 2.1.x API (2025)

In [None]:
# XGBoost parameters (2025 best practices)
xgb_params = {
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_estimators': 100,
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'random_state': 42,
    'tree_method': 'hist',  # Faster on CPU/GPU (2025 default)
    'device': 'cuda' if xgb.device.is_cuda_available() else 'cpu'  # Auto GPU detection (XGBoost 2.1+)
}

print(f"Training device: {xgb_params['device']}")
print(f"\nParameters:")
for key, val in xgb_params.items():
    print(f"  {key}: {val}")

In [None]:
%%time
# Train model
model = xgb.XGBClassifier(**xgb_params)

model.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    verbose=10  # Show progress every 10 iterations
)

print("\n✅ Training complete!")

## 📈 Step 6: Model Evaluation

In [None]:
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report, roc_curve
)
import matplotlib.pyplot as plt
import seaborn as sns

# Predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

y_pred_proba_train = model.predict_proba(X_train)[:, 1]
y_pred_proba_test = model.predict_proba(X_test)[:, 1]

# Metrics
metrics = {
    'Train': {
        'Accuracy': accuracy_score(y_train, y_pred_train),
        'Precision': precision_score(y_train, y_pred_train),
        'Recall': recall_score(y_train, y_pred_train),
        'F1': f1_score(y_train, y_pred_train),
        'ROC-AUC': roc_auc_score(y_train, y_pred_proba_train)
    },
    'Test': {
        'Accuracy': accuracy_score(y_test, y_pred_test),
        'Precision': precision_score(y_test, y_pred_test),
        'Recall': recall_score(y_test, y_pred_test),
        'F1': f1_score(y_test, y_pred_test),
        'ROC-AUC': roc_auc_score(y_test, y_pred_proba_test)
    }
}

# Display metrics
metrics_df = pd.DataFrame(metrics).T
print("\n📊 Model Performance:")
print(metrics_df.round(4))

# Highlight test metrics
print(f"\n🎯 **Test Set Performance:**")
print(f"  Accuracy:  {metrics['Test']['Accuracy']:.2%}")
print(f"  Precision: {metrics['Test']['Precision']:.2%}")
print(f"  Recall:    {metrics['Test']['Recall']:.2%}")
print(f"  F1:        {metrics['Test']['F1']:.2%}")
print(f"  ROC-AUC:   {metrics['Test']['ROC-AUC']:.2%}")

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_test)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Exoplanet', 'Exoplanet'],
            yticklabels=['No Exoplanet', 'Exoplanet'])
plt.title('Confusion Matrix (Test Set)', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

print(f"\nConfusion Matrix Breakdown:")
print(f"  True Negatives:  {cm[0, 0]} (correctly predicted no exoplanet)")
print(f"  False Positives: {cm[0, 1]} (incorrectly predicted exoplanet)")
print(f"  False Negatives: {cm[1, 0]} (missed exoplanet)")
print(f"  True Positives:  {cm[1, 1]} (correctly predicted exoplanet)")

In [None]:
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba_test)
roc_auc = roc_auc_score(y_test, y_pred_proba_test)

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve - Exoplanet Detection', fontsize=14, fontweight='bold')
plt.legend(loc="lower right", fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## 🔍 Step 7: Feature Importance

In [None]:
# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Importance', fontsize=12)
plt.title('Feature Importance - Exoplanet Detection', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nFeature Importance Ranking:")
for idx, row in feature_importance.iterrows():
    print(f"  {row['feature']}: {row['importance']:.4f}")

## 💾 Step 8: Save Model

In [None]:
import json
from datetime import datetime

# Create models directory
os.makedirs('models', exist_ok=True)

# Save model (XGBoost 2.1+ JSON format)
model_path = 'models/xgboost_model.json'
model.save_model(model_path)
print(f"✅ Model saved: {model_path}")

# Save training report
report = {
    'timestamp': datetime.now().isoformat(),
    'environment': 'Google Colab',
    'xgboost_version': xgb.__version__,
    'sklearn_version': sklearn.__version__,
    'dataset': {
        'total_samples': len(df_success),
        'train_samples': len(X_train),
        'test_samples': len(X_test),
        'features': feature_columns
    },
    'model': {
        'type': 'XGBClassifier',
        'parameters': xgb_params,
        'device': xgb_params['device']
    },
    'metrics': {
        'train': {k: float(v) for k, v in metrics['Train'].items()},
        'test': {k: float(v) for k, v in metrics['Test'].items()}
    },
    'confusion_matrix': {
        'true_negatives': int(cm[0, 0]),
        'false_positives': int(cm[0, 1]),
        'false_negatives': int(cm[1, 0]),
        'true_positives': int(cm[1, 1])
    },
    'feature_importance': feature_importance.to_dict('records')
}

report_path = 'models/colab_training_report.json'
with open(report_path, 'w') as f:
    json.dump(report, f, indent=2)

print(f"✅ Report saved: {report_path}")

## 📤 Step 9: Download Model (Optional)

Download trained model to your local machine

In [None]:
try:
    from google.colab import files
    
    print("📥 Downloading model files...")
    files.download('models/xgboost_model.json')
    files.download('models/colab_training_report.json')
    print("✅ Download complete!")
except ImportError:
    print("ℹ️ Not running in Colab. Files saved locally.")

## 🎉 Summary

**Training Complete!**

Your exoplanet detection model is ready for inference:
- Model: `models/xgboost_model.json`
- Report: `models/colab_training_report.json`

**Next Steps:**
1. Upload model to GitHub Release
2. Use model for inference on new lightcurves
3. Deploy as web service or integrate with frontend

**Optimization Tips for Colab Pro:**
- Enable GPU: Runtime → Change runtime type → GPU
- Use High-RAM runtime for larger datasets
- Consider hyperparameter tuning with cross-validation