# Numerai Crypto Prediction Pipeline

This notebook implements the complete pipeline for generating cryptocurrency predictions for the Numerai Crypto competition. The pipeline includes data retrieval, feature engineering, model training, and submission generation.

## Pipeline Overview

1. **Data Retrieval**: Download data from Numerai and Yiedl APIs
2. **Data Processing**: Clean and prepare data for feature generation
3. **Feature Engineering**: Generate predictive features using GPU acceleration
4. **Model Training**: Train multiple GPU-accelerated models (LightGBM, XGBoost, etc.)
5. **Ensemble Creation**: Combine model predictions for optimal performance
6. **Submission Generation**: Format predictions for Numerai submission

## Hardware Requirements

For optimal performance:
- RAM: 600GB+ (minimum 16GB for simple pipeline)
- GPU: 3x GPUs with CUDA support (minimum 1 for simple pipeline)
- CPU: 96 threads (minimum 8 for simple pipeline)
- Storage: 500GB+ free space

This notebook has two execution modes:
- **Quick Mode**: 15-30 minutes, minimal resource usage
- **Optimal Mode**: 4-8 hours, high resource usage for best predictions

## 1. Setup and Configuration

First, we'll set up our environment and import necessary dependencies. We'll also set some configuration parameters that will control the pipeline's behavior.

In [None]:
# Standard imports
import os
import sys
import time
import logging
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.StreamHandler(sys.stdout)
    ]
)
logger = logging.getLogger()

In [None]:
# Add repository root to path
# If running from the notebook directory, move up one level to find the root
repo_root = Path.cwd().parent if Path.cwd().name == 'notebook' else Path.cwd()
sys.path.append(str(repo_root))

# Import project configuration
from config.settings import (
    EXTERNAL_DATA_DIR, DATA_DIR, MODELS_DIR, SUBMISSION_DIR, 
    RAW_DATA_DIR, PROCESSED_DATA_DIR, FEATURES_DIR, CHECKPOINTS_DIR
)
from config.tournament_config import TOURNAMENT_NAME, get_tournament_endpoint

# Create necessary directories if they don't exist
os.makedirs(EXTERNAL_DATA_DIR, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(MODELS_DIR, exist_ok=True)
os.makedirs(SUBMISSION_DIR, exist_ok=True)
os.makedirs(RAW_DATA_DIR, exist_ok=True)
os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)
os.makedirs(FEATURES_DIR, exist_ok=True)
os.makedirs(CHECKPOINTS_DIR, exist_ok=True)

# Print environment information
logger.info(f"Python version: {sys.version}")
logger.info(f"Numerai tournament: {TOURNAMENT_NAME}")
logger.info(f"Data directory: {DATA_DIR}")
logger.info(f"Models directory: {MODELS_DIR}")
logger.info(f"Submission directory: {SUBMISSION_DIR}")

In [None]:
# Pipeline Configuration Parameters
# These parameters control the pipeline's behavior

# Execution mode: 'quick' or 'optimal'
EXECUTION_MODE = 'quick'  # Change to 'optimal' for best performance

# GPU usage
USE_GPU = True  # Set to False to use CPU only

# Data settings
SKIP_DOWNLOAD = False  # Set to True to skip data download and use existing data
SKIP_YIEDL = False  # Set to True to use only Numerai data (no Yiedl)
INCLUDE_HISTORICAL = True  # Set to False to use only the latest data

# Feature generation settings
if EXECUTION_MODE == 'quick':
    MAX_ITERATIONS = 1
    FEATURES_PER_ITERATION = 1000
else:  # optimal mode
    MAX_ITERATIONS = 3
    FEATURES_PER_ITERATION = 7500

# Model training settings
GPU_MEMORY_LIMIT = 8  # Set based on your GPU memory (in GB)
USE_AZURE_SYNAPSE = True  # Use Azure Synapse LightGBM for faster training

# Display configuration
logger.info(f"Execution mode: {EXECUTION_MODE}")
logger.info(f"GPU usage: {USE_GPU}")
logger.info(f"Maximum iterations: {MAX_ITERATIONS}")
logger.info(f"Features per iteration: {FEATURES_PER_ITERATION}")
logger.info(f"GPU memory limit: {GPU_MEMORY_LIMIT}GB")

## 2. GPU Detection and Optimization

Now we'll check for available GPUs and configure them for optimal performance. This is critical for the high-performance pipeline.

In [None]:
# Import GPU detection utilities
from utils.gpu.detection import get_available_gpus, select_best_gpu
from utils.gpu.optimization import optimize_cuda_memory_usage
from utils.memory_utils import log_memory_usage, clear_memory

# Detect available GPUs
available_gpus = get_available_gpus()
logger.info(f"Available GPUs: {available_gpus}")

# Select best GPU if available
if USE_GPU and available_gpus:
    gpu_id = select_best_gpu()
    logger.info(f"Selected GPU {gpu_id} for primary processing")
    
    # Configure environment for GPU usage
    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
    
    # Optimize CUDA memory usage
    optimize_cuda_memory_usage(reserve_memory_fraction=0.1)
    
    try:
        # Try to import torch for GPU memory info
        import torch
        if torch.cuda.is_available():
            # Get GPU memory info
            free_bytes = torch.cuda.memory_reserved(0) - torch.cuda.memory_allocated(0)
            total_bytes = torch.cuda.get_device_properties(0).total_memory
            logger.info(f"GPU memory: {free_bytes/(1024**3):.2f} GB free / {total_bytes/(1024**3):.2f} GB total")
            logger.info(f"CUDA version: {torch.version.cuda}")
            
            # Run test to verify GPU acceleration
            test_tensor = torch.ones((10, 10), device='cuda')
            test_result = test_tensor + test_tensor
            logger.info("PyTorch GPU test successful - tensor operations working")
        else:
            logger.warning("PyTorch is installed but CUDA is not available")
    except ImportError:
        logger.warning("PyTorch not available, skipping GPU memory check")
else:
    logger.warning("No GPUs available or GPU usage disabled")
    USE_GPU = False

## 3. Data Retrieval

Now we'll download data from Numerai and Yiedl APIs. This includes training data, validation data, and live prediction targets.

In [None]:
# Skip data download if SKIP_DOWNLOAD is True
if SKIP_DOWNLOAD:
    logger.info("Skipping data download (SKIP_DOWNLOAD=True)")
else:
    # Import data retrieval module
    from scripts.data.download_data import download_numerai_data, download_yiedl_data
    
    logger.info("Starting data download...")
    start_time = time.time()
    
    # Download Numerai data
    logger.info("Downloading Numerai data...")
    numerai_result = download_numerai_data(
        include_historical=INCLUDE_HISTORICAL,
        force=True  # Force download to ensure we have the latest data
    )
    
    if numerai_result:
        logger.info("Numerai data downloaded successfully")
    else:
        logger.error("Failed to download Numerai data")
    
    # Download Yiedl data if not skipped
    if not SKIP_YIEDL:
        logger.info("Downloading Yiedl data...")
        yiedl_result = download_yiedl_data(
            include_historical=INCLUDE_HISTORICAL
        )
        
        if yiedl_result:
            logger.info("Yiedl data downloaded successfully")
        else:
            logger.error("Failed to download Yiedl data")
    
    download_time = time.time() - start_time
    logger.info(f"Data download completed in {download_time:.2f} seconds")
    
    # Display available data files
    numerai_files = [f for f in os.listdir(RAW_DATA_DIR) if f.startswith('numerai')]
    logger.info(f"Available Numerai files: {numerai_files}")
    
    if not SKIP_YIEDL:
        yiedl_files = [f for f in os.listdir(RAW_DATA_DIR) if f.startswith('yiedl')]
        logger.info(f"Available Yiedl files: {yiedl_files}")

## 4. Data Processing

Now we'll process the raw data to prepare it for feature generation and model training.

In [None]:
# Import data processing module
from scripts.data.process_data import process_numerai_data, process_yiedl_data, merge_datasets

logger.info("Starting data processing...")
start_time = time.time()

# Determine raw data file paths
numerai_train_file = os.path.join(RAW_DATA_DIR, "numerai_train.parquet")
numerai_targets_file = os.path.join(RAW_DATA_DIR, "numerai_targets.parquet")
numerai_live_file = os.path.join(RAW_DATA_DIR, "numerai_live.parquet")

# Process Numerai data
logger.info("Processing Numerai data...")
numerai_processed = process_numerai_data(
    train_file=numerai_train_file,
    targets_file=numerai_targets_file,
    live_file=numerai_live_file,
    output_dir=PROCESSED_DATA_DIR
)

if not SKIP_YIEDL:
    # Determine Yiedl file paths
    yiedl_latest_file = os.path.join(RAW_DATA_DIR, "yiedl_latest.parquet")
    yiedl_historical_file = os.path.join(RAW_DATA_DIR, "yiedl_historical.parquet") if INCLUDE_HISTORICAL else None
    
    # Process Yiedl data
    logger.info("Processing Yiedl data...")
    yiedl_processed = process_yiedl_data(
        latest_file=yiedl_latest_file,
        historical_file=yiedl_historical_file,
        output_dir=PROCESSED_DATA_DIR
    )
    
    # Merge Numerai and Yiedl data
    logger.info("Merging Numerai and Yiedl data...")
    merged_data = merge_datasets(
        numerai_data=numerai_processed['train'],
        yiedl_data=yiedl_processed['processed'],
        output_dir=PROCESSED_DATA_DIR
    )
else:
    # If skipping Yiedl, just use Numerai data
    logger.info("Skipping Yiedl data, using only Numerai data")
    merged_data = numerai_processed

processing_time = time.time() - start_time
logger.info(f"Data processing completed in {processing_time:.2f} seconds")

# Display processed data files
processed_files = os.listdir(PROCESSED_DATA_DIR)
logger.info(f"Available processed files: {processed_files}")

## 5. Feature Generation

Now we'll generate features from the processed data using GPU acceleration.

In [None]:
# Import feature generation module
from scripts.features.gpu_accelerator import GPUFeatureAccelerator
from scripts.run_fast_iterative_evolution import FeatureEvolver

logger.info("Starting feature generation...")
start_time = time.time()

# Determine input file for feature generation
input_file = os.path.join(PROCESSED_DATA_DIR, "crypto_train.parquet")
if not os.path.exists(input_file):
    logger.warning(f"{input_file} not found, looking for alternatives...")
    alternatives = [f for f in os.listdir(PROCESSED_DATA_DIR) if f.endswith('.parquet')]
    if alternatives:
        input_file = os.path.join(PROCESSED_DATA_DIR, alternatives[0])
        logger.info(f"Using alternative input file: {input_file}")
    else:
        logger.error("No suitable input file found for feature generation")

# Load data
logger.info(f"Loading data from {input_file}...")
try:
    # Try using polars first (faster)
    import polars as pl
    df = pl.read_parquet(input_file)
    logger.info(f"Loaded data with polars: {df.shape}")
except ImportError:
    # Fall back to pandas if polars is not available
    import pandas as pd
    df = pd.read_parquet(input_file)
    logger.info(f"Loaded data with pandas: {df.shape}")

# Generate features using fast iterative evolution
if EXECUTION_MODE == 'optimal':
    logger.info("Using Fast Iterative Feature Evolution for optimal feature generation...")
    
    # Initialize feature evolver
    evolver = FeatureEvolver(
        input_file=input_file,
        output_dir=FEATURES_DIR,
        max_iterations=MAX_ITERATIONS,
        features_per_iteration=FEATURES_PER_ITERATION,
        use_gpu=USE_GPU,
        memory_limit_gb=GPU_MEMORY_LIMIT * 2  # Double GPU memory for system RAM
    )
    
    # Run evolution
    evolved_features = evolver.run()
    
    # Save final feature file path
    feature_file = evolved_features.get('output_file') if evolved_features else None
else:
    # For quick mode, use simple GPU acceleration without evolution
    logger.info("Using simple GPU acceleration for quick feature generation...")
    
    # Initialize GPU accelerator
    accelerator = GPUFeatureAccelerator(output_dir=FEATURES_DIR, force_gpu=USE_GPU)
    
    # Get numeric columns for feature generation
    excluded_cols = ['target', 'Symbol', 'symbol', 'Prediction', 'prediction', 
                     'date', 'era', 'id', 'asset', '__index_level_0__']
    
    if isinstance(df, pd.DataFrame):
        numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
    else:  # polars DataFrame
        numeric_cols = [col for col in df.columns if df[col].dtype in 
                        [pl.Float32, pl.Float64, pl.Int32, pl.Int64, pl.Int16, pl.Int8]]
    
    # Filter out non-feature columns
    numeric_cols = [col for col in numeric_cols if col not in excluded_cols]
    
    # Limit number of columns for quick mode
    if len(numeric_cols) > 20:
        numeric_cols = numeric_cols[:20]
    
    logger.info(f"Generating features for {len(numeric_cols)} numeric columns...")
    
    # Generate features with simplified parameters for quick mode
    result_df = accelerator.generate_all_features(
        df,
        group_col='symbol',
        numeric_cols=numeric_cols,
        rolling_windows=[7, 14],  # Simplified windows
        lag_periods=[1, 3, 7],    # Simplified lags
        ewm_spans=[5, 10],        # Simplified EWM spans
        date_col='date' if 'date' in df.columns else None
    )
    
    # Save features
    feature_file = os.path.join(FEATURES_DIR, f"quick_features_{datetime.now().strftime('%Y%m%d_%H%M')}.parquet")
    
    if isinstance(result_df, pd.DataFrame):
        result_df.to_parquet(feature_file)
    else:  # polars DataFrame
        result_df.write_parquet(feature_file)
    
    logger.info(f"Features saved to {feature_file}")

feature_generation_time = time.time() - start_time
logger.info(f"Feature generation completed in {feature_generation_time:.2f} seconds")

# Clean up memory
clear_memory()
log_memory_usage()

## 6. Model Training

Now we'll train multiple models using the generated features.

In [None]:
# Import model training modules
from scripts.models.lightgbm_model import LightGBMModel
from scripts.models.xgboost_model import XGBoostModel
from scripts.models.ensemble import ModelEnsemble

logger.info("Starting model training...")
start_time = time.time()

# Load features data
if feature_file and os.path.exists(feature_file):
    logger.info(f"Loading features from {feature_file}...")
    try:
        # Try using polars first (faster)
        import polars as pl
        features_df = pl.read_parquet(feature_file)
        # Convert to pandas for compatibility with models
        features_df = features_df.to_pandas()
    except ImportError:
        # Fall back to pandas
        features_df = pd.read_parquet(feature_file)
    
    logger.info(f"Loaded features with shape: {features_df.shape}")
else:
    # Look for any feature file if the specified one doesn't exist
    feature_files = [f for f in os.listdir(FEATURES_DIR) if f.endswith('.parquet')]
    if feature_files:
        latest_feature_file = os.path.join(FEATURES_DIR, feature_files[-1])
        logger.info(f"Using latest feature file: {latest_feature_file}")
        features_df = pd.read_parquet(latest_feature_file)
    else:
        logger.error("No feature files found. Cannot train models.")
        raise FileNotFoundError("No feature files found for model training")

# Clean up features data
logger.info("Cleaning features data...")

# Drop non-feature columns for training
excluded_cols = ['target', 'Symbol', 'symbol', 'Prediction', 'prediction', 
                 'date', 'era', 'id', 'asset', '__index_level_0__']

# Keep track of target column
target_col = 'target'
if target_col not in features_df.columns:
    logger.error(f"Target column '{target_col}' not found in features data")
    raise ValueError(f"Target column '{target_col}' not found in features data")

# Drop rows with NaN in target
features_df = features_df.dropna(subset=[target_col])
logger.info(f"Data shape after dropping NaN targets: {features_df.shape}")

# Prepare features (X) and target (y)
feature_cols = [col for col in features_df.columns 
                if col not in excluded_cols and col != target_col]
X = features_df[feature_cols]
y = features_df[target_col]

# Fill NaN values in features
X = X.fillna(0)

logger.info(f"Prepared {len(feature_cols)} features for training")

# Create a train/validation split
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

logger.info(f"Training set: {X_train.shape}, Validation set: {X_val.shape}")

# Initialize models
models = []
model_performance = {}

# Train LightGBM model
try:
    logger.info("Training LightGBM model...")
    lgb_model = LightGBMModel(
        use_gpu=USE_GPU,
        gpu_device_id=0,
        name="lightgbm_model"
    )
    
    # Configure Azure Synapse LightGBM if enabled
    if USE_AZURE_SYNAPSE:
        lgb_model.params.update({
            'synapse_mode': True
        })
    
    # Train model
    lgb_result = lgb_model.train(
        X_train, y_train,
        X_val, y_val,
        num_boost_round=500 if EXECUTION_MODE == 'quick' else 1000,
        early_stopping_rounds=50
    )
    
    logger.info(f"LightGBM training completed: Best iteration: {lgb_result['best_iteration']}")
    
    # Save model
    lgb_model_path = lgb_model.save_model(MODELS_DIR)
    logger.info(f"LightGBM model saved to {lgb_model_path}")
    
    # Evaluate on validation set
    from sklearn.metrics import mean_squared_error
    lgb_preds = lgb_model.predict(X_val)
    lgb_rmse = np.sqrt(mean_squared_error(y_val, lgb_preds))
    logger.info(f"LightGBM validation RMSE: {lgb_rmse:.6f}")
    
    models.append(lgb_model)
    model_performance['lightgbm'] = lgb_rmse
except Exception as e:
    logger.error(f"Error training LightGBM model: {e}")

# Train XGBoost model
try:
    logger.info("Training XGBoost model...")
    xgb_model = XGBoostModel(
        use_gpu=USE_GPU,
        gpu_id=0,
        name="xgboost_model"
    )
    
    # Train model
    xgb_result = xgb_model.train(
        X_train, y_train,
        X_val, y_val,
        num_boost_round=500 if EXECUTION_MODE == 'quick' else 1000,
        early_stopping_rounds=50
    )
    
    logger.info(f"XGBoost training completed: Best iteration: {xgb_result['best_iteration']}")
    
    # Save model
    xgb_model_path = xgb_model.save_model(MODELS_DIR)
    logger.info(f"XGBoost model saved to {xgb_model_path}")
    
    # Evaluate on validation set
    xgb_preds = xgb_model.predict(X_val)
    xgb_rmse = np.sqrt(mean_squared_error(y_val, xgb_preds))
    logger.info(f"XGBoost validation RMSE: {xgb_rmse:.6f}")
    
    models.append(xgb_model)
    model_performance['xgboost'] = xgb_rmse
except Exception as e:
    logger.error(f"Error training XGBoost model: {e}")

# Create ensemble model
if len(models) > 0:
    logger.info("Creating ensemble model...")
    ensemble = ModelEnsemble(name="ensemble_model", weights_strategy="performance")
    
    # Add models to ensemble with performance metrics
    for model, model_name in zip(models, model_performance.keys()):
        ensemble.add_model(model, performance_metric=model_performance[model_name])
    
    # Evaluate ensemble on validation set
    ensemble_preds = ensemble.predict(X_val)
    ensemble_rmse = np.sqrt(mean_squared_error(y_val, ensemble_preds))
    logger.info(f"Ensemble validation RMSE: {ensemble_rmse:.6f}")
    
    # Save ensemble metadata
    ensemble_path = ensemble.save(MODELS_DIR)
    logger.info(f"Ensemble model metadata saved to {ensemble_path}")
    
    model_performance['ensemble'] = ensemble_rmse
else:
    logger.warning("No models trained successfully, skipping ensemble creation")

# Display model performance summary
logger.info("Model performance summary:")
for model_name, rmse in model_performance.items():
    logger.info(f"  {model_name}: RMSE = {rmse:.6f}")

training_time = time.time() - start_time
logger.info(f"Model training completed in {training_time:.2f} seconds")

# Clean up memory
clear_memory()
log_memory_usage()

## 7. Prediction Generation

Now we'll generate predictions for the live universe using our trained models.

In [None]:
# Import prediction utilities
from utils.model.predict import generate_predictions
from scripts.python_utils.submission_utils import check_submission_format

logger.info("Starting prediction generation...")
start_time = time.time()

# Load live data for prediction
live_file = os.path.join(PROCESSED_DATA_DIR, "crypto_live.parquet")
if not os.path.exists(live_file):
    logger.warning(f"{live_file} not found, looking for alternatives...")
    alternatives = [f for f in os.listdir(PROCESSED_DATA_DIR) if 'live' in f and f.endswith('.parquet')]
    if alternatives:
        live_file = os.path.join(PROCESSED_DATA_DIR, alternatives[0])
        logger.info(f"Using alternative live file: {live_file}")
    else:
        logger.error("No suitable live file found for prediction")
        raise FileNotFoundError("No live data file found for prediction")

logger.info(f"Loading live data from {live_file}...")
try:
    # Try using polars first (faster)
    import polars as pl
    live_df = pl.read_parquet(live_file)
    # Convert to pandas for compatibility with models
    live_df = live_df.to_pandas()
except ImportError:
    # Fall back to pandas
    live_df = pd.read_parquet(live_file)

logger.info(f"Loaded live data with shape: {live_df.shape}")

# Ensure 'symbol' column is present for submission
symbol_col = None
for col_name in ['symbol', 'Symbol', 'asset']:
    if col_name in live_df.columns:
        symbol_col = col_name
        break

if symbol_col is None:
    logger.error("No symbol column found in live data")
    raise ValueError("No symbol column found in live data")

# Generate features for live data (same feature set as training)
logger.info("Generating features for live data...")

# If we have access to the same feature engineering pipeline
if 'GPUFeatureAccelerator' in globals():
    accelerator = GPUFeatureAccelerator(output_dir=FEATURES_DIR, force_gpu=USE_GPU)
    
    # Use the same feature generation parameters as training
    # This is crucial for model predictions to work correctly
    excluded_cols = ['target', 'Symbol', 'symbol', 'Prediction', 'prediction', 
                    'date', 'era', 'id', 'asset', '__index_level_0__']
    
    numeric_cols = live_df.select_dtypes(include=['number']).columns.tolist()
    numeric_cols = [col for col in numeric_cols if col not in excluded_cols]
    
    # Limit number of columns to match training
    if len(numeric_cols) > 20 and EXECUTION_MODE == 'quick':
        numeric_cols = numeric_cols[:20]
    
    # Generate the same features as training
    live_features_df = accelerator.generate_all_features(
        live_df,
        group_col=symbol_col,
        numeric_cols=numeric_cols,
        rolling_windows=[7, 14] if EXECUTION_MODE == 'quick' else [3, 7, 14, 28, 56],
        lag_periods=[1, 3, 7] if EXECUTION_MODE == 'quick' else [1, 2, 3, 5, 7, 14, 28],
        ewm_spans=[5, 10] if EXECUTION_MODE == 'quick' else [5, 10, 20, 40],
        date_col='date' if 'date' in live_df.columns else None
    )
    
    # Convert to pandas if needed
    if not isinstance(live_features_df, pd.DataFrame):
        live_features_df = live_features_df.to_pandas()
else:
    # If we don't have access to the feature engineering pipeline,
    # we need to ensure the live data has the same features as the training data
    logger.warning("Feature generation unavailable, using live data as-is")
    live_features_df = live_df

logger.info(f"Live features shape: {live_features_df.shape}")

# Filter live features to match training features
if set(X.columns) - set(live_features_df.columns):
    missing_cols = set(X.columns) - set(live_features_df.columns)
    logger.warning(f"Missing {len(missing_cols)} columns in live data")
    
    # Add missing columns with zeros
    for col in missing_cols:
        live_features_df[col] = 0
    
    logger.info(f"Added missing columns with zeros")

# Select only the features used in training
live_X = live_features_df[X.columns]
live_X = live_X.fillna(0)

logger.info(f"Prepared live features with shape: {live_X.shape}")

# Generate predictions using each model
predictions = {}

for model_name in model_performance.keys():
    if model_name == 'ensemble' and 'ensemble' in locals():
        logger.info(f"Generating predictions with ensemble model...")
        pred = ensemble.predict(live_X)
        predictions['ensemble'] = pred
    elif model_name == 'lightgbm' and 'lgb_model' in locals():
        logger.info(f"Generating predictions with LightGBM model...")
        pred = lgb_model.predict(live_X)
        predictions['lightgbm'] = pred
    elif model_name == 'xgboost' and 'xgb_model' in locals():
        logger.info(f"Generating predictions with XGBoost model...")
        pred = xgb_model.predict(live_X)
        predictions['xgboost'] = pred
    else:
        logger.warning(f"Model {model_name} not available for prediction")

# Use ensemble predictions if available, otherwise use best single model
if 'ensemble' in predictions:
    final_predictions = predictions['ensemble']
    model_used = 'ensemble'
elif predictions:
    # Find best model based on validation performance
    best_model = min(model_performance.items(), key=lambda x: x[1])[0]
    if best_model in predictions:
        final_predictions = predictions[best_model]
        model_used = best_model
    else:
        # Use first available model
        model_used = list(predictions.keys())[0]
        final_predictions = predictions[model_used]
else:
    logger.error("No predictions generated")
    raise RuntimeError("No predictions generated")

logger.info(f"Using predictions from {model_used} model")

# Create submission DataFrame
submission_df = pd.DataFrame({
    'symbol': live_features_df[symbol_col],
    'prediction': final_predictions
})

# Ensure lowercase 'symbol' column
if 'symbol' not in submission_df.columns and symbol_col in submission_df.columns:
    submission_df = submission_df.rename(columns={symbol_col: 'symbol'})

# Ensure values are in [0, 1] range
submission_df['prediction'] = submission_df['prediction'].clip(0, 1)

# Save submission file
os.makedirs(SUBMISSION_DIR, exist_ok=True)
submission_file = os.path.join(SUBMISSION_DIR, f"submission_{datetime.now().strftime('%Y%m%d_%H%M')}.csv")
submission_df.to_csv(submission_file, index=False)

logger.info(f"Saved submission with {len(submission_df)} predictions to {submission_file}")

# Check submission format
if check_submission_format(submission_file):
    logger.info("Submission format is valid")
else:
    logger.warning("Submission format is invalid, please check the file")

prediction_time = time.time() - start_time
logger.info(f"Prediction generation completed in {prediction_time:.2f} seconds")

# Preview submission
submission_df.head(10)

## 8. Performance Summary

Now we'll summarize the performance of the pipeline and the generated models.

In [None]:
# Calculate total execution time
total_time = 0
if 'download_time' in locals():
    total_time += download_time
if 'processing_time' in locals():
    total_time += processing_time
if 'feature_generation_time' in locals():
    total_time += feature_generation_time
if 'training_time' in locals():
    total_time += training_time
if 'prediction_time' in locals():
    total_time += prediction_time

# Format as hours, minutes, seconds
hours = int(total_time // 3600)
minutes = int((total_time % 3600) // 60)
seconds = int(total_time % 60)

logger.info(f"Total pipeline execution time: {hours}h {minutes}m {seconds}s")

# Display final model performance summary
if 'model_performance' in locals() and model_performance:
    logger.info("\nModel Performance Summary:")
    for model_name, rmse in sorted(model_performance.items(), key=lambda x: x[1]):
        logger.info(f"  {model_name}: RMSE = {rmse:.6f}")
    
    # Find best model
    best_model = min(model_performance.items(), key=lambda x: x[1])[0]
    best_rmse = model_performance[best_model]
    logger.info(f"\nBest model: {best_model} with RMSE = {best_rmse:.6f}")

# Display submission information
if 'submission_file' in locals() and os.path.exists(submission_file):
    submission_size = os.path.getsize(submission_file) / 1024  # KB
    logger.info(f"\nSubmission file: {submission_file}")
    logger.info(f"Submission size: {submission_size:.2f} KB")
    logger.info(f"Number of predictions: {len(submission_df)}")
    
    # Display submission statistics
    logger.info(f"Prediction range: [{submission_df['prediction'].min():.4f}, {submission_df['prediction'].max():.4f}]")
    logger.info(f"Prediction mean: {submission_df['prediction'].mean():.4f}")
    logger.info(f"Prediction std: {submission_df['prediction'].std():.4f}")

## 9. Submitting to Numerai

Finally, we'll show how to submit the predictions to the Numerai competition. You need to set your Numerai API credentials for this to work.

In [None]:
# Set this to True to submit to Numerai
SUBMIT_TO_NUMERAI = False

if SUBMIT_TO_NUMERAI:
    try:
        # Try to import numerapi
        import numerapi
        
        # Set your Numerai API credentials
        # You can get these from https://numer.ai/account
        public_id = os.environ.get('NUMERAI_PUBLIC_ID', None)
        secret_key = os.environ.get('NUMERAI_SECRET_KEY', None)
        
        if not public_id or not secret_key:
            logger.warning("Numerai API credentials not set. Set NUMERAI_PUBLIC_ID and NUMERAI_SECRET_KEY environment variables.")
        else:
            # Initialize Numerai API client
            napi = numerapi.NumerAPI(public_id=public_id, secret_key=secret_key)
            
            # Check if we have a valid submission file
            if 'submission_file' in locals() and os.path.exists(submission_file):
                # Submit predictions
                logger.info(f"Submitting predictions to Numerai...")
                submission_id = napi.upload_predictions(submission_file, tournament=TOURNAMENT_NAME)
                logger.info(f"Submission successful! Submission ID: {submission_id}")
            else:
                logger.error("No valid submission file found")
    except ImportError:
        logger.error("numerapi package not installed. Install it with: pip install numerapi")
    except Exception as e:
        logger.error(f"Error submitting to Numerai: {e}")
else:
    logger.info("Skipping submission to Numerai (SUBMIT_TO_NUMERAI=False)")
    logger.info("To submit manually, upload the submission file at: https://numer.ai/tournament")

## 10. Conclusion

We've successfully run the complete Numerai Crypto prediction pipeline. Here's a summary of what we've accomplished:

1. Downloaded and processed data from Numerai and Yiedl (if enabled)
2. Generated features using GPU-accelerated algorithms
3. Trained multiple models (LightGBM, XGBoost) and created an ensemble
4. Generated predictions for the live universe
5. Created a valid submission file ready for Numerai

### Next Steps

- Experiment with different feature generation parameters
- Try different model architectures and hyperparameters
- Run the pipeline in 'optimal' mode for best performance
- Submit your predictions to the Numerai Crypto tournament

### Performance Strategies

The repository implements multiple prediction strategies:

1. **Mean Reversion Strategy**: Assumes prices will revert to historical average (RMSE: 0.0893)
2. **Momentum Strategy**: Assumes price trends will continue (RMSE: 0.1117)
3. **Trend Following Strategy**: Follows established price trends (RMSE: 0.1079)
4. **Ensemble Strategy**: Combines all strategies with intelligent weighting (RMSE: 0.1050)

The optimal pipeline combines all these strategies for the best performance.