# Notebook 2: Model Training, Evaluation, and Prediction Pipeline

This notebook demonstrates:
1. Training a simple model using `model_training.py`.
2. Evaluating the model using `model_evaluation.py`.
3. Running the end-to-end daily prediction pipeline using `prediction_pipeline.py`.

**Note:** This notebook uses dummy data/features for model training and prediction as feature engineering is not fully implemented. The prediction pipeline also requires `FOOTBALL_DATA_API_KEY` for fetching live matches.

## 1. Setup and Imports

In [None]:
import os
import sys
import pandas as pd
import numpy as np
from datetime import datetime

# Add src directory to Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
src_path = os.path.join(project_root, 'src')
if src_path not in sys.path:
    sys.path.append(src_path)

try:
    from model_training import split_data, train_random_forest, load_model # Using Random Forest as example
    from model_evaluation import get_classification_metrics, plot_confusion_matrix
    from prediction_pipeline import predict_daily_matches, generate_features_for_prediction # For prediction part
    # from data_preprocessing import preprocess_match_data # If we need to re-process some data
except ImportError as e:
    print(f"Error importing modules: {e}")
    print("Make sure 'src' is in sys.path and __init__.py files are present.")

## 2. Model Training (Example with Dummy Data)

We'll create a synthetic dataset and train a Random Forest model. In a real scenario, this data would come from `01_data_collection_and_preprocessing.ipynb` after extensive feature engineering.

In [None]:
# Create dummy data for training
num_samples = 200
data = {
    'feature1': np.random.rand(num_samples),
    'feature2': np.random.rand(num_samples) * 10,
    'feature3': np.random.randint(0, 5, num_samples),
    # Target: 0 for Home Win, 1 for Draw, 2 for Away Win
    'result_label': np.random.choice([0, 1, 2], num_samples, p=[0.45, 0.25, 0.30]) 
}
training_df = pd.DataFrame(data)

print("Dummy Training Data Head:")
print(training_df.head())
print("\nTarget Distribution:")
print(training_df['result_label'].value_counts(normalize=True))

In [None]:
# Split data
if not training_df.empty:
    X_train, X_test, y_train, y_test = split_data(training_df, target_column='result_label', test_size=0.25)
    print(f"\nTraining set size: {X_train.shape[0]}, Test set size: {X_test.shape[0]}")

    # Train a Random Forest model
    # The model is saved by the train_random_forest function in ../models/ directory
    rf_model_filename = "example_rf_model.pkl"
    trained_model = train_random_forest(X_train, y_train, model_filename=rf_model_filename)
    print(f"\nModel trained: {trained_model}")
else:
    print("Training DataFrame is empty. Skipping training and evaluation.")
    # Initialize to avoid errors if cells below are run
    X_test, y_test, trained_model = pd.DataFrame(), pd.Series(dtype='float64'), None

## 3. Model Evaluation

In [None]:
if trained_model and not X_test.empty:
    print("\nEvaluating model on the test set...")
    y_pred = trained_model.predict(X_test)
    y_prob = trained_model.predict_proba(X_test)

    # Define class labels for report (consistent with dummy data: 0, 1, 2)
    class_labels = sorted(training_df['result_label'].unique())
    target_names = [f"Class {l}" for l in class_labels] # e.g., "Class 0", "Class 1", "Class 2"
    # Or more descriptive: target_names = ["Home Win", "Draw", "Away Win"]

    metrics = get_classification_metrics(y_test, y_pred, y_prob, average='weighted', labels=class_labels, target_names=target_names)
    # print("\nEvaluation Metrics:", metrics)

    # Plot confusion matrix
    # Ensure evaluation_reports directory exists (model_evaluation.py should create it)
    plot_confusion_matrix(y_test, y_pred, labels=class_labels, display_labels=target_names, filename="example_notebook_cm.png")
else:
    print("Model not trained or test data is empty. Skipping evaluation.")

## 4. Running the Daily Prediction Pipeline

This part uses the `prediction_pipeline.py` script to fetch today's matches, preprocess them (minimal for now), generate dummy features, load the trained model, and predict outcomes.

**Requires:**
- A trained model file (e.g., `example_rf_model.pkl` saved from the step above or any model in `models/`).
- `FOOTBALL_DATA_API_KEY` environment variable for fetching actual matches.

In [None]:
today_str = datetime.now().strftime("%Y-%m-%d")
api_key_notebook = os.getenv("FOOTBALL_DATA_API_KEY", "YOUR_API_TOKEN")

print(f"Running prediction pipeline for date: {today_str}")
print(f"Using model: {rf_model_filename}")
print(f"API Key available: {'Yes' if api_key_notebook != 'YOUR_API_TOKEN' else 'No (will not fetch real data)'}")

# Ensure the model used here is the one trained or a valid one from the models folder
model_to_use_in_pipeline = rf_model_filename 
# Check if the model exists
model_path_pipeline = os.path.join(project_root, 'models', model_to_use_in_pipeline)
if not os.path.exists(model_path_pipeline):
    print(f"ERROR: Model {model_to_use_in_pipeline} not found at {model_path_pipeline}.")
    print("Please ensure the model from training step was saved correctly or use an existing model file.")
else:
    # The prediction_pipeline.py uses its own feature generation logic.
    # The `generate_features_for_prediction` in that script creates dummy features ('feature1', 'feature2', 'feature3').
    # Our dummy training_df also used these feature names, so it should be compatible for this example.
    predict_daily_matches(date_str=today_str, model_filename=model_to_use_in_pipeline, api_key=api_key_notebook)

### Important Considerations for Real-World Usage:

1.  **Feature Consistency**: The features generated by `generate_features_for_prediction` in `prediction_pipeline.py` *must* exactly match the features used to train the loaded model. This is a critical point for the pipeline to work correctly.
2.  **Data Availability**: Fetching historical data for feature engineering (e.g., team form, H2H) can be complex and API-dependent.
3.  **Model Retraining**: Models should be periodically retrained with new data to maintain performance.
4.  **API Key Management**: Securely manage API keys (e.g., using environment variables or a config file, not hardcoded).