# Notebook 2: Model Training, Evaluation, and Prediction Pipeline

This notebook demonstrates:
1. Training a simple model using `model_training.py`.
2. Evaluating the model using `model_evaluation.py`.
3. Running the end-to-end daily prediction pipeline using `prediction_pipeline.py`.

**Note:** This notebook uses dummy data/features for model training and prediction as feature engineering is not fully implemented. The prediction pipeline also requires `FOOTBALL_DATA_API_KEY` for fetching live matches.

## 1. Setup and Imports

In [None]:
import os
import sys
import pandas as pd
import numpy as np
from datetime import datetime

# Add src directory to Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
src_path = os.path.join(project_root, 'src')
if src_path not in sys.path:
    sys.path.append(src_path)

try:
    from model_training import split_data, train_random_forest, load_model # Using Random Forest as example
    from model_evaluation import get_classification_metrics, plot_confusion_matrix
    from prediction_pipeline import predict_daily_matches, generate_features_for_prediction # For prediction part
    # from data_preprocessing import preprocess_match_data # If we need to re-process some data
except ImportError as e:
    print(f"Error importing modules: {e}")
    print("Make sure 'src' is in sys.path and __init__.py files are present.")

## 2. Model Training (Example with Dummy Data)

We'll create a synthetic dataset and train a Random Forest model. In a real scenario, this data would come from `01_data_collection_and_preprocessing.ipynb` after extensive feature engineering.

In [None]:
# Create dummy data for training
num_samples = 200
data = {
    'feature1': np.random.rand(num_samples),
    'feature2': np.random.rand(num_samples) * 10,
    'feature3': np.random.randint(0, 5, num_samples),
    # New form features (dummy values)
    'home_form_W': np.random.randint(0, 3, num_samples), # Wins in last 5 games (0-2 typical for dummy)
    'home_form_D': np.random.randint(0, 2, num_samples), # Draws in last 5 games (0-1)
    'home_form_L': np.random.randint(0, 3, num_samples), # Losses 
    'home_form_games_played': np.full(num_samples, 5),
    'away_form_W': np.random.randint(0, 3, num_samples),
    'away_form_D': np.random.randint(0, 2, num_samples),
    'away_form_L': np.random.randint(0, 3, num_samples),
    'away_form_games_played': np.full(num_samples, 5),
    # Target: 0 for Home Win, 1 for Draw, 2 for Away Win
    'result_label': np.random.choice([0, 1, 2], num_samples, p=[0.45, 0.25, 0.30]) 
}
training_df = pd.DataFrame(data)

# Ensure consistent sum for W, D, L for played games (example for home team)
training_df['home_form_L'] = training_df.apply(lambda row: max(0, row['home_form_games_played'] - row['home_form_W'] - row['home_form_D']), axis=1)
training_df['away_form_L'] = training_df.apply(lambda row: max(0, row['away_form_games_played'] - row['away_form_W'] - row['away_form_D']), axis=1)

print("Dummy Training Data Head (with form features):")
print(training_df.head())
print("\nTarget Distribution:")
print(training_df['result_label'].value_counts(normalize=True))

feature_columns = [
    'feature1', 'feature2', 'feature3',
    'home_form_W', 'home_form_D', 'home_form_L', 'home_form_games_played',
    'away_form_W', 'away_form_D', 'away_form_L', 'away_form_games_played'
]

In [None]:
# Split data
if not training_df.empty:
    # The split_data function in model_training.py uses all columns except target as features.
    # So, X_train will automatically include the new form features added to training_df.
    X_train, X_test, y_train, y_test = split_data(training_df, target_column='result_label', test_size=0.25)
    print(f"\nTraining set size: {X_train.shape[0]}, Test set size: {X_test.shape[0]}")
    print(f"Features in X_train: {X_train.columns.tolist()}")

    # Train a Random Forest model
    rf_model_filename = "example_rf_model_with_form.pkl" # New model name
    trained_model = train_random_forest(X_train, y_train, model_filename=rf_model_filename)
    print(f"\nModel trained: {trained_model}")
else:
    print("Training DataFrame is empty. Skipping training and evaluation.")
    X_test, y_test, trained_model = pd.DataFrame(), pd.Series(dtype='float64'), None

## 3. Model Evaluation

In [None]:
if trained_model and not X_test.empty:
    print("\nEvaluating model on the test set...")
    # Ensure X_test has the same features as X_train (split_data handles this)
    y_pred = trained_model.predict(X_test)
    y_prob = trained_model.predict_proba(X_test)

    class_labels = sorted(training_df['result_label'].unique())
    target_names = ["Home Win (0)", "Draw (1)", "Away Win (2)"] # More descriptive

    metrics = get_classification_metrics(y_test, y_pred, y_prob, average='weighted', labels=class_labels, target_names=target_names)

    plot_confusion_matrix(y_test, y_pred, labels=class_labels, display_labels=target_names, filename="example_notebook_cm_with_form.png")
else:
    print("Model not trained or test data is empty. Skipping evaluation.")

## 4. Running the Daily Prediction Pipeline

This part uses the `prediction_pipeline.py` script. Key changes to be aware of:
- The pipeline now incorporates **team form features**.
- This means `generate_features_for_prediction` (called by `predict_daily_matches`) uses `engineer_form_features`.
- `engineer_form_features` requires **historical match data**. The pipeline attempts to load this from `data/historical_matches_sample.csv`. If not found, it uses an internal minimal dummy dataset.

**Requires:**
- A trained model file (e.g., `example_rf_model_with_form.pkl` saved from the step above). The model must be trained with the same features the pipeline generates (including form features and any other base features like 'feature1', 'feature2', 'feature3' if the pipeline's dummy data includes them).
- `FOOTBALL_DATA_API_KEY` environment variable for fetching actual daily matches.
- Optionally, `APISPORTS_API_KEY` if you want to test with that data source via the pipeline script (though the notebook call below uses football-data.org by default).

In [None]:
today_str = datetime.now().strftime("%Y-%m-%d")
api_key_notebook = os.getenv("FOOTBALL_DATA_API_KEY", "YOUR_API_TOKEN")

print(f"Running prediction pipeline for date: {today_str}")
print(f"Using model: {rf_model_filename}") # This should be the model trained with form features
print(f"API Key for football-data.org available: {'Yes' if api_key_notebook != 'YOUR_API_TOKEN' else 'No (will not fetch real data)'}")

model_to_use_in_pipeline = rf_model_filename 
model_path_pipeline = os.path.join(project_root, 'models', model_to_use_in_pipeline)

if not os.path.exists(model_path_pipeline):
    print(f"ERROR: Model {model_to_use_in_pipeline} not found at {model_path_pipeline}.")
    print("Please ensure the model from training step was saved correctly or use an existing model file.")
else:
    # The prediction_pipeline.py's generate_features_for_prediction now calls engineer_form_features.
    # It also adds dummy 'feature1', 'feature2', 'feature3' for consistency with model_training.py.
    # Our trained model (example_rf_model_with_form.pkl) includes these features.
    print("\nPipeline will attempt to load historical data from '../data/historical_matches_sample.csv' or use a fallback.")
    predict_daily_matches(date_str=today_str, model_filename=model_to_use_in_pipeline, api_key=api_key_notebook, source_api='football-data')


### Important Considerations for Real-World Usage:

1.  **Feature Consistency**: The features generated by `generate_features_for_prediction` in `prediction_pipeline.py` (which now includes form features and potentially other base features) *must* exactly match the features used to train the loaded model. The examples are aligned, but custom models need careful handling.
2.  **Historical Data**: The quality and availability of historical data (like `data/historical_matches_sample.csv`) are crucial for meaningful form features. The current sample CSV is very small.
3.  **Model Retraining**: Models should be periodically retrained with new data.
4.  **API Key Management**: Securely manage API keys.