# Notebook 2: Model Training, Evaluation, and Prediction Pipeline

This notebook demonstrates:
1. Training a simple model using `model_training.py`.
2. Evaluating the model using `model_evaluation.py`.
3. Running the end-to-end daily prediction pipeline using `prediction_pipeline.py`.

**Note:** This notebook uses dummy data/features for model training and prediction as feature engineering is not fully implemented. The prediction pipeline also requires `FOOTBALL_DATA_API_KEY` for fetching live matches.

###  क्रॉस-व्हॅलिडेशनसह अधिक मजबूत मूल्यांकन (Cross-Validation for Robust Evaluation)

Standard train/test split is useful, but cross-validation provides a more robust estimate of model performance by training and testing on different subsets of the data.
The `get_cross_val_metrics` function from `model_evaluation.py` can be used for this.

In [None]:
if trained_model and not X_train.empty:
    print("\n--- Performing Cross-Validation on the Training Data ---")
    # Ensure X_train and y_train are available from the model training step
    # The get_cross_val_metrics function is already imported at the top of the notebook.
    cv_metrics = get_cross_val_metrics(trained_model, X_train, y_train, cv=3) # Using cv=3 for speed
    print("\nCross-Validation Metrics (on training data):")
    for metric, value in cv_metrics.items():
        print(f"  {metric}: {value:.4f}")
else:
    print("\nSkipping cross-validation as model was not trained or X_train is not available.")

#### Note on Random Forest Training with GridSearchCV:
The `train_random_forest` function called below now incorporates `GridSearchCV` for hyperparameter optimization. 
This means it will search through a predefined set of hyperparameter combinations to find the best performing Random Forest model based on cross-validation. 
As a result, this training step might take longer to execute compared to a simple model fit. The best model found will be saved.

## 1. Setup and Imports

In [None]:
import os
import sys
import pandas as pd
import numpy as np
from datetime import datetime

# Add src directory to Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
src_path = os.path.join(project_root, 'src')
if src_path not in sys.path:
    sys.path.append(src_path)

try:
    from model_training import split_data, train_random_forest, load_model # Using Random Forest as example
    from model_evaluation import get_classification_metrics, plot_confusion_matrix, get_cross_val_metrics # Added get_cross_val_metrics
    # Ensure prediction_pipeline can be imported to get its functions
    from prediction_pipeline import predict_daily_matches, generate_features_for_prediction 
    # from data_collection import get_matches_with_fallback # Not directly used here, but good to ensure it's available
except ImportError as e:
    print(f"Error importing modules: {e}")
    print("Make sure 'src' is in sys.path and __init__.py files are present.")

## 2. Model Training (Example with Dummy Data)

We'll create a synthetic dataset and train a Random Forest model. In a real scenario, this data would come from `01_data_collection_and_preprocessing.ipynb` after extensive feature engineering.

In [None]:
# Create dummy data for training
num_samples = 200
data = {
    'feature1': np.random.rand(num_samples),
    'feature2': np.random.rand(num_samples) * 10,
    'feature3': np.random.randint(0, 5, num_samples),
    
    # New comprehensive form features (dummy values)
    # Home team
    'home_form_overall_W': np.random.randint(0, 4, num_samples), 
    'home_form_overall_D': np.random.randint(0, 2, num_samples),
    'home_form_overall_games_played': np.full(num_samples, 5),
    
    'home_form_home_W': np.random.randint(0, 3, num_samples),
    'home_form_home_D': np.random.randint(0, 2, num_samples),
    'home_form_home_games_played': np.random.randint(2, 6, num_samples), # Random home games played (2-5)

    'home_form_away_W': np.random.randint(0, 3, num_samples),
    'home_form_away_D': np.random.randint(0, 2, num_samples),
    'home_form_away_games_played': np.random.randint(2, 6, num_samples), # Random away games played (2-5)

    # Away team
    'away_form_overall_W': np.random.randint(0, 4, num_samples),
    'away_form_overall_D': np.random.randint(0, 2, num_samples),
    'away_form_overall_games_played': np.full(num_samples, 5),

    'away_form_home_W': np.random.randint(0, 3, num_samples),
    'away_form_home_D': np.random.randint(0, 2, num_samples),
    'away_form_home_games_played': np.random.randint(2, 6, num_samples),

    'away_form_away_W': np.random.randint(0, 3, num_samples),
    'away_form_away_D': np.random.randint(0, 2, num_samples),
    'away_form_away_games_played': np.random.randint(2, 6, num_samples),
    
    'result_label': np.random.choice([0, 1, 2], num_samples, p=[0.45, 0.25, 0.30]) 
}
training_df = pd.DataFrame(data)

# Ensure consistent L (Losses) based on W, D, and games_played for all form categories
form_categories_notebook = [
    'home_form_overall', 'home_form_home', 'home_form_away',
    'away_form_overall', 'away_form_home', 'away_form_away'
]
for cat_prefix in form_categories_notebook:
    w_col = f'{cat_prefix}_W'
    d_col = f'{cat_prefix}_D'
    gp_col = f'{cat_prefix}_games_played'
    l_col = f'{cat_prefix}_L'
    # Ensure W + D <= games_played before calculating L
    training_df[w_col] = training_df.apply(lambda row: min(row[w_col], row[gp_col]), axis=1)
    training_df[d_col] = training_df.apply(lambda row: min(row[d_col], row[gp_col] - row[w_col]), axis=1)
    training_df[l_col] = training_df.apply(lambda row: row[gp_col] - row[w_col] - row[d_col], axis=1)

print("Dummy Training Data Head (with comprehensive form features):")
print(training_df.head())
print("\nTarget Distribution:")
print(training_df['result_label'].value_counts(normalize=True))

feature_columns = [
    'feature1', 'feature2', 'feature3',
    'home_form_overall_W', 'home_form_overall_D', 'home_form_overall_L', 'home_form_overall_games_played',
    'home_form_home_W', 'home_form_home_D', 'home_form_home_L', 'home_form_home_games_played',
    'home_form_away_W', 'home_form_away_D', 'home_form_away_L', 'home_form_away_games_played',
    'away_form_overall_W', 'away_form_overall_D', 'away_form_overall_L', 'away_form_overall_games_played',
    'away_form_home_W', 'away_form_home_D', 'away_form_home_L', 'away_form_home_games_played',
    'away_form_away_W', 'away_form_away_D', 'away_form_away_L', 'away_form_away_games_played'
]

#### Note on XGBoost Training (Optional):
While this notebook primarily trains and uses a Random Forest model, the `model_training.py` script also contains a `train_xgboost` function. 
This function has been enhanced to use **early stopping**. It internally splits its training data to create a validation set, 
allowing it to stop training if performance on the validation set doesn't improve for a set number of rounds (e.g., 10). 
This helps prevent overfitting and can find a more optimal number of boosting rounds. The `verbose=False` parameter is used to keep the output clean.

In [None]:
# Split data
if not training_df.empty:
    # The split_data function in model_training.py uses all columns except target as features.
    # So, X_train will automatically include the new form features added to training_df.
    X_train, X_test, y_train, y_test = split_data(training_df, target_column='result_label', test_size=0.25)
    print(f"\nTraining set size: {X_train.shape[0]}, Test set size: {X_test.shape[0]}")
    print(f"Features in X_train: {X_train.columns.tolist()}")

    # Train a Random Forest model
    rf_model_filename = "example_rf_model_with_form.pkl" # New model name
    trained_model = train_random_forest(X_train, y_train, model_filename=rf_model_filename)
    print(f"\nModel trained: {trained_model}")
else:
    print("Training DataFrame is empty. Skipping training and evaluation.")
    X_test, y_test, trained_model = pd.DataFrame(), pd.Series(dtype='float64'), None

### 📝 Verifying Training Performance (Checklist Item)

According to the checklist (`🔍 Verificar se Está Treinando Certo`):
*   **Confirma se a função de perda (loss) está diminuindo durante o treino.** (Confirm if the loss function is decreasing during training.)
*   **Avalia se a acurácia (ou outra métrica) está subindo nos dados de validação.** (Evaluate if accuracy (or another metric) is increasing on validation data.)

**How to check:**
*   **Loss Decreasing:** If you were training a model like XGBoost with an `eval_set`, you would typically see the loss printed for each round. For scikit-learn models like RandomForest, direct epoch-by-epoch loss is not usually displayed during `.fit()`. However, improvements in metrics on a validation set after tuning (e.g., more trees, different depth) indirectly indicate learning. If using XGBoost with `verbose=True` (or a callback) during training with an `eval_set`, you'd monitor the printed validation loss (e.g., `validation_0-mlogloss`).
*   **Accuracy Increasing (on Validation):** After training, the subsequent "Model Evaluation" section will calculate metrics on the test set. This is a key indicator. If you had a separate validation set split before training, you would run evaluation metrics on that. For iterative models or during hyperparameter tuning, you would look for the validation accuracy/metric to improve.

The cell below evaluates the model on the test set. Pay attention to the accuracy and other metrics reported.

## 3. Model Evaluation

In [None]:
if trained_model and not X_test.empty:
    print("\nEvaluating model on the test set...")
    # Ensure X_test has the same features as X_train (split_data handles this)
    y_pred = trained_model.predict(X_test)
    y_prob = trained_model.predict_proba(X_test)

    class_labels = sorted(training_df['result_label'].unique())
    target_names = ["Home Win (0)", "Draw (1)", "Away Win (2)"] # More descriptive

    metrics = get_classification_metrics(y_test, y_pred, y_prob, average='weighted', labels=class_labels, target_names=target_names)

    plot_confusion_matrix(y_test, y_pred, labels=class_labels, display_labels=target_names, filename="example_notebook_cm_with_form.png")
else:
    print("Model not trained or test data is empty. Skipping evaluation.")

### 🩺 Checking for Overfitting (Checklist Item)

According to the checklist (`🔍 Verificar se Está Treinando Certo`):
*   **Checa se o modelo não está apenas decorando os dados (overfitting).** (Check if the model is not just memorizing the data (overfitting).)

**How to check:**
*   **Compare Training vs. Test Performance:** Overfitting occurs when a model performs exceptionally well on the training data but poorly on unseen data (like the test set or a validation set).
    *   To do this thoroughly, you would need to:
        1.  Train your model.
        2.  Evaluate it on the **training data** (e.g., `trained_model.predict(X_train)` and then use `get_classification_metrics`).
        3.  Compare these training metrics with the **test set metrics** calculated above.
*   **Signs of Overfitting:**
    *   Training accuracy is very high (e.g., 95-100%).
    *   Test accuracy is significantly lower than training accuracy.
    *   The model might be too complex (e.g., too many features, too deep trees in Random Forest).
*   **Addressing Overfitting (General Techniques mentioned in checklist):**
    *   Use simpler models.
    *   Reduce feature dimensionality.
    *   Regularization (e.g., L1/L2 for Logistic Regression, `alpha`/`lambda` in XGBoost).
    *   For tree-based models: prune trees, limit max depth, increase minimum samples per leaf.
    *   Use techniques like Dropout (for Neural Networks).
    *   Use Cross-Validation for more robust evaluation.

## 4. Running the Daily Prediction Pipeline

This part uses the `prediction_pipeline.py` script. Key changes to be aware of:
- The pipeline now incorporates **team form features**.
- It uses `get_matches_with_fallback` which can use mock data if live data fails for football-data.org.
- `generate_features_for_prediction` (called by `predict_daily_matches`) uses `engineer_form_features`.
- `engineer_form_features` requires **historical match data**. The pipeline attempts to load this from `data/historical_matches_sample.csv`. If not found, it uses an internal minimal dummy dataset.

**Requires:**
- A trained model file (e.g., `example_rf_model_with_form.pkl` saved from the step above). The model must be trained with the same features the pipeline generates (including form features and any other base features like 'feature1', 'feature2', 'feature3' if the pipeline's dummy data includes them).
- `FOOTBALL_DATA_API_KEY` environment variable for fetching actual daily matches.
- Optionally, `APISPORTS_API_KEY` if you want to test with that data source via the pipeline script (though the notebook call below uses football-data.org by default).

In [None]:
today_str = datetime.now().strftime("%Y-%m-%d")
api_key_notebook = os.getenv("FOOTBALL_DATA_API_KEY", "YOUR_API_TOKEN")

print(f"Running prediction pipeline for date: {today_str}")
print(f"Using model: {rf_model_filename}") # This should be the model trained with form features
print(f"API Key for football-data.org available: {'Yes' if api_key_notebook != 'YOUR_API_TOKEN' else 'No (will not fetch real data, mock fallback might be used)'}")

model_to_use_in_pipeline = rf_model_filename 
model_path_pipeline = os.path.join(project_root, 'models', model_to_use_in_pipeline)

if not os.path.exists(model_path_pipeline):
    print(f"ERROR: Model {model_to_use_in_pipeline} not found at {model_path_pipeline}.")
    print("Please ensure the model from training step was saved correctly or use an existing model file.")
else:
    # The prediction_pipeline.py's generate_features_for_prediction now calls engineer_form_features.
    # It also adds dummy 'feature1', 'feature2', 'feature3' for consistency with model_training.py.
    # Our trained model (example_rf_model_with_form.pkl) includes these features.
    print("\nPipeline will attempt to load historical data from '../data/historical_matches_sample.csv' or use a fallback.")
    
    # Call predict_daily_matches, setting use_mock_data_if_unavailable to False by default for this notebook example.
    # Set to True if you want to demonstrate/test the fallback mechanism when live data might fail.
    predict_daily_matches(date_str=today_str, 
                          model_filename=model_to_use_in_pipeline, 
                          api_key=api_key_notebook, 
                          source_api='football-data', 
                          use_mock_data_if_unavailable=False)
    
    # Example of how to call it if you want to enable the mock data fallback:
    # print("\n--- Example: Running pipeline with mock data fallback enabled (if API fails/returns no data) ---")
    # predict_daily_matches(date_str=today_str, 
    #                       model_filename=model_to_use_in_pipeline, 
    #                       api_key=api_key_notebook, 
    #                       source_api='football-data', 
    #                       use_mock_data_if_unavailable=True)


### Important Considerations for Real-World Usage:

1.  **Feature Consistency**: The features generated by `generate_features_for_prediction` in `prediction_pipeline.py` (which now includes form features and potentially other base features) *must* exactly match the features used to train the loaded model. The examples are aligned, but custom models need careful handling.
2.  **Historical Data**: The quality and availability of historical data (like `data/historical_matches_sample.csv`) are crucial for meaningful form features. The current sample CSV is very small.
3.  **Model Retraining**: Models should be periodically retrained with new data.
4.  **API Key Management**: Securely manage API keys.