# Time Series Forecasting for Capilano University Enrollment Data

This notebook processes enrollment data from Capilano University, converts it into time series format,
and applies various forecasting models (Seasonal Naive, ETS, ARIMA). The results are logged to
Weights & Biases for visualization and comparison.

Original Work: Jiaqi Li (jiaqili@capilanou.ca)  
Author: Eric Drechsler (dr.eric.drechsler@gmail.com)  
Version: 250301

# Environment Setup and Module Imports

This cell is responsible for setting up the Python environment and importing all the necessary modules and libraries required for the time series analysis.

First, it imports standard Python libraries:
- `logging`: Used for logging information, warnings, and errors during the execution of the script. This is crucial for debugging and monitoring the process.
- `os`: Provides a way of using operating system dependent functionality, such as interacting with the file system.
- `typing`: Used for type hinting, which improves code readability and helps catch type-related errors early on. Specifically, it imports `Any`, `Dict`, `List`, `Optional`, and `Tuple` for defining the types of variables and function arguments.

Next, it imports external libraries:
- `hydra`: A powerful configuration management library that allows for structured configuration through YAML files and command-line overrides. This aligns with the README's mention of using Hydra for configuration management.
- `pandas`: A fundamental library for data manipulation and analysis, particularly for working with structured data like CSV files, which are mentioned in the `data_loader.py` description.
- `pyrootutils`: A utility library for managing project structure and paths, as described in the README's project structure section.
- `omegaconf`: Part of the Hydra ecosystem, it's used for working with configurations in a structured way.

The code then uses `pyrootutils.setup_root()` to establish the project's root directory. This is important for ensuring that the script can find necessary files and modules regardless of the current working directory. The parameters passed to `setup_root()` indicate that it should search for a `.project_root` file, use an environment variable if set, load `.env` files, and ensure the project directory is added to the Python path.

Following this, the script imports specific functions from the modules within the `capu_time_series_analysis` package, as outlined in the "Modules Explained" section of the README:
- From `data_loader.py`: `add_to_consolidated_df`, `load_data`, and `prepare_time_series`, which are responsible for loading, preparing, and structuring the enrollment data.
- From `evaluation.py`: `analyze_residuals` and `evaluate_forecasts`, which provide tools for assessing the performance and validity of the forecasting models.
- From `models.py`: `fit_models`, which implements the time series forecasting models as described in the README.
- From `visualization.py`: `log_results_to_wandb`, which handles logging results and visualizations to Weights & Biases, as mentioned in the "Enhanced Visualization" section of the README.

Finally, the cell configures the logging system. It sets the logging level to `INFO`, specifies the format of the log messages (including timestamp, logger name, level, and message), and sets the date format. It then gets the root logger, which will be used to record events during the script's execution.

In [None]:
# This cell sets up the environment, imports necessary modules, and configures
# logging

import logging
import os
from typing import Any, Dict, List, Optional, Tuple

import hydra
import pandas as pd
import pyrootutils
from omegaconf import DictConfig, OmegaConf

ROOT_PATH = pyrootutils.setup_root(
    search_from=os.getcwd(),
    indicator=".project_root",
    project_root_env_var=True,
    dotenv=True,
    pythonpath=True,
    cwd=True,
)

# Import utility modules
from capu_time_series_analysis.data_loader import (
    add_to_consolidated_df,
    load_data,
    prepare_time_series,
)
from capu_time_series_analysis.evaluation import analyze_residuals, evaluate_forecasts
from capu_time_series_analysis.models import fit_models
from capu_time_series_analysis.visualization import log_results_to_wandb

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger()

## Helper Functions
### Calculate and Analyze Model Residuals

This cell defines a function called `calculate_residuals` which takes the fitted models (Seasonal Naive, ETS, ARIMA), the training time series data, a list to store residual diagnostics, and metadata (metric type, residency, and level) as input. Its primary purpose is to analyze the residuals of each fitted model and store the diagnostic results.

The function starts by calling the `analyze_residuals` function (imported from `capu_time_series_analysis.evaluation`) for each of the three models: Seasonal Naive, ETS, and ARIMA. The `analyze_residuals` function likely performs various tests and generates plots to assess the quality of the model fit, as described in the "Understanding Model Diagnostics" section of the README. This includes checking for autocorrelation using the Ljung-Box test, assessing normality, and determining if the residuals resemble white noise.

The results from `analyze_residuals` for each model (including mean residual, standard deviation of residuals, Ljung-Box statistic and p-value, normality p-value, white noise status, residuals plot, and actual vs. fitted plot) are then structured into dictionaries. Each dictionary represents the residual diagnostics for a specific model and includes the provided metadata (metric type, residency, and level) to help identify the context of the analysis.

Finally, these dictionaries are appended to the `residual_diagnostics` list, which is passed into the function and then returned. This list will accumulate the residual analysis results for different models and data slices, allowing for a comprehensive evaluation of the model performance.

In essence, this function automates the process of evaluating how well each model fits the training data by examining its residuals, which is a crucial step in model selection and validation.

In [None]:
def calculate_residuals(
    fit_snaive: Any,
    fit_ets: Any,
    fit_arima: Any,
    train_series: pd.Series,
    residual_diagnostics: List[Dict[str, Any]],
    mt: str,
    resd: str,
    level: str,
) -> List[Dict[str, Any]]:
    snaive_residuals = analyze_residuals(fit_snaive, train_series, "Seasonal Naive")
    ets_residuals = analyze_residuals(fit_ets, train_series, "ETS")
    arima_residuals = analyze_residuals(fit_arima, train_series, "ARIMA")

    for model_name, diag in [
        ("Seasonal Naive", snaive_residuals),
        ("ETS", ets_residuals),
        ("ARIMA", arima_residuals),
    ]:
        residual_diagnostics.append(
            {
                "Analysis_Type": mt,
                "Residency": resd,
                "Level": level,
                "Model": model_name,
                "Mean_Residual": diag["mean_residual"],
                "Std_Residual": diag["std_residual"],
                "Ljung_Box_Stat": diag["ljung_box_stat"],
                "Ljung_Box_pvalue": diag["ljung_box_pvalue"],
                "Normality_pvalue": diag["residuals_normal_pvalue"],
                "White_Noise": diag["white_noise"],
                "Residuals_Plot": diag["residuals_plot"],
                "Actual_Fitted_Plot": diag["actual_fitted_plot"],
            }
        )
    return residual_diagnostics

### Function to Process Time Series Data for Multiple Combinations

This cell defines the core function `process_timeseries`, which is responsible for iterating through different combinations of metrics, residencies, and levels to perform time series analysis on the enrollment data.

The function takes several arguments:
- `input_file`: The path to the input data file (likely a CSV file as mentioned in the README).
- `metrics`: A list of metrics to analyze (e.g., Headcount, CourseEnrolment).
- `residencies`: A list of residency types to filter the data by.
- `levels`: A list of levels (e.g., CapU) to filter the data by.
- `forecast_steps`: The number of steps to forecast into the future (default is 9).
- `model_params`: An optional dictionary to specify parameters for the models.

The function first handles the case where `model_params` is not provided, initializing it as an empty dictionary.

It then calculates the total number of combinations to be processed and logs this information using the `logger` (set up in the first cell). This provides a clear indication of the workload.

The function initializes an empty DataFrame `consolidated_df` to store the results in a standardized format, as suggested by the description of `data_loader.py` in the README. It also initializes two empty lists: `evaluation_results` and `residual_diagnostics`, which will store the evaluation metrics and residual analysis results, respectively.

The core logic of the function involves three nested loops that iterate through all possible combinations of metrics, residencies, and levels. Inside these loops:
- The function keeps track of the current combination being processed and logs its progress.
- The comment `...existing code...` indicates where the actual data processing and model fitting logic will be implemented for each combination. This will likely involve loading the data, preparing the time series, fitting the models, evaluating the forecasts, and analyzing the residuals.

Finally, the function returns the `consolidated_df`, `evaluation_results`, and `residual_diagnostics`. These outputs will contain all the results of the time series analysis for all the processed combinations.

This function orchestrates the entire analysis process, ensuring that all specified combinations are processed and their results are collected for further analysis and visualization.

In [None]:
def process_timeseries(
    input_file: str,
    metrics: List[str],
    residencies: List[str],
    levels: List[str],
    forecast_steps: int = 9,
    model_params: Optional[Dict[str, Dict[str, Any]]] = None,
) -> Tuple[pd.DataFrame, List[Dict[str, Any]], List[Dict[str, Any]]]:
    if model_params is None:
        model_params = {}

    total_combinations = len(levels) * len(residencies) * len(metrics)
    logger.info(
        f"Processing {total_combinations} combinations of levels, residencies, and metrics"
    )

    current_combination = 0
    consolidated_df = pd.DataFrame(
        columns=[
            "Analysis_Type",
            "Residency",
            "Level",
            "Timestamp",
            "Model",
            "Entry_Type",
            "Entry",
        ]
    )
    evaluation_results: List[Dict[str, Any]] = []
    residual_diagnostics: List[Dict[str, Any]] = []

    # Process all combinations
    for mt in metrics:
        for resd in residencies:
            for level in levels:
                current_combination += 1
                logger.info(
                    f"Processing {level} - {resd} - {mt} ({current_combination}/{total_combinations})"
                )

                # ...existing code...

    return consolidated_df, evaluation_results, residual_diagnostics

## Main Execution

In [None]:
# Configuration
cfg = {
    "input_file": "path/to/your/input.csv",
    "plot_dir": "plots",
    "levels": ["Undergraduate", "Graduate"],
    "residencies": ["Domestic", "International"],
    "metrics": ["Headcount", "FTE"],
    "forecast_steps": 9,
    "models": {"seasonal_naive_params": {}, "ets_params": {}, "arima_params": {}},
}

# Create plot directory if it doesn't exist
if not os.path.exists(cfg["plot_dir"]):
    os.makedirs(cfg["plot_dir"])
    logger.info(f"Created plot directory: {cfg['plot_dir']}")

# Process time series
consolidated_df, evaluation_results, residual_diagnostics = process_timeseries(
    cfg["input_file"],
    cfg["metrics"],
    cfg["residencies"],
    cfg["levels"],
    forecast_steps=cfg["forecast_steps"],
    model_params=cfg["models"],
)

# Create evaluation dataframe
evaluation_df = pd.DataFrame(evaluation_results)
logger.info(f"Created evaluation dataframe with {len(evaluation_df)} entries")

# Create residual diagnostics dataframe
residual_df = pd.DataFrame(residual_diagnostics)
logger.info(f"Created residual diagnostics dataframe with {len(residual_df)} entries")

# Log results to wandb
logger.info("Logging results to Weights & Biases")
log_results_to_wandb(
    consolidated_df, evaluation_df, residual_df, cfg["metrics"], cfg["levels"], cfg["residencies"]
)