# Time Series Forecasting for Capilano University Enrollment Data

This notebook processes enrollment data from Capilano University, converts it into time series format,
and applies various forecasting models (Seasonal Naive, ETS, ARIMA). The results are logged to
Weights & Biases for visualization and comparison.

Original Work: Jiaqi Li (jiaqili@capilanou.ca)  
Author: Eric Drechsler (dr.eric.drechsler@gmail.com)  
Version: 250301

# Environment Setup and Module Imports

This cell is responsible for setting up the Python environment and importing all the necessary modules and libraries required for the time series analysis.

First, it imports standard Python libraries:
- `logging`: Used for logging information, warnings, and errors during the execution of the script. This is crucial for debugging and monitoring the process.
- `os`: Provides a way of using operating system dependent functionality, such as interacting with the file system.
- `typing`: Used for type hinting, which improves code readability and helps catch type-related errors early on. Specifically, it imports `Any`, `Dict`, `List`, `Optional`, and `Tuple` for defining the types of variables and function arguments.

Next, it imports external libraries:
- `hydra`: A powerful configuration management library that allows for structured configuration through YAML files and command-line overrides. This aligns with the README's mention of using Hydra for configuration management.
- `pandas`: A fundamental library for data manipulation and analysis, particularly for working with structured data like CSV files, which are mentioned in the `data_loader.py` description.
- `pyrootutils`: A utility library for managing project structure and paths, as described in the README's project structure section.
- `omegaconf`: Part of the Hydra ecosystem, it's used for working with configurations in a structured way.

The code then uses `pyrootutils.setup_root()` to establish the project's root directory. This is important for ensuring that the script can find necessary files and modules regardless of the current working directory. The parameters passed to `setup_root()` indicate that it should search for a `.project_root` file, use an environment variable if set, load `.env` files, and ensure the project directory is added to the Python path.

Following this, the script imports specific functions from the modules within the `capu_time_series_analysis` package, as outlined in the "Modules Explained" section of the README:
- From `data_loader.py`: `add_to_consolidated_df`, `load_data`, and `prepare_time_series`, which are responsible for loading, preparing, and structuring the enrollment data.
- From `evaluation.py`: `analyze_residuals` and `evaluate_forecasts`, which provide tools for assessing the performance and validity of the forecasting models.
- From `models.py`: `fit_models`, which implements the time series forecasting models as described in the README.
- From `visualization.py`: `log_results_to_wandb`, which handles logging results and visualizations to Weights & Biases, as mentioned in the "Enhanced Visualization" section of the README.

Finally, the cell configures the logging system. It sets the logging level to `INFO`, specifies the format of the log messages (including timestamp, logger name, level, and message), and sets the date format. It then gets the root logger, which will be used to record events during the script's execution.

In [None]:
# This cell sets up the environment, imports necessary modules, and configures
# logging

import logging
import os
from typing import Any, Dict, List, Optional, Tuple

import hydra
import pandas as pd
import pyrootutils
from omegaconf import DictConfig, OmegaConf

ROOT_PATH = pyrootutils.setup_root(
    search_from=os.getcwd(),
    indicator=".project_root",
    project_root_env_var=True,
    dotenv=True,
    pythonpath=True,
    cwd=True,
)

# This is the main function for processing time series data
from capu_time_series_analysis.utils import process_timeseries
from capu_time_series_analysis.visualization import log_results_to_wandb

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger()

## Main Execution

In [None]:
# Configuration
cfg = {
    "input_file": "${hydra:runtime.cwd}/data/input/capu_enrolment_data_200820-202310_v20230213.csv",
    "plot_dir": "${hydra:runtime.cwd}/output/plots",  # TODO exclude plot storage in ipynb
    "levels": ["CapU", "AS"],
    # "levels": ["CapU", "AS", "BPS", "EHHD", "FAA", "GCS"],
    "residencies": ["Domestic", "International"],
    "metrics": ["Headcount"],
    # "metrics": ["Headcount", "CourseEnrolment", "AttemptedCredits"],
    "forecast_steps": 9,
    "models": {
        "seasonal_naive_params": {"seasonal_periods": 3},
        "ets_params": {
            "seasonal": "add",
            "damped_trend": False,
        },
        "arima_params": {
            "seasonal": True,
            "m": 3,
            "d": 0,
            "D": 0,
        },
    },
}

In [None]:
# Process time series
consolidated_df, evaluation_results, residual_diagnostics = process_timeseries(
    cfg["input_file"],
    cfg["metrics"],
    cfg["residencies"],
    cfg["levels"],
    forecast_steps=cfg["forecast_steps"],
    model_params=cfg["models"],
)

# Create evaluation dataframe
evaluation_df = pd.DataFrame(evaluation_results)
logger.info(f"Created evaluation dataframe with {len(evaluation_df)} entries")

# Create residual diagnostics dataframe
residual_df = pd.DataFrame(residual_diagnostics)
logger.info(f"Created residual diagnostics dataframe with {len(residual_df)} entries")

In [None]:
# Log results to wandb
logger.info("Logging results to Weights & Biases")
log_results_to_wandb(
    consolidated_df, evaluation_df, residual_df, cfg["metrics"], cfg["levels"], cfg["residencies"]
)