# 1. Time Series

In previous chapters, we worked with datasets where samples were assumed to be independent. In contrast, time series data consists of observations ordered in time, where each data point can be related to its neighbors. This temporal dependency introduces specific challenges for modeling and evaluation.

Most time series models assume that the data exhibits some form of regularity, meaning patterns repeat over time. A single signal may contain multiple overlapping patterns. For example, the electricity consumption of a household depends on:

- the time of day,
- the day of the week (e.g., weekdays vs. weekends),
- holidays or vacation periods,
- and other seasonal or behavioral factors.

In such cases, a model attempts to learn the contributions of different patterns in order to forecast future values. More complex models, especially deep learning based ones, aim to model time series with minimal assumptions. However, accurately forecasting irregular or chaotic time series, such as stock prices, remains an extremely difficult task.

This notebook introduces the core concepts of time series modeling. We begin with a basic trend and seasonality decomposition and proceed to build predictive models using `scikit-learn`, `statsmodels`, and `sktime`, a unified framework for time series modeling in Python.

## 1.1 Seasonality and Trend

Many time series contain recurring patterns and long-term changes. Two key components of such series are:

- **Trend**: a long-term increase or decrease in the data,
- **Seasonality**: a periodic fluctuation occurring at regular intervals (e.g., daily, weekly, monthly).

To illustrate these concepts, we begin by analyzing a dataset on atmospheric CO₂ concentrations. This data is available in the `statsmodels` package.


In [None]:
from statsmodels.datasets import co2
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
data = co2.load().data

# Drop missing values for cleaner visualization
data.dropna(inplace=True)

# Plot the time series
sns.set(style="whitegrid")
plt.figure(figsize=(12, 4))
sns.lineplot(x=data.index, y=data["co2"])
plt.title("Atmospheric $\mathregular{CO}_2$ Concentration Over Time")
plt.xlabel("Date")
plt.ylabel("$\mathregular{CO}_2$ (ppm)")
plt.tight_layout()
plt.show()

We observe that the CO₂ time series exhibits both a long-term upward trend and short-term oscillations. This suggests that the signal can be decomposed into three components:

- $T(t)$: the trend component (long-term progression),
- $S(t)$: the seasonal component (repeating patterns),
- $e(t)$: the residual or error component (random noise).

In the **additive model**, the observed data is expressed as:

$$
D(t) = T(t) + S(t) + e(t)
$$

In some cases, especially when seasonal variations grow with the trend, a **multiplicative model** is more appropriate:

$$
D(t) = T(t) \cdot S(t) \cdot e(t)
$$

We now prepare the data for modeling by extracting time-related features.

In [None]:
# Check the structure of the dataset
data.head(3)

The date is currently stored as the index of the DataFrame. To extract time-based features, we first convert the index into a regular column. We then extract the year and month from each date to use them as model features.

In [None]:
# Convert the datetime index into a regular column
data.reset_index(inplace=True)  # This adds a new column named 'index' before resetting
data.rename(columns={"index": "date"}, inplace=True)

# Extract year and month as separate columns
data["year"] = data["date"].dt.year
data["month"] = data["date"].dt.month

# Check the result
data.head(3)

We now remove any remaining missing values to prepare the data for modeling. Then we fit a simple linear regression model using `year` and `month` as input features to predict the CO₂ concentration.

In [None]:
from sklearn.linear_model import LinearRegression

# Define and train the linear regression model
model = LinearRegression()
model.fit(X=data[["year", "month"]], y=data["co2"])

We now use the trained model to make predictions over the entire dataset and compare the predicted trend with the actual CO₂ values. This model captures the overall trend but fails to represent the monthly oscillations accurately.

In [None]:
# Make predictions
y_pred = model.predict(data[["year", "month"]])

# Plot actual vs. predicted values
plt.figure(figsize=(12, 4))
sns.lineplot(x=data["date"], y=data["co2"], label="Actual $\mathregular{CO}_2$")
sns.lineplot(x=data["date"], y=y_pred, label="Predicted $\mathregular{CO}_2$ (Trend Only)")
plt.title("Trend Prediction with Linear Regression")
plt.xlabel("Date")
plt.ylabel("$\mathregular{CO}_2$ (ppm)")
plt.legend()
plt.tight_layout()
plt.show()

To capture the monthly seasonality, we can transform the `month` feature using sine and cosine functions. This is useful because months are cyclical: January is as close to December as it is to February. Sine and cosine transformations help represent this circular nature.

We define transformers that map each month to a point on the unit circle. This allows the model to learn smooth, periodic patterns.


<div style="display: flex; justify-content: center;">
    <img src="../illustrations/month_unit_circle.png" width="400px" />
</div>

In [None]:
from sklearn.preprocessing import FunctionTransformer


# Define cyclic feature transformations
def sin_transformer(period):
    return FunctionTransformer(lambda x: np.sin(x / period * 2 * np.pi), validate=True)


def cos_transformer(period):
    return FunctionTransformer(lambda x: np.cos(x / period * 2 * np.pi), validate=True)


# Apply sine and cosine transformations to the month column
data["month_sin"] = sin_transformer(12).fit_transform(data[["month"]])
data["month_cos"] = cos_transformer(12).fit_transform(data[["month"]])

# Check the result
data.head(3)

In [None]:
# Re-train the model with cyclical features
model = LinearRegression()
model.fit(X=data[["year", "month", "month_sin", "month_cos"]], y=data["co2"])

# Make predictions
y_pred = model.predict(data[["year", "month", "month_sin", "month_cos"]])

# Plot actual vs. predicted values
plt.figure(figsize=(12, 4))
sns.lineplot(x=data["date"], y=data["co2"], label="Actual $\mathregular{CO}_2$")
sns.lineplot(x=data["date"], y=y_pred, label="Predicted $\mathregular{CO}_2$ (Trend + Seasonality)")
plt.title("Improved $\mathregular{CO}_2$ Prediction with Seasonality Features")
plt.xlabel("Date")
plt.ylabel("$\mathregular{CO}_2$ (ppm)")
plt.legend()
plt.tight_layout()
plt.show()

Instead of manually engineering features to capture trend and seasonality, we can use automated decomposition methods. The `seasonal_decompose` function from `statsmodels` separates the time series into its components based on a given periodicity.

In our case, the CO₂ data has **weekly** resolution. We therefore specify a seasonal period of 52 weeks.

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

# Reload and clean the original dataset
data = co2.load().data
data.dropna(inplace=True)

# Perform seasonal decomposition with weekly periodicity
result_add = seasonal_decompose(data["co2"], model="additive", period=52)

# Plot the decomposed components
result_add.plot()
plt.suptitle("Additive Seasonal Decomposition of $\mathregular{CO}_2$ Time Series", y=1.02)
plt.tight_layout()
plt.show()

## 1.3 Working with `sktime`

While we can implement time series models manually, libraries like `sktime` provide a unified framework for time series analysis. It supports a wide range of models and integrates well with `scikit-learn`.

In this section, we explore a few key features of `sktime`, starting with data import and formatting.

In [None]:
from sktime.datasets import load_airline
import numpy as np
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Load example dataset: number of airline passengers per month
y = load_airline()
y.head()

The `airline` dataset is a univariate time series represented as a `pandas.Series`. Each entry corresponds to the number of airline passengers in a given month. Unlike standard series, its index is a `PeriodIndex`, which explicitly encodes time intervals.

This special index allows `pandas` and `sktime` to handle regular time intervals (e.g., monthly, weekly) with built-in functionality.

In [None]:
# Check the index type
y.index

A `PeriodIndex` allows for efficient handling of fixed-interval time data. We can also create such indices manually using `pandas`. The `period_range` function generates a sequence of periods between a start and end date, with a specified frequency.

Below is an example of a monthly period range.

In [None]:
import pandas as pd

# Create a custom range of monthly periods
pd.period_range(start="2020-05-01", end="2024-08-01", freq="M")

We can visualize the time series using either `matplotlib` or a helper function from `sktime`. The built-in `plot_series` function automatically formats time labels and supports multiple series in one plot.

In [None]:
import matplotlib.pyplot as plt

# Simple plot using matplotlib
fig, ax = plt.subplots(figsize=(10, 4))

ax.plot(y.values)
ax.set_title("Monthly Airline Passengers")
ax.set_xlabel("Time Index")
ax.set_ylabel("Passengers")

plt.tight_layout()
plt.show()

In [None]:
from sktime.utils.plotting import plot_series

# More informative plot using sktime's utility function
plot_series(y, labels=["Number of Airline Passengers"]);

## 1.5 Forecasting

Forecasting involves predicting future values of a time series based on past observations. Before building a model, it's helpful to visually inspect the data for patterns.

The airline passenger data exhibits two key patterns:

- A **seasonal pattern** that repeats each year (monthly peaks and troughs),
- A **positive trend**, with a steady increase over time.

We begin with a simple baseline forecaster that uses the most recent values to predict future ones. This gives us a point of comparison for more sophisticated models.

In [None]:
from sktime.forecasting.naive import NaiveForecaster

# Initialize a naive forecaster that repeats the last observed season
forecaster = NaiveForecaster(strategy="last", sp=12)  # sp = seasonal period

We fit the model to the entire time series and forecast the next 36 months. The `"last"` strategy simply repeats the last observed season, assuming it will continue identically into the future.

The seasonal period `sp=12` tells the model to treat the data as repeating every 12 time steps, which corresponds to 12 months in this case.

In [None]:
# Fit the forecaster on the complete series
forecaster.fit(y)

# Forecast the next 36 months
fh = np.arange(1, 37)  # Forecast horizon
y_pred = forecaster.predict(fh)

# Plot the forecast alongside the original data
plot_series(y, y_pred, labels=["Observed", "Forecast (Naive - Last Season)"]);

Instead of repeating only the most recent season, we can also average over all previous seasons. The `"mean"` strategy uses the mean of all previous values for the same season (e.g., all Januaries, all Februaries, etc.).

This method often performs worse if earlier data has different scale or trend, which is the case here.

In [None]:
# Use the mean of all previous seasons for forecasting
forecaster = NaiveForecaster(strategy="mean", sp=12)
forecaster.fit(y)

# Forecast the same 36-month horizon
y_pred = forecaster.predict(fh)

# Plot comparison
plot_series(y, y_pred, labels=["Observed", "Forecast (Naive - Mean)"]);

## 1.5.1 Forecasting with ETS

Naive models provide a useful baseline, but more sophisticated methods are available. One popular class of models is **ETS** (Error, Trend, Seasonality), which explicitly models each component.

`sktime` provides access to ETS models from the `statsmodels` package via a unified interface.

In [None]:
from sktime.forecasting.ets import AutoETS

# Initialize the model with automatic configuration and using all available CPU cores
forecaster = AutoETS(auto=True, sp=12, n_jobs=-1)

# Fit the model to the data
forecaster.fit(y)

# Forecast 36 months ahead
y_pred = forecaster.predict(fh)

# Plot results
plot_series(y, y_pred, labels=["Observed", "Forecast (ETS)"]);

## 1.6 Evaluating Forecasting Models

Until now, we evaluated forecasts visually. However, to objectively assess model performance, we need a proper evaluation setup.

For time series, we cannot use random train-test splits. Instead, we split the series chronologically, training on earlier data and testing on later values.

`sktime` provides a helper function to split data along the time axis.

In [None]:
from sktime.split import temporal_train_test_split

# Split the time series: last 36 months reserved for testing
y_train, y_test = temporal_train_test_split(y, test_size=36)

# Visualize the split
plot_series(y_train, y_test, labels=["Training Set", "Test Set"]);

We now train the `AutoETS` model on the training portion of the data and forecast the values in the test set. This lets us compare the model's predictions to actual observations and compute an error metric.

In [None]:
# Re-initialize and fit the forecaster on training data only
forecaster = AutoETS(auto=True, sp=12, n_jobs=-1)
forecaster.fit(y_train)

# Predict for the time points in the test set
y_pred = forecaster.predict(fh=y_test.index)

# Plot actual vs. predicted values
plot_series(y_train, y_test, y_pred, labels=["Training", "Test", "Forecast"]);

To quantify the accuracy of our forecast, we use the **Mean Absolute Percentage Error (MAPE)**. This metric expresses prediction errors as a percentage of the true values, making it easy to interpret.

Lower values indicate better performance.

In [None]:
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error

# Evaluate the forecast
mape = mean_absolute_percentage_error(y_true=y_test, y_pred=y_pred)
print(f"Mean Absolute Percentage Error (MAPE): {mape:.2%}")

## 1.7 Forecasting with Prophet

[`Prophet`](https://facebook.github.io/prophet/) is a forecasting model developed by Facebook/Meta. It fits a model with components for trend, seasonality, and holiday effects, and is particularly effective when:

- the time series has strong seasonal patterns,
- multiple years of historical data are available,
- interpretability is important.

It also handles missing values, outliers, and holidays automatically, making it a popular choice for business time series forecasting.

By default, Prophet uses an additive model. However, when the magnitude of seasonal fluctuations increases with the level of the trend, a multiplicative seasonality model is more appropriate. This is the case for the airline passenger data, so we set `seasonality_mode="multiplicative"` when initializing the model.

Note: Prophet is available through the `sktime` interface as shown below.

In [None]:
from sktime.forecasting.fbprophet import Prophet

# Initialize the Prophet model
forecaster = Prophet(seasonality_mode="multiplicative")  # or "additive"

# Fit on training data
forecaster.fit(y_train)

# Predict on test dates
y_pred = forecaster.predict(fh=y_test.index)

# Plot forecast
plot_series(y_train, y_test, y_pred, labels=["Training", "Test", "Forecast (Prophet)"]);

## 1.8 Exercise

In this exercise, you will apply what you have learned to a new dataset containing daily climate data from Delhi, India. The goal is to forecast the mean daily temperature.

The dataset is split into a training and a test file.

### Tasks:

1. Load the file `DailyDelhiClimateTrain.csv` from the `datasets` folder.
2. Use `pd.to_datetime` to convert the `date` column to datetime format, and set it as the index.
3. Keep only the `meantemp` column.
4. Train a `Prophet` model on the training data.
5. Use the model to forecast 1000 future time points.
6. Plot the forecasted values.
7. Load the file `DailyDelhiClimateTest.csv` and format it in the same way.
8. Use the model to forecast the dates in the test dataset.
9. Plot the predictions alongside the true test data.