# Module 1.8: Timeseries Diagnostics

> **Goal:** Explore characteristics of the M5 dataset using tsfeatures + tsforge.

This module teaches you to:
1. Load data
2. Compute diagnostics at the most granular "unique_id" level. 
3. Motivate the focus on the "Lie Detector Six" metric set.
    * getting a feel for the forecastability, quality and characteristics of the data BEFORE we start forecasting.


## 1. Setup

In [56]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import combinations
from pathlib import Path
from tsforge import load_m5
import tsforge as tsf
import seaborn as sns

# Configuration
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

Notebook agenda 

* Reconfirm the unique_id definition
* Compute tsfeatures + tsforge diagnostics per unique_id
* Highlight the “Lie Detector Six”

In [57]:
# read in data 
weekly_df = pd.read_parquet(
    "/Users/jackrodenberg/Desktop/real-world-forecasting-foundations/modules/output/m5_weekly_clean.parquet",
)

In [58]:
weekly_df.head()

Unnamed: 0,unique_id,ds,y
0,FOODS_1_001_CA_1,2011-01-29,3.0
1,FOODS_1_001_CA_1,2011-02-05,9.0
2,FOODS_1_001_CA_1,2011-02-12,7.0
3,FOODS_1_001_CA_1,2011-02-19,8.0
4,FOODS_1_001_CA_1,2011-02-26,14.0


In [59]:
from tsforge.eda.ts_features_extension import permutation_entropy,MI_top_k_lags,ADI
from tsfeatures import tsfeatures,lumpiness,stl_features,statistics
# using nixtla's tsfeatures 
id_lvl_feats = tsfeatures(

    ts = weekly_df,
    # frequency of data is weekly, so here we input 52     
    freq=52,

    # COMPUTE LIE detector six 
    features=[
        statistics,
        lumpiness, # variance of variances 
        permutation_entropy, # permutation entropy 
        MI_top_k_lags, # sum of MI over top 5 lags 
        stl_features, # STL decomposition Features (Trend, Seasonal Strength)
        ADI, # Avg Demand Interval
      #  pacf_features,
        ],

        scale=False # ENSURE YOU TURN THIS OFF for accurate statistics, otherwise outputs are standard scaled for model training.. 
)

* taking a closter look at the table we the "Lie Detector 6". 

    - Lumpiness: Variance of Variances 
    - Entropy (Permutation Entropy)
    - Seasonal Strength 
    - Trend Strength
    - MI Top K Lags: Mutual Information Top K Lags (K = 5)
        - for more clarity this is the sum of the Mutual Information of the top 5 lags from lags 1-freq
    - ADI: Average Demand Interval (time between demands)

In [60]:
id_lvl_feats[["unique_id","lumpiness", "permutation_entropy", "seasonal_strength", "trend", "MI_top_k_lags", "adi"]].head()

Unnamed: 0,unique_id,lumpiness,permutation_entropy,seasonal_strength,trend,MI_top_k_lags,adi
0,FOODS_1_001_CA_1,87.235596,0.969347,0.376623,0.20445,0.270401,1.105469
1,FOODS_1_001_CA_2,230.382385,0.981118,0.439298,0.22328,0.153054,1.105469
2,FOODS_1_001_CA_3,116.775986,0.984305,0.384099,0.162804,0.150131,1.118577
3,FOODS_1_001_CA_4,0.956493,0.952965,0.479389,0.110839,0.28407,1.276018
4,FOODS_1_001_TX_1,20.594612,0.962183,0.376637,0.260977,0.168827,1.200855


* add a few useful descriptors to help us understand the data in a more intuitive way. 
    - how much does each item make up of the total demand
    - where does an item rank in terms of total sales? 
    - skewness and kurtosis (understand distribution shape)
        - kurtosis: how heavy are the tails of the distribution ? 
        - skewness: how asymmetric is the distribution? 

In [61]:
# add some additional useful descriptors
id_lvl_feats = id_lvl_feats.assign(
    pct_of_demand=id_lvl_feats["total_sum"] / id_lvl_feats["total_sum"].sum(),
)

import scipy.stats as st

# merge with skew, kurtosis of demand!
id_lvl_feats = id_lvl_feats.merge(
    weekly_df.groupby("unique_id").agg(
        skew=("y", "skew"),
        kurtosis=("y", st.kurtosis),
    ),
    on="unique_id",
)


In [62]:
id_lvl_feats.columns

Index(['unique_id', 'adi', 'nperiods', 'seasonal_period', 'trend', 'spike',
       'linearity', 'curvature', 'e_acf1', 'e_acf10', 'seasonal_strength',
       'peak', 'trough', 'MI_top_k_lags', 'permutation_entropy', 'lumpiness',
       'total_sum', 'mean', 'variance', 'median', 'p2point5', 'p5', 'p25',
       'p75', 'p95', 'p97point5', 'max', 'min', 'pct_of_demand', 'skew',
       'kurtosis'],
      dtype='object')

In [63]:
ld_six = [
        "lumpiness",
        "permutation_entropy",
        "seasonal_strength",
        "trend",
        "MI_top_k_lags",
        "adi",
    ]

descriptors = ['unique_id','skew','kurtosis','pct_of_demand']

cols = descriptors + ld_six

In [74]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import mutual_info_regression
from joblib import Parallel, delayed


def find_top_mi_lags(y, freq, top_n=3, random_state=42):
    """Compute MI for lags 1-freq, return top N as (lag, mi_score) arrays."""
    y = y.values if isinstance(y, pd.Series) else y
    y = y[~np.isnan(y)]

    if len(y) <= freq:
        raise ValueError(f"Length {len(y)} must be > freq {freq}")

    # Fast lag matrix using stride tricks
    from numpy.lib.stride_tricks import as_strided

    n = len(y) - freq
    X_lags = as_strided(y, shape=(n, freq), strides=(y.itemsize, y.itemsize))[
        :, ::-1
    ]  # Reverse columns for lag order 1,2,3...

    # Compute all MI scores at once
    mi_scores = mutual_info_regression(X_lags, y[freq:], random_state=random_state, n_neighbors=3)

    # Return top N as arrays (avoid DataFrame until final concat)
    top_idx = np.argpartition(mi_scores, -top_n)[-top_n:]
    top_idx = top_idx[np.argsort(mi_scores[top_idx])[::-1]]

    return np.arange(1, freq + 1)[top_idx], mi_scores[top_idx]


def _process_group(y, group_keys, freq, top_n, random_state):
    """Process single series, return dict or None."""
    if len(y) <= freq * 2:
        return None
    try:
        lags, scores = find_top_mi_lags(y, freq, top_n, random_state)
        return {"keys": group_keys, "lags": lags, "scores": scores}
    except:
        return None


def find_top_mi_lags_batch(df, value_col, group_cols, freq, top_n=3, random_state=42, n_jobs=-1):
    """
    Find top MI lags for multiple time series with parallel processing.

    Returns DataFrame with group_cols + ['lag', 'mi_score', 'rank'].
    """
    # Extract groups as arrays for faster processing
    grouped = df.groupby(group_cols)[value_col].apply(np.array)

    # Parallel processing
    results = Parallel(n_jobs=n_jobs, backend="loky", batch_size="auto")(
        delayed(_process_group)(y, keys, freq, top_n, random_state) for keys, y in grouped.items()
    )

    # Build final DataFrame efficiently
    results = [r for r in results if r is not None]
    if not results:
        return pd.DataFrame()

    # Vectorized DataFrame construction
    data = []
    for r in results:
        keys = r["keys"] if isinstance(r["keys"], tuple) else (r["keys"],)
        for i, (lag, score) in enumerate(zip(r["lags"], r["scores"]), 1):
            data.append((*keys, lag, score, i))

    cols = group_cols + ["lag", "mi_score", "rank"]
    return pd.DataFrame(data, columns=cols)


In [75]:
most_common_lags = find_top_mi_lags_batch(weekly_df, value_col='y', group_cols=['unique_id'], freq=52, top_n=3, random_state=42)


In [None]:
most_common_lags['lag'].value_counts()[:10] # no surprise top predictive lags are 1-9, and lag 52

lag
1     14244
2     10292
3      6643
4      4681
5      2655
6      1875
7      1475
8      1463
9      1284
52     1224
Name: count, dtype: int64

In [94]:
# simple heuristic we can use to mark series as short or long range dependent based on MI 
def count_longrange_dependent(series):
    ''' we say a series is long range dependent if it has more than one lag in the top 3 > 26 (6 months) '''
    return series[series > 26].count() > 1

long_range_df = most_common_lags.assign(long_range = most_common_lags.groupby("unique_id")['lag'].transform(count_longrange_dependent)).query("long_range == True")

dep_mapping = long_range_df[['unique_id','long_range']].drop_duplicates("unique_id")

In [96]:
id_lvl_feats = id_lvl_feats.merge(dep_mapping,on='unique_id',how='left').assign(
    long_range = lambda df: df['long_range'].fillna(False)
)

In [98]:


for detector in ld_six:
    id_lvl_feats[f"{detector}_flag"] = id_lvl_feats[detector] > id_lvl_feats[detector].quantile(
        0.75
    )

# Build labeled dataset with prominent flags + detector details
flag_cols = [f"{d}_flag" for d in ld_six]

id_lvl_feats_labeled = id_lvl_feats.assign(
    # Prominent characteristic flags
    intermittent=id_lvl_feats["adi"] >= 1.34,
    heavy_tailed=id_lvl_feats["kurtosis"].abs() > 3,
    non_zero_min=id_lvl_feats["min"] > 0,
    # Detector summary flags
    n_flags=id_lvl_feats[flag_cols].sum(axis=1),
    single_flag=id_lvl_feats[flag_cols].any(axis=1),
    double_flag=lambda df: df["n_flags"] >= 2,
    # Which detectors are flagging
    flagged_detectors=id_lvl_feats[flag_cols].apply(
        lambda row: [ld_six[i] for i, val in enumerate(row) if val], axis=1
    ),
    # Compact string representation
    flag_pattern=lambda df: df["flagged_detectors"].apply(
        lambda x: "|".join([d[:4].upper() for d in x]) if x else "CLEAN"
    ),
)

# Quick summary
print("Characteristic flags:")
print(f"  Intermittent: {id_lvl_feats_labeled['intermittent'].sum()}")
print(f"  Heavy-tailed: {id_lvl_feats_labeled['heavy_tailed'].sum()}")
print(f"  Non-zero min: {id_lvl_feats_labeled['non_zero_min'].sum()}")

print("\nLie detector flags:")
print(f"  Suspect (1+ detectors): {id_lvl_feats_labeled['single_flag'].sum()}")
print(f"  Highly suspect (2+ detectors): {id_lvl_feats_labeled['double_flag'].sum()}")

print("\nMost common flag patterns --> Clean == No Flags:")
print(id_lvl_feats_labeled["flag_pattern"].value_counts().head(10))


Characteristic flags:
  Intermittent: 11924
  Heavy-tailed: 6394
  Non-zero min: 752

Lie detector flags:
  Suspect (1+ detectors): 24698
  Highly suspect (2+ detectors): 14898

Most common flag patterns --> Clean == No Flags:
flag_pattern
CLEAN        5792
MI_T|ADI     3199
PERM         2793
LUMP|PERM    1955
LUMP         1870
MI_T         1735
ADI          1389
SEAS|TREN    1328
SEAS         1197
LUMP|TREN     975
Name: count, dtype: int64


In [109]:
# creat a new flag where we call out skus with long-range dependencies 

id_lvl_feats_labeled['new_flag_pattern'] = np.where(id_lvl_feats_labeled['long_range'] == True,
id_lvl_feats_labeled['flag_pattern'] + '|' + 'LONG',
id_lvl_feats_labeled['flag_pattern'])


dmd_by_pattern = (
    id_lvl_feats_labeled.groupby("new_flag_pattern")["pct_of_demand"].sum().sort_values(ascending=False)
)


# answer how much of our volume is made up of each pattern
for marker in ["LUMP", "SEAS", "TREN", "ADI", "MI_T", "PERM","LONG"]:
    if marker in ["SEAS", "TREN"]:
        subtractor = dmd_by_pattern.filter(
            like="PERM|" + marker
        ).sum()  # don't recognize series that have high entropy
    else:
        subtractor = 0
    print(
        f"Percentage of Total Volume with High {marker}: {(dmd_by_pattern.filter(like=marker).sum() - subtractor)* 100:.2f}%"
    )  # 70% of our data has high lumpiness!


Percentage of Total Volume with High LUMP: 71.39%
Percentage of Total Volume with High SEAS: 6.67%
Percentage of Total Volume with High TREN: 19.65%
Percentage of Total Volume with High ADI: 6.14%
Percentage of Total Volume with High MI_T: 10.70%
Percentage of Total Volume with High PERM: 49.96%
Percentage of Total Volume with High LONG: 14.85%


Some Insights: 

* It looks like series with High ADI (> 1.32) have high dependence on past values, this means we could see some temporal patterns in these intermittent series. 

* 70% of our volume is made up of lumpy timeseries! This is a big clue, we likely will need to use robust loss functions in any ML or DL approaches as the variance is highly unstable in many of our timeseries, we could also greatly benefit from variance stabilizing transformations

* Variance explained by seasonality is much lower than that of trend, much of the series with significant trend/seasonality + low complexity don't make up much of the total demand 

* Intermittent Series (ADI) and Serial Dependence (MI_T) are forming a very small amount of total volume. In the ADI case this makes sense and is good news, meaning we can focus on the bulk of the volume with more complex series using advanced methods and likely use intermittent methods to get a 'good enough' forecast for the high ADI series... 

* Out of all series, long range dependence (MI_T) comprises 15% of total volume ( 2 or more lags > lag26 are considered important)