# Feature Extraction Benchmarks
---

This walkthrough serves as a benchmark for comparing `functime` with `tsfresh` feature extraction functions. We begin the analysis by evaluating the speed of feature extraction across time series of three different sizes: 100K, 1M, and 9M. Next, we assess the speed in a groupby and aggregation context, making a performance comparison between functime with polars and tsfresh using pandas.

In [15]:
%%capture
%pip install perfplot
%pip install pandas
%pip install tsfresh
%pip install functime

In [35]:
from typing import Callable
import pandas as pd
import perfplot
import polars as pl
from tsfresh.feature_extraction import feature_calculators as tsfresh
from functime import feature_extractors as fe

In [36]:
pl.Config.set_tbl_rows(100)
pl.Config.set_fmt_str_lengths(60)
pl.Config.set_tbl_hide_column_data_types(True)

polars.config.Config

## 1. Setup for the comparison
---
We are using the M4 dataset. We create a `pd.DataFrame` and `pl.DataFrame` and we define a list of dictionnary with the following structure:
<br>
(<br>
&emsp;  `<functime_function>`,<br>
&emsp;  `<tsfresh_function>`,<br>
&emsp;  `<functime_parameters>`,<br>
&emsp;   `<tsfresh_parameters>`<br>
)<br>

In [37]:
_M4_DATASET = "../../data/m4_1d_train.parquet"

DF_PANDAS = (
    pd.melt(pd.read_parquet(_M4_DATASET))
    .drop(columns=["variable"])
    .dropna()
    .reset_index(drop=True)
)
DF_PL_EAGER = (
    pl.read_parquet(_M4_DATASET).drop("V1").melt().drop("variable").drop_nulls()
)

In [38]:
FUNC_PARAMS_BENCH = [
    (fe.absolute_energy, tsfresh.abs_energy, {}, {}),
    (fe.absolute_maximum, tsfresh.absolute_maximum, {}, {}),
    (fe.absolute_sum_of_changes, tsfresh.absolute_sum_of_changes, {}, {}),
    (fe.lempel_ziv_complexity, tsfresh.lempel_ziv_complexity, {"threshold": (pl.col("value").max() - pl.col("value").min()) / 2}, {"bins": 2}),
    (
        fe.approximate_entropy,
        tsfresh.approximate_entropy,
        {"run_length": 2, "filtering_level": 0.5},
        {"m": 2, "r": 0.5},
    ),
    # (fe.augmented_dickey_fuller, tsfresh.augmented_dickey_fuller, "param")
    (fe.autocorrelation, tsfresh.autocorrelation, {"n_lags": 4}, {"lag": 4}),
    (
        fe.autoregressive_coefficients,
        tsfresh.ar_coefficient,
        {"n_lags": 4},
        {"param": [{"coeff": i, "k": 4}] for i in range(5)},
    ),
    (fe.benford_correlation, tsfresh.benford_correlation, {}, {}),
    (fe.binned_entropy, tsfresh.binned_entropy, {"bin_count": 10}, {"max_bins": 10}),
    (fe.c3, tsfresh.c3, {"n_lags": 10}, {"lag": 10}),
    (
        fe.change_quantiles,
        tsfresh.change_quantiles,
        {"q_low": 0.1, "q_high": 0.9, "is_abs": True},
        {"ql": 0.1, "qh": 0.9, "isabs": True, "f_agg": "mean"},
    ),
    (fe.cid_ce, tsfresh.cid_ce, {"normalize": True}, {"normalize": True}),
    (fe.count_above, tsfresh.count_above, {"threshold": 0.0}, {"t": 0.0}),
    (fe.count_above_mean, tsfresh.count_above_mean, {}, {}),
    (fe.count_below, tsfresh.count_below, {"threshold": 0.0}, {"t": 0.0}),
    (fe.count_below_mean, tsfresh.count_below_mean, {}, {}),
    # (fe.cwt_coefficients, tsfresh.cwt_coefficients, {"widths": (1, 2, 3), "n_coefficients": 2},{"param": {"widths": (1, 2, 3), "coeff": 2, "w": 1}}),
    (
        fe.energy_ratios,
        tsfresh.energy_ratio_by_chunks,
        {"n_chunks": 6},
        {"param": [{"num_segments": 6, "segment_focus": i} for i in range(6)]},
    ),
    (fe.first_location_of_maximum, tsfresh.first_location_of_maximum, {}, {}),
    (fe.first_location_of_minimum, tsfresh.first_location_of_minimum, {}, {}),
    # (fe.fourier_entropy, tsfresh.fourier_entropy, {"n_bins": 10}, {"bins": 10}),
    # (fe.friedrich_coefficients, tsfresh.friedrich_coefficients, {"polynomial_order": 3, "n_quantiles": 30}, {"params": [{"m": 3, "r": 30}]}),
    (fe.has_duplicate, tsfresh.has_duplicate, {}, {}),
    (fe.has_duplicate_max, tsfresh.has_duplicate_max, {}, {}),
    (fe.has_duplicate_min, tsfresh.has_duplicate_min, {}, {}),
    (
        fe.index_mass_quantile,
        tsfresh.index_mass_quantile,
        {"q": 0.5},
        {"param": [{"q": 0.5}]},
    ),
    (
        fe.large_standard_deviation,
        tsfresh.large_standard_deviation,
        {"ratio": 0.25},
        {"r": 0.25},
    ),
    (fe.last_location_of_maximum, tsfresh.last_location_of_maximum, {}, {}),
    (fe.last_location_of_minimum, tsfresh.last_location_of_minimum, {}, {}),
    # (fe.lempel_ziv_complexity, tsfresh.lempel_ziv_complexity, {"n_bins": 5}, {"bins": 5}),
    (
        fe.linear_trend,
        tsfresh.linear_trend,
        {},
        {
            "param": [
                {"attr": "pvalue"},
                {"attr": "rvalue"},
                {"attr": "intercept"},
                {"attr": "slope"},
                {"attr": "stderr"},
            ]
        },
    ),
    (fe.longest_streak_above_mean, tsfresh.longest_strike_above_mean, {}, {}),
    (fe.longest_streak_below_mean, tsfresh.longest_strike_below_mean, {}, {}),
    (fe.mean_abs_change, tsfresh.mean_abs_change, {}, {}),
    (fe.mean_change, tsfresh.mean_change, {}, {}),
    (
        fe.mean_n_absolute_max,
        tsfresh.mean_n_absolute_max,
        {"n_maxima": 20},
        {"number_of_maxima": 20},
    ),
    (
        fe.mean_second_derivative_central,
        tsfresh.mean_second_derivative_central,
        {},
        {},
    ),
    (
        fe.number_crossings,
        tsfresh.number_crossing_m,
        {"crossing_value": 0.0},
        {"m": 0.0},
    ),
    (fe.number_cwt_peaks, tsfresh.number_cwt_peaks, {"max_width": 5}, {"n": 5}),
    (fe.number_peaks, tsfresh.number_peaks, {"support": 5}, {"n": 5}),
    # (fe.partial_autocorrelation, tsfresh.partial_autocorrelation, "param"),
    (
        fe.percent_reoccurring_values,
        tsfresh.percentage_of_reoccurring_values_to_all_values,
        {},
        {},
    ),
    (
        fe.percent_reoccurring_points,
        tsfresh.percentage_of_reoccurring_datapoints_to_all_datapoints,
        {},
        {},
    ),
    (
        fe.permutation_entropy,
        tsfresh.permutation_entropy,
        {"tau": 1, "n_dims": 3},
        {"tau": 1, "dimension": 3},
    ),
    (
        fe.range_count,
        tsfresh.range_count,
        {"lower": 0, "upper": 9, "closed": "none"},
        {"min": 0, "max": 9},
    ),
    (fe.ratio_beyond_r_sigma, tsfresh.ratio_beyond_r_sigma, {"ratio": 2}, {"r": 2}),
    (
        fe.ratio_n_unique_to_length,
        tsfresh.ratio_value_number_to_time_series_length,
        {},
        {},
    ),
    (fe.root_mean_square, tsfresh.root_mean_square, {}, {}),
    (fe.sample_entropy, tsfresh.sample_entropy, {}, {}),
    (
        fe.spkt_welch_density,
        tsfresh.spkt_welch_density,
        {"n_coeffs": 10},
        {"param": [{"coeff": i} for i in range(10)]},
    ),
    (fe.sum_reoccurring_points, tsfresh.sum_of_reoccurring_data_points, {}, {}),
    (fe.sum_reoccurring_values, tsfresh.sum_of_reoccurring_values, {}, {}),
    (
        fe.symmetry_looking,
        tsfresh.symmetry_looking,
        {"ratio": 0.25},
        {"param": [{"r": 0.25}]},
    ),
    (
        fe.time_reversal_asymmetry_statistic,
        tsfresh.time_reversal_asymmetry_statistic,
        {"n_lags": 3},
        {"lag": 3},
    ),
    (fe.variation_coefficient, tsfresh.variation_coefficient, {}, {}),
    (fe.var_gt_std, tsfresh.variance_larger_than_standard_deviation, {}, {}),
]

## 2 Benchmark core functions
---
Benchmark core function for time series' length of 100_000, 1_000_000 and 9_000_000. (Except 10_000 for `approximate_entropy` and 10_000/100_000 for `number_cwt_peaks` and `sample_entropy`). `all_benchmarks()` iterates through the elements in the `FUNC_PARAMS_BENCH` list and invoke `benchmark()` for each function.

In [39]:
def benchmark(
    f_feat: Callable, ts_feat: Callable, f_params: dict, ts_params: dict, is_expr: bool
):
    if f_feat.__name__ == "approximate_entropy":
        n_range = [10_000]
    elif f_feat.__name__ in ("number_cwt_peaks", "sample_entropy", "lempel_ziv_complexity"):
        n_range = [10_000, 100_000]
    else:
        n_range = [10_000, 100_000, 1_000_000, 9_000_000]
    benchmark = perfplot.bench(
        setup=lambda n: (DF_PL_EAGER.head(n), DF_PANDAS.head(n)),
        kernels=[
            lambda x, _y: f_feat(x["value"], **f_params)
            if not is_expr
            else x.select(f_feat(pl.col("value"), **f_params)),
            lambda _x, y: ts_feat(y["value"], **ts_params),
        ],
        n_range=n_range,
        equality_check=False,
        labels=["functime", "tsfresh"],
    )
    return benchmark

In [40]:
def all_benchmarks(params: list[tuple], is_expr: bool) -> list:
    bench_df = pl.DataFrame(
        schema={
            "Feature name": pl.Utf8,
            "n": pl.Int64,
            "functime (ms)": pl.Float64,
            "tfresh (ms)": pl.Float64,
            "diff (ms)": pl.Float64,
            "diff %": pl.Float64,
            "speedup": pl.Float64,
        }
    )
    for x in params:
        try:
            f_feat = x[0]
            print(f"Feature: {f_feat.__name__}")
            bench = benchmark(
                f_feat=f_feat,
                ts_feat=x[1],
                f_params=x[2],
                ts_params=x[3],
                is_expr=is_expr,
            )
            bench_df = pl.concat(
                [
                    pl.DataFrame(
                        {
                            "Feature name": [x[0].__name__] * len(bench.n_range),
                            "n": bench.n_range,
                            "functime (ms)": bench.timings_s[0] * 1_000,
                            "tfresh (ms)": bench.timings_s[1] * 1_000,
                            "diff (ms)": (bench.timings_s[0] - bench.timings_s[1])
                            * 1_000,
                            "diff %": 100
                            * (bench.timings_s[0] - bench.timings_s[1])
                            / bench.timings_s[1],
                            "speedup": bench.timings_s[1] / bench.timings_s[0],
                        }
                    ),
                    bench_df,
                ]
            )
        except ValueError:
            print(f"Failed to compute feature {x[0].__name__}")
        except ImportError:
            print(f"Failed to import feature {x[0].__name__}")
        except TypeError:
            print(f"Feature {x[0].__name__} not implemented for pl.Expr")
    return bench_df

## 3. Run benchmarks
---

In [41]:
# Code to prettify benchmark results
def table_prettifier(df: pl.DataFrame, n: int):
    table = (
        df.filter(pl.col("n") == n)
        .drop("n")
        .sort("speedup", descending=True)
        .with_columns(
            pl.when(pl.exclude("Feature name").abs() < 0.1)
            .then(pl.exclude("Feature name").round(4))
            .when(pl.exclude("Feature name").abs() < 1)
            .then(pl.exclude("Feature name").round(2))
            .when(pl.exclude("Feature name").abs() < 30)
            .then(pl.exclude("Feature name").round(1))
            .otherwise(pl.exclude("Feature name").round(1))
        )
        .with_columns(speedup="x " + pl.col("speedup").cast(pl.Utf8))
    )
    return table

In [42]:
%%capture
bench_expr = all_benchmarks(params = FUNC_PARAMS_BENCH, is_expr = True)
bench_series = all_benchmarks(params = FUNC_PARAMS_BENCH, is_expr = False)

# Lazy benchmarks
df_expr_10k = table_prettifier(bench_expr, n=10_000)
df_expr_100k = table_prettifier(bench_expr, n=100_000)
df_expr_1m = table_prettifier(bench_expr, n=1_000_000)
df_expr_9m = table_prettifier(bench_expr, n=9_000_000)

# Eager benchmarks
df_series_10k = table_prettifier(bench_series, n=10_000)
df_series_100k = table_prettifier(bench_series, n=100_000)
df_series_1m = table_prettifier(bench_series, n=1_000_000)
df_series_9m = table_prettifier(bench_series, n=9_000_000)

INFO:functime.feature_extractors:Expression version of approximate_entropy is not yet implemented due to technical difficulty regarding Polars Expression Plugins.
INFO:functime.feature_extractors:Expression version of autoregressive_coefficients is not yet implemented due to technical difficulty regarding Polars Expression Plugins.
INFO:functime.feature_extractors:Expression version of number_cwt_peaks is not yet implemented due to technical difficulty regarding Polars Expression Plugins.
INFO:functime.feature_extractors:Expression version of sample_entropy is not yet implemented due to technical difficulty regarding Polars Expression Plugins.
INFO:functime.feature_extractors:Expression version of spkt_welch_density is not yet implemented due to technical difficulty regarding Polars Expression Plugins.


## 4. Benchmark results
---

Display 8 tables:
- For `pl.Series`: 10k, 100k, 1M and 9M rows
- For `pl.Expr`: 10k, 100k, 1M and 9M rows

Each table contains the execution time (ms) for tsfresh and functime, the difference, the difference in % and the speedup:

### 4.1 Results for `pl.Expr`

#### 10k expr

In [43]:
df_expr_10k

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""lempel_ziv_complexity""",1.6,49.2,-47.6,-96.8,"""x 31.1"""
"""benford_correlation""",1.4,15.1,-13.7,-91.0,"""x 11.1"""
"""energy_ratios""",0.47,3.4,-2.9,-86.2,"""x 7.2"""
"""mean_n_absolute_max""",0.13,0.65,-0.52,-79.4,"""x 4.8"""
"""longest_streak_below_mean""",0.44,1.7,-1.3,-73.9,"""x 3.8"""
"""range_count""",0.0771,0.25,-0.18,-69.7,"""x 3.3"""
"""change_quantiles""",0.45,1.3,-0.9,-67.0,"""x 3.0"""
"""longest_streak_above_mean""",0.59,1.7,-1.1,-65.2,"""x 2.9"""
"""number_peaks""",0.65,1.7,-1.1,-62.0,"""x 2.6"""
"""ratio_beyond_r_sigma""",0.18,0.42,-0.25,-58.4,"""x 2.4"""


#### 100k expr

In [44]:
df_expr_100k

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""lempel_ziv_complexity""",37.2,2403.7,-2366.5,-98.5,"""x 64.6"""
"""mean_n_absolute_max""",0.55,7.3,-6.7,-92.5,"""x 13.3"""
"""benford_correlation""",11.6,144.6,-133.0,-92.0,"""x 12.5"""
"""longest_streak_above_mean""",2.5,15.4,-12.9,-84.1,"""x 6.3"""
"""longest_streak_below_mean""",2.4,15.3,-12.9,-84.0,"""x 6.3"""
"""energy_ratios""",1.0,5.0,-3.9,-78.9,"""x 4.7"""
"""ratio_n_unique_to_length""",2.8,8.2,-5.4,-66.3,"""x 3.0"""
"""change_quantiles""",2.3,6.4,-4.1,-64.6,"""x 2.8"""
"""absolute_maximum""",0.15,0.37,-0.21,-58.5,"""x 2.4"""
"""count_above_mean""",0.18,0.42,-0.23,-56.0,"""x 2.3"""


#### 1M expr

In [45]:
df_expr_1m

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""benford_correlation""",150.9,3102.5,-2951.6,-95.1,"""x 20.6"""
"""mean_n_absolute_max""",7.4,93.5,-86.1,-92.1,"""x 12.6"""
"""longest_streak_below_mean""",22.4,157.0,-134.6,-85.7,"""x 7.0"""
"""longest_streak_above_mean""",22.5,154.0,-131.6,-85.4,"""x 6.9"""
"""absolute_maximum""",0.82,4.9,-4.1,-83.3,"""x 6.0"""
"""energy_ratios""",8.3,36.4,-28.1,-77.3,"""x 4.4"""
"""count_below_mean""",1.3,4.7,-3.3,-71.8,"""x 3.5"""
"""number_peaks""",22.2,76.5,-54.3,-70.9,"""x 3.4"""
"""ratio_n_unique_to_length""",30.6,96.0,-65.3,-68.1,"""x 3.1"""
"""has_duplicate_max""",1.3,4.1,-2.7,-67.0,"""x 3.0"""


#### 9M expr

In [46]:
df_expr_9m

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""mean_n_absolute_max""",84.1,1466.7,-1382.6,-94.3,"""x 17.4"""
"""benford_correlation""",1907.6,19368.9,-17461.4,-90.2,"""x 10.2"""
"""longest_streak_above_mean""",197.0,1437.9,-1240.9,-86.3,"""x 7.3"""
"""longest_streak_below_mean""",197.4,1403.2,-1205.8,-85.9,"""x 7.1"""
"""energy_ratios""",86.7,530.6,-444.0,-83.7,"""x 6.1"""
"""absolute_maximum""",8.0,48.0,-40.0,-83.3,"""x 6.0"""
"""time_reversal_asymmetry_statistic""",146.3,722.4,-576.1,-79.7,"""x 4.9"""
"""change_quantiles""",298.0,1146.8,-848.8,-74.0,"""x 3.8"""
"""count_above_mean""",10.3,39.4,-29.1,-73.8,"""x 3.8"""
"""count_below_mean""",10.0,38.0,-28.1,-73.8,"""x 3.8"""


### 4.2 Results for `pl.Series`

#### 10k series

In [47]:
df_series_10k

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""approximate_entropy""",256.8,37776.4,-37519.6,-99.3,"""x 147.1"""
"""lempel_ziv_complexity""",1.6,46.2,-44.6,-96.6,"""x 29.3"""
"""count_below_mean""",0.0177,0.17,-0.15,-89.3,"""x 9.3"""
"""count_above_mean""",0.0178,0.16,-0.14,-89.0,"""x 9.1"""
"""has_duplicate_max""",0.0203,0.18,-0.16,-88.8,"""x 8.9"""
"""has_duplicate_min""",0.0204,0.18,-0.16,-88.6,"""x 8.8"""
"""energy_ratios""",0.41,3.3,-2.9,-87.8,"""x 8.2"""
"""sample_entropy""",1268.6,10169.0,-8900.4,-87.5,"""x 8.0"""
"""count_above""",0.0164,0.12,-0.1,-86.5,"""x 7.4"""
"""benford_correlation""",1.8,13.5,-11.6,-86.5,"""x 7.4"""


#### 100k series

In [48]:
df_series_100k

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""sample_entropy""",5101.4,1385099.1,-1380000.0,-99.6,"""x 271.5"""
"""lempel_ziv_complexity""",38.8,2457.4,-2418.6,-98.4,"""x 63.3"""
"""mean_n_absolute_max""",0.59,7.4,-6.8,-92.0,"""x 12.6"""
"""benford_correlation""",11.3,136.0,-124.7,-91.7,"""x 12.1"""
"""energy_ratios""",0.61,5.1,-4.5,-88.0,"""x 8.3"""
"""longest_streak_below_mean""",2.4,15.8,-13.4,-84.9,"""x 6.6"""
"""longest_streak_above_mean""",2.4,15.5,-13.1,-84.7,"""x 6.5"""
"""has_duplicate_min""",0.0815,0.5,-0.42,-83.6,"""x 6.1"""
"""has_duplicate_max""",0.0808,0.49,-0.41,-83.4,"""x 6.0"""
"""count_below_mean""",0.0725,0.43,-0.36,-83.1,"""x 5.9"""


#### 1M series

In [49]:
df_series_1m

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""benford_correlation""",114.4,1435.3,-1320.9,-92.0,"""x 12.5"""
"""mean_n_absolute_max""",7.6,87.6,-80.0,-91.3,"""x 11.6"""
"""energy_ratios""",4.3,35.8,-31.5,-88.0,"""x 8.3"""
"""root_mean_square""",0.55,4.3,-3.7,-87.1,"""x 7.8"""
"""longest_streak_below_mean""",22.3,164.4,-142.0,-86.4,"""x 7.4"""
"""autoregressive_coefficients""",86.0,619.3,-533.3,-86.1,"""x 7.2"""
"""longest_streak_above_mean""",22.5,161.5,-139.0,-86.1,"""x 7.2"""
"""linear_trend""",20.5,104.6,-84.1,-80.4,"""x 5.1"""
"""absolute_maximum""",1.1,4.8,-3.7,-77.7,"""x 4.5"""
"""count_below_mean""",1.1,4.7,-3.6,-76.5,"""x 4.3"""


#### 9M series

In [50]:
df_series_9m

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""mean_n_absolute_max""",47.4,1059.4,-1012.0,-95.5,"""x 22.4"""
"""root_mean_square""",3.8,37.7,-33.9,-89.9,"""x 9.9"""
"""energy_ratios""",39.9,374.9,-335.0,-89.4,"""x 9.4"""
"""longest_streak_above_mean""",188.5,1676.1,-1487.6,-88.8,"""x 8.9"""
"""benford_correlation""",1658.2,13073.8,-11415.6,-87.3,"""x 7.9"""
"""longest_streak_below_mean""",206.5,1459.7,-1253.3,-85.9,"""x 7.1"""
"""linear_trend""",185.2,1224.0,-1038.8,-84.9,"""x 6.6"""
"""absolute_maximum""",9.5,43.0,-33.5,-77.8,"""x 4.5"""
"""autoregressive_coefficients""",1259.4,5526.1,-4266.7,-77.2,"""x 4.4"""
"""count_above_mean""",9.6,34.7,-25.1,-72.3,"""x 3.6"""


## 5. Benchmark `Group by / Aggregation` context

Benchmark combining functime's feature extraction and polars' `Group by / Aggregation` context.

In [51]:
_SP500_DATASET = "../../data/sp500.parquet"

SP500_PANDAS = pd.read_parquet(_SP500_DATASET)
SP500_PL_EAGER = pl.read_parquet(_SP500_DATASET)

In [52]:
SP500_PANDAS

Unnamed: 0,ticker,time,price
0,A,2022-06-01,122.278214
1,A,2022-06-02,128.248581
2,A,2022-06-03,127.642609
3,A,2022-06-06,126.788277
4,A,2022-06-07,128.049881
...,...,...,...
126248,ZTS,2023-05-24,169.139999
126249,ZTS,2023-05-25,165.240005
126250,ZTS,2023-05-26,164.740005
126251,ZTS,2023-05-30,160.940002


We want to compare `tsfresh` using `pandas' groupby`  with  `functime` using `polars' groupby` such as:

In [53]:
%%timeit
SP500_PANDAS.groupby(
    by = "ticker"
)["price"].agg(
    tsfresh.number_peaks,
    n = 5
)

908 ms ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [54]:
%%timeit
SP500_PL_EAGER.group_by(
    pl.col("ticker")
).agg(
    pl.col("price").ts.number_peaks(support = 5)
)

52.8 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


If we examine the previous benchmark, we can see that the `number_peaks` operation is approximately **2.5** times faster when using `functime` compared to `tsfresh`.

In the `groupby` context, it's **10** times faster!

In [60]:
def benchmark_groupby_context(
    f_feat: Callable, ts_feat: Callable, f_params: dict, ts_params: dict
):
    if f_feat.__name__ == "lempel_ziv_complexity":
        f_params = {"threshold": (pl.col("price").max() - pl.col("price").min()) / 2}
    benchmark = perfplot.bench(
        setup=lambda _n: (SP500_PL_EAGER, SP500_PANDAS),
        kernels=[
            lambda x, _y: x.group_by(pl.col("ticker")).agg(
                f_feat(pl.col("price"), **f_params)
            ),  # functime + polars groupby
            lambda _x, y: y.groupby("ticker")["price"].agg(
                ts_feat, **ts_params
            ),  # tsfresh + pandas groupby
        ],
        n_range=[1],
        equality_check=False,
        labels=["functime", "tsfresh"],
    )
    return benchmark

In [61]:
def all_benchmarks_groupby(params: list[tuple]) -> list:
    bench_df = pl.DataFrame(
        schema={
            "Feature name": pl.Utf8,
            "n": pl.Int64,
            "functime + pl groupby (ms)": pl.Float64,
            "tfresh + pd groupby (ms)": pl.Float64,
            "diff (ms)": pl.Float64,
            "diff %": pl.Float64,
            "speedup": pl.Float64,
        }
    )
    for x in params:
        try:
            print(f"Feature: {x[0].__name__}")
            bench = benchmark_groupby_context(
                f_feat=x[0], ts_feat=x[1], f_params=x[2], ts_params=x[3]
            )
            bench_df = pl.concat(
                [
                    pl.DataFrame(
                        {
                            "Feature name": [x[0].__name__] * len(bench.n_range),
                            "n": bench.n_range,
                            "functime + pl groupby (ms)": bench.timings_s[0] * 1_000,
                            "tfresh + pd groupby (ms)": bench.timings_s[1] * 1_000,
                            "diff (ms)": (bench.timings_s[0] - bench.timings_s[1])
                            * 1_000,
                            "diff %": 100
                            * (bench.timings_s[0] - bench.timings_s[1])
                            / bench.timings_s[1],
                            "speedup": bench.timings_s[1] / bench.timings_s[0],
                        }
                    ),
                    bench_df,
                ]
            )
        except ValueError:
            print(f"Failed to compute feature {x[0].__name__}")
        except ImportError:
            print(f"Failed to import feature {x[0].__name__}")
        except TypeError:
            print(f"Feature {x[0].__name__} not implemented for pl.Expr")
    return bench_df

In [65]:
%%capture
bench_groupby = all_benchmarks_groupby(params=FUNC_PARAMS_BENCH)
df_groupby = table_prettifier(df=bench_groupby, n=1)

INFO:functime.feature_extractors:Expression version of approximate_entropy is not yet implemented due to technical difficulty regarding Polars Expression Plugins.
INFO:functime.feature_extractors:Expression version of autoregressive_coefficients is not yet implemented due to technical difficulty regarding Polars Expression Plugins.
INFO:functime.feature_extractors:Expression version of number_cwt_peaks is not yet implemented due to technical difficulty regarding Polars Expression Plugins.
INFO:functime.feature_extractors:Expression version of sample_entropy is not yet implemented due to technical difficulty regarding Polars Expression Plugins.
INFO:functime.feature_extractors:Expression version of spkt_welch_density is not yet implemented due to technical difficulty regarding Polars Expression Plugins.


#### S&P500 groupby

In [66]:
df_groupby

Feature name,functime + pl groupby (ms),tfresh + pd groupby (ms),diff (ms),diff %,speedup
"""energy_ratios""",9.1,2024.5,-2015.4,-99.5,"""x 222.1"""
"""index_mass_quantile""",7.2,544.3,-537.1,-98.7,"""x 75.8"""
"""range_count""",2.8,154.3,-151.6,-98.2,"""x 56.0"""
"""symmetry_looking""",3.1,114.4,-111.4,-97.3,"""x 37.3"""
"""percent_reoccurring_points""",6.8,246.0,-239.2,-97.2,"""x 36.2"""
"""ratio_beyond_r_sigma""",6.0,215.1,-209.1,-97.2,"""x 35.8"""
"""root_mean_square""",2.4,83.8,-81.4,-97.1,"""x 34.3"""
"""count_above""",2.7,80.0,-77.4,-96.6,"""x 29.7"""
"""count_below""",2.7,78.5,-75.8,-96.5,"""x 28.9"""
"""lempel_ziv_complexity""",10.5,293.2,-282.8,-96.4,"""x 28.0"""
