Speed Improvement: polars backends #77

mdancho84 · 2023-10-03T12:02:59Z

Running checklist of backends: #77 (comment)

iamjakkie · 2023-10-03T16:50:28Z

I will check what's possible there

JustinKurland · 2023-10-15T23:44:22Z

GTimothee · 2023-10-26T12:47:44Z

Will do augment_fourier (discussed with Justin Kurland) :)

mdancho84 · 2023-10-26T13:25:28Z

Awesome that's much appreciated!

GTimothee · 2023-10-26T19:34:53Z

I will do ts_summary at the same time because I need it.

I updated checks.py like this (not yet pushed):

def check_data_type(data, authorized_dtypes: list, error_str=None):
    if not error_str:
        error_str = f'Input type must be one of {authorized_dtypes}'
    if not sum(map(lambda dtype: isinstance(data, dtype), authorized_dtypes)) > 0:
        raise TypeError(error_str)


def check_dataframe_or_groupby(data: Union[pd.DataFrame, pd.core.groupby.generic.DataFrameGroupBy]) -> None:
    check_data_type(
        data, authorized_dtypes = [
        pd.DataFrame,
        pd.core.groupby.generic.DataFrameGroupBy
    ], error_str='`data` is not a Pandas DataFrame or GroupBy object.')

def check_dataframe_or_groupby_polar(data: Union[pl.DataFrame, pd.DataFrame, pd.core.groupby.generic.DataFrameGroupBy]) -> None:
    check_data_type(data, authorized_dtypes = [
        pl.DataFrame,
        pd.DataFrame,
        pd.core.groupby.generic.DataFrameGroupBy
    ])

It seems more Pythonic to me, if you agree with it. I ran the tests/ it is working :)

I am doing a polars version of augment_fourier, then if possible I plan to merge the polar version with augment_fourier_v2, converting pandas dtypes to polars dtypes, then doing the computations, then converting back. Is that what you intended to do ?

mdancho84 · 2023-10-26T19:45:08Z

As long as it works as intended I'm Ok. Thanks!

JustinKurland · 2023-10-27T11:39:08Z

I will do ts_summary at the same time because I need it.

I updated checks.py like this (not yet pushed):
def check_data_type(data, authorized_dtypes: list, error_str=None):
    if not error_str:
        error_str = f'Input type must be one of {authorized_dtypes}'
    if not sum(map(lambda dtype: isinstance(data, dtype), authorized_dtypes)) > 0:
        raise TypeError(error_str)


def check_dataframe_or_groupby(data: Union[pd.DataFrame, pd.core.groupby.generic.DataFrameGroupBy]) -> None:
    check_data_type(
        data, authorized_dtypes = [
        pd.DataFrame,
        pd.core.groupby.generic.DataFrameGroupBy
    ], error_str='`data` is not a Pandas DataFrame or GroupBy object.')

def check_dataframe_or_groupby_polar(data: Union[pl.DataFrame, pd.DataFrame, pd.core.groupby.generic.DataFrameGroupBy]) -> None:
    check_data_type(data, authorized_dtypes = [
        pl.DataFrame,
        pd.DataFrame,
        pd.core.groupby.generic.DataFrameGroupBy
    ])
It seems more Pythonic to me, if you agree with it. I ran the tests/ it is working :)

I am doing a polars version of augment_fourier, then if possible I plan to merge the polar version with augment_fourier_v2, converting pandas dtypes to polars dtypes, then doing the computations, then converting back. Is that what you intended to do ?

@GTimothee yes, that is correct. pandas -> polars -> pandas ... where inside the function the conversions occur. There may be some functions at the moment where polars dataframes are being accepted. Do not use that pattern those have to be refactored to only accept pandas.

GTimothee · 2023-10-28T07:32:51Z

Understood :) Sorry I am lacking time a little bit but I am on it !

…d-fill-internal Update future.py forward fill internal

GTimothee · 2023-11-02T18:26:14Z

I think we can check augment_fourier, no ?
I am now starting to add polars support to ts_summary.
About the speed improvement on calc_fourier, I found a bug in my new implementation so I will have to experiment a bit more and check again that my idea is good. I will be in touch with Justin K about this.

mdancho84 · 2023-11-02T18:43:20Z

Ok sounds good. I plan to release 0.2.0 tomorrow. Let me know if there is anything I can do to help.

GTimothee · 2023-11-02T18:57:03Z

Actually the main problem I have is with checking my results. I am trying %timeit in a notebook cell but everytime I run it it gives me different results. And there is also a difference between running my experiments notebook locally and in colab'. Not the same output. I am not sure what I am doing wrong.

But I guess my experimental function is not good enough anyway because in general, even with the variations, the current implementation is faster. I had an implementation leveraging itertools.permutation which was faster but I found that it does not give good results. I switched to itertools.product and now it is slower :/

GTimothee · 2023-11-02T19:09:43Z

In this function : https://github.com/business-science/pytimetk/blob/master/src/pytimetk/core/ts_summary.py#L398 why is there the comment "# "America/New_York" ?

mdancho84 · 2023-11-02T19:25:20Z

I think that's just an example of the time zone

GTimothee · 2023-11-02T19:27:13Z

I was wondering if you were expected this particular time zone

mdancho84 · 2023-11-02T19:45:33Z

No I believe it can be different time zones. That comment is just an example.

JustinKurland · 2023-11-02T19:52:47Z

Actually the main problem I have is with checking my results. I am trying %timeit in a notebook cell but everytime I run it it gives me different results. And there is also a difference between running my experiments notebook locally and in colab'. Not the same output. I am not sure what I am doing wrong.

There are many reasons that running something even just locally could generate different results, I would not expect them to be identical. In fact you may get instances where the time goes down as a function of caching. Do not get thrown off by this. Further and related, I would not expect your results in colab to be the same. Also in colab I do not know what your setup is, but you can choose to take advantage of GPUs. You can check disk information using a command like !df -h. To see CPU specs, !cat /proc/cpuinfo. For memory, !cat /proc/meminfo.

But I guess my experimental function is not good enough anyway because in general, even with the variations, the current implementation is faster. I had an implementation leveraging itertools.permutation which was faster but I found that it does not give good results. I switched to itertools.product and now it is slower :/

Maybe we can connect. I am not sure why you would be using itertools for pretty much anything we are doing, so deeply curious how you are using this.

GTimothee · 2023-11-02T20:05:06Z

Yes I will submit my experiments to you asap to get some feedback :) I was using itertools to generate permutations of order x period. It is how I would replace the loops.

seyf97 · 2023-11-03T02:23:55Z

Can I take ceil_date? @JustinKurland

JustinKurland · 2023-11-03T12:14:51Z

Can I take ceil_date? @JustinKurland

Absolutely @seyf97 . I had begun working on this to figure out what this looked like for polars dataframes and series. I actually finished figuring this out for most dates, but did not start on datetimes. This code should help you start quickly.

Dataframes

import polars as pl

# Create a DataFrame with a datetime column
df = pl.DataFrame({
    'date': ['2023-10-01', '2023-10-02', '2023-10-03', '2023-10-04', '2024-02-26'],
    'value': [1, 2, 3, 4, 5]
})
# Convert the date column to datetime
df = df.with_columns(pl.col('date').str.strptime(pl.Date, format="%Y-%m-%d"))#.cast(pl.Datetime)

# week
(df.with_columns(
    (pl.col('date')
      .dt.offset_by('1w')
      .dt.truncate('1w')
      .dt.offset_by('-1d'))
    .alias('ceil_W'))
)

# month
(df.with_columns(
    (pl.col('date')
      .dt.offset_by('1mo')
      .dt.truncate('1mo')
      .dt.offset_by('-1d'))
      .alias('ceil_M')
      )
)
# or you can use this but I think given the pattern it probably makes more sense to actually not use it and use the pattern
df.with_columns(pl.col("date").dt.month_end())

# quarter
(df.with_columns(
    (pl.col('date')
      .dt.offset_by('1q')
      .dt.truncate('1q')
      .dt.offset_by('-1d'))
    .alias('ceil_Q')))

# year
(df.with_columns(
    (pl.col('date')
      .dt.offset_by('1y')
      .dt.truncate('1y')
      .dt.offset_by('-1d'))
    .alias('ceil_Y')))

# So the missing ceiling now for the dataframe pattern all relates to the time component like hour, minute, and 
# second and whatever other `pandas` frequency we have included to ensure alignment.

Series

pl_series = pl.Series('date', ['2023-10-01', '2023-10-02', '2023-10-03', '2023-10-04', '2024-02-26'])

pl_series = pl_series.str.strptime(pl.Date, format="%Y-%m-%d")

# Week
pl_series.dt.offset_by('1w_saturating').dt.truncate('1w').dt.offset_by('-1d')

# Month - In the case of the month I recommend to use this as using the offset pattern does not give consistent results
# but .month_end() does
pl_series.dt.month_end()

# Quarter
pl_series.dt.offset_by('1q_saturating').dt.truncate('1q').dt.offset_by('-1d')

# Year
pl_series.dt.offset_by('1y_saturating').dt.truncate('1y').dt.offset_by('-1d')

# So the missing ceiling now for the series pattern, like with the dataframes, all relates to the time component like hour, 
# minute, and second and whatever other `pandas` frequency we have included to ensure alignment.

Hopefully this helps jump start your effort quickly!

GTimothee · 2023-11-08T18:05:44Z

Will do get_frequency_summary

mdancho84 added the help wanted Extra attention is needed label Oct 3, 2023

rabadzhiyski added this to the v0.2.0 milestone Oct 4, 2023

mdancho84 assigned JustinKurland Oct 6, 2023

mdancho84 changed the title ~~Speed Improvement: polars and tidypolars backend~~ Speed Improvement: polars backends Oct 19, 2023

mdancho84 mentioned this issue Oct 24, 2023

Polars Backend Tests for pytimetk v0.2.0 #177

Open

JustinKurland assigned GTimothee Oct 26, 2023

GTimothee mentioned this issue Oct 29, 2023

Polars backend for Augment fourier #205

Merged

mdancho84 mentioned this issue Oct 31, 2023

Tracking Table: Which Functions Have Polars / Parallel Processing #114

Closed

mdancho84 pushed a commit that referenced this issue Nov 1, 2023

Merge pull request #77 from JustinKurland/JustinKurland-future-forwar…

3f6c744

…d-fill-internal Update future.py forward fill internal

GTimothee mentioned this issue Nov 2, 2023

Ts summary - polars backend #211

Merged

GTimothee mentioned this issue Nov 15, 2023

add polars backend to get_frequency_summary - first draft #271

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed Improvement: polars backends #77

Speed Improvement: polars backends #77

mdancho84 commented Oct 3, 2023 •

edited

iamjakkie commented Oct 3, 2023

JustinKurland commented Oct 15, 2023 •

edited

GTimothee commented Oct 26, 2023

mdancho84 commented Oct 26, 2023

GTimothee commented Oct 26, 2023 •

edited

mdancho84 commented Oct 26, 2023

JustinKurland commented Oct 27, 2023

GTimothee commented Oct 28, 2023

GTimothee commented Nov 2, 2023

mdancho84 commented Nov 2, 2023

GTimothee commented Nov 2, 2023 •

edited

GTimothee commented Nov 2, 2023

mdancho84 commented Nov 2, 2023

GTimothee commented Nov 2, 2023

mdancho84 commented Nov 2, 2023

JustinKurland commented Nov 2, 2023

GTimothee commented Nov 2, 2023

seyf97 commented Nov 3, 2023

JustinKurland commented Nov 3, 2023

GTimothee commented Nov 8, 2023

Speed Improvement: polars backends #77

Speed Improvement: polars backends #77

Comments

mdancho84 commented Oct 3, 2023 • edited

iamjakkie commented Oct 3, 2023

JustinKurland commented Oct 15, 2023 • edited

Polars Backend Functions

Wrangling Pandas Time Series DataFrames

Anomaly Detection

Adding Features to Time Series DataFrames (Augmenting)

TS Features

Finance Module

Time Series for Pandas Series

Date Utilities

Extra Pandas Helpers

13 Datasets

GTimothee commented Oct 26, 2023

mdancho84 commented Oct 26, 2023

GTimothee commented Oct 26, 2023 • edited

mdancho84 commented Oct 26, 2023

JustinKurland commented Oct 27, 2023

GTimothee commented Oct 28, 2023

GTimothee commented Nov 2, 2023

mdancho84 commented Nov 2, 2023

GTimothee commented Nov 2, 2023 • edited

GTimothee commented Nov 2, 2023

mdancho84 commented Nov 2, 2023

GTimothee commented Nov 2, 2023

mdancho84 commented Nov 2, 2023

JustinKurland commented Nov 2, 2023

GTimothee commented Nov 2, 2023

seyf97 commented Nov 3, 2023

JustinKurland commented Nov 3, 2023

Dataframes

Series

GTimothee commented Nov 8, 2023

mdancho84 commented Oct 3, 2023 •

edited

JustinKurland commented Oct 15, 2023 •

edited

GTimothee commented Oct 26, 2023 •

edited

GTimothee commented Nov 2, 2023 •

edited