Tabularize dtype draft #316

hamzahiqb · 2020-06-22T19:01:03Z

Reference Issues/PRs

Initial draft for #311

What does this implement/fix? Explain your changes.

Saves the column as a tuple with column name and time index instead of the string implementation.

Does your contribution introduce a new dependency? If yes, which one?

Yes. Currently need to ensure that it does not break implementation. There are other secondary issues to consider.

PR checklist for new estimators

Any other comments?

hamzahiqb · 2020-06-22T19:20:49Z

@mloning This is a simplified change. There are some behaviours I noticed and wanted your opinion on it:

Starting with a nested df:

from sktime.utils.data_container import tabularize, from_nested_to_long, detabularize
from sktime.utils._testing import generate_df_from_array

n_obs_X = 20
n_cols_X = 3
X = generate_df_from_array(np.random.normal(size=n_obs_X), n_rows=10,n_cols=n_cols_X)
X.head() # <- Returns 3 columns with nested series

# Tabularize (new implementation
Xt = tabularize(X)
Xt.head() # <- column is now is a tuple of (column name, time index)

	('col0', 0)	('col0', 1)	('col0', 2)	('col0', 3)	('col0', 4)
0	2.91718	0.341364	0.663797	0.350589	0.0169558
1	2.91718	0.341364	0.663797	0.350589	0.0169558
2	2.91718	0.341364	0.663797	0.350589	0.0169558
3	2.91718	0.341364	0.663797	0.350589	0.0169558
...

Remarks:

Is this an acceptable change?
I still need to implement the tests and make sure all dependancies take the new column change.

Further remarks:
When I run detabularize, I expect X back but get a single column:

Xd = detabularize(Xt)
Xd.head() # <- single column where each element is a series of series. Behaviour is the same in old implementation of tabularize

tabularize(detabularize(tabularize(X))) # <- The time index for 3 columns is merged (we lose information about the time index)

A few questions I have about this behavior:

Are we ok with losing information about column names and time_index name?
I see that we currently accept duplicated column names. E.g. test_tabularize Makes a df with duplicate column names. Is this a supported feature? Would this not fail any of the models?
In extension, do we accept duplicates over any of the indexes (time, column name or observation index)
Do we require the time index to be sorted?

hamzahiqb · 2020-06-22T19:28:16Z

I tried one implementation with a MultiIndex setup for tabulzarize that would return consistent transformations (more extensive testing that visual checks needed):

def explode_series(s):
    s = s if isinstance(s, pd.Series) else pd.Series(s) 
    df = s.explode()
    idx = lambda x: x.index if hasattr(x, "index") else np.arange(x.shape[0])
    time = s.apply(lambda x: idx(x.index)).explode()
    df_idx = pd.DataFrame([df.index, time]).T
    df.index = pd.MultiIndex.from_frame(df_idx)
    return df

# Need to add return_array 
def tabularize(X):
    Xt=(
        X
        .apply(explode_series)
        .unstack(-1)
    )
    Xt.columns.names = ["column", "time_index"]
    Xt.index.name = "index"
    return Xt

def detabularize(Xt):
    def _detabularize_series(s):
        df = pd.Series(s.values, index=s.index.values)
        return pd.Series([df,], name=s.name)

    def _detabularize_group(df):
        temp = (
            df.droplevel("column")
            .apply(_detabularize_series).T
        )
        return temp

    Xd = (
        Xt.T
        .groupby("column")
        .apply(lambda x: _detabularize_group(x))
        .unstack(0)
        .droplevel(level=0, axis=1)
    )
    Xd.index.name = None
    Xd.columns.name = None
    return Xd

Test everything is same:

_assert_almost_equal(X, detabularize(tabularize(X))) # passes

Note: This implementation will (potentially) allow consistency between transformations ([de]tabularize - nested_to_long - long_to_nested) but it requires all indexes to be unique.

mloning · 2020-06-23T08:51:03Z

Thanks @hiqbal2 for your work on this!

We're loosing information currently about time index and column names, no? - so I'm happy about any improvement where we manage to keep more information through transformations
Does using tuples as column names break anything, e.g. df.rename(columns={...})?
I've not considered multi-indexes before, I always find them a bit unwieldy, but it seems like the perfect use case for them, what's your preference over:
- (<column name>, <time point>) tuples,
- <column name>__<time point> double underscore convention, and
- multi-index?
pandas seems to allow non-unique column names, so I think we should simply accept them as too
We use tabularize and detabularize frequently throughout sktime, so efficiency of the implementation is also an important consideration, we should compare some timings before merging major changes
Another complication: ideally tabularize should work with data frames that have both time series columns (series in cells) and regular columns (primitives in cells) (i.e. mixed panel data, or panel data with time constant variables)
Another request: all tests should be added to test_data_container.py and existing tests moved there from test_utils_forecasting.py

Thanks again @hiqbal2 I really appreciate your work on this! 👍

hamzahiqb · 2020-06-23T21:13:16Z

@mloning, will try to answer some of the questions you raised.

We're loosing information currently about time index and column names, no?
Yes. Duplicate column names seems to be one reason.
Does using tuples as column names break anything, e.g. df.rename(columns={...})?
Not with core pandas. The renaming would require passing tuples.
I've not considered multi-indexes before, I always find them a bit unwieldy, but it seems like the perfect use case for them, what's your preference over:
1. (, ) tuples :
  1. Acceptable,
  2. But can be hard to work with.
  3. Always required unpacking - this would just be a speed slow down
2. __ double underscore convention:
  1. Personally don't prefer this, because
  2. Always need to unpack (speed)
  3. Can loose time index type information
  4. How do converting datetime, date or weird indexes (like 'a__', 'b__' for a niche case)?
3. multi-index
  1. Recently started preferring this, because
  2. Its fast to filter, use.
  3. You don't touch the underlying data types. they just pass around
  4. Downside: you need to learn some multi index slicing commands.

Given the multiple ways the data can be transformed, I would go for a method that does not transform data types too much.

pandas seems to allow non-unique column names, so I think we should simply accept them as too

Does this mean that our TS models accept it too? As a simple example, if I run a simple regression model with duplicate column names, wouldn't it be confusing to return betas with common name but different values? Shouldn't we raise a warning or error.

Similarly, how do we go from X to long format to X again if we have duplicate column names?

One potential method could be 1) raise a warning, and 2) rename with distinguishing names (col0, col0) -> (col0_0, col0_1)

We use tabularize and detabularize frequently throughout sktime, so efficiency of the implementation is also an important consideration, we should compare some timings before merging major changes

Definitely. I can benchmark the computations.

I also need to provide an ability to return numpy arrays too.

Another complication: ideally tabularize should work with data frames that have both time series columns (series in cells) and regular columns (primitives in cells) (i.e. mixed panel data, or panel data with time constant variables)

It should be possible to add in checks for dtypes of cells.

However, since I'm new, if the tabular format mostly used within the sktime api? From how i understand, a user will likely provide data in a normal pandas dataframe with a time column and a group/stock id/id/observation column.

Another request: all tests should be added to test_data_container.py and existing tests moved there from test_utils_forecasting.py

Looks like I will be finally getting a crash course in testing :)

Some general points/questions I had in mind:

If you think we should move to building in the consistency, it could be a relatively large PR. It might be to good to put in the requirements and expectations from the transformations, especially since im new to the codebase
A few tests that exists/can be added:
1. Index, column and cell dtype consistency
2. For nested data, consistency of index dtypes across all cells
3. Also, exact length of nested series/array
4. Values match
5. No Duplicate indexes?
6. Time indexes are sorted?
7. Only numeric dtypes for values?
If the data conversions are for internal use only, do we need to have numpy implementations?
Probably the biggest issue I find is allowing duplicate column names without raising any warnings or errors. It directly causes inconsistencies in how the data transformations work.
I'm wondering if the data transformations should be provided as a method instead of a function? This is because the transformation functions expect a pretty rigid data structure (e.g. long_to_nested requires time and column). We can possibly ensure this structure is available if we can provide it as a method of Estimator or Data classes.

Completely off-topic but:

Have you looked into xarray. I have never used it but given that sktime is working in 3D space, it might be easier to handle differing cases (mixed panel vs panel) while maintaining labels. Disclaimer: never worked with xarray so I am only guessing here. However, given the large community and use cases, it could be interesting to look into it.
Reason I bring it up is that nesting/un-nesting in general is an expensive operation. It could be interesting to find ways to not have to do the transformations

mloning · 2020-06-24T08:37:30Z

Hi @hiqbal2 thanks for your comments.

We've had a more in-depth discussion about our choice of data container, for more details see our wiki entry and speak to @prockenschaub who prototyped an extension of pandas for our use case. Another idea we're currently considering is adding support for 3d numpy arrays in addition to the nested pandas DataFrame.

I'm happy to go ahead with multi-index and see if that breaks anything of the existing code base.
Enforcing unique column names inside the tabularize function is also okay for me (we could even consider enforcing this inside check_X but that should be on a separate PR).

I'd suggest to start with these two changes and then see what else we may have to change.

mloning · 2021-03-23T10:19:41Z

@hamzahiqb please feel free to reopen if you start working on this again!

update tabularize test; save tabularize column as tuple

3e32d43

Merge branch 'dev' into tabularize_dtype

816796e

mloning closed this Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabularize dtype draft #316

Tabularize dtype draft #316

hamzahiqb commented Jun 22, 2020

hamzahiqb commented Jun 22, 2020 •

edited

hamzahiqb commented Jun 22, 2020 •

edited

mloning commented Jun 23, 2020

hamzahiqb commented Jun 23, 2020 •

edited

mloning commented Jun 24, 2020

mloning commented Mar 23, 2021

Tabularize dtype draft #316

Tabularize dtype draft #316

Conversation

hamzahiqb commented Jun 22, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

PR checklist for new estimators

Any other comments?

hamzahiqb commented Jun 22, 2020 • edited

hamzahiqb commented Jun 22, 2020 • edited

mloning commented Jun 23, 2020

hamzahiqb commented Jun 23, 2020 • edited

mloning commented Jun 24, 2020

mloning commented Mar 23, 2021

hamzahiqb commented Jun 22, 2020 •

edited

hamzahiqb commented Jun 22, 2020 •

edited

hamzahiqb commented Jun 23, 2020 •

edited