Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column transformer for segmented data #9

Closed
wants to merge 3 commits into from

Conversation

qtux
Copy link
Contributor

@qtux qtux commented Nov 28, 2018

Hi David,

I wrote a simple wrapper to use the sklearn ColumnTransformer on segmented data which is kind of useful when dealing with heterogeneous (multivariate) time series data.
I've taken a look into supporting contextual data but did not find an easy way to make the current code work with the TS_Data class. Maybe copying and adapting the whole ColumnTransformer code instead of patching some parts of it could lead to a proper solution to support both.
Nevertheless, I hope you find the SegmentedColumnTransformer to be useful.

Cheers,
Matthias

The main use case for this transformer is to enable the application of specified
groups of feature functions to specified columns of data, e.g. when dealing with
heterogeneous data.

The SegmentedColumnTransformer is derived from the sklearn ColumnTransformer
and adapted to be used inside a Pype object after a segment transformation.

The adaption mainly consists of:
- adapt the notation of a column (ColumnTransformer iterates over the second
  dimension, segmented data must be iterated over the third dimension).
- disable "drop" and "passthrough" transform options for simplicity and drop
  non-specified columns by default

Note: SegmentedColumnTransformer does not support contextual data.
@coveralls
Copy link

coveralls commented Nov 28, 2018

Pull Request Test Coverage Report for Build 124

  • 33 of 33 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 93.715%

Totals Coverage Status
Change from base Build 122: 0.2%
Covered Lines: 1178
Relevant Lines: 1257

💛 - Coveralls

@dmbee
Copy link
Owner

dmbee commented Nov 28, 2018

Thank you Matthias for your work on this. If I understand your aim correctly - you want to have the API support specifying which time series variables each feature is computed for?

It is true that currently the API supports only feature representations where each feature is computed for all variables. I agree that this is a limitation of the current code.

I would be happy to include this capability, but it needs to work with TS_Data. Did you look at FeatureRep.transform? Should be pretty easy to merge the context data back with the feature data using np.column stack. Also maybe pick a better name eg FeatureRepMix since columns makes sense primarily in the context of 2D data. Also we should implement f_labels so that the user can retrieve the mapping of features post transform. It would be nice if the unit testing checked the calculation, not just the returned data shape.

Let me know if you need any help with this. Once you are done, please submit the pull request on the dev branch so I can thoroughly test it before releasing it to master.

Thanks again
David

@dmbee dmbee closed this Nov 28, 2018
@qtux
Copy link
Contributor Author

qtux commented Nov 29, 2018

Hi David,

If I understand your aim correctly - you want to have the API support specifying which time series variables each feature is computed for?

Yes, that is correct. Furthermore I wanted to have the same functionality that ColumnTransformer offers: Parallel processing of transformers not only restricted to the FeatureRep (hence the naming).

I will have a look on whether I can integrate the TS_Data into the SegmentedColumnTransformer or go with your proposal of implementing a FeatureRepMix for only applying different FeatureRep transforms on different time series variables. In that case we could use the sklearn ColumnTransformer on the outcome of the FeatureRep transformers.

Cheers,
Matthias

@dmbee
Copy link
Owner

dmbee commented Nov 29, 2018

Thanks Matthias,

I think that would be a great addition to seglearn. I think you can use (inherit) sklearn ColumnTransformer to do the processing on the time series data as you did in your previous pull request. I was just suggesting you call the seglearn class implementation like FeatureRepMix to avoid confusion.

The context data doesn't need to go to ColumnTransformer, so the implementation would look like the current feature rep

Xt, Xc = get_ts_data_parts(X)

fts = Parallel(n_jobs=self.n_jobs)(
            delayed(func)(
                clone(trans) if not fitted else trans, np.atleast_3d(X)[:, :, column], y, weight
            ) for _, trans, column, weight in self._iter(fitted=fitted, replace_strings=False)

if Xc is not None:
            fts = np.column_stack([fts, Xc])

return fts

good luck

@qtux qtux mentioned this pull request Dec 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants