Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing data to temporal_split and other functions #45

Closed
adalseno opened this issue May 26, 2020 · 7 comments
Closed

Passing data to temporal_split and other functions #45

adalseno opened this issue May 26, 2020 · 7 comments
Assignees

Comments

@adalseno
Copy link

adalseno commented May 26, 2020

Hi, I was following your example code (simple regression), but I'm stuck. I have a DataFrame of shape (1017, 15). The last column is the target so I created two dfs, one for X (1017, 14) and one for y (1017). I tried to pass those values to temporal_split but I always get an error no matter what I do (passing the df, passing them as lists). For example, passing them as list gives:

KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000],\n dtype='int64', length=1001)] are in the [columns]"

If, on the other hand, I pass them as df I get:

AttributeError: 'DataFrame' object has no attribute 'ts_data'

The same holds true if I manually split the DataFrames and pass them to seg.fit_transform(X_train, y_train)
I tried to put the date column in the df as well as in the index but the error is still there.
What's wrong?

Info of the Dataframe:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1017 entries, 896 to 1912
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          1017 non-null   datetime64[ns]
 1   id            1017 non-null   object        
 2   price         1017 non-null   float64       
 3   month         1017 non-null   int64         
 4   year          1017 non-null   int64         
 5   event_name_1  1017 non-null   int64         
 6   event_type_1  1017 non-null   int64         
 7   event_name_2  1017 non-null   int64         
 8   event_type_2  1017 non-null   int64         
 9   snap_CA       1017 non-null   int64         
 10  dow           1017 non-null   int64         
 11  is_weekend    1017 non-null   int64         
 12  is_holiday    1017 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(10), object(1)
memory usage: 111.2+ KB

I tried to use it with date column or date index or as a list. The same for y: I tried to use it a Series, a Dataframe with date column or date index and list both with or without the date column. As you see there are no NaN values.

@dmbee
Copy link
Owner

dmbee commented May 26, 2020

The length of the dataframe or array-like object must correspond to the number of time series in the data set (not the number of samples in a single time series). So if you are working with a single time series, you can just put [ ] around it to make it into a length 1 list.

Most data sets have many time series, and so this the reason for the convention. These details are explained in the user guide

https://dmbee.github.io/seglearn/user_guide.html

Let me know if this fixes your problems.

D

@dmbee
Copy link
Owner

dmbee commented May 26, 2020

e.g. a typical X_train with three time series would be shaped like this [(100, 5), (150, 5), (200,5)]

an X_train with one time series would be like this [(100,5)]

I usually use lists or numpy object arrays. If using pandas, again, you'll want each sample to correspond to a time series.

@adalseno
Copy link
Author

Thank you very much for your kind and prompt reply.
Doing:
X_train, X_test, y_train, y_test = temporal_split([x_train.values], [y_train.values], test_size=0.02)
did the trick (simply using [x_train] instead does not work).
The pipe went fine and fit too, but now I get a new error. If I try:
score = pipe.score(X_test, y_test)
I get:

ValueError Traceback (most recent call last)
in
----> 1 score = pipe.score(X_test, y_test)

~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/pipe.py in score(self, X, y, sample_weight)
279 """
280
--> 281 Xt, yt, swt = self._transform(X, y, sample_weight)
282
283 self.N_test = len(yt)

~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/pipe.py in _transform(self, X, y, sample_weight)
139 Xt, yt, swt = transformer.transform(Xt, yt, swt)
140 else:
--> 141 Xt = transformer.transform(Xt)
142
143 return Xt, yt, swt

~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/transform.py in transform(self, X)
1072 self._check_if_fitted()
1073 Xt, Xc = get_ts_data_parts(X)
-> 1074 check_array(Xt, dtype='numeric', ensure_2d=False, allow_nd=True)
1075
1076 fts = np.column_stack([self.featuresf for f in self.features])

~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
584 " minimum of %d is required%s."
585 % (n_samples, array.shape, ensure_min_samples,
--> 586 context))
587
588 if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 200, 11)) while a minimum of 1 is required.

Pipe is:

pipe = Pype([('seg', Segment(width=200, overlap=0.5, y_func=last)),
             ('features', FeatureRep()),
             ('lin', LinearRegression())])

as in your example and X_test and y_test come from temporal_split and obviously they are not empty!
For example y_test is:
[array([0., 4., 2., 3., 0., 1., 2., 0., 0., 0., 1., 1., 3., 0., 1., 1., 1., 3., 0., 1., 1.])]

@dmbee
Copy link
Owner

dmbee commented May 27, 2020

I assume your X_test, y_test is too small to segment with width 200. I probably should add a check in the transformer for that. I'll put that on the todo list.

@dmbee
Copy link
Owner

dmbee commented May 27, 2020

It may be this package is not right for your application though based on the data you are looking at. Generally you should have hundreds / thousands of segments to train and test on. Hard to know without knowing more, but you may want to look at methods like ARIMA and such if you just have one series.

@dmbee dmbee self-assigned this May 27, 2020
@adalseno
Copy link
Author

Thanks again. Actually I have several thousands series to analyse (all with the same characteristics that I can easily melt in a single df even though they are mostly independent; and since they are independent and all different ARIMA would require to calculate p and q for each one and that would take to much time; moreover since they have seasonality the model would be a SARIMA, with even more calculations) but before training on the whole dataset I wanted to test it with just one series. I reduced the segment size and now I don't get any more errors but the prediction quality, at least for this one, is not exciting. I will test it on a bunch of series but, in case, what may I do to improve?

@dmbee
Copy link
Owner

dmbee commented May 27, 2020

I wouldn't expect good results with one series. Using sliding window segmentation doesn't always make sense for every problem. It's great for some things like earthquakes, activity recognition, where there is no or little time dependency outside the window. Generally, you need to make sure the window length is long enough to incorporate enough dynamics sensible for a prediction. It's important to interpolate the samples (if not regularly sampled) to a fixed sampling rate so the window time is constant. Setting high overlap is a good data augmentation strategy. Concatenating any heuristics available to the calculated features eg season is also very helpful.

Just a few thoughts. Good luck.

@dmbee dmbee closed this as completed May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants