Passing data to temporal_split and other functions #45

adalseno · 2020-05-26T13:37:13Z

Hi, I was following your example code (simple regression), but I'm stuck. I have a DataFrame of shape (1017, 15). The last column is the target so I created two dfs, one for X (1017, 14) and one for y (1017). I tried to pass those values to temporal_split but I always get an error no matter what I do (passing the df, passing them as lists). For example, passing them as list gives:

KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000],\n dtype='int64', length=1001)] are in the [columns]"

If, on the other hand, I pass them as df I get:

AttributeError: 'DataFrame' object has no attribute 'ts_data'

The same holds true if I manually split the DataFrames and pass them to seg.fit_transform(X_train, y_train)
I tried to put the date column in the df as well as in the index but the error is still there.
What's wrong?

Info of the Dataframe:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1017 entries, 896 to 1912
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          1017 non-null   datetime64[ns]
 1   id            1017 non-null   object        
 2   price         1017 non-null   float64       
 3   month         1017 non-null   int64         
 4   year          1017 non-null   int64         
 5   event_name_1  1017 non-null   int64         
 6   event_type_1  1017 non-null   int64         
 7   event_name_2  1017 non-null   int64         
 8   event_type_2  1017 non-null   int64         
 9   snap_CA       1017 non-null   int64         
 10  dow           1017 non-null   int64         
 11  is_weekend    1017 non-null   int64         
 12  is_holiday    1017 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(10), object(1)
memory usage: 111.2+ KB

I tried to use it with date column or date index or as a list. The same for y: I tried to use it a Series, a Dataframe with date column or date index and list both with or without the date column. As you see there are no NaN values.

The text was updated successfully, but these errors were encountered:

dmbee · 2020-05-26T15:42:48Z

The length of the dataframe or array-like object must correspond to the number of time series in the data set (not the number of samples in a single time series). So if you are working with a single time series, you can just put [ ] around it to make it into a length 1 list.

Most data sets have many time series, and so this the reason for the convention. These details are explained in the user guide

https://dmbee.github.io/seglearn/user_guide.html

Let me know if this fixes your problems.

D

dmbee · 2020-05-26T15:45:31Z

e.g. a typical X_train with three time series would be shaped like this [(100, 5), (150, 5), (200,5)]

an X_train with one time series would be like this [(100,5)]

I usually use lists or numpy object arrays. If using pandas, again, you'll want each sample to correspond to a time series.

adalseno · 2020-05-27T09:56:11Z

Thank you very much for your kind and prompt reply.
Doing:
X_train, X_test, y_train, y_test = temporal_split([x_train.values], [y_train.values], test_size=0.02)
did the trick (simply using [x_train] instead does not work).
The pipe went fine and fit too, but now I get a new error. If I try:
score = pipe.score(X_test, y_test)
I get:

ValueError Traceback (most recent call last)
in
----> 1 score = pipe.score(X_test, y_test)

~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/pipe.py in score(self, X, y, sample_weight)
279 """
280
--> 281 Xt, yt, swt = self._transform(X, y, sample_weight)
282
283 self.N_test = len(yt)

~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/pipe.py in _transform(self, X, y, sample_weight)
139 Xt, yt, swt = transformer.transform(Xt, yt, swt)
140 else:
--> 141 Xt = transformer.transform(Xt)
142
143 return Xt, yt, swt

~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/transform.py in transform(self, X)
1072 self._check_if_fitted()
1073 Xt, Xc = get_ts_data_parts(X)
-> 1074 check_array(Xt, dtype='numeric', ensure_2d=False, allow_nd=True)
1075
1076 fts = np.column_stack([self.featuresf for f in self.features])

~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
584 " minimum of %d is required%s."
585 % (n_samples, array.shape, ensure_min_samples,
--> 586 context))
587
588 if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 200, 11)) while a minimum of 1 is required.

Pipe is:

pipe = Pype([('seg', Segment(width=200, overlap=0.5, y_func=last)),
             ('features', FeatureRep()),
             ('lin', LinearRegression())])

as in your example and X_test and y_test come from temporal_split and obviously they are not empty!
For example y_test is:
[array([0., 4., 2., 3., 0., 1., 2., 0., 0., 0., 1., 1., 3., 0., 1., 1., 1., 3., 0., 1., 1.])]

dmbee · 2020-05-27T14:18:19Z

I assume your X_test, y_test is too small to segment with width 200. I probably should add a check in the transformer for that. I'll put that on the todo list.

dmbee · 2020-05-27T14:20:06Z

It may be this package is not right for your application though based on the data you are looking at. Generally you should have hundreds / thousands of segments to train and test on. Hard to know without knowing more, but you may want to look at methods like ARIMA and such if you just have one series.

adalseno · 2020-05-27T15:26:54Z

Thanks again. Actually I have several thousands series to analyse (all with the same characteristics that I can easily melt in a single df even though they are mostly independent; and since they are independent and all different ARIMA would require to calculate p and q for each one and that would take to much time; moreover since they have seasonality the model would be a SARIMA, with even more calculations) but before training on the whole dataset I wanted to test it with just one series. I reduced the segment size and now I don't get any more errors but the prediction quality, at least for this one, is not exciting. I will test it on a bunch of series but, in case, what may I do to improve?

dmbee · 2020-05-27T15:34:12Z

I wouldn't expect good results with one series. Using sliding window segmentation doesn't always make sense for every problem. It's great for some things like earthquakes, activity recognition, where there is no or little time dependency outside the window. Generally, you need to make sure the window length is long enough to incorporate enough dynamics sensible for a prediction. It's important to interpolate the samples (if not regularly sampled) to a fixed sampling rate so the window time is constant. Setting high overlap is a good data augmentation strategy. Concatenating any heuristics available to the calculated features eg season is also very helpful.

Just a few thoughts. Good luck.

dmbee self-assigned this May 27, 2020

dmbee closed this as completed May 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing data to temporal_split and other functions #45

Passing data to temporal_split and other functions #45

adalseno commented May 26, 2020 •

edited

dmbee commented May 26, 2020

dmbee commented May 26, 2020

adalseno commented May 27, 2020

dmbee commented May 27, 2020

dmbee commented May 27, 2020

adalseno commented May 27, 2020

dmbee commented May 27, 2020 •

edited

Passing data to temporal_split and other functions #45

Passing data to temporal_split and other functions #45

Comments

adalseno commented May 26, 2020 • edited

dmbee commented May 26, 2020

dmbee commented May 26, 2020

adalseno commented May 27, 2020

dmbee commented May 27, 2020

dmbee commented May 27, 2020

adalseno commented May 27, 2020

dmbee commented May 27, 2020 • edited

adalseno commented May 26, 2020 •

edited

dmbee commented May 27, 2020 •

edited