Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it support Dataframe as input? #100

Closed
26345211 opened this issue Aug 26, 2018 · 10 comments
Closed

Does it support Dataframe as input? #100

26345211 opened this issue Aug 26, 2018 · 10 comments

Comments

@26345211
Copy link

The estimator I am trying to fit accepts a pandas data frame as input in the fit method, using the column labels, however when using the SuperLearner, the data is converted to a numpy.ndarray when passing to the estimator's fit method, is there a way to preserve the column label data?

@flennerhag
Copy link
Owner

Yes, if you set array_check=0 the ensemble will not perform any checks on the input so the DataFrame will be passed to the estimator. As long as the DataFrame supports array indexing (e.g. X[start:stop] it should work fine.

One caveat is that you can't use backend='multiprocessing', since this requires memmapping an nd.array.

In the next release array_checks will be removed so this issue will not arise.

@26345211
Copy link
Author

Thanks for your help, but an error occurs even if array_check=0, it still uses ndarray
...........................................................................
/Users/anaconda3/lib/python3.6/site-packages/mlens/parallel/_base_functions.py in slice_array(x= Age Race Rating Publ...1667 8 37694

[3000 rows x 6 columns], y= Res_Final_Position
0 ...2999 0

[3000 rows x 1 columns], idx=None, r=0)
170 x = x[slice(idx[0] - r, idx[1] - r)]
171 y = y[slice(idx[0] - r, idx[1] - r)] if y is not None else y
172
173 # Cast as ndarray to avoid passing memmaps to estimators
174 if y is not None:
--> 175 y = y.view(type=np.ndarray)
y = Res_Final_Position
0 ...2999 0

[3000 rows x 1 columns]
y.view = undefined
176 if not issparse(x):
177 x = x.view(type=np.ndarray)
178
179 return x, y

...........................................................................
/Users/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in getattr(self= Final
0 ...2999 0

[3000 rows x 1 columns], name='view')
4371 name in self._accessors):
4372 return object.getattribute(self, name)
4373 else:
4374 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4375 return self[name]
-> 4376 return object.getattribute(self, name)
self = Res_Final_Position
0 ...2999 0

[3000 rows x 1 columns]
name = 'view'
4377
4378 def setattr(self, name, value):
4379 """After regular attribute access, try setting the name
4380 This allows simpler access to columns for interactive use.

AttributeError: 'DataFrame' object has no attribute 'view'


@flennerhag
Copy link
Owner

Aha! The offending line is a legacy from v0.1 when we only did multiproc. I've pushed a mmap branch that solves this by only casting to ndarray if the data is a mmap. I ran a simple test and a DataFrame as input passes through the ensemble now.

This works as long as 'threading' is used as the backend. For multiprocessing, things are thornier since we use memmapping. Fixing that may be a while, if at all, but i'm guessing you're ok with 'threading'?

Btw, to install the necessary update, do

pip uninstall mlens;
git clone https://github.com/flennerhag/mlens; cd mlens;
git fetch; git checkout mmap;
pip install .

This will uninstall the version you currently have and install the mmap branch of the bleeding edge build (to be 0.2.4). Let me know if you run into any more problems!

@flennerhag
Copy link
Owner

btw in the bleeding edge version array_checks has been deprecated so you can ignore it.

@26345211
Copy link
Author

Thanks for the help, when I tried passing in a DataFrame to it, it ran into a index error, I tried the simple example from the getting started page and simply use pd.DataFrame(X) and pd.DataFrame(y) to change the type and the [start:stop] format for indexing that DataFrame works
/Users/anaconda3/lib/python3.6/site-packages/mlens/parallel/_base_functions.py in slice_array(x= 0 1 2 3
0 6.8 3.2 5.9 2.3
...1
149 5.2 2.7 3.9 1.4

[150 rows x 4 columns], y= 0
0 2
1 2
2 2
3 1
4 1
5 2...6 0
147 2
148 0
149 1

[150 rows x 1 columns], idx=array([ 75, 76, 77, 78, 79, 80, 81, 82, ...40, 141, 142, 143, 144, 145, 146, 147, 148, 149]), r=0)
151 if len(idx[0]) > 1:
152 # Advanced indexing is required. This will trigger a copy
153 # of the slice in question to be made
154 simple_slice = False
155 idx = np.hstack([np.arange(t0 - r, t1 - r) for t0, t1 in idx])
--> 156 x = x[idx]

@flennerhag
Copy link
Owner

Sorry for the delay, the issue's been fixed. Turns out using simple slicing (X[:i]) on a dataframe slices the columns, as opposed to the rows.

You should be able to run with dataframes both as X and y now.

@flennerhag
Copy link
Owner

In master branch as of #101.

@26345211
Copy link
Author

when the predicted and actual y is passed to the accuracy scorer, can the dataframe along with its indexes be passed or will the model simply pass a ndarray?

@flennerhag
Copy link
Owner

if y is a DataFrame then the input to the scorer should be a DataFrame since we do no data type conversion on the input. The prediction will be in whatever format the estimator produces, presumably a numpy array.

@Matthew-A-epi
Copy link

It seems that this still does not function. It looks like an index issue. There are some index calls not using [] instead of .loc[].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants