Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sklearn example #7

Open
jcklie opened this issue Jul 9, 2019 · 3 comments
Open

Add sklearn example #7

jcklie opened this issue Jul 9, 2019 · 3 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@jcklie
Copy link
Collaborator

jcklie commented Jul 9, 2019

Is your feature request related to a problem? Please describe.
The API of this tool should be compatible with sklearn. It would be nice to document how to use these together.

Describe the solution you'd like
Add an example using e.g. cross validation, parameter grid search or pipelining.

@jcklie jcklie added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 9, 2019
@jcklie jcklie added this to the Ready for pypi milestone Jul 9, 2019
@AMarkard
Copy link
Collaborator

Implementing the sklearn api for the neural networks turned out to be more difficult than expected.
As "fit" requires X and Ys seperatly rather than the torchtext.data.Iterator that is currently in use.
But skorch provides a nice solution by wrapping the Pytorch-network and also by SliceDataset which solves the Iterator issue. So I wrote a complete wrapper class which uses skorch to wrap the neural networks and adjusts them to thesklearn api as well as the project structure.
But after that the next issue came up, due to the fact of averaging the sentence inside the neural network required to not pad the data, problems occur with sklearn. Also the datatypes that are used are not supported by sklearn.
As a Collaborator of the skorch project states, the problem lies within our datastructure and the way sklearn handles the data.
"Getting pytorch Datasets to work with GridSearchCV is not trivially possible. The problem is that eventually, the Dataset leaves the skorch domain and is handled directly by sklearn. sklearn only works with a couple of data types (ndarray, scipy sparse, pandas DataFrame), so you will encounter an error sooner or later." (skorch-dev/skorch#212)
To finally conclude, in order to use sklearn the datahandling needs to be completely restructured.

@jcklie
Copy link
Collaborator Author

jcklie commented Oct 28, 2019

What happens if you replace torchtext with flair for the embeddings and just using pytorch datasets?

@AMarkard
Copy link
Collaborator

Flair sadly comes with other drawbacks regarding our system.
Flair seems to be very slow as well, at least for such huge data amounts. Nvidia Apex does NOT yield the improvement needed. After some investigation it seems like the bottleneck lies in the structure itself. Our system needs pairs of (embedded) sentences and frames. But Flair requires a wrapping as "Sentence"-objects. Therefore the usage needs to be as follows: sentence -> Flair's sentence object -> embedding -> taking the embeddings out of the sentence object -> dropping the object. (Compare old repo #25)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants