-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support training on (non-in-memory) datasets for VectorModel-derived classes #40
Comments
This issue will likely require many significant changes. The requirements I see from the top of my head:
There will be many more difficulties which I cannot anticipate now. Certainly the issue should be split into multiple issues/PRs when the time to address it comes |
It might be worthwhile to take a look at existing libraries such as NVTabular, which supports feature engineering and preprocessing with a particular (exclusive) focus on neural networks and which can be integrated with fastai. If we go ahead with all this, it might make sense to restrict ourselves to being compatible only with finite datasets, e.g. support only |
A reasonable approach to handle this using current mechanisms (for torch models) is to have DataFrames which contain only meta-data (e.g. filenames/paths or other references to the actual data) and which do fully fit in memory and to make the |
Yes, this sounds like a reasonable approach for many applications, where normalizers and feature extractors don't need to be fitted on the non-loaded data. What I originally had in mind was a support for training on generators of data frames (or arrays). Several libraries help building such generators, augmenting data on the way which can come in pretty handy. It might well be that these tools for data augmentation can also easily be used within an implementation of We could either close this issue or put it on ice until one of us actually uses sensai for such data sets and shares hands-on experience (I would prefer the latter option) |
The approach I described above has recently been added to sensAI with class |
This is crucial for datasets that don't fit in RAM. Special care must be taken with featuregens and dft transformers since they typically cannot be trained batch-wise
The text was updated successfully, but these errors were encountered: