New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve hub #122
Comments
Should the datasets be part of this library, or have it as a separate repo, (HuggingFace does this) ? |
I'm not sure yet... We can start with being a part of this library and change it afterwards if needed |
Which one is more intuive? Option 1source = hub.pets.dataset()
parser = hub.pets.parser(source) Option 2source = hub.pets.get_dataset()
parser = hub.pets.get_parser(source) But Option 3source = datasets.pets.load()
parser = datasets.pets.parser(source) If we start to also distribute pre-trained models for datasets, does it becomes awkward to do so: model = datasets.pets.model(...)
model = hub.pets.model(...) |
User can do Methods that we might provide . These would be usueful for augmentation.
Also yes as proposed
|
|
Can you elaborate more? Using autocompletion So if I understood correctly |
|
If we use What about this: Using a URLs class will triggers auto-completion. Fastai uses that, and I use it in my |
May I suggest: source = datasets.pets.source()
parser = datasets.pets.parser(source) or even
|
It only works for datasets that are in hub, similar to In reality we might need much more process to get a dataset to work, like downloading from multiple sources, then combining into a single directory, etc.. So: First conditionCollaborators need the ability of writing their own Hugging face also does
We need something self contained. Let's say you want to colaborate with a new dataset, you would have to create a new file in This is why: Second conditionThe collaborator needs a self-contained directory to integrate his dataset, parsers, models, etc.. So all of his changes would be |
Maybe the problem is having |
|
So we can have |
Absolutely. Gets best of both the worlds. Users can simply do dataset.model or parser or viz etc. And people rely on mantishhrimp for model building. HuggingFace too did this ; separated transformers from data. |
The only thing is that a |
Alright, what about this: from mantisshrimp.datasets import pets
data_dir = pets.load()
parser = pets.parser(data_dir)
|
All functions can have specific arguments, for instance, if you also want to train a segmentation model: parser = pets.parser(data_dir, mask=True) |
The downside of I think we should decouple The way I see it:
At the high level, Datasets can be loaded in a transparent way regardless of their locations. Why not having a Parses locations:
For the models, it seems to be similar to the datasets:
|
This is only going to work as a proxy for the original function that loads the dataset. And there is the classic problem with parser = pets.parser(data_dir, mask=True) If we instead went with: Also, as I said in the OP, this design locks the api, let's say the user also wants to provide a method for visualizing the class distribution of the data, how would he make that available to other users?
In the proposed solution there is no datasets object, it's only a namespace it's a folder structure like so:
The problem with the confusing name is easily solvable with: from mantisshrimp import datasets
data_dir = datasets.pets.load() Maybe there is a conflict between
The confusion happening here is what I mean by model. I'm not talking about Ross models or anything like that at all, but instead of pre-trained weights for that dataset (a discussion on how to handle new models is being handled at #75 ). The thing is that it's impossible to decouple the model definition with its weights file. The general idea is that the pretained weights provided should be from a model already present in the library. Let's say I already trained a model for this dataset, and I want future users to be able to start from that, in this way they can simply do: model = datasets.pets.model() And all I (as the implementor) have to do is: def model():
pets_model = MantisFasterRCNN(num_classes=2)
pets_model.load('weight_file')
return pets_model You can argue that we just need to provide the weights file together with the model class, but that would break very simple modifications we would like to support, as an example: def model():
backbone = MantisFasterRCNN.get_backbone_by_name('resnet18')
pets_model = MantisFasterRCNN(num_classes=2, backbone=backbone)
pets_model.load('weight_file')
return pets_model |
There is a bit of confusion here regarding the use of terms. This is going to create asynchronized discussions. I propose that we create some sketches or diagrams or any illustrations to avoid the whole misunderstanding process. We can also have a title glossary starting with the most confusion and overloaded terms: These document will save us a lot time and clear the path to new contributors/users. If we have to explain to every contributor and/or new user what those terms really mean it isn't going to be a very productive process. |
As such, datasets and model architecture are orthogonal concepts, thus can have their own APIs to communicate between them and with the outside world. Providing default weights in a single |
Hub was successfully replaced by datasets |
🚀 Feature
Hub came into a lot of reformulations, but it's real essence is becoming clear, a place to share datasets (and more)!
For a specific dataset, the following functions can be present:
A great place to draw inspiration is huggingface/nlp, where they maintain a collection of datasets.
While they maintain a very easy to use interface, I think there are some points we can improve.
huggingface/nlp locks the interface and provide fixed methods for getting the dataset, e.g.
load_dataset
. This works very well for simple (predicted) use cases, but means that custom (unpredicted) functionality is very hard (or impossible) to implement.Instead of implementing the interface ourselves, it's a better idea to leave the implementation to the contributor, the only thing we need to do is provide are "blueprints" for the obligatory and common optional methods.
For example, in huggingface for getting the pets dataset we would do:
What I'm proposing would be:
All datasets on hub would have at least the obligatory functions, plus any optional functions. Discoverability of custom functions would be easy because of autocompletion:
hub.pets.<TAB>
would show everything available.What is really cool about huggingface/nlp is that they figured out a way of automatically testing the users datasets, they automatically generate a sample of data from the contributor dataset and test on that. Awesome.
@ai-fast-track, @oke-aditya, what do you think?
The text was updated successfully, but these errors were encountered: