Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve hub #122

Closed
lgvaz opened this issue Jun 27, 2020 · 22 comments
Closed

Improve hub #122

lgvaz opened this issue Jun 27, 2020 · 22 comments
Labels
discussion enhancement New feature or request help wanted Extra attention is needed

Comments

@lgvaz
Copy link
Collaborator

lgvaz commented Jun 27, 2020

🚀 Feature

Hub came into a lot of reformulations, but it's real essence is becoming clear, a place to share datasets (and more)!

For a specific dataset, the following functions can be present:

  • Load dataset (obligatory)
  • Implement parser (obligatory)
  • Implement baseline model (optional)
  • Implement metrics (optional)
  • Additional stuff we cannot predict, for example, the user may want to include some visualization functions, some exploratory data-analysis tool, etc... (optional)

A great place to draw inspiration is huggingface/nlp, where they maintain a collection of datasets.

While they maintain a very easy to use interface, I think there are some points we can improve.

huggingface/nlp locks the interface and provide fixed methods for getting the dataset, e.g. load_dataset. This works very well for simple (predicted) use cases, but means that custom (unpredicted) functionality is very hard (or impossible) to implement.

Instead of implementing the interface ourselves, it's a better idea to leave the implementation to the contributor, the only thing we need to do is provide are "blueprints" for the obligatory and common optional methods.

For example, in huggingface for getting the pets dataset we would do:

dataset = load_dataset('pets') #additional arguments can be passed

What I'm proposing would be:

dataset = hub.pets.load_dataset() #additional arguments can be passed

All datasets on hub would have at least the obligatory functions, plus any optional functions. Discoverability of custom functions would be easy because of autocompletion: hub.pets.<TAB> would show everything available.


What is really cool about huggingface/nlp is that they figured out a way of automatically testing the users datasets, they automatically generate a sample of data from the contributor dataset and test on that. Awesome.

@ai-fast-track, @oke-aditya, what do you think?

@lgvaz lgvaz added enhancement New feature or request help wanted Extra attention is needed discussion labels Jun 27, 2020
@lgvaz lgvaz added this to To do in First release 0.1.0 Jun 27, 2020
@oke-aditya
Copy link
Contributor

Should the datasets be part of this library, or have it as a separate repo, (HuggingFace does this) ?

@lgvaz
Copy link
Collaborator Author

lgvaz commented Jun 27, 2020

I'm not sure yet... We can start with being a part of this library and change it afterwards if needed

@lgvaz
Copy link
Collaborator Author

lgvaz commented Jun 28, 2020

Which one is more intuive?

Option 1

source = hub.pets.dataset()
parser = hub.pets.parser(source)

Option 2

source = hub.pets.get_dataset()
parser = hub.pets.get_parser(source)

But get is like a virus, everything will end up being get_<something>

Option 3

source = datasets.pets.load()
parser = datasets.pets.parser(source)

If we start to also distribute pre-trained models for datasets, does it becomes awkward to do so:

model = datasets.pets.model(...)

hub feels more natural

model = hub.pets.model(...)

@oke-aditya
Copy link
Contributor

oke-aditya commented Jun 28, 2020

hub.pets.dataset() sounds strange, user will have to remember name of dataset in his IDE too.

User can do hub.dataset. and he would simply get list of all available datasets.

Methods that we might provide . These would be usueful for augmentation.

image_set (string, optional) – Select the image_set to use, train, trainval or val

download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop

target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

transforms (callable, optional) – A function/transform that takes input sample and its target as entry and returns a transformed version.

Also yes as proposed datasets.pets.parser(source)

datasets.pets.model(...) This should be baseline model if implemented.

@oke-aditya
Copy link
Contributor

oke-aditya commented Jun 28, 2020

dataset. will allow us to quickly move this to a seperate repo in future if required.

@lgvaz
Copy link
Collaborator Author

lgvaz commented Jun 28, 2020

hub.pets.dataset() sounds strange, user will have to remember name of dataset in his IDE too.

Can you elaborate more? Using autocompletion hub.<TAB> should list all datasets

So if I understood correctly datasets.pets.model sounds reasonable for you?

@oke-aditya
Copy link
Contributor

hub.dataset sounds good because Coz later we would have hub.models. and maybe hub.parsers.

@ai-fast-track
Copy link
Collaborator

If we use hub.dataset , what happens when the dataset isn't in Hub?

What about this:
source = datasets.load(URLs.PETS)

Using a URLs class will triggers auto-completion. Fastai uses that, and I use it in my timeseries package here

@philtrade
Copy link

Option 3

source = datasets.pets.load()
parser = datasets.pets.parser(source)

May I suggest:

source = datasets.pets.source()
parser = datasets.pets.parser(source)

or even

parser = datasets.pets.parser.from_source() # or datasets.pets.parser(from_source=True)

@lgvaz
Copy link
Collaborator Author

lgvaz commented Jun 28, 2020

If we use hub.dataset , what happens when the dataset isn't in Hub?

It only works for datasets that are in hub, similar to huggingface/nlp. Datasets come in a varied set of formats, fastai way (of having a url string) only works because they expect the data to always be in the same format (a single compressed file that needs to be umcomproseed and that's it).

In reality we might need much more process to get a dataset to work, like downloading from multiple sources, then combining into a single directory, etc..

So:

First condition

Collaborators need the ability of writing their own load method.

Hugging face also does load_dataset(string) and still allows collaborators to write their custom functions, but I explained the pitfalls of that in OP.

hub.dataset sounds good because Coz later we would have hub.models. and maybe hub.parsers.

We need something self contained. Let's say you want to colaborate with a new dataset, you would have to create a new file in hub.datasets, hub.models, and hub.parsers. Let's say you would also like to include a visualisation function, where would that go? Not clear

This is why:

Second condition

The collaborator needs a self-contained directory to integrate his dataset, parsers, models, etc..

So all of his changes would be hub/pets (or whatever naming we end up using). This way we can clearly associate a folder with a single person.

@lgvaz
Copy link
Collaborator Author

lgvaz commented Jun 28, 2020

source = datasets.pets.source()

Maybe the problem is having source = something but having datasets.pets.source does not make it clear that this method can download the dataset from the internet and extract it.

@oke-aditya
Copy link
Contributor

hub.pets would be good if it was a separate repo like huggingFace NLP is. Exclusively for datasets. Hub becomes ambigous. Does ``hub.``` give you dataset? models? implementations of ross?

@lgvaz
Copy link
Collaborator Author

lgvaz commented Jun 28, 2020

So we can have datasets (can be a separate repo yes), and that will give you everything related to a specific dataset, like downloading, parsers, visualization tools, metrics, etc

@oke-aditya
Copy link
Contributor

oke-aditya commented Jun 28, 2020

Absolutely. Gets best of both the worlds. hub is place for models. It should include models/implementation of supportive CNNs, layers etc. Whereas by decoupling data from model. Both stay independent.

Users can simply do dataset.model or parser or viz etc. And people rely on mantishhrimp for model building.

HuggingFace too did this ; separated transformers from data.

@lgvaz
Copy link
Collaborator Author

lgvaz commented Jun 28, 2020

The only thing is that a datasets repo will have a dependency on mantisshrimp (because of Parser). Is that ok?

@lgvaz
Copy link
Collaborator Author

lgvaz commented Jun 28, 2020

Alright, what about this:

from mantisshrimp.datasets import pets

data_dir = pets.load()
parser = pets.parser(data_dir)

datasets can be moved to a separate folder later on if required

@lgvaz
Copy link
Collaborator Author

lgvaz commented Jun 28, 2020

All functions can have specific arguments, for instance, if you also want to train a segmentation model:

parser = pets.parser(data_dir, mask=True)

@ai-fast-track
Copy link
Collaborator

The downside of pets.load() is the fact that pets is a dataset in reality. For outsiders (new users), they won't know that (pets is a dataset object) in isolation unless they already saw the import statement beforehand (from mantisshrimp.datasets import pets).

I think we should decouple dataset and models. A dataset can be associated with a model or not at all. Its initial model can be swapped by a new and improved model.

The way I see it:
A dataset object is a dataset. It can be located in those different locations:

  • Local machine
  • Somewhere in the cloud (urls)
  • Our Datasets Hub

At the high level, Datasets can be loaded in a transparent way regardless of their locations. Why not having a Datasets.load method with different kwargs that are linked with the locations mentioned here above?

Parses locations:

  • Local machine
  • Our Parsers Hub

For the models, it seems to be similar to the datasets:

  • Local machine
  • Somewhere in the cloud (Ross urls for example if they work out of the box)
  • Our Models Hub

@lgvaz
Copy link
Collaborator Author

lgvaz commented Jun 29, 2020

Why not having a Datasets.load

This is only going to work as a proxy for the original function that loads the dataset. And there is the classic problem with kwargs, take my previous example:

parser = pets.parser(data_dir, mask=True)

If we instead went with: Datasets.load the signature would always look like (data_dir, **kwargs), in this way, there is no way of using autocompletion to discover what parameters should be passed.

Also, as I said in the OP, this design locks the api, let's say the user also wants to provide a method for visualizing the class distribution of the data, how would he make that available to other users?


A dataset object is a dataset

In the proposed solution there is no datasets object, it's only a namespace it's a folder structure like so:

datasets
| -------> pets
           | ---> __init__.py
           | ---> parser.py
            ...

The problem with the confusing name is easily solvable with:

from mantisshrimp import datasets

data_dir = datasets.pets.load()

Maybe there is a conflict between datasets concept that we're discussing and the Dataset object from the library. We can work on that and clear out the names.


I think we should decouple dataset and models. A dataset can be associated with a model or not at all.

The confusion happening here is what I mean by model. I'm not talking about Ross models or anything like that at all, but instead of pre-trained weights for that dataset (a discussion on how to handle new models is being handled at #75 ). The thing is that it's impossible to decouple the model definition with its weights file.

The general idea is that the pretained weights provided should be from a model already present in the library.

Let's say I already trained a model for this dataset, and I want future users to be able to start from that, in this way they can simply do:

model = datasets.pets.model()

And all I (as the implementor) have to do is:

def model():
  pets_model = MantisFasterRCNN(num_classes=2)
  pets_model.load('weight_file')
  return pets_model

You can argue that we just need to provide the weights file together with the model class, but that would break very simple modifications we would like to support, as an example:

def model():
  backbone = MantisFasterRCNN.get_backbone_by_name('resnet18')
  pets_model = MantisFasterRCNN(num_classes=2, backbone=backbone)
  pets_model.load('weight_file')
  return pets_model

@ai-fast-track
Copy link
Collaborator

There is a bit of confusion here regarding the use of terms. This is going to create asynchronized discussions. I propose that we create some sketches or diagrams or any illustrations to avoid the whole misunderstanding process. We can also have a title glossary starting with the most confusion and overloaded terms: These document will save us a lot time and clear the path to new contributors/users.

If we have to explain to every contributor and/or new user what those terms really mean it isn't going to be a very productive process.

@philtrade
Copy link

philtrade commented Jun 29, 2020

  • Datasets -- raw materials, manipulated via load or download or parse, or visualize, augment, compress, shuffle blah blah.

  • Models -- an implementation of an architecture with some backbone, without any weights.

  • Weights (and optimizer states) -- the snapshot of a model state, when banging the above two together. In other words: model_X.train(dataset_Y) produces weights_Z.

As such, datasets and model architecture are orthogonal concepts, thus can have their own APIs to communicate between them and with the outside world. Weights is a derivative of these two underlyings, it can have its own API too, pointing to
what model it belongs to, and from what dataset it was trained on, or a detail recipe of how it was cooked up.

Providing default weights in a single model() call is a fair usage, but perhaps with a parameter such as use_default_weights:bool=True and if use_default_weights: pet_model.load(pet_model.default_weights)... something along that line.

@lgvaz lgvaz mentioned this issue Jun 29, 2020
@lgvaz
Copy link
Collaborator Author

lgvaz commented Jul 19, 2020

Hub was successfully replaced by datasets

@lgvaz lgvaz closed this as completed Jul 19, 2020
First release 0.1.0 automation moved this from To do to Done Jul 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request help wanted Extra attention is needed
Projects
No open projects
Development

No branches or pull requests

4 participants