Improve hub #122

lgvaz · 2020-06-27T02:06:38Z

🚀 Feature

Hub came into a lot of reformulations, but it's real essence is becoming clear, a place to share datasets (and more)!

For a specific dataset, the following functions can be present:

Load dataset (obligatory)
Implement parser (obligatory)
Implement baseline model (optional)
Implement metrics (optional)
Additional stuff we cannot predict, for example, the user may want to include some visualization functions, some exploratory data-analysis tool, etc... (optional)

A great place to draw inspiration is huggingface/nlp, where they maintain a collection of datasets.

While they maintain a very easy to use interface, I think there are some points we can improve.

huggingface/nlp locks the interface and provide fixed methods for getting the dataset, e.g. load_dataset. This works very well for simple (predicted) use cases, but means that custom (unpredicted) functionality is very hard (or impossible) to implement.

Instead of implementing the interface ourselves, it's a better idea to leave the implementation to the contributor, the only thing we need to do is provide are "blueprints" for the obligatory and common optional methods.

For example, in huggingface for getting the pets dataset we would do:

dataset = load_dataset('pets') #additional arguments can be passed

What I'm proposing would be:

dataset = hub.pets.load_dataset() #additional arguments can be passed

All datasets on hub would have at least the obligatory functions, plus any optional functions. Discoverability of custom functions would be easy because of autocompletion: hub.pets.<TAB> would show everything available.

What is really cool about huggingface/nlp is that they figured out a way of automatically testing the users datasets, they automatically generate a sample of data from the contributor dataset and test on that. Awesome.

@ai-fast-track, @oke-aditya, what do you think?

The text was updated successfully, but these errors were encountered:

oke-aditya · 2020-06-27T11:34:36Z

Should the datasets be part of this library, or have it as a separate repo, (HuggingFace does this) ?

lgvaz · 2020-06-27T20:50:30Z

I'm not sure yet... We can start with being a part of this library and change it afterwards if needed

lgvaz · 2020-06-28T18:08:31Z

Which one is more intuive?

Option 1

source = hub.pets.dataset()
parser = hub.pets.parser(source)

Option 2

source = hub.pets.get_dataset()
parser = hub.pets.get_parser(source)

But get is like a virus, everything will end up being get_<something>

Option 3

source = datasets.pets.load()
parser = datasets.pets.parser(source)

If we start to also distribute pre-trained models for datasets, does it becomes awkward to do so:

model = datasets.pets.model(...)

hub feels more natural

model = hub.pets.model(...)

oke-aditya · 2020-06-28T18:13:38Z

hub.pets.dataset() sounds strange, user will have to remember name of dataset in his IDE too.

User can do hub.dataset. and he would simply get list of all available datasets.

Methods that we might provide . These would be usueful for augmentation.

image_set (string, optional) – Select the image_set to use, train, trainval or val

download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop

target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

transforms (callable, optional) – A function/transform that takes input sample and its target as entry and returns a transformed version.

Also yes as proposed datasets.pets.parser(source)

datasets.pets.model(...) This should be baseline model if implemented.

oke-aditya · 2020-06-28T18:15:26Z

dataset. will allow us to quickly move this to a seperate repo in future if required.

lgvaz · 2020-06-28T18:16:57Z

hub.pets.dataset() sounds strange, user will have to remember name of dataset in his IDE too.

Can you elaborate more? Using autocompletion hub.<TAB> should list all datasets

So if I understood correctly datasets.pets.model sounds reasonable for you?

oke-aditya · 2020-06-28T18:19:08Z

hub.dataset sounds good because Coz later we would have hub.models. and maybe hub.parsers.

ai-fast-track · 2020-06-28T18:29:31Z

If we use hub.dataset , what happens when the dataset isn't in Hub?

What about this:
source = datasets.load(URLs.PETS)

Using a URLs class will triggers auto-completion. Fastai uses that, and I use it in my timeseries package here

philtrade · 2020-06-28T18:32:59Z

Option 3

source = datasets.pets.load()
parser = datasets.pets.parser(source)

May I suggest:

source = datasets.pets.source()
parser = datasets.pets.parser(source)

or even

parser = datasets.pets.parser.from_source() # or datasets.pets.parser(from_source=True)

lgvaz · 2020-06-28T18:44:42Z

If we use hub.dataset , what happens when the dataset isn't in Hub?

It only works for datasets that are in hub, similar to huggingface/nlp. Datasets come in a varied set of formats, fastai way (of having a url string) only works because they expect the data to always be in the same format (a single compressed file that needs to be umcomproseed and that's it).

In reality we might need much more process to get a dataset to work, like downloading from multiple sources, then combining into a single directory, etc..

So:

First condition

Collaborators need the ability of writing their own load method.

Hugging face also does load_dataset(string) and still allows collaborators to write their custom functions, but I explained the pitfalls of that in OP.

hub.dataset sounds good because Coz later we would have hub.models. and maybe hub.parsers.

We need something self contained. Let's say you want to colaborate with a new dataset, you would have to create a new file in hub.datasets, hub.models, and hub.parsers. Let's say you would also like to include a visualisation function, where would that go? Not clear

This is why:

Second condition

The collaborator needs a self-contained directory to integrate his dataset, parsers, models, etc..

So all of his changes would be hub/pets (or whatever naming we end up using). This way we can clearly associate a folder with a single person.

lgvaz · 2020-06-28T18:46:19Z

source = datasets.pets.source()

Maybe the problem is having source = something but having datasets.pets.source does not make it clear that this method can download the dataset from the internet and extract it.

oke-aditya · 2020-06-28T18:50:21Z

hub.pets would be good if it was a separate repo like huggingFace NLP is. Exclusively for datasets. Hub becomes ambigous. Does ``hub.``` give you dataset? models? implementations of ross?

lgvaz · 2020-06-28T18:54:16Z

So we can have datasets (can be a separate repo yes), and that will give you everything related to a specific dataset, like downloading, parsers, visualization tools, metrics, etc

oke-aditya · 2020-06-28T18:57:05Z

Absolutely. Gets best of both the worlds. hub is place for models. It should include models/implementation of supportive CNNs, layers etc. Whereas by decoupling data from model. Both stay independent.

Users can simply do dataset.model or parser or viz etc. And people rely on mantishhrimp for model building.

HuggingFace too did this ; separated transformers from data.

lgvaz · 2020-06-28T19:01:26Z

The only thing is that a datasets repo will have a dependency on mantisshrimp (because of Parser). Is that ok?

lgvaz · 2020-06-28T20:41:05Z

Alright, what about this:

from mantisshrimp.datasets import pets

data_dir = pets.load()
parser = pets.parser(data_dir)

datasets can be moved to a separate folder later on if required

lgvaz · 2020-06-28T20:44:46Z

All functions can have specific arguments, for instance, if you also want to train a segmentation model:

parser = pets.parser(data_dir, mask=True)

ai-fast-track · 2020-06-28T22:01:42Z

The downside of pets.load() is the fact that pets is a dataset in reality. For outsiders (new users), they won't know that (pets is a dataset object) in isolation unless they already saw the import statement beforehand (from mantisshrimp.datasets import pets).

I think we should decouple dataset and models. A dataset can be associated with a model or not at all. Its initial model can be swapped by a new and improved model.

The way I see it:
A dataset object is a dataset. It can be located in those different locations:

Local machine
Somewhere in the cloud (urls)
Our Datasets Hub

At the high level, Datasets can be loaded in a transparent way regardless of their locations. Why not having a Datasets.load method with different kwargs that are linked with the locations mentioned here above?

Parses locations:

Local machine
Our Parsers Hub

For the models, it seems to be similar to the datasets:

Local machine
Somewhere in the cloud (Ross urls for example if they work out of the box)
Our Models Hub

lgvaz · 2020-06-29T00:12:44Z

Why not having a Datasets.load

This is only going to work as a proxy for the original function that loads the dataset. And there is the classic problem with kwargs, take my previous example:

parser = pets.parser(data_dir, mask=True)

If we instead went with: Datasets.load the signature would always look like (data_dir, **kwargs), in this way, there is no way of using autocompletion to discover what parameters should be passed.

Also, as I said in the OP, this design locks the api, let's say the user also wants to provide a method for visualizing the class distribution of the data, how would he make that available to other users?

A dataset object is a dataset

In the proposed solution there is no datasets object, it's only a namespace it's a folder structure like so:

datasets
| -------> pets
           | ---> __init__.py
           | ---> parser.py
            ...

The problem with the confusing name is easily solvable with:

from mantisshrimp import datasets

data_dir = datasets.pets.load()

Maybe there is a conflict between datasets concept that we're discussing and the Dataset object from the library. We can work on that and clear out the names.

I think we should decouple dataset and models. A dataset can be associated with a model or not at all.

The confusion happening here is what I mean by model. I'm not talking about Ross models or anything like that at all, but instead of pre-trained weights for that dataset (a discussion on how to handle new models is being handled at #75 ). The thing is that it's impossible to decouple the model definition with its weights file.

The general idea is that the pretained weights provided should be from a model already present in the library.

Let's say I already trained a model for this dataset, and I want future users to be able to start from that, in this way they can simply do:

model = datasets.pets.model()

And all I (as the implementor) have to do is:

def model():
  pets_model = MantisFasterRCNN(num_classes=2)
  pets_model.load('weight_file')
  return pets_model

You can argue that we just need to provide the weights file together with the model class, but that would break very simple modifications we would like to support, as an example:

def model():
  backbone = MantisFasterRCNN.get_backbone_by_name('resnet18')
  pets_model = MantisFasterRCNN(num_classes=2, backbone=backbone)
  pets_model.load('weight_file')
  return pets_model

ai-fast-track · 2020-06-29T03:13:07Z

There is a bit of confusion here regarding the use of terms. This is going to create asynchronized discussions. I propose that we create some sketches or diagrams or any illustrations to avoid the whole misunderstanding process. We can also have a title glossary starting with the most confusion and overloaded terms: These document will save us a lot time and clear the path to new contributors/users.

If we have to explain to every contributor and/or new user what those terms really mean it isn't going to be a very productive process.

philtrade · 2020-06-29T03:55:33Z

Datasets -- raw materials, manipulated via load or download or parse, or visualize, augment, compress, shuffle blah blah.
Models -- an implementation of an architecture with some backbone, without any weights.
Weights (and optimizer states) -- the snapshot of a model state, when banging the above two together. In other words: model_X.train(dataset_Y) produces weights_Z.

As such, datasets and model architecture are orthogonal concepts, thus can have their own APIs to communicate between them and with the outside world. Weights is a derivative of these two underlyings, it can have its own API too, pointing to
what model it belongs to, and from what dataset it was trained on, or a detail recipe of how it was cooked up.

Providing default weights in a single model() call is a fair usage, but perhaps with a parameter such as use_default_weights:bool=True and if use_default_weights: pet_model.load(pet_model.default_weights)... something along that line.

lgvaz · 2020-07-19T18:45:03Z

Hub was successfully replaced by datasets

lgvaz added enhancement New feature or request help wanted Extra attention is needed discussion labels Jun 27, 2020

lgvaz added this to To do in First release 0.1.0 Jun 27, 2020

lgvaz mentioned this issue Jun 28, 2020

Integrate Pytorch Image Models as Hub Module #75

Closed

lgvaz mentioned this issue Jun 28, 2020

Adds pets datasets #130

Merged

review-notebook-app bot mentioned this issue Jun 28, 2020

Update advanced Faster RCNN Tutorial #126

Merged

lgvaz mentioned this issue Jun 29, 2020

Inference helper #78

Closed

lgvaz closed this as completed Jul 19, 2020

First release 0.1.0 automation moved this from To do to Done Jul 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve hub #122

Improve hub #122

lgvaz commented Jun 27, 2020 •

edited

oke-aditya commented Jun 27, 2020

lgvaz commented Jun 27, 2020

lgvaz commented Jun 28, 2020 •

edited

oke-aditya commented Jun 28, 2020 •

edited

oke-aditya commented Jun 28, 2020 •

edited

lgvaz commented Jun 28, 2020

oke-aditya commented Jun 28, 2020

ai-fast-track commented Jun 28, 2020

philtrade commented Jun 28, 2020

Option 3

lgvaz commented Jun 28, 2020

lgvaz commented Jun 28, 2020

oke-aditya commented Jun 28, 2020

lgvaz commented Jun 28, 2020 •

edited

oke-aditya commented Jun 28, 2020 •

edited

lgvaz commented Jun 28, 2020

lgvaz commented Jun 28, 2020 •

edited

lgvaz commented Jun 28, 2020

ai-fast-track commented Jun 28, 2020

lgvaz commented Jun 29, 2020 •

edited

ai-fast-track commented Jun 29, 2020

philtrade commented Jun 29, 2020 •

edited

lgvaz commented Jul 19, 2020

Improve hub #122

Improve hub #122

Comments

lgvaz commented Jun 27, 2020 • edited

🚀 Feature

oke-aditya commented Jun 27, 2020

lgvaz commented Jun 27, 2020

lgvaz commented Jun 28, 2020 • edited

Option 1

Option 2

Option 3

oke-aditya commented Jun 28, 2020 • edited

oke-aditya commented Jun 28, 2020 • edited

lgvaz commented Jun 28, 2020

oke-aditya commented Jun 28, 2020

ai-fast-track commented Jun 28, 2020

philtrade commented Jun 28, 2020

Option 3

lgvaz commented Jun 28, 2020

First condition

Second condition

lgvaz commented Jun 28, 2020

oke-aditya commented Jun 28, 2020

lgvaz commented Jun 28, 2020 • edited

oke-aditya commented Jun 28, 2020 • edited

lgvaz commented Jun 28, 2020

lgvaz commented Jun 28, 2020 • edited

lgvaz commented Jun 28, 2020

ai-fast-track commented Jun 28, 2020

lgvaz commented Jun 29, 2020 • edited

ai-fast-track commented Jun 29, 2020

philtrade commented Jun 29, 2020 • edited

lgvaz commented Jul 19, 2020

lgvaz commented Jun 27, 2020 •

edited

lgvaz commented Jun 28, 2020 •

edited

oke-aditya commented Jun 28, 2020 •

edited

oke-aditya commented Jun 28, 2020 •

edited

lgvaz commented Jun 28, 2020 •

edited

oke-aditya commented Jun 28, 2020 •

edited

lgvaz commented Jun 28, 2020 •

edited

lgvaz commented Jun 29, 2020 •

edited

philtrade commented Jun 29, 2020 •

edited