Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overall loop for training a deep net for molecules here #3

Open
2 of 4 tasks
karalets opened this issue Mar 27, 2020 · 29 comments
Open
2 of 4 tasks

Overall loop for training a deep net for molecules here #3

karalets opened this issue Mar 27, 2020 · 29 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@karalets
Copy link
Collaborator

karalets commented Mar 27, 2020

We have two desiderata:

  1. We want to be able to learn a network which regresses to measurements given a structure as input.
  2. We may want to pretrain parts of that network (i.e. the molecular representation part) with existing molecular data in order to get some knowledge into the model about what molecular structures exist.

We can decompose those two tasks as follows:

We want to have a model of a representation P(h|x) which predicts hidden features h from a molecular graph x, ideally a joint model P(h,x) that is a joint density with arbitrary conditioning.

We furthermore want a model of the measurements m we care about given a molecular representation h, expressed asP(m|h). In simple regression this could be a probabilistic linear model on top of the outputs of the representation model P(h|x).

We may want to train them both jointly, separately, or in phases.
If we pre-train P(h|x), that is called semi-supervised learning.

In a training loop to solve the task of regressing m from a training set D_t = {X_t, M_t}, we may want to account for having access to a background dataset D_b = {X_b} without measurements but with molecular graphs.

The desired training loop now allows us to potentially pre-train or jointly train a model which can learn from both sources of data.

Our targeted output is a model P(m|x) = int_h P(m|h) P(h|x) dh that is applied to a test set and works much better after having ingested all data available to us.

In this issue/thread, I suggest we link to the code and discuss how to create this loop concretely based on a concrete application example with molecules.

Missing pieces:

  • details of the API for training
  • details of the API for testing
  • metrics for testing
  • ....
@karalets
Copy link
Collaborator Author

karalets commented Mar 27, 2020

@yuanqing-wang Can you please comment on how this matches your thoughts and if not what we should change in the overall desiderata?

Then we can talk about how you intend to or already have structured this.

@yuanqing-wang
Copy link
Member

@karalets

I like your idea of the overall structure. Separating x -> h and h -> m sounds like a reasonable thing to do since we can play with each part afterwards.

Meanwhile for training and testing I guess we only need something as simple as functions that take models, weights, and return metrics. I'll put stuff here: https://github.com/choderalab/pinot/blob/master/pinot/app/utils.py to incorporate both x -> h and h -> m.

@karalets
Copy link
Collaborator Author

karalets commented Mar 27, 2020

@yuanqing-wang what do you mean by weights?
I would abstract away from weights and think about model parameters or model posteriors quite independently, that should be up to a model class to decide what it wants to serialize to a first approximation.

And in utils I don't see the concrete link to dataset creation, i.e. 20% or whatnot.
I suggest also accounting for: validation set, data loaders, an interface for passing an arbitray model class with a defined API into the trainer/tester, ...

I would prefer if that function became a thing which has a specified API for this problem which we can pass models into that conform to an API and then push a button and get a few metrics, i.e. test log likelihood.

Could you build a loop an an experiment file which, for the simplest off-the-shelf model does that and does the entire thing?

i.e. it would be good if we get to an abstraction that allows us to define an experiment as follows or similar, my main point is to modularize heavily.

def experiment_1(args):
     model = Model1
     dataset_train = ...
     dataset_background = ....
    hyperparameters = args....
    out_path = args.experiment_path
    
    #if this is semi-supervised do this
    #if it were not semi-supervised there could also be a run_experiment(...) that only does thhe other 
    stuff or so
    
    results = run_ss_experiment(...)
    plot_figures(results, out_path)

And in order to test that there should be from the beginning a concrete instance of such an experiment that one can run.

@yuanqing-wang
Copy link
Member

the data utils are supplied separately here https://github.com/choderalab/pinot/blob/master/pinot/data/utils.py

@karalets
Copy link
Collaborator Author

Cool can we have an experiment file that brings everything together and executes a full example of it all similarly to what I described above?

@yuanqing-wang
Copy link
Member

Working on it

@karalets
Copy link
Collaborator Author

@karalets

I like your idea of the overall structure. Separating x -> h and h -> m sounds like a reasonable thing to do since we can play with each part afterwards.

Meanwhile for training and testing I guess we only need something as simple as functions that take models, weights, and return metrics. I'll put stuff here: https://github.com/choderalab/pinot/blob/master/pinot/app/utils.py to incorporate both x -> h and h -> m.

Just to clarify: I do suggest separating them not necessarily in the model, but rather accounting for the existence of both, so that maybe they are trained separately, maybe jointly, but in any case they need to have a consistent API for the data each part needs to see.
In fact, I believe building both into a joint model will work best, but we still need to have datasets in there that can supervise each aspect.

Consider this an instance of data oriented programming, rather than a deep learning model with different phases.

@yuanqing-wang
Copy link
Member

@karalets

I incorporated your idea here: https://github.com/choderalab/pinot/blob/master/pinot/net.py

and the training pipeline now looks like https://github.com/choderalab/pinot/blob/master/pinot/app/train.py

like me know your thoughts on this

@karalets
Copy link
Collaborator Author

@karalets

I incorporated your idea here: https://github.com/choderalab/pinot/blob/master/pinot/net.py

and the training pipeline now looks like https://github.com/choderalab/pinot/blob/master/pinot/app/train.py

like me know your thoughts on this

Great start!

In the Net class I would make the representation and parametrization objects concrete.
I.e. you can play the inheritance game and create explicit classes that inherit from Net and have a concrete form. Else you do not win much here.
I would also suggest not calling the top layer parametrization, but rather something like regressor_layer or measurement model or whatever as opposed to the other component, the more lucid representation_layer or representation that you currently use; parametrization is pretty misleading as a name.

Regarding the loop:
I would still recommend factoring out an experiment class which has some more modularity.

I.e. in your current loop you do a lot of things in one larger script:
defining the model layers, building the model, training, etc...
In a better universe training and experiment setup are factorized out.

Currently, also, unlike the suggestion above, you do not have the potential for semi-supervised learning in there even if you wanted to do use it.

Think about wanting to define an experiment which can differ in the following ways:

  • use 20% more training data, but the same settings otherwise
  • use or do not use semi-supervised data, same otherwise
  • use a particular semi-supervised background dataset or another one, but the same main training set
  • try the same data settings but different models
  • play with hyperparameter selection for each experiment
  • get new metrics for all of the versions of the above when you have pre-trained models lying around
  • have new test data that may arrive
  • think about a joint model over representation and regression vs a disjoint model, how can you still do all you want?
  • ...

Your experiment runner and trainer etc. should make such changes easy and clear, I suggest you think backwards from the results you anticipate wanting to be able to get to the structure here.

As I said, I recommend factoring things out a bit more than you have, ut this is surely a good direction

@karalets
Copy link
Collaborator Author

karalets commented Mar 28, 2020

One can also factor the loop out into:

experiment files contain all the settings (model setting, data settings, model hyper-parameters, storage paths and names for relevant output files) and receive inputs from args

trainer files receive experiment files as args and produce trained objects according to settings

tester files run pre-trained objects on test-data and run eval methods

eval methods receive metrics and predictions according to some API and do stuff that generates numbers

plotting methods visualize eval

We can improve on this I am sure, but I would imagine making this modular will very quickly yield benefits.

@karalets
Copy link
Collaborator Author

One cool example of factorization is the dataloaders etc. in pytorch:

https://pytorch.org/docs/stable/data.html

You can define in separate classes things like:

  • the dataset
  • the normalization/preprocessing strategy
  • the properties of a dataloader which receives dataset class and preprocessing class as inputs

Then objects like this dataloader are passed to trainer classes which tie this to models and deliver batches for training. The dataloader class can be kept invariant to compare all kinds of models while having an auditable 'version' of the training data and pre-processing.
In our case, I would like the experimental setup and choices to be auditable by being stored in some experiment definition which can be changed in its attributes for comparing different experiments.

If you prefer not to use as much bespoke pytorch that is fine, I am just suggesting looking at examples of how modern ML software works on separation of concerns.

@yuanqing-wang
Copy link
Member

yuanqing-wang commented Mar 28, 2020

In the Net class I would make the representation and parametrization objects concrete.
I.e. you can play the inheritance game and create explicit classes that inherit from Net and have a concrete form. Else you do not win much here.

Not sure if I followed. the objects are taken as parameters here

@yuanqing-wang
Copy link
Member

I'll further factorize the experiment

@karalets
Copy link
Collaborator Author

In the Net class I would make the representation and parametrization objects concrete.
I.e. you can play the inheritance game and create explicit classes that inherit from Net and have a concrete form. Else you do not win much here.

Not sure if I followed. the objects are taken as parameters here

Yes, the objects are parameters and that is very nice and would already do if the experiment file factors things sufficiently. An option would be to just create, for each combination of objects, a particular subclass, as other things may also change.

But that is unnecessary for now as we can do all of that later, I am ok with it.

@yuanqing-wang
Copy link
Member

@karalets

Would something like this be a bit better?
https://github.com/choderalab/pinot/blob/master/pinot/app/train.py

@karalets
Copy link
Collaborator Author

I am still unsure if you can do the cases described below.

Think about wanting to define an experiment which can differ in the following ways:

  • use 20% more training data, but the same settings otherwise
  • use or do not use semi-supervised data, same otherwise
  • use a particular semi-supervised background dataset or another one, but the same main training set
  • try the same data settings but different models
  • play with hyperparameter selection for each experiment
  • get new metrics for all of the versions of the above when you have pre-trained models lying around
  • have new test data that may arrive
  • think about a joint model over representation and regression vs a disjoint model, how can you still do all you want?
  • ...

@yuanqing-wang
Copy link
Member

These could be done by simply changing some args in the script.
-use 20% more training data, but the same settings otherwise
-try the same data settings but different models
-play with hyperparameter selection for each experiment

@yuanqing-wang
Copy link
Member

The rest would can be done by using the APIs but with small twists in the scripts.

@karalets
Copy link
Collaborator Author

karalets commented Mar 29, 2020

Ok, could you run a test-playthrough with an off-the-shelf semi-supervised model, i.e. the one from the paper?

@yuanqing-wang
Copy link
Member

Semi-supervised learning has not been implemented yet. Should that be our next step?

@karalets
Copy link
Collaborator Author

karalets commented Mar 29, 2020

I believe it serves the utility of making the pipeline more complete and step 1 should be to have a robust skeleton of the pipeline and examples of the types of workflows we may need.

I think you will understand my asks for more modularization a bit better when you build semi-supervised in there.

Thus: yes, let's proceed to having an example of SS training.

Ideally you could make two examples: one with and one without SS aspects, both using the same training data and as much of the same infrastructure as possible.
I.e. ideally the differences only live in the arguments passed to the experiment code.

@karalets
Copy link
Collaborator Author

karalets commented Apr 27, 2020

Hey @yuanqing-wang , do we have at this point a little toy/sandbox example that one could test and run on a laptop in a closed loop? I'd like to play with some of the problems with NN training in a toy example that is easy to re-run.

@jchodera
Copy link
Member

Not quite yet, I think. We have the beginnings of this, but I think we're hoping @dnguyen1196 can dive in and get this part going!

@karalets karalets added the good first issue Good for newcomers label May 13, 2020
@karalets
Copy link
Collaborator Author

I am tagging @dnguyen1196 here to read through the beginning as this issue explains a lot of what is going on here.

This was referenced May 13, 2020
@dnguyen1196
Copy link
Collaborator

@karalets @yuanqing-wang

To recap and please correct me, it seems that the goals when this issue was created were:

  • Implement functionalities that can do testing with detailed specifications (Overall loop for training a deep net for molecules here #3 (comment)) and the current implementation covers some basic requirements.
  • Per this issue, one major way we might want to change from the current implementation is the ability to more deeply separate between parameterization and representation (because it seems at the moment they are jointly trained). We want functionalities such as pre-training representations, combining fixed representations and trainable parameterization, pretrained representations with trainable parameterization etc

So within this issue, perhaps two subtasks remain:

  • Add more fine-grain testing capabilities to
    the current experiment infrastructure
  • More cleanly separate between parameterization and representation.

@karalets
Copy link
Collaborator Author

Hey,

So you understand the issue here quite well.
There are some subtleties with respect to how to specify remaining subtasks.

So within this issue, perhaps two subtasks remain:

  • Add more fine-grain testing capabilities to
    the current experiment [infrastructure]

Absolutely correct. We need to be able -as I have described above- to add "background" data to inform the representation and train the whole thing nicely together.
In addition, I would argue, as mentioned in issue #26 , that we should also first individually test components that would do unsupervised or self-supervised learning to learn representations so we can target a reasonable set of things to plug in here.
However, in the literature we oftentimes also consider this thing a joint training process as a graphical model which sometimes has more or less evidence at some of the invovled variables, see for instance this https://arxiv.org/abs/1406.5298 and newer literature along those lines https://arxiv.org/abs/1706.00400 .

(https://github.com/choderalab/pinot/blob/master/pinot/app/experiment.py)

  • More cleanly separate between parameterization and representation.

I would not go that route quite yet, I would prefer to be agnostic if the model makes these things communicate uncertainty or not. There may be model classes that have their own way of incorporating one or more variables.
Imagine you have a net-class which has a method net.train(X, Y) and when you set Y=None it just updates the parts it needs.

Another model may really be to hackily pretrain two seperate objectives, one just for representation and one for the measurement term, which are then pliugged together correctly according to the degree of supervision in the observed tupel.

The shared API int he infrastructure should make both types of workflows useable, so I would focus on that API and infrastructure first with a concrete example with real data.

I envision that first pre-training some representation based on background data and then finetuning it on labeled data is ok as a start, but keep in mind we may want to train jointly later with a more rigorous treatment of semi-supervised learning.

We should discuss and iterate on a concrete version of this more, but we also need a separate process to just evaluate the different unsupervised models as mentioned in #26 .

@dnguyen1196
Copy link
Collaborator

dnguyen1196 commented May 13, 2020

Absolutely correct. We need to be able -as I have described above- to add "background" data to inform the representation and train the whole thing nicely together.

What do you mean by this @karalets ? Is the following interpretation correct? For example, say we have 1000 compounds with their associated properties. We actually first use this as "background" data where we train, for example, an unsupervised representation so that we get a "reasonable" representation first (and not touch the parameterization). And then after we have obtained this reasonable representation, we train both the representation and parameterization jointly on the prediction task (supervised).

@karalets
Copy link
Collaborator Author

karalets commented May 14, 2020

Absolutely correct. We need to be able -as I have described above- to add "background" data to inform the representation and train the whole thing nicely together.

What do you mean by this @karalets ? Is the following interpretation correct? For example, say we have 1000 compounds with their associated properties. We actually first use this as "background" data where we train, for example, an unsupervised representation so that we get a "reasonable" representation first (and not touch the parameterization). And then after we have obtained this reasonable representation, we train both the representation and parameterization jointly on the prediction task (supervised).

Sorry, to be precise:
By "background data" I mean data for which we only have graphs, not the measurements/properties, i.e. background molecules that are not the data we are collecting measurements for, but we know exist as molecules.

Intuitively: we need we need graphs to train "representations", and matched 'measurements' to train likelihoods/observation terms ("parametrizations" although I prefer to fade this term out).

In my world we could consider all this to be training data, but sometimes we only observe X, sometimes we observe the tupel X,Y to train our models and we want to make the best of both.

@dnguyen1196
Copy link
Collaborator

dnguyen1196 commented May 14, 2020

@karalets @yuanqing-wang

Intuitively: we need we need graphs to train "representations", and matched 'measurements' to train likelihoods/observation terms ("parametrizations" although I prefer to fade this term out).
In my world we could consider all this to be training data, but sometimes we only observe X, sometimes we observe the tupel X,Y to train our models and we want to make the best of both.

Ok I see your point now. In that regard, I think we might need to modify two interfaces, let me know what you think and if I should start a new issue/discussion on this.

  1. Net
    Right now net.loss(g, y) takes in two arguments.
    def loss(self, g, y):
        distribution = self.condition(g)
        return -distribution.log_prob(y)

So we can modify this function so that for the case when y = None, we only compute "loss" for the representation layer.

  1. For the experiment.py interface, I think we have two options

2a. Add TrainUnsupervised, TestUnsupervised, etc (basically for every current supervised training/testing class, we need a corresponding class for unsupervised training). This will probably repeat a lot of codes but supervised and unsupervised training will involve different optimizers, potentially very different choice of hyperparameters. So if we have separate unsupervised and supervised classes, we can have another class that can combine supervised and unsupervised components together.

2b. Modify the current Train and Test class so that it accommodates both unsupervised and supervised training. This will involve modifying the current constructor to take in more arguments (optimizer for unsupervised training vs supervised training, hyperparameters for unsupervised training). And within the class implementation, more care is needed to make sure the training/testing steps are in the right order.

I think 2a is better, although we repeat more codes but the modularity allows us to do more fine grain training/testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants