-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overall loop for training a deep net for molecules here #3
Comments
@yuanqing-wang Can you please comment on how this matches your thoughts and if not what we should change in the overall desiderata? Then we can talk about how you intend to or already have structured this. |
I like your idea of the overall structure. Separating x -> h and h -> m sounds like a reasonable thing to do since we can play with each part afterwards. Meanwhile for training and testing I guess we only need something as simple as functions that take models, weights, and return metrics. I'll put stuff here: https://github.com/choderalab/pinot/blob/master/pinot/app/utils.py to incorporate both x -> h and h -> m. |
@yuanqing-wang what do you mean by weights? And in utils I don't see the concrete link to dataset creation, i.e. 20% or whatnot. I would prefer if that function became a thing which has a specified API for this problem which we can pass models into that conform to an API and then push a button and get a few metrics, i.e. test log likelihood. Could you build a loop an an experiment file which, for the simplest off-the-shelf model does that and does the entire thing? i.e. it would be good if we get to an abstraction that allows us to define an experiment as follows or similar, my main point is to modularize heavily.
And in order to test that there should be from the beginning a concrete instance of such an experiment that one can run. |
the data utils are supplied separately here https://github.com/choderalab/pinot/blob/master/pinot/data/utils.py |
Cool can we have an experiment file that brings everything together and executes a full example of it all similarly to what I described above? |
Working on it |
Just to clarify: I do suggest separating them not necessarily in the model, but rather accounting for the existence of both, so that maybe they are trained separately, maybe jointly, but in any case they need to have a consistent API for the data each part needs to see. Consider this an instance of data oriented programming, rather than a deep learning model with different phases. |
I incorporated your idea here: https://github.com/choderalab/pinot/blob/master/pinot/net.py and the training pipeline now looks like https://github.com/choderalab/pinot/blob/master/pinot/app/train.py like me know your thoughts on this |
Great start! In the Net class I would make the representation and parametrization objects concrete. Regarding the loop: I.e. in your current loop you do a lot of things in one larger script: Currently, also, unlike the suggestion above, you do not have the potential for semi-supervised learning in there even if you wanted to do use it. Think about wanting to define an experiment which can differ in the following ways:
Your experiment runner and trainer etc. should make such changes easy and clear, I suggest you think backwards from the results you anticipate wanting to be able to get to the structure here. As I said, I recommend factoring things out a bit more than you have, ut this is surely a good direction |
One can also factor the loop out into: experiment files contain all the settings (model setting, data settings, model hyper-parameters, storage paths and names for relevant output files) and receive inputs from args trainer files receive experiment files as args and produce trained objects according to settings tester files run pre-trained objects on test-data and run eval methods eval methods receive metrics and predictions according to some API and do stuff that generates numbers plotting methods visualize eval We can improve on this I am sure, but I would imagine making this modular will very quickly yield benefits. |
One cool example of factorization is the dataloaders etc. in pytorch: https://pytorch.org/docs/stable/data.html You can define in separate classes things like:
Then objects like this dataloader are passed to trainer classes which tie this to models and deliver batches for training. The dataloader class can be kept invariant to compare all kinds of models while having an auditable 'version' of the training data and pre-processing. If you prefer not to use as much bespoke pytorch that is fine, I am just suggesting looking at examples of how modern ML software works on separation of concerns. |
Not sure if I followed. the objects are taken as parameters here |
I'll further factorize the experiment |
Yes, the objects are parameters and that is very nice and would already do if the experiment file factors things sufficiently. An option would be to just create, for each combination of objects, a particular subclass, as other things may also change. But that is unnecessary for now as we can do all of that later, I am ok with it. |
Would something like this be a bit better? |
I am still unsure if you can do the cases described below.
|
These could be done by simply changing some args in the script. |
The rest would can be done by using the APIs but with small twists in the scripts. |
Ok, could you run a test-playthrough with an off-the-shelf semi-supervised model, i.e. the one from the paper? |
Semi-supervised learning has not been implemented yet. Should that be our next step? |
I believe it serves the utility of making the pipeline more complete and step 1 should be to have a robust skeleton of the pipeline and examples of the types of workflows we may need. I think you will understand my asks for more modularization a bit better when you build semi-supervised in there. Thus: yes, let's proceed to having an example of SS training. Ideally you could make two examples: one with and one without SS aspects, both using the same training data and as much of the same infrastructure as possible. |
Hey @yuanqing-wang , do we have at this point a little toy/sandbox example that one could test and run on a laptop in a closed loop? I'd like to play with some of the problems with NN training in a toy example that is easy to re-run. |
Not quite yet, I think. We have the beginnings of this, but I think we're hoping @dnguyen1196 can dive in and get this part going! |
I am tagging @dnguyen1196 here to read through the beginning as this issue explains a lot of what is going on here. |
To recap and please correct me, it seems that the goals when this issue was created were:
So within this issue, perhaps two subtasks remain:
|
Hey, So you understand the issue here quite well.
Absolutely correct. We need to be able -as I have described above- to add "background" data to inform the representation and train the whole thing nicely together. (https://github.com/choderalab/pinot/blob/master/pinot/app/experiment.py)
I would not go that route quite yet, I would prefer to be agnostic if the model makes these things communicate uncertainty or not. There may be model classes that have their own way of incorporating one or more variables. Another model may really be to hackily pretrain two seperate objectives, one just for representation and one for the measurement term, which are then pliugged together correctly according to the degree of supervision in the observed tupel. The shared API int he infrastructure should make both types of workflows useable, so I would focus on that API and infrastructure first with a concrete example with real data. I envision that first pre-training some representation based on background data and then finetuning it on labeled data is ok as a start, but keep in mind we may want to train jointly later with a more rigorous treatment of semi-supervised learning. We should discuss and iterate on a concrete version of this more, but we also need a separate process to just evaluate the different unsupervised models as mentioned in #26 . |
What do you mean by this @karalets ? Is the following interpretation correct? For example, say we have 1000 compounds with their associated properties. We actually first use this as "background" data where we train, for example, an unsupervised representation so that we get a "reasonable" representation first (and not touch the parameterization). And then after we have obtained this reasonable representation, we train both the representation and parameterization jointly on the prediction task (supervised). |
Sorry, to be precise: Intuitively: we need we need graphs to train "representations", and matched 'measurements' to train likelihoods/observation terms ("parametrizations" although I prefer to fade this term out). In my world we could consider all this to be training data, but sometimes we only observe |
Ok I see your point now. In that regard, I think we might need to modify two interfaces, let me know what you think and if I should start a new issue/discussion on this.
So we can modify this function so that for the case when
2a. Add TrainUnsupervised, TestUnsupervised, etc (basically for every current supervised training/testing class, we need a corresponding class for unsupervised training). This will probably repeat a lot of codes but supervised and unsupervised training will involve different optimizers, potentially very different choice of hyperparameters. So if we have separate unsupervised and supervised classes, we can have another class that can combine supervised and unsupervised components together. 2b. Modify the current Train and Test class so that it accommodates both unsupervised and supervised training. This will involve modifying the current constructor to take in more arguments (optimizer for unsupervised training vs supervised training, hyperparameters for unsupervised training). And within the class implementation, more care is needed to make sure the training/testing steps are in the right order. I think 2a is better, although we repeat more codes but the modularity allows us to do more fine grain training/testing. |
We have two desiderata:
We can decompose those two tasks as follows:
We want to have a model of a representation
P(h|x)
which predicts hidden featuresh
from a molecular graphx
, ideally a joint modelP(h,x)
that is a joint density with arbitrary conditioning.We furthermore want a model of the measurements
m
we care about given a molecular representationh
, expressed asP(m|h)
. In simple regression this could be a probabilistic linear model on top of the outputs of the representation modelP(h|x)
.We may want to train them both jointly, separately, or in phases.
If we pre-train
P(h|x)
, that is called semi-supervised learning.In a training loop to solve the task of regressing
m
from a training setD_t = {X_t, M_t}
, we may want to account for having access to a background datasetD_b = {X_b}
without measurements but with molecular graphs.The desired training loop now allows us to potentially pre-train or jointly train a model which can learn from both sources of data.
Our targeted output is a model
P(m|x) = int_h P(m|h) P(h|x) dh
that is applied to a test set and works much better after having ingested all data available to us.In this issue/thread, I suggest we link to the code and discuss how to create this loop concretely based on a concrete application example with molecules.
Missing pieces:
The text was updated successfully, but these errors were encountered: