VAMPnet partial fit #130

pl992 · 2021-03-11T08:58:44Z

I'd like to use VAMPnet for a large amount of data. I'm coming from pyEMMA where managing this is easy thanks to the function pyemma.coordinates.source. I see that deeptime is lacking this function but I do see the partial_fit function in almost all the functions. My problem is how this can be used in VAMPnet? The fit and partial_fit functions seem to do different things: in the first one for instance it is asked also for validation data while the second is satisfied by just the training data, same thing for the number of epochs.
Another thing is whether I should fetch the model at the end. Right now I'm trying to do a loop over my data in the following way:

import torch
import torch.nn as nn
import numpy as np
from deeptime.data import TimeLaggedDataset
from deeptime.util.torch import MLP
from torch.utils.data import DataLoader
from deeptime.decomposition.deep import VAMPNet

lobe = MLP(units=ns, nonlinearity=nn.ReLU)
vampnet = VAMPNet(lobe=lobe,learning_rate=3)
# paths is just a list of strings containing the path to .npy data
for path in paths:
    data = np.load(path)
    dataset = TimeLaggedDataset.from_trajectory(lagtime=500, data=data.astype(np.float32))
    lobe = MLP(units=ns, nonlinearity=nn.ReLU)
    vampnet = VAMPNet(lobe=lobe, learning_rate=1e-4)
    vampnet.partial_fit((dataset.data,dataset.data_lagged))
model = vampnet.fetch_model()  # I'm pretty sure this is not right at all as most of the code before

Note I'm not using the train_data and the val_data as you did in the documentation since partial_fit doesn't require it, but I'm pretty sure that I should somehow.
I think that from the documentation is not clear how you should deal with this kind of problem.
Thank you very much for your time

The text was updated successfully, but these errors were encountered:

clonker · 2021-03-11T10:02:30Z

Hi, we are soon going to add an MD example for vampnets (tagging @amardt), then hopefully things become more clear. The fit basically loops over chunks of your data and calls partial_fit for each chunk (where chunk refers to a block of instantaneous data with corresponding timelagged data). So in that regard your code isn't all that wrong.
However it is probably very hard to train as you are using the entire file for training. One of the ingredients that make deep learning work is so-called minibatching. With minibatching you take your dataset and you divide it up into chunks which are then used to train the network. This is what happens in loader_train = DataLoader(train_data, batch_size=512, shuffle=True) in the documentation. It takes chunks of 512 frames with corresponding 512 frames of timelagged data from a shuffled dataset.
In principle it would be good to shuffle also between the numpy files but that would probably kill performance due to IO (repeated opening/closing/reading of files). One might be able to work around this with memmaps if it is an issue downstream.

Here is some boilerplate code how you could work with your large data:

paths_train = paths[:-1]
path_val = paths[1:]

data_val = TimeLaggedDataset.from_trajectory(lagtime=500, data=np.load(path_val).astype(np.float32))
loader_val = DataLoader(val_data, batch_size=len(data_val), shuffle=False)

lobe = MLP(units=ns, nonlinearity=nn.ReLU)
vampnet = VAMPNet(lobe=lobe, learning_rate=1e-4)

from random import shuffle

for _ in range(n_rounds):
    for path in shuffle(paths):
        data = np.load(path)
        dataset = TimeLaggedDataset.from_trajectory(lagtime=500, data=data.astype(np.float32))
        loader_train = DataLoader(dataset, batch_size=512, shuffle=True)
        
        vampnet.fit(loader_train, n_epochs=80, validation_loader=loader_val)

model = vampnet.fetch_model()

Note that I haven't actually run this so there might be some typos in there. Also with regards to validation you might need to be more careful depending on how your data looks like.

pl992 · 2021-03-11T10:46:40Z

Thank you for the quick answer! So the fit function every time is called updates the network it doesn't overwrite right?
I'm just a little confused by your first 2 lines:

paths_train = paths[:-1]
path_val = paths[1:]

is it a typo or you wanted to provide a list of trajectories? It seems a typo although it's quite consistent between the two lines so I'm wondering whether I didn't understand something.
Thank you again that is really useful!

clonker · 2021-03-11T10:50:26Z

Oops yes, it should be

paths_train = paths[:-1]
path_val = paths[-1]

so this would take all but the last file for training and validates on the last file. Fit indeed doesn't overwrite the model (which it would in any other estimator so there is a small inconsistency here...). Eventually I want to provide an adapter between the pyemma source object and pytorch datasets (or even implement the pyemma source as such), then everything should work seamlessly, i.e., create source with pyemma and then fit with deeptime.

Also it should be for path in shuffle(paths_train), sorry for the inconsistencies 🙂
The idea is just to create a non-overlapping split between the data so that on the one portion you perform training and the other one is used for validation. That way you can avoid overfitting (comparing training and validation curves).

pl992 · 2021-03-11T11:16:31Z

ok thanks! no problem at all! but then if I wanted to use multiple trajectories for validation purposes (and not just one) would it be a good idea putting the loader_val inside the loop and changing randomly the path (of course disconnected by paths_train)? Because I see you cannot pass a list of trajectories in the TimeLaggedDataset

clonker · 2021-03-11T11:34:59Z

So for validation you dont have to separate, I would validate always against the full validation set. In case of multiple trajectories I suggest to do something like this (outside the loop):

class ValidationSet:

    def __init__(self, files, lagtime):
        self.data = [np.load(f) for f in files]
        self.lagtime = lagtime
    
    def __getitem__(self, item):
        return self.data[item][:-self.lagtime], self.data[item][self.lagtime:]

    def __len__(self):
        return len(self.data)

class ValidationLoader:
    
    def __init__(self, val_data):
        self.val_data = val_data
        self.loader_val_internal = DataLoader(val_data, batch_size=1, shuffle=False)
        
    def __len__(self):
        return len(self.data)
    
    def __iter__(self):
        for X, Y in loader_val_internal:
            yield X.squeeze(), Y.squeeze()
val_data = ValidationSet(paths_val, 200)
loader_val = ValidationLoader(val_data)

The squeeze bit is important to get the right shape out of the loader.

clonker · 2021-03-11T11:36:48Z

That being said I think it is a useful addition to the TimeLaggedDataset to also accept lists of trajectories so this kind of stuff can be handled internally! I will keep it in mind.

pl992 · 2021-03-12T09:37:56Z

many thanks! It seems it's working perfectly! I do have another question though. The script doesn't seem to use the GPU at all and actually it's pretty slow. I can see that pytorch can see the cuda installation and the GPU correctly:

torch.cuda.is_available() : True

but when, for instance, I regularly check the memory used by the script using torch.cuda.memory_allocated(0) or torch.cuda.memory_reserved(0) I get 0. Also the temperature of the GPU checked with nvidia-smi is stick to 34-35 degrees.
I've followed you documentation putting at the beginning of the python script:

device = torch.device("cuda")
torch.backends.cudnn.benchmark = True
torch.set_num_threads(12)

and it does seem the device is set to cuda but it doesn't use it

clonker · 2021-03-12T09:48:10Z

Glad it did the trick! For the device you actually discovered a bug in the documentation. When you set it in the VAMPNet constructor (as in estimator = VAMPNet(..., device=device) then it should use the GPU.

pl992 · 2021-03-12T10:10:45Z

I'm glad in this way I've helped a little bit. I've inserted this on the estimator but I get this error

Traceback (most recent call last):
  File "VAMPnet.py", line 72, in <module>
    vampnet.fit(loader_train,n_epochs=10,validation_loader=loader_val)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/deeptime/decomposition/deep/_vampnet.py", line 604, in fit
    self.partial_fit((batch_0.to(device=self.device), batch_t.to(device=self.device)),
RuntimeError: CUDA error: out of memory

Note that if I do the same in the CPU I don't get any error, the data should be loaded as we discussed so they fit extremely well the memory ~200MB/32GB

clonker · 2021-03-12T10:18:49Z

This is indeed strange, does the error persist if you restart your machine? How does your nvidia-smi output look like? You could also try decreasing the batch size.

pl992 · 2021-03-12T10:25:02Z

sorry you're absolutely right it was my fault since I was using without realizing the GPU memory elsewhere... although unfortunately now I get this problem:

Traceback (most recent call last):
  File "VAMPnet.py", line 72, in <module>
    vampnet.fit(loader_train,n_epochs=10,validation_loader=loader_val)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/deeptime/decomposition/deep/_vampnet.py", line 605, in fit
    train_score_callback=train_score_callback)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/deeptime/decomposition/deep/_vampnet.py", line 529, in partial_fit
    x_0 = self.lobe(batch_0)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/deeptime/util/torch.py", line 91, in forward
    return self._sequential(inputs)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
    exponential_average_factor, self.eps)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1670, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: Tensor for argument #2 'weight' is on CPU, but expected it to be on GPU (while checking arguments for cudnn_batch_norm)

I don't understand where I could load the data on the GPU, from the documentation I thought it was going to do it automatically

clonker · 2021-03-12T10:27:53Z

try this before creating the vampnet estimator: lobe = lobe.to(device=device)

clearly the documentation is still lacking on this account, sorry for that!

pl992 · 2021-03-12T12:50:11Z

thank you very much! that does the trick! I'm happy I helped you somehow on debugging/checking the documentation while you helped me making things to work.

clonker · 2021-03-12T14:11:06Z

Perfect, I am glad it's resolved! Let me know if you have other issues downstream, happy to help and in the end it also helps to improve the library 🙂 For how I'll close the issue and make an update to the docs soon.

clonker closed this as completed Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAMPnet partial fit #130

VAMPnet partial fit #130

pl992 commented Mar 11, 2021

clonker commented Mar 11, 2021

pl992 commented Mar 11, 2021

clonker commented Mar 11, 2021 •

edited

pl992 commented Mar 11, 2021

clonker commented Mar 11, 2021 •

edited

clonker commented Mar 11, 2021

pl992 commented Mar 12, 2021

clonker commented Mar 12, 2021

pl992 commented Mar 12, 2021

clonker commented Mar 12, 2021

pl992 commented Mar 12, 2021

clonker commented Mar 12, 2021

pl992 commented Mar 12, 2021

clonker commented Mar 12, 2021

VAMPnet partial fit #130

VAMPnet partial fit #130

Comments

pl992 commented Mar 11, 2021

clonker commented Mar 11, 2021

pl992 commented Mar 11, 2021

clonker commented Mar 11, 2021 • edited

pl992 commented Mar 11, 2021

clonker commented Mar 11, 2021 • edited

clonker commented Mar 11, 2021

pl992 commented Mar 12, 2021

clonker commented Mar 12, 2021

pl992 commented Mar 12, 2021

clonker commented Mar 12, 2021

pl992 commented Mar 12, 2021

clonker commented Mar 12, 2021

pl992 commented Mar 12, 2021

clonker commented Mar 12, 2021

clonker commented Mar 11, 2021 •

edited

clonker commented Mar 11, 2021 •

edited