Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAMPnet partial fit #130

Closed
pl992 opened this issue Mar 11, 2021 · 14 comments
Closed

VAMPnet partial fit #130

pl992 opened this issue Mar 11, 2021 · 14 comments

Comments

@pl992
Copy link

pl992 commented Mar 11, 2021

I'd like to use VAMPnet for a large amount of data. I'm coming from pyEMMA where managing this is easy thanks to the function pyemma.coordinates.source. I see that deeptime is lacking this function but I do see the partial_fit function in almost all the functions. My problem is how this can be used in VAMPnet? The fit and partial_fit functions seem to do different things: in the first one for instance it is asked also for validation data while the second is satisfied by just the training data, same thing for the number of epochs.
Another thing is whether I should fetch the model at the end. Right now I'm trying to do a loop over my data in the following way:

import torch
import torch.nn as nn
import numpy as np
from deeptime.data import TimeLaggedDataset
from deeptime.util.torch import MLP
from torch.utils.data import DataLoader
from deeptime.decomposition.deep import VAMPNet

lobe = MLP(units=ns, nonlinearity=nn.ReLU)
vampnet = VAMPNet(lobe=lobe,learning_rate=3)
# paths is just a list of strings containing the path to .npy data
for path in paths:
    data = np.load(path)
    dataset = TimeLaggedDataset.from_trajectory(lagtime=500, data=data.astype(np.float32))
    lobe = MLP(units=ns, nonlinearity=nn.ReLU)
    vampnet = VAMPNet(lobe=lobe, learning_rate=1e-4)
    vampnet.partial_fit((dataset.data,dataset.data_lagged))
model = vampnet.fetch_model()  # I'm pretty sure this is not right at all as most of the code before

Note I'm not using the train_data and the val_data as you did in the documentation since partial_fit doesn't require it, but I'm pretty sure that I should somehow.
I think that from the documentation is not clear how you should deal with this kind of problem.
Thank you very much for your time

@clonker
Copy link
Member

clonker commented Mar 11, 2021

Hi, we are soon going to add an MD example for vampnets (tagging @amardt), then hopefully things become more clear. The fit basically loops over chunks of your data and calls partial_fit for each chunk (where chunk refers to a block of instantaneous data with corresponding timelagged data). So in that regard your code isn't all that wrong.
However it is probably very hard to train as you are using the entire file for training. One of the ingredients that make deep learning work is so-called minibatching. With minibatching you take your dataset and you divide it up into chunks which are then used to train the network. This is what happens in loader_train = DataLoader(train_data, batch_size=512, shuffle=True) in the documentation. It takes chunks of 512 frames with corresponding 512 frames of timelagged data from a shuffled dataset.
In principle it would be good to shuffle also between the numpy files but that would probably kill performance due to IO (repeated opening/closing/reading of files). One might be able to work around this with memmaps if it is an issue downstream.

Here is some boilerplate code how you could work with your large data:

paths_train = paths[:-1]
path_val = paths[1:]

data_val = TimeLaggedDataset.from_trajectory(lagtime=500, data=np.load(path_val).astype(np.float32))
loader_val = DataLoader(val_data, batch_size=len(data_val), shuffle=False)

lobe = MLP(units=ns, nonlinearity=nn.ReLU)
vampnet = VAMPNet(lobe=lobe, learning_rate=1e-4)

from random import shuffle

for _ in range(n_rounds):
    for path in shuffle(paths):
        data = np.load(path)
        dataset = TimeLaggedDataset.from_trajectory(lagtime=500, data=data.astype(np.float32))
        loader_train = DataLoader(dataset, batch_size=512, shuffle=True)
        
        vampnet.fit(loader_train, n_epochs=80, validation_loader=loader_val)

model = vampnet.fetch_model()

Note that I haven't actually run this so there might be some typos in there. Also with regards to validation you might need to be more careful depending on how your data looks like.

@pl992
Copy link
Author

pl992 commented Mar 11, 2021

Thank you for the quick answer! So the fit function every time is called updates the network it doesn't overwrite right?
I'm just a little confused by your first 2 lines:

paths_train = paths[:-1]
path_val = paths[1:]

is it a typo or you wanted to provide a list of trajectories? It seems a typo although it's quite consistent between the two lines so I'm wondering whether I didn't understand something.
Thank you again that is really useful!

@clonker
Copy link
Member

clonker commented Mar 11, 2021

Oops yes, it should be

paths_train = paths[:-1]
path_val = paths[-1]

so this would take all but the last file for training and validates on the last file. Fit indeed doesn't overwrite the model (which it would in any other estimator so there is a small inconsistency here...). Eventually I want to provide an adapter between the pyemma source object and pytorch datasets (or even implement the pyemma source as such), then everything should work seamlessly, i.e., create source with pyemma and then fit with deeptime.

Also it should be for path in shuffle(paths_train), sorry for the inconsistencies 🙂
The idea is just to create a non-overlapping split between the data so that on the one portion you perform training and the other one is used for validation. That way you can avoid overfitting (comparing training and validation curves).

@pl992
Copy link
Author

pl992 commented Mar 11, 2021

ok thanks! no problem at all! but then if I wanted to use multiple trajectories for validation purposes (and not just one) would it be a good idea putting the loader_val inside the loop and changing randomly the path (of course disconnected by paths_train)? Because I see you cannot pass a list of trajectories in the TimeLaggedDataset

@clonker
Copy link
Member

clonker commented Mar 11, 2021

So for validation you dont have to separate, I would validate always against the full validation set. In case of multiple trajectories I suggest to do something like this (outside the loop):

class ValidationSet:

    def __init__(self, files, lagtime):
        self.data = [np.load(f) for f in files]
        self.lagtime = lagtime
    
    def __getitem__(self, item):
        return self.data[item][:-self.lagtime], self.data[item][self.lagtime:]

    def __len__(self):
        return len(self.data)

class ValidationLoader:
    
    def __init__(self, val_data):
        self.val_data = val_data
        self.loader_val_internal = DataLoader(val_data, batch_size=1, shuffle=False)
        
    def __len__(self):
        return len(self.data)
    
    def __iter__(self):
        for X, Y in loader_val_internal:
            yield X.squeeze(), Y.squeeze()
val_data = ValidationSet(paths_val, 200)
loader_val = ValidationLoader(val_data)

The squeeze bit is important to get the right shape out of the loader.

@clonker
Copy link
Member

clonker commented Mar 11, 2021

That being said I think it is a useful addition to the TimeLaggedDataset to also accept lists of trajectories so this kind of stuff can be handled internally! I will keep it in mind.

@pl992
Copy link
Author

pl992 commented Mar 12, 2021

many thanks! It seems it's working perfectly! I do have another question though. The script doesn't seem to use the GPU at all and actually it's pretty slow. I can see that pytorch can see the cuda installation and the GPU correctly:

torch.cuda.is_available() : True

but when, for instance, I regularly check the memory used by the script using torch.cuda.memory_allocated(0) or torch.cuda.memory_reserved(0) I get 0. Also the temperature of the GPU checked with nvidia-smi is stick to 34-35 degrees.
I've followed you documentation putting at the beginning of the python script:

device = torch.device("cuda")
torch.backends.cudnn.benchmark = True
torch.set_num_threads(12)

and it does seem the device is set to cuda but it doesn't use it

@clonker
Copy link
Member

clonker commented Mar 12, 2021

Glad it did the trick! For the device you actually discovered a bug in the documentation. When you set it in the VAMPNet constructor (as in estimator = VAMPNet(..., device=device) then it should use the GPU.

@pl992
Copy link
Author

pl992 commented Mar 12, 2021

I'm glad in this way I've helped a little bit. I've inserted this on the estimator but I get this error

Traceback (most recent call last):
  File "VAMPnet.py", line 72, in <module>
    vampnet.fit(loader_train,n_epochs=10,validation_loader=loader_val)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/deeptime/decomposition/deep/_vampnet.py", line 604, in fit
    self.partial_fit((batch_0.to(device=self.device), batch_t.to(device=self.device)),
RuntimeError: CUDA error: out of memory

Note that if I do the same in the CPU I don't get any error, the data should be loaded as we discussed so they fit extremely well the memory ~200MB/32GB

@clonker
Copy link
Member

clonker commented Mar 12, 2021

This is indeed strange, does the error persist if you restart your machine? How does your nvidia-smi output look like? You could also try decreasing the batch size.

@pl992
Copy link
Author

pl992 commented Mar 12, 2021

sorry you're absolutely right it was my fault since I was using without realizing the GPU memory elsewhere... although unfortunately now I get this problem:

Traceback (most recent call last):
  File "VAMPnet.py", line 72, in <module>
    vampnet.fit(loader_train,n_epochs=10,validation_loader=loader_val)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/deeptime/decomposition/deep/_vampnet.py", line 605, in fit
    train_score_callback=train_score_callback)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/deeptime/decomposition/deep/_vampnet.py", line 529, in partial_fit
    x_0 = self.lobe(batch_0)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/deeptime/util/torch.py", line 91, in forward
    return self._sequential(inputs)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
    exponential_average_factor, self.eps)
  File "/home/plongo/miniconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1670, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: Tensor for argument #2 'weight' is on CPU, but expected it to be on GPU (while checking arguments for cudnn_batch_norm)

I don't understand where I could load the data on the GPU, from the documentation I thought it was going to do it automatically

@clonker
Copy link
Member

clonker commented Mar 12, 2021

try this before creating the vampnet estimator: lobe = lobe.to(device=device)

clearly the documentation is still lacking on this account, sorry for that!

@pl992
Copy link
Author

pl992 commented Mar 12, 2021

thank you very much! that does the trick! I'm happy I helped you somehow on debugging/checking the documentation while you helped me making things to work.

@clonker
Copy link
Member

clonker commented Mar 12, 2021

Perfect, I am glad it's resolved! Let me know if you have other issues downstream, happy to help and in the end it also helps to improve the library 🙂 For how I'll close the issue and make an update to the docs soon.

@clonker clonker closed this as completed Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants