Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue training with any dataset #3

Closed
rafaelorozco opened this issue Nov 19, 2022 · 7 comments
Closed

Issue training with any dataset #3

rafaelorozco opened this issue Nov 19, 2022 · 7 comments

Comments

@rafaelorozco
Copy link

Hello! I am very interested in this work. I would like to get some examples running, any help would be appreciated.

I havent had any luck with any of the examples. Either evaluating or training. This is the current error I run into:

For the thin spiral

(ml) rorozcom3@dgx1:~/denoising-normalizing-flow/experiments$ python3 train.py -c configs/train_dnf_thin_spiral.config
21:55 __main__             INFO    Hi!
21:55 __main__             INFO    Training model dnf_1_thin_spiral_paper with algorithm dnf on data set thin_spiral
Traceback (most recent call last):
  File "/nethome/rorozcom3/denoising-normalizing-flow/experiments/train.py", line 682, in <module>
    dataset = simulator.load_dataset(train=True, dataset_dir=create_filename("dataset", None, args), limit_samplesize=args.samplesize, joint_score=args.scandal is not None)
  File "/nethome/rorozcom3/denoising-normalizing-flow/experiments/datasets/base.py", line 54, in load_dataset
    x = np.load("{}/x_{}{}{}.npy".format(dataset_dir, tag, param_label, run_label))
  File "/nethome/rorozcom3/miniconda3/envs/ml/lib/python3.10/site-packages/numpy/lib/npyio.py", line 390, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/nethome/rorozcom3/denoising-normalizing-flow/experiments/data/samples/thin_spiral/x_train.npy'

For celeba:

(ml) rorozcom3@dgx1:~/denoising-normalizing-flow/experiments$ python3 train.py -c configs/train_dnf_celeba.config
21:53 __main__             INFO    Hi!
21:53 __main__             INFO    Training model dnf_512_celeba_paper with algorithm dnf on data set celeba
Traceback (most recent call last):
  File "/nethome/rorozcom3/denoising-normalizing-flow/experiments/train.py", line 682, in <module>
    dataset = simulator.load_dataset(train=True, dataset_dir=create_filename("dataset", None, args), limit_samplesize=args.samplesize, joint_score=args.scandal is not None)
  File "/nethome/rorozcom3/denoising-normalizing-flow/experiments/datasets/images.py", line 44, in load_dataset
    x = np.load("{}/{}.npy".format(dataset_dir, "train" if train else "test"))
  File "/nethome/rorozcom3/miniconda3/envs/ml/lib/python3.10/site-packages/numpy/lib/npyio.py", line 418, in load
    raise ValueError("Cannot load file containing pickled data "
ValueError: Cannot load file containing pickled data when allow_pickle=False

after adding allow_pickle to that load call i get this:

(ml) rorozcom3@dgx1:~/denoising-normalizing-flow/experiments$ python3 train.py -c configs/train_dnf_celeba.config
21:52 __main__             INFO    Hi!
21:52 __main__             INFO    Training model dnf_512_celeba_paper with algorithm dnf on data set celeba
Traceback (most recent call last):
  File "/nethome/rorozcom3/miniconda3/envs/ml/lib/python3.10/site-packages/numpy/lib/npyio.py", line 421, in load
    return pickle.load(fid, **pickle_kwargs)
_pickle.UnpicklingError: invalid load key, '<'.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nethome/rorozcom3/denoising-normalizing-flow/experiments/train.py", line 682, in <module>
    dataset = simulator.load_dataset(train=True, dataset_dir=create_filename("dataset", None, args), limit_samplesize=args.samplesize, joint_score=args.scandal is not None)
  File "/nethome/rorozcom3/denoising-normalizing-flow/experiments/datasets/images.py", line 44, in load_dataset
    x = np.load("{}/{}.npy".format(dataset_dir, "train" if train else "test"), allow_pickle=True)
  File "/nethome/rorozcom3/miniconda3/envs/ml/lib/python3.10/site-packages/numpy/lib/npyio.py", line 423, in load
    raise pickle.UnpicklingError(
_pickle.UnpicklingError: Failed to interpret file '/nethome/rorozcom3/denoising-normalizing-flow/experiments/data/samples/celeba/train.npy' as a pickle

Thank you for your attention! I look forward to getting this running!

@chrvt
Copy link
Owner

chrvt commented Nov 21, 2022

Hi there,

thanks a lot for raising this issue.

It seems to me that there is some issue with loading the dataset. For the thin_spiral, the dataset is not created "[Errno 2] No such file or directory"

A quick and ugly fix would be to simply add the dataset yourself in the necessary folder. I.e. you would need to create some train and test samples, save them as .npy files and then add them to the folder. However, if I remember right, this should be done automatically. I will have a look at this as soon as I have some spare time and keep you updated!

Regarding celeba, it looks like the dataset was indeed downloaded. However, this error usually occurs whenever the dataset was not properly downloaded. Have you tried to open the dataset manually on your local machine? Do you get the same issue? If yes, then indeed something went wrong in the downloading process (which should not be the case, however, will help me to localize the error).

Thanks again!

@chrvt
Copy link
Owner

chrvt commented Nov 23, 2022

Update: I fixed the donwloading issue for the thin_spiral dataset.
Regarding celeba, it seems the google drive provided my Manifold Flow is not working anymore. I will try to provide an alternative soon. Until then, you need to download "celeba" manually. You can download them here:

"train.npy" => https://drive.google.com/file/d/1xDlVwFjFpts7gcJglPpQg8zu9g4PhuGG/view?usp=share_link
"test.npy" => https://drive.google.com/file/d/1ihhlOel2rriGrzRGdYsfWsVIYB-q_cKa/view?usp=sharing

Let me know if it works now :-)

@rafaelorozco
Copy link
Author

Hello thank you for all of the help.

I think that I am succesfuly able to train both the celeba and spiral dataset thanks to your help. If you have time I would appreciate help evaluating the models.
For the spiral dataset, after training for 5 epochs:

(ml) rorozcom3@dgx1:~/denoising-normalizing-flow/experiments$ CUDA_VISIBLE_DEVICES=3 python3 train.py -c configs/train_dnf_thin_spiral.config &
[2] 51593
(ml) rorozcom3@dgx1:~/denoising-normalizing-flow/experiments$ 11:29 __main__             INFO    Hi!
11:29 __main__             DEBUG   Starting train.py with arguments Namespace(modelname='paper', algorithm='dnf', dataset='thin_spiral', i=0, truelatentdim=1, datadim=2, epsilon=0.01, modellatentdim=1, specified=False, outertransform='rq-coupling', innertransform='affine-autoregressive', lineartransform='lu', outerlayers=9, innerlayers=1, conditionalouter=False, dropout=0.0, pieepsilon=0.01, pieclip=None, encoderblocks=5, encoderhidden=100, splinerange=3.0, splinebins=5, levels=3, actnorm=False, batchnorm=False, linlayers=2, linchannelfactor=2, intermediatensf=False, decoderblocks=5, decoderhidden=100, sig2=0.01, alternate=False, sequential=False, load=None, startepoch=0, samplesize=10000, epochs=5, subsets=1, batchsize=100, genbatchsize=1000, lr=0.0003, msefactor=1.0, addnllfactor=0.1, nllfactor=1.0, sinkhornfactor=10.0, weightdecay=0.0001, clip=5.0, nopretraining=False, noposttraining=False, validationsplit=0.1, scandal=None, l1=False, uvl2reg=0.0, seed=1357, resume=None, scheduler_restart=50, noise_type='gaussian', sampling_train=False, c='configs/train_dnf_thin_spiral.config', dir='/nethome/rorozcom3/denoising-normalizing-flow', debug=True)
11:29 __main__             INFO    Training model dnf_1_thin_spiral_paper with algorithm dnf on data set thin_spiral
11:29 datasets.base        INFO    Only using 10000 of 10000 available samples
11:29 architectures.create INFO    Creating manifold flow for vector data with 1 latent dimensions, 9 + 1 layers, transforms rq-coupling / affine-autoregressive, None context features
11:29 manifold_flow.transf DEBUG   Set up projection from vector with dimension 2 to vector with dimension 1
11:29 manifold_flow.flows. INFO    Model has 0.4 M parameters (0.4 M trainable) with an estimated size of 1.7 MB
11:29 manifold_flow.flows. INFO      Outer transform: 0.4 M parameters
11:29 manifold_flow.flows. INFO      Inner transform: 0.0 M parameters

(ml) rorozcom3@dgx1:~/denoising-normalizing-flow/experiments$ 11:29 training.trainer     INFO    Training on GPU with single precision
11:29 __main__             INFO    Starting training denoising flow on NLL
11:29 training.trainer     DEBUG   Initialising training data
11:29 training.trainer     DEBUG   Setting up dataloaders with 4 workers
11:29 training.trainer     DEBUG   Training partition indices: [5876, 8244, 9527, 865, 3702, 282, 5619, 761, 7718, 7830]...
11:29 training.trainer     DEBUG   Validation partition indices: [2769, 8974, 9207, 340, 2179, 5595, 1942, 5053, 749, 5504]...
11:29 training.trainer     DEBUG   Setting up optimizer
11:29 training.trainer     DEBUG   Setting up LR scheduler
11:29 training.trainer     DEBUG   Using early stopping with infinite patience
11:29 training.trainer     DEBUG   Beginning main training loop
11:29 training.trainer     DEBUG   Training epoch 1 / 5
11:29 training.trainer     DEBUG   Learning rate: [0.0003]
/nethome/rorozcom3/denoising-normalizing-flow/experiments/../manifold_flow/transforms/lu.py:85: UserWarning: torch.triangular_solve is deprecated in favor of torch.linalg.solve_triangularand will be removed in a future PyTorch release.
torch.linalg.solve_triangular has its arguments reversed and does not return a copy of one of the inputs.
X = torch.triangular_solve(B, A).solution
should be replaced with
X = torch.linalg.solve_triangular(A, B). (Triggered internally at ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2115.)
  outputs, _ = torch.triangular_solve(outputs.t(), lower, upper=False, unitriangular=True)
11:29 training.trainer     INFO    Epoch   1: train loss 17.21824 (NLL: 16.338, MSE:  0.881, L2_lat:  0.285)
11:29 training.trainer     INFO               val. loss   4.37498 (NLL:  3.652, MSE:  0.723, L2_lat:  0.033)
11:29 training.trainer     DEBUG   Training epoch 2 / 5
11:29 training.trainer     DEBUG   Learning rate: [0.0002997040092642407]
11:30 training.trainer     INFO    Epoch   2: train loss  4.54401 (NLL:  3.869, MSE:  0.675, L2_lat:  0.039)
11:30 training.trainer     INFO               val. loss   3.50567 (NLL:  2.906, MSE:  0.600, L2_lat:  0.025)
11:30 training.trainer     DEBUG   Training epoch 3 / 5
11:30 training.trainer     DEBUG   Learning rate: [0.0002988172051971717]
11:30 training.trainer     INFO    Epoch   3: train loss  3.69675 (NLL:  3.101, MSE:  0.596, L2_lat:  0.028)
11:30 training.trainer     INFO               val. loss   2.78211 (NLL:  2.277, MSE:  0.505, L2_lat:  0.015)
11:30 training.trainer     DEBUG   Training epoch 4 / 5
11:30 training.trainer     DEBUG   Learning rate: [0.0002973430876093033]
11:31 training.trainer     INFO    Epoch   4: train loss  2.81677 (NLL:  2.305, MSE:  0.512, L2_lat:  0.013)
11:31 training.trainer     INFO               val. loss   2.27601 (NLL:  1.860, MSE:  0.416, L2_lat:  0.009)
11:31 training.trainer     DEBUG   Training epoch 5 / 5
11:31 training.trainer     DEBUG   Learning rate: [0.00029528747416929463]
11:32 training.trainer     INFO    Epoch   5: train loss  2.59067 (NLL:  2.166, MSE:  0.425, L2_lat:  0.012)
11:32 training.trainer     INFO               val. loss   2.28360 (NLL:  1.837, MSE:  0.447, L2_lat:  0.011)
11:32 training.trainer     INFO    Early stopping after epoch 4, with loss  2.27601 compared to final loss  2.28360
11:32 training.trainer     DEBUG   Training finished
11:32 __main__             INFO    Saving model
11:32 __main__             INFO    All done! Have a nice day!

I get this error when trying to evaluate

(ml) rorozcom3@dgx1:~/denoising-normalizing-flow/experiments$ python3 evaluate.py -c configs/evaluate_dnf_thin_spiral.config

Could not import fid_score, make sure that pytorch-fid is in the Python path
11:27 __main__             INFO    Hi!
11:27 __main__             INFO    Evaluating model dnf_1_thin_spiral_paper
11:27 architectures.create INFO    Creating manifold flow for vector data with 1 latent dimensions, 9 + 1 layers, transforms rq-coupling / affine-autoregressive, None context features
11:27 manifold_flow.flows. INFO    Model has 0.4 M parameters (0.4 M trainable) with an estimated size of 1.7 MB
11:27 manifold_flow.flows. INFO      Outer transform: 0.4 M parameters
11:27 manifold_flow.flows. INFO      Inner transform: 0.0 M parameters
11:27 __main__             INFO    Evaluating grid for density compariison.
/nethome/rorozcom3/miniconda3/envs/ml/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/nethome/rorozcom3/denoising-normalizing-flow/experiments/../manifold_flow/transforms/lu.py:85: UserWarning: torch.triangular_solve is deprecated in favor of torch.linalg.solve_triangularand will be removed in a future PyTorch release.
torch.linalg.solve_triangular has its arguments reversed and does not return a copy of one of the inputs.
X = torch.triangular_solve(B, A).solution
should be replaced with
X = torch.linalg.solve_triangular(A, B). (Triggered internally at ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2115.)
  outputs, _ = torch.triangular_solve(outputs.t(), lower, upper=False, unitriangular=True)
11:27 __main__             INFO    Time needed for batch 1: 1.7188756465911865 sec
11:27 __main__             INFO    Time needed for batch 2: 0.7976391315460205 sec
11:27 __main__             INFO    Time needed for batch 3: 0.8050322532653809 sec
11:27 __main__             INFO    Time needed for batch 4: 0.7919723987579346 sec
11:27 __main__             INFO    Time needed for batch 5: 0.8677799701690674 sec
11:27 __main__             INFO    Time needed for batch 6: 0.8129651546478271 sec
11:27 __main__             INFO    Time needed for batch 7: 0.8065090179443359 sec
11:27 __main__             INFO    Time needed for batch 8: 0.7708215713500977 sec
11:27 __main__             INFO    Time needed for batch 9: 0.8067355155944824 sec
11:27 __main__             INFO    Time needed for batch 10: 0.7868742942810059 sec
11:27 __main__             INFO    Time needed for batch 11: 0.8096134662628174 sec
11:27 __main__             INFO    Time needed for batch 12: 0.8246207237243652 sec
11:27 __main__             INFO    Time needed for batch 13: 0.8709676265716553 sec
11:27 __main__             INFO    Time needed for batch 14: 0.78993821144104 sec
11:27 __main__             INFO    Time needed for batch 15: 0.768791675567627 sec
11:27 __main__             INFO    Time needed for batch 16: 0.7577681541442871 sec
11:27 __main__             INFO    Time needed for batch 17: 0.7451643943786621 sec
11:27 __main__             INFO    Time needed for batch 18: 0.7610089778900146 sec
11:27 __main__             INFO    Time needed for batch 19: 0.7358658313751221 sec
11:27 __main__             INFO    Time needed for batch 20: 0.7000815868377686 sec
11:27 __main__             INFO    Time needed to evaluate model samples: 16.782846212387085 sec
11:27 __main__             INFO    Start calculating KS statistics
Traceback (most recent call last):
  File "/nethome/rorozcom3/denoising-normalizing-flow/experiments/evaluate.py", line 784, in <module>
    calculate_KS_stats(args,model,simulator) #calculate KS if possible
  File "/nethome/rorozcom3/denoising-normalizing-flow/experiments/evaluate.py", line 200, in calculate_KS_stats
    latent_test = simulator.load_latent(train=False,dataset_dir=create_filename("dataset", None, args))
  File "/nethome/rorozcom3/denoising-normalizing-flow/experiments/datasets/thin_spiral.py", line 77, in load_latent
    latents = np.load("{}/x_{}_latent.npy".format(dataset_dir, tag))
  File "/nethome/rorozcom3/miniconda3/envs/ml/lib/python3.10/site-packages/numpy/lib/npyio.py", line 390, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/nethome/rorozcom3/denoising-normalizing-flow/experiments/data/samples/thin_spiral/x_test_latent.npy'

I was also having trouble understanding how to produce this file flow_1_thin_spiral_paper_log_grid_likelihood.npy so that I could reproduce the upper half of figure 2 of the paper:
log_prob = np.load(os.path.join(data_path,'flow_1_thin_spiral_paper_log_grid_likelihood.npy'))

@chrvt
Copy link
Owner

chrvt commented Nov 29, 2022

I fixed it - at least it works for me now. Please let me know if there is still something wrong or you can't reproduce. Thanks :-)

@rafaelorozco
Copy link
Author

Great thank you. I can now run the evaluation script!

For reproducing the plot, did you add the .npy files that I need? Or do I need to produce them myself? I couldnt find the files that look like this

flow_1_thin_spiral_paper_log_grid_likelihood.npy

in the repo currently.

@chrvt
Copy link
Owner

chrvt commented Dec 2, 2022

it should produce automatically when evaluating, does it not?

@chrvt chrvt closed this as completed Dec 23, 2022
@chrvt
Copy link
Owner

chrvt commented Dec 23, 2022

training dataset issues were solved, in case there are some problems with the evaluation, please open new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants