Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete training #30

Closed
fjxmlzn opened this issue Jul 25, 2022 · 7 comments
Closed

Incomplete training #30

fjxmlzn opened this issue Jul 25, 2022 · 7 comments

Comments

@fjxmlzn
Copy link
Owner

fjxmlzn commented Jul 25, 2022

Recently, I met with another problem.
I tried to run main.py in the example_training file and main_generate_data.py in the example_generating_data file. However, the result was that only a file named results was created. And in sub-files of 'results', there was only a worker_*.log.txt.
Q1: Why no synthetic datasets of [web/google/FCC_MBA] were generated?
Snipaste_2022-07-24_23-13-48
I looked for whether there is a place in the code to specify the dataset path. But I found nothing.

Q2: When I know the attributes and features of my datasets, how to generate the four files including data_attribute_output.pkl, data_feature_output.pkl, data_test.npz and data_train.npz. Whether another codes need to be written to achieve this work?

At last, thank you for your continued patient answers.

Originally posted by @chameleonzz in #3 (comment)

@fjxmlzn fjxmlzn mentioned this issue Jul 25, 2022
@chameleonzz
Copy link

According to the previous 'work.log', I found that maybe my TF has something wrong (there were multiple TFs). Therefore, I re-install TF, and run example_training/main.py again. Now, the output is as follows.
worklog
In the aux_disc-False,dataset-FCC_MBA,epoch-17000,epoch_checkpoint_freq-70,extra_checkpoint_freq-850,run-0,sample_len-1,self_norm-False, file, the continet of work.log is as follows.
1
2
3
In the last raw, it showed: FileNotFoundError: [Errno 2] No such file or directory: '../results/aux_disc-False,dataset-FCC_MBA,epoch-17000,epoch_checkpoint_freq-70,extra_checkpoint_freq-850,run-0,sample_len-1,self_norm-False,\sample\epoch_id-69,batch_id--1,global_id-419,type-free,feature,output-0,dim-0.png'
�[0m

I think this is about to run successfully. I am debugging recently based on the worker.log suggestion.
Thank you very much for your continued help.

@chameleonzz
Copy link

I think I can run the DG rightly now.
To solve the above problem, I try to debug main.py in example_training(without_GPUTaskScheduler).
Then I amended doppelganer.py in gan. I deleted checkpoint_dir in the last row. And the code could run properly. It cost about 22 hours, just as follows. (My computer is i7-10750H CPU, NVIDIA GeForce RTX 2060 GPU and 32GB)
results-example_training(without_GPUTaskScheduler)
In the example_training(without_GPUTaskScheduler)/test, there are three files, including checkpoint, sample, and time.txt.
The checkpoint file includes many documents, as follows.
2-1-checkpoint
In addition, the sample file comprises a sea of picture files, as follows.
There are around 19,000 pictures and several npz files.
2-2-sample

Is the right results of running example_training(without_GPUTaskScheduler)/main.py? If it is right, how to generate synthesis data of web/goggle/FCC_MBA?

@fjxmlzn
Copy link
Owner Author

fjxmlzn commented Aug 1, 2022

Yes, it is the right result with this code.

Regarding the FileNotFoundError you posted in #30 (comment), it should have already been fixed in c2f4bfb in June 2022. Please re-clone the repo and rerun and check if that works.

Regarding data generation for web, you can use https://github.com/fjxmlzn/DoppelGANger/tree/master/example_generating_data(without_GPUTaskScheduler) (before re-runing the above training code).

The above "without_GPUTaskScheduler" version of training and generation codes are only for web dataset. For other datasets (google, FCC_MBA), you can either modify the hyper-parameters according to the config files https://github.com/fjxmlzn/DoppelGANger/blob/master/example_training/config.py, or directly use the version with GPUTaskScheduler (https://github.com/fjxmlzn/DoppelGANger/tree/master/example_training and https://github.com/fjxmlzn/DoppelGANger/tree/master/example_generating_data)

Let me know if you run into any issues with the code.

@chameleonzz
Copy link

chameleonzz commented Aug 15, 2022

Yes, it is the right result with this code.

Regarding the FileNotFoundError you posted in #30 (comment), it should have already been fixed in c2f4bfb in June 2022. Please re-clone the repo and rerun and check if that works.

Regarding data generation for web, you can use https://github.com/fjxmlzn/DoppelGANger/tree/master/example_generating_data(without_GPUTaskScheduler) (before re-runing the above training code).

The above "without_GPUTaskScheduler" version of training and generation codes are only for web dataset. For other datasets (google, FCC_MBA), you can either modify the hyper-parameters according to the config files https://github.com/fjxmlzn/DoppelGANger/blob/master/example_training/config.py or directly use the version with GPUTaskScheduler (https://github.com/fjxmlzn/DoppelGANger/tree/master/example_training and https://github.com/fjxmlzn/DoppelGANger/tree/master/example_generating_data)

Let me know if you run into any issues with the code.

After modifying example_training/config,py and other config*.py according to c2f4bfb, it also had the same error information after re-running the code, just as showed in 30(comment).

In the 'aux_disc-False,dataset-FCC_MBA,epoch-17000,epoch_checkpoint_freq-70,extra_checkpoint_freq-850,run-,sample_len-,self_norm-False,\sample', there was only a npz file named 'epoch_id-69,batch_id--1,global_id-419,type-free,samples.npz'.
And in the 'aux_disc-False,dataset-google,epoch-400,epoch_checkpoint_freq-1,extra_checkpoint_freq-5,run-0,sample_len-1,self_norm-False,\sample', it had the same situation.
However, in the 'aux_disc-True,dataset-web,epoch-400,epoch_checkpoint_freq-1,extra_checkpoint_freq-5,run-0,sample_len-1,self_norm-True,\sample', there were many files, including lots of pictures and two npz files. But the 'worker.log' also had the likely error information: 'FileNotFoundError: [Errno 2] No such file or directory: '..\results\aux_disc-True,dataset-web,epoch-400,epoch_checkpoint_freq-1,extra_checkpoint_freq-5,run-0,sample_len-1,self_norm-True,\sample\epoch_id-0,batch_id-199,global_id-199,type-teacher,attribute,output-3,dim-0.png'
�[0m".

@fjxmlzn
Copy link
Owner Author

fjxmlzn commented Aug 15, 2022

This looks weird. Could you please attach worker.log in these three folders here? Thank you!

@chameleonzz
Copy link

This looks weird. Could you please attach worker.log in these three folders here? Thank you!

OK, I sent you an email.

@fjxmlzn
Copy link
Owner Author

fjxmlzn commented Aug 16, 2022

Thank you. Since I believe we found the root cause of this issue, I am closing this issue now.

For future readers of this thread, the issue is that Windows system has a max path length requirement, and a FileNotFoundError will be raised when writing to a path that exceeds this length.

To reduce the length of paths, we can add some keys into ignored_keys_for_folder_name in the config file so that they do not appear in the folder name. For example, we can change the top part of https://github.com/fjxmlzn/DoppelGANger/blob/master/example_training/config.py to

config = {
    "scheduler_config": {
        "gpu": ["0"],
        "config_string_value_maxlen": 1000,
        "result_root_folder": os.path.join("..", "results”),
	“ignored_keys_for_folder_name”: ['extra_checkpoint_freq', 'epoch_checkpoint_freq', 'aux_disc', 'self_norm']
    },

See https://github.com/fjxmlzn/GPUTaskScheduler for more details of the config options of GPUTaskScheduler. Alternatively, we can try moving the entire folder of DoppelGANger to a path that is shorter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants