-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to run main.py in example_training #48
Comments
Which python version are you using and what command are you executing? |
Dear @fjxmlzn - The version of my Python is
|
Have you made any changes to the code in the repo? |
yes @fjxmlzn we have made changes on the repo, i just downloaded locally and add it on other Github repo. I think it is a problem if you have multiple DoppelGANGers folders in your local machine or it can maybe be a problem with the Python version. I can run it on my end because i need to modify the load_data.py and you have not reported the way (the) you created the npz and the pkls using pickle, so i changed to dill and i also changed some tf calls using tf-version>2.0, to make usable for most people. Can you report how you create the files? or maybe can you let us know how to fix the error? |
Hi @meiyor @ArianKhorasani , I just got the time to try it with the same python version, but I did not get this error. This error should not be due to pickle or tf; this is simply an error from Python when it tries to locate & import the libaries. Upon a closer look at the error message, it seems like you have modified
whereas the original code https://github.com/fjxmlzn/DoppelGANger/blob/05f36ec6c3850863751d4f3f88d180e9b12cb3eb/example_training/gan_task.py#L12C13-L12C27 was
Why did you make this change? Just to make it clear, the original code works like this: The code first adds the parent folder to the system path (
from gan.network import XXX means we will important from gan/network.py file. The grammar from .op import XXX is relative import; it implies that we want to import from op.py which is in the same folder as network.py .
|
We only made some changes to make the code compatible to tf >=2.0. We have not made any change on any import nor any file from the original code. I definitely think the problem appears when we created the .npzs and the .pkls associated with the output modules. As a preliminary solution for this issue, we substituted the pickle.dump by dill.dump to save the output class environment and for each value on the saved pkls. We also changed the pickle.load by dill.load in load_data.py code. For now the code is working but if you believe some function from tf>=2.0 will affect the main-code functionality - please let us know. We are now evaluating the generated data, if we obtain some similar results as your paper we can share the code working for tf>=2.0. Let us know what you think. |
I see! I am sorry that I misunderstood what you said. It's great that it's working! @yzion previously helped us create a TF2 branch here: https://github.com/fjxmlzn/DoppelGANger/tree/TF2, you can see the changes they made there. But that's a long time ago. If you have your TF2 code ready (which is very helpful to the community and us), feel free to make a pull request to the TF2 branch, or host it in your own repo and I can add a link in the readme file to your code. |
Hi @fjxmlzn before sharing the new code with you, we want to know how you measure the autocorrelation, and MSE for the generated data. Do you have a code, you can share with us to see if we can obtain similar results as your evaluations, or let us know some pseudocode, after having the generated data how we can measure your metrics. We will be really glad if you can help us with that. Thank you again! |
Please see here the code for computing autocorrelation: #20 (comment) Thanks. |
btw, @fjxmlzn, the autocorrelation is calculated for the entire 50000 samples of the generated data for Google and Web dataset, for instance? or this is only calculate for the first samples representing the first 500 days lags? how many samples are associated with the amount of lags in days? Which feature index from the generated dataset are used for this autocorrelation, for the Web dataset we have 550 features and for the Google we have 2500, the feature selected here is the one with the minimum MSE in comparison with original? I hope you can clarify this to me asap! Thank you! |
We only computed autocorrelation for the Wiki Web dataset (Figure 1 of https://arxiv.org/pdf/1909.13403). We computed the autocorrelation for the entire 50000 samples of the generated data. In the Wiki Web dataset, there is only one feature, which refers to the daily page view (see Table 7 of https://arxiv.org/pdf/1909.13403; "measurements" means "features" there). The number 550 is the number of days, not the number of features. Hope that this is helpful! |
I'm really confused in the way you explain it in Github. When you say this: data_feature: Training features, in numpy float32 array format. The size is [(number of training samples) x (maximum length) x (total dimension of features)] Then, the first axis is the number of samples of the time series, what is the second axis?, and the third axis is the number of features? it says that is the feature dimensionality which are two different things. In our approach we add our own dataset like this: number_of_samples (>1M) x number_of_features (34) x features_dimensions (1). it will work? if not please let us know as soon as possible, this will clarify a lot. |
I see where the confusion is. It would be helpful to read the data formulation section (Section 3.1) of the paper https://arxiv.org/pdf/1909.13403. I explain it below. Each dataset contains many samples; each sample is a list of features ordered by time (features are called measurements in the paper). Let's take Wiki Web Traffic dataset as an example (please take a look at Section 5.1 and Section A where we had detailed discussions):
I am confused when you say your dataset is of shape number_of_samples (>1M) x number_of_features (34) x features_dimensions (1). Which samples correspond to one time series? Are the samples (of count 1M) independent? |
@fjxmlzn - this is what you wrote in section 3.1: "We abstract the scope of our datasets as follows: A dataset D =�𝑂1, 𝑂2, ..., 𝑂𝑛 is defined as a set of samples 𝑂𝑖 (e.g., the clients). This is extremely confused because in this case you are relating measurements with features dimensions and they are features itself, following your last explanation. You are also referring to samples but in fact the samples are not relating with the time-series sequence at all, they are independent variables. In this way and following your last explanation, then the samples are not related with sequence at all and are independent measurements/variables, right? The max-length is the maximum-sequence length in the time-series and it represent the sequence itself, right? if so please clarify in Github that this axis is related with the time-series sequence itself, and the features-dimensions is in fact the amount of features, right? Can you confirm that to us, so we can be sure we are doing our experiments right. |
I am sorry but I am not sure I fully understand your descriptions, for example, what do "samples are not relating with the time-series sequence at all", "it represents the sequence itself", etc. mean? I know the terminologies can be ambiguous, so let's make it clearer using concrete examples. For the wiki web traffic, the shape of What is your data? After knowing that I can provide suggestions on how to format the data. |
Our data is a clinical dataset we have now 40k subjects x 270 hours x 34 features (biol. signals), I think we are trying to map our dataset as you map the web traffic dataset in your experiments. We are assuming each subject is an independent variable as you clarified before. Let us know if this is valid, ofc, I know the potential convergence of our data generation will depend on our data distribution anyways. If you can suggest us something to make our dataset more similar to your structure map we will be glad. Thank you! |
I see! Re: "We are assuming each subject is an independent variable as you clarified before". You are right! Re: format. If the features have one value per hour, and the features are numerical (instead of categorical), then making the data in the shape of 40k x 270 x 34 for |
ok perfect! Thank you very much! |
btw @fjxmlzn we wanted to ask you. Do you have the code for plotting the distribution or histogram comparison for the generated and real datasets you evaluated? The google and the web ones don't have max/min values reported, but the FCC_MBA has it in here https://drive.google.com/drive/folders/19hnyG8lN9_WWIac998rT6RtBB9Zit70X. If you can share the max and min for google and web datasets we will be glad. |
For web: #27 |
Closing the issues for now. Feel free to reopen the issue if it is still not resoled. |
Dear @fjxmlzn - I have created all necessary files from my dataset (e.g. .pkl files) and want to train DoppelGANger on it. Based on the description, I use the main.py file in example_training and it runs so quickly, but after checking the results directory, in worker.log, I get the following error:
I think the issue is related to the
network.py
file, which has the following imports:but after removing . I still get that error, so I'm not sure how to solve it. Any clue or feedback would be appreciated!
Thank you!
The text was updated successfully, but these errors were encountered: