Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run main.py in example_training #48

Closed
ArianKhorasani opened this issue May 21, 2024 · 21 comments
Closed

Unable to run main.py in example_training #48

ArianKhorasani opened this issue May 21, 2024 · 21 comments

Comments

@ArianKhorasani
Copy link

ArianKhorasani commented May 21, 2024

Dear @fjxmlzn - I have created all necessary files from my dataset (e.g. .pkl files) and want to train DoppelGANger on it. Based on the description, I use the main.py file in example_training and it runs so quickly, but after checking the results directory, in worker.log, I get the following error:

Traceback (most recent call last):
  File "/home/arian.khorasani/scratch/Generative_models/generative-env/bin/start_gpu_task", line 33, in <module>
    sys.exit(load_entry_point('GPUTaskScheduler', 'console_scripts', 'start_gpu_task')())
  File "/home/arian.khorasani/GPUTaskScheduler/gpu_task_scheduler/start_gpu_task.py", line 23, in main
    worker.main()
  File "/home/arian.khorasani/DoppelGANger/DoppelGANGer/example_training/gan_task.py", line 12, in main
    from network import DoppelGANgerGenerator, Discriminator, \
  File "/home/arian.khorasani/DoppelGANger/DoppelGANGer/gan/network.py", line 2, in <module>
    from .op import linear, batch_norm, flatten
ImportError: attempted relative import with no known parent package

I think the issue is related to the network.py file, which has the following imports:

from .op import linear, batch_norm, flatten
from .output import OutputType, Normalization

but after removing . I still get that error, so I'm not sure how to solve it. Any clue or feedback would be appreciated!
Thank you!

@fjxmlzn
Copy link
Owner

fjxmlzn commented May 21, 2024

Which python version are you using and what command are you executing?

@ArianKhorasani
Copy link
Author

Which python version are you using and what command are you executing?

Dear @fjxmlzn - The version of my Python is 3.10.14 and I am launching the following commands:

cd example_training
python main.py

@fjxmlzn
Copy link
Owner

fjxmlzn commented May 22, 2024

Have you made any changes to the code in the repo?

@meiyor
Copy link

meiyor commented May 22, 2024

yes @fjxmlzn we have made changes on the repo, i just downloaded locally and add it on other Github repo. I think it is a problem if you have multiple DoppelGANGers folders in your local machine or it can maybe be a problem with the Python version. I can run it on my end because i need to modify the load_data.py and you have not reported the way (the) you created the npz and the pkls using pickle, so i changed to dill and i also changed some tf calls using tf-version>2.0, to make usable for most people. Can you report how you create the files? or maybe can you let us know how to fix the error?

@fjxmlzn
Copy link
Owner

fjxmlzn commented May 26, 2024

Hi @meiyor @ArianKhorasani , I just got the time to try it with the same python version, but I did not get this error. This error should not be due to pickle or tf; this is simply an error from Python when it tries to locate & import the libaries.

Upon a closer look at the error message, it seems like you have modified gan_task.py at line 12. The error message shows that you have modified the line to

from network import DoppelGANgerGenerator, Discriminator

whereas the original code https://github.com/fjxmlzn/DoppelGANger/blob/05f36ec6c3850863751d4f3f88d180e9b12cb3eb/example_training/gan_task.py#L12C13-L12C27 was

from gan.network import DoppelGANgerGenerator, Discriminator, \

Why did you make this change?

Just to make it clear, the original code works like this:

The code first adds the parent folder to the system path (

sys.path.append("..")
) so that we can import any library relative to the repo root path. For example, from gan.network import XXX means we will important from gan/network.py file. The grammar from .op import XXX is relative import; it implies that we want to import from op.py which is in the same folder as network.py.

@meiyor
Copy link

meiyor commented May 30, 2024

We only made some changes to make the code compatible to tf >=2.0. We have not made any change on any import nor any file from the original code. I definitely think the problem appears when we created the .npzs and the .pkls associated with the output modules. As a preliminary solution for this issue, we substituted the pickle.dump by dill.dump to save the output class environment and for each value on the saved pkls. We also changed the pickle.load by dill.load in load_data.py code. For now the code is working but if you believe some function from tf>=2.0 will affect the main-code functionality - please let us know. We are now evaluating the generated data, if we obtain some similar results as your paper we can share the code working for tf>=2.0. Let us know what you think.

@fjxmlzn
Copy link
Owner

fjxmlzn commented May 31, 2024

I see! I am sorry that I misunderstood what you said. It's great that it's working!

@yzion previously helped us create a TF2 branch here: https://github.com/fjxmlzn/DoppelGANger/tree/TF2, you can see the changes they made there. But that's a long time ago. If you have your TF2 code ready (which is very helpful to the community and us), feel free to make a pull request to the TF2 branch, or host it in your own repo and I can add a link in the readme file to your code.

@meiyor
Copy link

meiyor commented Jun 5, 2024

Hi @fjxmlzn before sharing the new code with you, we want to know how you measure the autocorrelation, and MSE for the generated data. Do you have a code, you can share with us to see if we can obtain similar results as your evaluations, or let us know some pseudocode, after having the generated data how we can measure your metrics. We will be really glad if you can help us with that. Thank you again!

@fjxmlzn
Copy link
Owner

fjxmlzn commented Jun 6, 2024

Please see here the code for computing autocorrelation: #20 (comment)

Thanks.

@meiyor
Copy link

meiyor commented Jun 6, 2024

btw, @fjxmlzn, the autocorrelation is calculated for the entire 50000 samples of the generated data for Google and Web dataset, for instance? or this is only calculate for the first samples representing the first 500 days lags? how many samples are associated with the amount of lags in days?

Which feature index from the generated dataset are used for this autocorrelation, for the Web dataset we have 550 features and for the Google we have 2500, the feature selected here is the one with the minimum MSE in comparison with original?

I hope you can clarify this to me asap! Thank you!

@fjxmlzn
Copy link
Owner

fjxmlzn commented Jun 6, 2024

We only computed autocorrelation for the Wiki Web dataset (Figure 1 of https://arxiv.org/pdf/1909.13403).

We computed the autocorrelation for the entire 50000 samples of the generated data.

In the Wiki Web dataset, there is only one feature, which refers to the daily page view (see Table 7 of https://arxiv.org/pdf/1909.13403; "measurements" means "features" there). The number 550 is the number of days, not the number of features.

Hope that this is helpful!

@meiyor
Copy link

meiyor commented Jun 6, 2024

I'm really confused in the way you explain it in Github. When you say this: data_feature: Training features, in numpy float32 array format. The size is [(number of training samples) x (maximum length) x (total dimension of features)] Then, the first axis is the number of samples of the time series, what is the second axis?, and the third axis is the number of features? it says that is the feature dimensionality which are two different things. In our approach we add our own dataset like this: number_of_samples (>1M) x number_of_features (34) x features_dimensions (1). it will work? if not please let us know as soon as possible, this will clarify a lot.

@fjxmlzn
Copy link
Owner

fjxmlzn commented Jun 6, 2024

I see where the confusion is. It would be helpful to read the data formulation section (Section 3.1) of the paper https://arxiv.org/pdf/1909.13403. I explain it below.

Each dataset contains many samples; each sample is a list of features ordered by time (features are called measurements in the paper).

Let's take Wiki Web Traffic dataset as an example (please take a look at Section 5.1 and Section A where we had detailed discussions):

  • Each sample is the daily page views of a website (different samples correspond to different websites).
  • The daily page views of a website contain 550 numbers; each number corresponds to the page view of one day. In this case, we say there is only 1 feature (and total dimension of features=1), which is the daily page view. The length of the time series is 550 (i.e., maximum length=550).

I am confused when you say your dataset is of shape number_of_samples (>1M) x number_of_features (34) x features_dimensions (1). Which samples correspond to one time series? Are the samples (of count 1M) independent?

@meiyor
Copy link

meiyor commented Jun 6, 2024

@fjxmlzn - this is what you wrote in section 3.1: "We abstract the scope of our datasets as follows: A dataset D =�𝑂1, 𝑂2, ..., 𝑂𝑛 is defined as a set of samples 𝑂𝑖 (e.g., the clients).
Each sample 𝑂𝑖 = (𝐴𝑖 , 𝑅𝑖 ) contains 𝑚 metadata 𝐴𝑖 = [𝐴𝑖
1, 𝐴𝑖
2, ..., 𝐴𝑖
𝑚 ].
For example, metadata 𝐴𝑖
1 could represent client 𝑖’s physical loca-
tion, and 𝐴𝑖
2 the client’s ISP. Note that we can support datasets
in which multiple samples have the same set of metadata. The
second component of each sample is a time series of records 𝑅𝑖 =
[𝑅𝑖
1, 𝑅𝑖
2, ..., 𝑅𝑖
𝑇 𝑖 ], where 𝑅𝑖
𝑗 means 𝑗-th measurement of 𝑖-th client.
Different samples may contain a different number of measure-
ments. The number of records for sample 𝑂𝑖 is given by 𝑇 𝑖 . Each
record 𝑅𝑖
𝑗 = (𝑡𝑖
𝑗 , 𝑓 𝑖
𝑗 ) contains a timestamp 𝑡𝑖
𝑗 , and 𝐾 measurements
𝑓 𝑖
𝑗 = [𝑓 𝑖
𝑗,1, 𝑓 𝑖
𝑗,2, ..., 𝑓 𝑖
𝑗,𝐾 ]. For example, 𝑡𝑖
𝑗 represents the time when
the measurement 𝑓 𝑖
𝑗 is taken, and 𝑓 𝑖
𝑗,1, 𝑓 𝑖
𝑗,2 represent the ping loss
rate and traffic byte counter at this timestamp respectively. Note
that the timestamps are sorted, i.e. 𝑡𝑖
𝑗 < 𝑡𝑖
𝑗+1 ∀1 ≤ 𝑗 < 𝑇 𝑖 ."

This is extremely confused because in this case you are relating measurements with features dimensions and they are features itself, following your last explanation. You are also referring to samples but in fact the samples are not relating with the time-series sequence at all, they are independent variables.

In this way and following your last explanation, then the samples are not related with sequence at all and are independent measurements/variables, right? The max-length is the maximum-sequence length in the time-series and it represent the sequence itself, right? if so please clarify in Github that this axis is related with the time-series sequence itself, and the features-dimensions is in fact the amount of features, right? Can you confirm that to us, so we can be sure we are doing our experiments right.

@fjxmlzn
Copy link
Owner

fjxmlzn commented Jun 6, 2024

I am sorry but I am not sure I fully understand your descriptions, for example, what do "samples are not relating with the time-series sequence at all", "it represents the sequence itself", etc. mean? I know the terminologies can be ambiguous, so let's make it clearer using concrete examples.

For the wiki web traffic, the shape of data_feature in data_train.npz is 50000x550x1. The [i, j, 0] element inside this tensor means the total page view of i-th website on j-th day.

What is your data? After knowing that I can provide suggestions on how to format the data.

@meiyor
Copy link

meiyor commented Jun 6, 2024

Our data is a clinical dataset we have now 40k subjects x 270 hours x 34 features (biol. signals), I think we are trying to map our dataset as you map the web traffic dataset in your experiments. We are assuming each subject is an independent variable as you clarified before. Let us know if this is valid, ofc, I know the potential convergence of our data generation will depend on our data distribution anyways. If you can suggest us something to make our dataset more similar to your structure map we will be glad. Thank you!

@fjxmlzn
Copy link
Owner

fjxmlzn commented Jun 7, 2024

I see!

Re: "We are assuming each subject is an independent variable as you clarified before". You are right!

Re: format. If the features have one value per hour, and the features are numerical (instead of categorical), then making the data in the shape of 40k x 270 x 34 for data_feature should be good!

@meiyor
Copy link

meiyor commented Jun 7, 2024

ok perfect! Thank you very much!

@meiyor
Copy link

meiyor commented Jun 7, 2024

btw @fjxmlzn we wanted to ask you. Do you have the code for plotting the distribution or histogram comparison for the generated and real datasets you evaluated? The google and the web ones don't have max/min values reported, but the FCC_MBA has it in here https://drive.google.com/drive/folders/19hnyG8lN9_WWIac998rT6RtBB9Zit70X. If you can share the max and min for google and web datasets we will be glad.

@fjxmlzn
Copy link
Owner

fjxmlzn commented Jun 8, 2024

For web: #27
For Google: there should be "data_feature_max" and "data_feature_min" embedded in the npz files. We used those values to linearly normalize the original data.

@fjxmlzn
Copy link
Owner

fjxmlzn commented Jun 17, 2024

Closing the issues for now. Feel free to reopen the issue if it is still not resoled.

@fjxmlzn fjxmlzn closed this as completed Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants