Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding the dataset and best hyper-paramters #1

Open
felixgwu opened this issue Nov 13, 2018 · 5 comments
Open

Questions regarding the dataset and best hyper-paramters #1

felixgwu opened this issue Nov 13, 2018 · 5 comments

Comments

@felixgwu
Copy link

Hi,

Thank you for sharing the code of this impressive work.
I have two questions regarding how to reproduce the results in the paper.

  1. Based on the paper, it seems that the TWITTER-WORLD dataset is larger than the TWITTER-US; however, when I downloaded the data from this link, I found that the files in the na folder is larger than the ones in the world folder, which confuses me. I wonder if there is a naming typo here.

  2. I tried the following command to get the GCN results with the default hyper-paramters on GEOTEXT:
    python gcnmain.py -save -i cmu -d data/cmu -enc latin1
    Unfortunately, I only get:
    PM dev results:
    PM Mean: 565 Median: 103 Acc@161: 54
    PM test results:
    PM Mean: 578 Median: 99 Acc@161: 53
    This is a lot worse than the Mean: 546 Median: 45 Acc@161: 60 reported in the paper.
    Could you please share the commands you used to produce the amazing GCN results on all three datasets in Table 1 of the paper?

@afshinrahimi
Copy link
Owner

afshinrahimi commented Nov 13, 2018

Hi Felix,

I should have added the hyperparameters to the readme file (which I will very soon).

CMU:
THEANO_FLAGS='device=cuda0,floatX=float32' nice -n 9 python -u gcnmain.py -hid 300 300 300 -bucket 50 -batch 500 -d ./data/cmu/ -enc latin1 -mindf 10 -reg 0.0 -dropout 0.5 -cel 5 -highway

NA:
THEANO_FLAGS='device=cpu,floatX=float32' python -u gcnmain.py -hid 600 600 600 -bucket 2400 -batch 500 -d ~/data/na/ -mindf 10 -reg 0.0 -dropout 0.5 -cel 15 -highway

WORLD:
THEANO_FLAGS='device=cpu,floatX=float32' python -u gcnmain.py -hid 900 900 900 -bucket 2400 -batch 500 -d ~/data/world/ -mindf 10 -reg 0.0 -dropout 0.5 -cel 5 -highway

Why Twitter-WORLD is smaller in size than Twitter-US?
world has higher number of users, but lower number of tweets per user, so NA is actually bigger in dataset size.

Also note that the random seeds are changed so you might not get the exact results (unfortunately) but in several runs it might be a little better or worse, but in general comparable.

Don't hesitate to contact me if there were more issues.

Afshin

@felixgwu
Copy link
Author

felixgwu commented Nov 14, 2018

Hi Afshin,

Thank you for your instant response.

I ran the first command:
THEANO_FLAGS='device=cuda0,floatX=float32' nice -n 9 python -u gcnmain.py -hid 300 300 300 -bucket 50 -batch 500 -d ./data/cmu/ -enc latin1 -mindf 10 -reg 0.0 -dropout 0.5 -cel 5 -silent -highway
However, I only got
dev: Mean: 536 Median: 100 Acc@161: 55
test: Mean: 561 Median: 96 Acc@161: 54

I ran 10 other runs with different seeds (from 1 to 10), but I still can't get an acc@161 higher than 55.
Here is the log I got from these 10 runs:
https://gist.github.com/felixgwu/8ae4c6e7a887092ae30c82fea6d6db40

I wonder if I made some mistakes.

Here is how I create the environment.
I first create a new environment using the requirements.txt file

conda create --name geo --file requirements.txt

However, I got an error when it tried to import lasagne (version 0.1).
The error occurred at:
from theano.tensor.signal import downsample
It seems that theano doesn't have downsample in theano.tensor.signal, so I upgrade both theano and lasagne to the newest version with this command:

pip install --upgrade https://github.com/Theano/Theano/archive/master.zip
pip install --upgrade https://github.com/Lasagne/Lasagne/archive/master.zip

After that I can run the gcnmain.py script without errors.
BTW, I use CUDA 8.0.
I only have limited experience with Theano. Maybe there is something wrong here.

-Felix

@afshinrahimi
Copy link
Owner

afshinrahimi commented Nov 14, 2018

Hi Felix,

Regarding the Lasagne and Theano update, you're right, they should be upgraded.

Regarding running with default hyperparameters, we shouldn't do that because the default hyperparameters are not suitable for all the three datasets, they're just there so that the code runs (e.g. the default hidden layer is only 100 and the bucket is 300 which should be 300 300 300 and 50, respectively for cmu).

Regarding when you run with the command
THEANO_FLAGS='device=cuda0,floatX=float32' nice -n 9 python -u gcnmain.py -hid 300 300 300 -bucket 50 -batch 500 -d ./data/cmu/ -enc latin1 -mindf 10 -reg 0.0 -dropout 0.5 -cel 5 -silent -highway

I had this mistake of putting -silent in the previous command that I put here, If you remove it you'll see the correct results like this: https://gist.github.com/felixgwu/8ae4c6e7a887092ae30c82fea6d6db40
That's why you only got dev: test:

Don't hesitate to send me feedback if something was still wrong, I'd love to fix errors, help.

Thanks Felix.

Afshin

@felixgwu
Copy link
Author

Hi Afshin,

Thank you so much!
I can finally reproduce your results in the paper on GEOTEXT.
At first I couldn't reproduce it even with the correct command, but after I removed the data/cmu/dump.pkl and data/cmu/vocab.pkl and ran the script again, I got the correct results.
I wonder if there is something cached here so that I have to delete them every time whenever I use a different set of hyperparameters.

Here is the log in case someone else also wants to reproduce it.
https://gist.github.com/felixgwu/8c74a28b040b95635fcf28c5c1e3e078

I'll try the other two larger datasets and hopefully, I can reproduce them.
BTW, there are two typos in the command for NA and World datasets in the README file.
~/data/na/ should be ./data/na/ and ~/data/world/ should be ./data/world/

-Felix

@afshinrahimi
Copy link
Owner

Hi Felix,

Great News.

After running for the first time, the code saves the preprocessed dataset in dump.pkl in the dataset dir. Next time it loads that by default. If that file is made with incorrect hyperparameters, it'll still load it even if the new hyperparameters (e.g. bucket size) is correct. To stop it from doing that we can use -builddata option to force it to reproduce dump.pkl.

Thanks a lot Felix for the typo fixes and all the other help (I'll add them to the repo asap). It made the code easier for everyone else to reproduce.

Afshin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants