Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't download dump files that are not done yet #63

Closed
vrandezo opened this issue Apr 21, 2014 · 5 comments
Closed

Don't download dump files that are not done yet #63

vrandezo opened this issue Apr 21, 2014 · 5 comments
Assignees
Labels

Comments

@vrandezo
Copy link
Member

E.g. using wdtk-example, it is right now already downloading the April 20 dump, but the dump is not fully generated yet. Can we first check whether a dump is complete and available before starting to download it?

@mkroetzsch
Copy link
Member

I thought that we are doing this. The script always fetches the maxrevid before attempting a download, and it only starts if this id can be found. There are also some kind of checks for the "done" status independent of this, but maybe there is a gap there in one case. What kind of dump are you talking about, daily or current or full?

@vrandezo
Copy link
Member Author

It was the current one.

@guenthermi
Copy link
Member

I got the same problem by testing my json-serializer example code. At first I got the error that there is no maximal revision id. Later it downloaded the incomplete dump and reported that it finished the processing after downloading and processing only 160 MB of the dump file (wikidatawiki-20140420-pages-meta-current).

@mkroetzsch mkroetzsch added the bug label Apr 23, 2014
@mkroetzsch
Copy link
Member

Confirmed. We have code that checks the md5sums to see if a dump is done, but the download does not use this and relies on the maxrevid alone. My assumption was that the maxrevid is not published before the dump is done, but that seems to be wrong.

@mkroetzsch mkroetzsch self-assigned this Apr 23, 2014
mkroetzsch added a commit that referenced this issue Apr 23, 2014
* Fix issue #63 by checking availability explicitly
* Finding most recent dump of some type now checks availability
* Better logging output to show which dumps are found/processed
* Updated tests to work with new code
vrandezo referenced this issue Apr 23, 2014
Improve dump downloading behaviour
@vrandezo
Copy link
Member Author

Thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants