New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't download dump files that are not done yet #63
Comments
I thought that we are doing this. The script always fetches the maxrevid before attempting a download, and it only starts if this id can be found. There are also some kind of checks for the "done" status independent of this, but maybe there is a gap there in one case. What kind of dump are you talking about, daily or current or full? |
It was the current one. |
I got the same problem by testing my json-serializer example code. At first I got the error that there is no maximal revision id. Later it downloaded the incomplete dump and reported that it finished the processing after downloading and processing only 160 MB of the dump file (wikidatawiki-20140420-pages-meta-current). |
Confirmed. We have code that checks the md5sums to see if a dump is done, but the download does not use this and relies on the maxrevid alone. My assumption was that the maxrevid is not published before the dump is done, but that seems to be wrong. |
* Fix issue #63 by checking availability explicitly * Finding most recent dump of some type now checks availability * Better logging output to show which dumps are found/processed * Updated tests to work with new code
Thx! |
E.g. using wdtk-example, it is right now already downloading the April 20 dump, but the dump is not fully generated yet. Can we first check whether a dump is complete and available before starting to download it?
The text was updated successfully, but these errors were encountered: