New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delivering broken data #210
Comments
|
Hi, thanks for reporting this issue. It's already documented on the Overpass API status wiki page as a Won't fix.
Here's how to find out what happened (in case you're interested in the details): Way 340431930 obviously refers to some nodes, which are not available in the Overpass API DB, e.g. node 3488830107. Both overpass-api.de and api.openstreetmap.fr/oapi/ are affected at this time. So, most likely this is diff inconsistency issue caused by an upstream glitch. Let's take a look at the diffs now. The missing nodes were introduced in cs 30659805: <changeset id="30659805" user="claireralph" uid="2876298" created_at="2015-04-30T13:29:47Z" closed_at="2015-04-30T13:29:49Z"Based on the timestamp of this CS, this leads us to the following diff file in directory http://planet.openstreetmap.org/replication/minute/001/374/ Note that there's a large gap of about 45 minutes between 686 and 687 and 687.osc.gz is quite unusual large. We're comparing this with the corresponding files on dev instance, which uses the same mechanism to pull minutely diff files (no reload mechanism). Those are the files, which have been applied back at that time. Result: File 001/374/687.osc.gz was created twice on the upstream planet.openstreetmap.org server with different file contents, causing the inconsistency you've noticed. Most of the changes from the updated 687.osc.gz were never applied. <way id="340431930">
<bounds minlat="28.0257255" minlon="85.1107004" maxlat="28.0399973" maxlon="85.1200248"/>
[...]
<nd ref="3490601202" lat="28.0399973" lon="85.1200248"/>
<nd ref="3490601193" lat="28.0396881" lon="85.1199142"/>
<nd ref="3476578017" lat="28.0275874" lon="85.1131419"/>
<nd ref="3476578013" lat="28.0274558" lon="85.1131526"/>
<nd ref="3488830107"/>
<nd ref="3488830106"/>
<nd ref="3488830105"/>
<nd ref="3488830104"/>
<nd ref="3488830103"/>
<nd ref="3477698777" lat="28.0272279" lon="85.1128781"/>
<nd ref="3488836288"/>
<nd ref="3488830108"/>
<nd ref="3477698776" lat="28.0270524" lon="85.1124452"/>
<nd ref="3488830102"/>
<nd ref="3488014975" lat="28.0267980" lon="85.1121412"/>
[...]
<tag k="highway" v="path"/>
</way> |
|
So it seems we can only hope and pray there won't be too many
diff "glitches"…
|
|
Thanks mmd that was quite a detective work ! The api.openstreetmap.fr was indeeded affected by this glitch and missing what was in that diff. However, the true nature of that "corrupt glitch" is still unknown to me and is quite likely to reappear. The only information I suspect is that during the import of the supposedly corrupted diff file, the update_database processus met an out of memory condition. Next time I'll try to check a little more what was in that diff. |
It's all about the replication process on the OSM main server. Overpass API just processes what's in the minutely diffs directory at that time. Those files were ok and not corrupted in any way. There's also no out of memory condition whatsoever. The real trouble starts a bit later: From time to time the replication process on the main server seems to hang and somehow needs to be restarted. During that restart process, some already written .osc.gz. files seem to be written a second time, this time with much more changesets in them. Overpass API won't pick up those newly written .osc.gz files and there you lose your changesets. I don't know if writing the same diff files multiple times is ok. Right now, it just seems to occur once in a while. Maybe @tomhughes has some explanation what is really going on here or how this situation could be avoided (or maybe at least detected) in the first place. Whatever makes most sense. |
|
It's pretty simple really, if you get a zero byte file then ignore it and keep retying it until you get a proper file. |
|
@tomhughes : the first time, the file size was 48375 bytes. The very same file was later written with a filesize of 2.2M. Both versions look perfectly valid, can be gunzip'ed without any issue. So checking for a zero byte file won't help here. |
|
Oh I know nothing about that then. I assumed you were talking about the other day when the server crashed and left a zero byte file that hadn't been flushed, so I had to reset the state. |
|
Ah I guess maybe if you managed to download it before the server crashed then it might not have been zero sized... But after the server came back up it was because it had never been flushed before. The real fix is to make omosis sync the files to disk properly, so that's the place to look if you want to improve things. |
|
Just as a reference: I put both versions on the dev server now: http://dev.overpass-api.de/issue210/defect/ |
@tomhughes : I'm not exactly sure on how to get in touch with the right people. There's some osmosis-dev list around with very little traffic, and there's the OSM trac. As I don't have access to the main OSM instance to collect further evidence (such as log files documenting the exact sequence of events before/during/after the hickup), would you mind creating a new ticket on trac with further details and link it here? That would really be terrific. I will of course add any insight I have from a data consumer point of view along with the implications on follow-on processes. |
|
Well I don't think osmosis has any real active maintenance at the moment and I certainly don't have any special evidence to offer. |
|
Thanks for digging into this issue so painstakingly.
|
|
What the hell is that supposed to mean. I've told you what I think happened. There are no logs that will, or ever could, confirm that theory. What exactly do you want me to do? |
|
If it makes you all happy I have raised https://trac.openstreetmap.org/ticket/5312 for you now. |
|
I'm sorry for the late reply: I think I would like to also add protection mechanisms to the download process: The update process should get resilient against applying the same update twice. After talking with lonvia at the FOSSGIS I would relize this by ignoring elements if an explicit version with a strictly newer timestamp of that element already exists. The second protection should be to suspend updates and produce an alert if no new diffs get available within a certain time window (I think of half an hour or so). The third protection mechanism is to make regular backups of the mirror database. Currently it is possible to catch up in about 30-fold real time. Hence, even a backup once a week would mean that under worst possible circumstances the mirror could catch up within 8 hours or so. Please note that these protections are useful on their own. For example, the third one will mitigate any kind of accident, e.g. also a bit flip due to hardware problems. The first one will help when starting minutely updates from a fresh planet, because you can then safely start with some overlap without the risk of damaged data. On the other hand: I appreciate very much Toms fast response. But controlling what is exactly happening in a crash-reboot-recover setting on a nontrivial system is close to impossible, because systems on other layers (the file system, the dbms, etc.) may have their own idea of consistent state and impose it at an uncoordinated point in time in an unexpected way. This is most likely a really tough bug there, but a manageable bug here. Hence we should fix it preferably downstream. |
|
malenki wrote
Nothing. Roland Olbricht wrote
For which planned downtimes also should be taken into regard. |
I second that.
The main reason why I was asking for this ticket is to raise some awareness that this kind of issue in fact exists. I'm perfectly fine with having additional checks on Overpass API. However, there are probably other applications out there, which might be impacted in a similar way. If the osmosis developers later come back with "sorry, there's nothing we can do about it", that's ok. If they could isolate the issue and even fix it, every data consumer in the OSM ecosystem would benefit. Win for everyone. |
|
I'm currently reviewing which bugs should be fixed in the next version. This bug won't, it is too complex. I'll shortly explain details: The first protection is already effective for non-attic data provided that:
For attic data, the ramifications are much more difficult: until now, no attic data is ever deleted once it has been written. This cannot be upheld anymore if we want to be resilient in the told way. Thus, this has to be checked extra carefully to ensure that no data is lost. While it is technically possible, this is beyond little thing that can be fixed on the fly. The second protection will be implemented in the course of this release preparation, see commit fd804d9 The third is less a question of software and more a question of organzation. Shortly: we don't have enough disk space to automate this in a safe way. Hence, I will not promise this right now. |
|
The last part of the issue is solved by an automatic daily cloning. |
When downloading this area via Overpass API way number 340431930 isn't displayed in JOSM, though one can select and edit it.
After downloading the area from OSM via download area or download object it is displayed normally.
In JOSM's bugtracker the bug got closed as [othersoftware] (image and exampla data there, too) since the downloaded data is missing a node. Thus I assume I should file the bug against the API…
The text was updated successfully, but these errors were encountered: