You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running a custom harvester on my CKAN 2.5.2 with datapusher's stable branch. During a harvest, that harvester uploads many CSV resources to CKAN, in the sense that the resource data is not linked but actually uploaded to CKAN's file store.
When datapusher is enabled, I'm getting lots of HTTPError messages in datapusher's log (see #69), which turn out to be 404s for either the metadata or the actual data of resources just created by the harvester.
From my logs I can see that my harvester creates the resource in question and adds a view to it (at that point the resource must have existed, otherwise trying to add a view to it would have failed). The resource is also submitted to the datapusher, which then ends up receiving a 404. At that point the resource does in fact not exist anymore. This means that not only is there a problem with the upload to the datastore via datapusher, but resources are actually lost. Interestingly, during the same harvesting operation other resources are created just fine, including their addition to the datastore via the datapusher.
If I disable the datapusher plugin then the harvest runs without losing any resources. Once I enable it again resources start getting lost. Another of my harvesters, which also creates many resources in a dataset works fine, but that one simply links the resource data instead of uploading it.
During my experiments, the problem only occurs after some resources have been added to the package. That is, the package gets created, some resources are added to it without problems and then suddenly all remaining resources of the package fail. The number of resources after which the error starts isn't deterministic, but it's not totally random either (i.e. it's usually the same resource but sometimes it's the one before or the one after).
In addition, during the same harvest operation I often get multiple 404s for the same resource, since existing resources are apparently re-submitted to the datapusher when new resources are added to the same dataset. Indeed, the datapusher plugin's notify handler is called again and again for the same resource (first time with operation'new' and then None). This might be a problem with IResourceUrlChange, since I'm not sure why the resource's URL should change again and again.
I found out that this is some kind of timing issue: If I put a time.sleep(20) at the beginning of push_to_datastore so that my harvest has completed before datapusher runs then I do not get any errors and all resources are imported correctly. Similarly, if I put time.sleep(20) pauses between the resource creations of my harvester then it also works.
I couldn't find code in the involved parties (CKAN, datapusher, harvester) that actively deletes resources, so my guess is that this is synchronization problem where one of the parties retrieves metadata, modifies it and then re-submits it while another party has modified the metadata inbetween. However, the information flow is quite complex and timing sensitive, so I haven't been able to figure this out completely, yet.
The text was updated successfully, but these errors were encountered:
I'm running a custom harvester on my CKAN 2.5.2 with datapusher's stable branch. During a harvest, that harvester uploads many CSV resources to CKAN, in the sense that the resource data is not linked but actually uploaded to CKAN's file store.
When datapusher is enabled, I'm getting lots of
HTTPError
messages in datapusher's log (see #69), which turn out to be 404s for either the metadata or the actual data of resources just created by the harvester.From my logs I can see that my harvester creates the resource in question and adds a view to it (at that point the resource must have existed, otherwise trying to add a view to it would have failed). The resource is also submitted to the datapusher, which then ends up receiving a 404. At that point the resource does in fact not exist anymore. This means that not only is there a problem with the upload to the datastore via datapusher, but resources are actually lost. Interestingly, during the same harvesting operation other resources are created just fine, including their addition to the datastore via the datapusher.
If I disable the datapusher plugin then the harvest runs without losing any resources. Once I enable it again resources start getting lost. Another of my harvesters, which also creates many resources in a dataset works fine, but that one simply links the resource data instead of uploading it.
During my experiments, the problem only occurs after some resources have been added to the package. That is, the package gets created, some resources are added to it without problems and then suddenly all remaining resources of the package fail. The number of resources after which the error starts isn't deterministic, but it's not totally random either (i.e. it's usually the same resource but sometimes it's the one before or the one after).
In addition, during the same harvest operation I often get multiple 404s for the same resource, since existing resources are apparently re-submitted to the datapusher when new resources are added to the same dataset. Indeed, the datapusher plugin's
notify
handler is called again and again for the same resource (first time withoperation
'new'
and thenNone
). This might be a problem withIResourceUrlChange
, since I'm not sure why the resource's URL should change again and again.I found out that this is some kind of timing issue: If I put a
time.sleep(20)
at the beginning ofpush_to_datastore
so that my harvest has completed before datapusher runs then I do not get any errors and all resources are imported correctly. Similarly, if I puttime.sleep(20)
pauses between the resource creations of my harvester then it also works.I couldn't find code in the involved parties (CKAN, datapusher, harvester) that actively deletes resources, so my guess is that this is synchronization problem where one of the parties retrieves metadata, modifies it and then re-submits it while another party has modified the metadata inbetween. However, the information flow is quite complex and timing sensitive, so I haven't been able to figure this out completely, yet.
The text was updated successfully, but these errors were encountered: