Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploaded resources are lost when DataPusher is enabled #100

Open
torfsen opened this issue Jul 1, 2016 · 0 comments
Open

Uploaded resources are lost when DataPusher is enabled #100

torfsen opened this issue Jul 1, 2016 · 0 comments

Comments

@torfsen
Copy link

torfsen commented Jul 1, 2016

I'm running a custom harvester on my CKAN 2.5.2 with datapusher's stable branch. During a harvest, that harvester uploads many CSV resources to CKAN, in the sense that the resource data is not linked but actually uploaded to CKAN's file store.

When datapusher is enabled, I'm getting lots of HTTPError messages in datapusher's log (see #69), which turn out to be 404s for either the metadata or the actual data of resources just created by the harvester.

From my logs I can see that my harvester creates the resource in question and adds a view to it (at that point the resource must have existed, otherwise trying to add a view to it would have failed). The resource is also submitted to the datapusher, which then ends up receiving a 404. At that point the resource does in fact not exist anymore. This means that not only is there a problem with the upload to the datastore via datapusher, but resources are actually lost. Interestingly, during the same harvesting operation other resources are created just fine, including their addition to the datastore via the datapusher.

If I disable the datapusher plugin then the harvest runs without losing any resources. Once I enable it again resources start getting lost. Another of my harvesters, which also creates many resources in a dataset works fine, but that one simply links the resource data instead of uploading it.

During my experiments, the problem only occurs after some resources have been added to the package. That is, the package gets created, some resources are added to it without problems and then suddenly all remaining resources of the package fail. The number of resources after which the error starts isn't deterministic, but it's not totally random either (i.e. it's usually the same resource but sometimes it's the one before or the one after).

In addition, during the same harvest operation I often get multiple 404s for the same resource, since existing resources are apparently re-submitted to the datapusher when new resources are added to the same dataset. Indeed, the datapusher plugin's notify handler is called again and again for the same resource (first time with operation 'new' and then None). This might be a problem with IResourceUrlChange, since I'm not sure why the resource's URL should change again and again.

I found out that this is some kind of timing issue: If I put a time.sleep(20) at the beginning of push_to_datastore so that my harvest has completed before datapusher runs then I do not get any errors and all resources are imported correctly. Similarly, if I put time.sleep(20) pauses between the resource creations of my harvester then it also works.

I couldn't find code in the involved parties (CKAN, datapusher, harvester) that actively deletes resources, so my guess is that this is synchronization problem where one of the parties retrieves metadata, modifies it and then re-submits it while another party has modified the metadata inbetween. However, the information flow is quite complex and timing sensitive, so I haven't been able to figure this out completely, yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant