New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataPusher called multiple times when creating a dataset #2856
Comments
And for the record, before 2.5, DataPusher always got called 2 times when creating a dataset because the wonderful |
Hi, may i know is the problem solve the ? My ckan is install on ubuntu 12.10 with ckan version 2.5.1 Just like amercader said, I check the log, most of the api got call twice when i upload a dataset. Such as store_create,datastore_delete, datastore_search etc. But teach me if there is any solution for this. Thank you |
@TkTech I'm working on a patch for this, so grabbing it from you |
The datapusher plugin uses the notify extension point to check if a resource was created or updated. `datapusher_submit` will create a new task with status pending when first called and `datapusher_hook` will mark it as complete when necessary. We can use that to avoid resending jobs unnecessarily.
Rather than updating the resource always, just do it when is actually needed, ie: * On datastore_create, when the resource doesn't already has a datastore_active=True extra * On datastore_delete, only if the resource has datastore_active=False
Added a new `datastore_resource_exists` function. Eventhough this does not reduce the number of requests, it helps with ckan/ckan#2856
Previous commits prevent new DataPusher jobs from being created if there is an ongoing one. There are cases though where we actually want to resubmit, basically if the data source has changed since the first DataPusher job started. This can be because a new file was uploaded or because the URL was changed. * For file uploads we check the `last_modified` resource property, which is set after an upload against the task creation date. * For URLs we check if the original URL is different from the current one. We store both the task creation date and the original URL in the metadata sent to (and returned from) the DataPusher.
Because of the way actions are accessed via `get_action`, it is a pain to mock them. This decorator adds a mock object for the provided action as a parameter to the test function. The mock is discarded at the end of the function, even if there is an exception raised. This mocks the action both when it's called directly via `ckan.logic.get_action` and via `ckan.plugins.toolkit.get_action`
Test basic behaviour and resubmissions if updates are performed during an ongoing job.
[#2856] Limit calls between DataPusher -> DataStore -> CKAN
The datapusher plugin uses the notify extension point to check if a resource was created or updated. `datapusher_submit` will create a new task with status pending when first called and `datapusher_hook` will mark it as complete when necessary. We can use that to avoid resending jobs unnecessarily.
Rather than updating the resource always, just do it when is actually needed, ie: * On datastore_create, when the resource doesn't already has a datastore_active=True extra * On datastore_delete, only if the resource has datastore_active=False
@inghamn I'm not using SSL, just plain HTTP. And this only happens with .xls or .cvs files. |
I have also been able to trigger it without SSL, but using a CSV file in ISO 8859-1 encoding. When the file includes non-utf8 special characters, the infinite loop is triggered. If I convert the file to UTF-8 encoding (converting or deleting any special characters), datapusher considers the file upload a success and does not retry I speculate that this is also part of jobs.py. Datapusher tries to download the newly uploaded file, to confirm everything worked. I believe CKAN is converting the file to UTF-8 when the file is received. Thus, whe datapusher compares the newly downloaded file with the hash of the original ISO 8859-1 file, the hashes are different, and a retry is triggered. |
@inghamn uploaded an UTF-8 encoded CVS file without special chars and still happens. |
Perhaps you need to ignore package notifications that don't involve a change in resource.url? I recently did some code for that that you might wish to reuse? See: |
I have the same problem in my CKAN instance. I'm using the latest CKAN v2.5.2 and when I finish to add a dataset with a csv file, the datapusher begins to push "quasi" infinitely this csv. Due to this, the ckan log grows a lot and the resource preview remains inaccesible by a while. In my humble opinion, I think that the logic of the datapusher extension is good. I think that the "problem" is in the CKAN source code. Viewing the CKAN log and understanding the CKAN base code, I've noticed an inconsistence in the manner where the "now" datetime is requested. In some places of the code, the "now" time is calculated using the datetime.utcnow function, and in other places using the datetime.now function. This causes that, if your local datetime is different that the UTC (for example, UTC-3 in my case), the "now" datetime differs depending the manner it is requested. Particularly, considering this issue, a resource is resubmitted to the datastore (at this line if I'm not wrong) when the resource 'last_modified_datetime' is greater than the 'task_created_datetime'. When a resource is uploaded to a dataset, the ckan/lib/uploader.py modifies the resource['last_modified'] date using the To solve this particular issue, I've modified (and compiled) the uploader.py in this line with the following:
Doing this change, the datapusher will finish with this odd behaviour. In my case, it works. |
@FacundoAdorno thank you very much! That solved the issue. The only thing is, in the site, the dataset still says "last modified: 6 hours ago" for every new upload. Maybe we should use datetime.now() instead of datetime.utcnow() in both sides? |
Internally times should always be in in utc using |
@TkTech does this mean the task_created data should be UTC, then? task = {
'entity_id': res_id,
'entity_type': 'resource',
'task_type': 'datapusher',
'last_updated': str(datetime.datetime.utcnow()),
'state': 'submitting',
'key': 'datapusher',
'value': '{}',
'error': '{}',
} |
In a perfect world yes, you don't want any internally stored timestamps to be tz-dependant. Even if it's user-generated, you store a utc timestamp and the timezone separately. |
@linuxitux @FacundoAdorno @inghamn @TkTech diff --git a/ckanext/datapusher/logic/action.py b/ckanext/datapusher/logic/action.py
index 8ce97c2..5afb4b0 100644
--- a/ckanext/datapusher/logic/action.py
+++ b/ckanext/datapusher/logic/action.py
@@ -74,7 +74,7 @@ def datapusher_submit(context, data_dict):
'entity_id': res_id,
'entity_type': 'resource',
'task_type': 'datapusher',
- 'last_updated': str(datetime.datetime.now()),
+ 'last_updated': str(datetime.datetime.utcnow()),
'state': 'submitting',
'key': 'datapusher',
'value': '{}',
@@ -119,7 +119,7 @@ def datapusher_submit(context, data_dict):
'details': str(e)}
task['error'] = json.dumps(error)
task['state'] = 'error'
- task['last_updated'] = str(datetime.datetime.now()),
+ task['last_updated'] = str(datetime.datetime.utcnow()),
p.toolkit.get_action('task_status_update')(context, task)
raise p.toolkit.ValidationError(error)
@@ -134,7 +134,7 @@ def datapusher_submit(context, data_dict):
'status_code': r.status_code}
task['error'] = json.dumps(error)
task['state'] = 'error'
- task['last_updated'] = str(datetime.datetime.now()),
+ task['last_updated'] = str(datetime.datetime.utcnow()),
p.toolkit.get_action('task_status_update')(context, task)
raise p.toolkit.ValidationError(error)
@@ -143,7 +143,7 @@ def datapusher_submit(context, data_dict):
task['value'] = value
task['state'] = 'pending'
- task['last_updated'] = str(datetime.datetime.now()),
+ task['last_updated'] = str(datetime.datetime.utcnow()),
p.toolkit.get_action('task_status_update')(context, task)
return True
@@ -175,7 +175,7 @@ def datapusher_hook(context, data_dict):
})
task['state'] = status
- task['last_updated'] = str(datetime.datetime.now())
+ task['last_updated'] = str(datetime.datetime.utcnow())
resubmit = False In any case I also want to check the issues with encoding you mention @inghamn as I haven't been able to replicate them. Can you paste a link or DM one of the files that is failing for you? Thanks |
Success! After updating action.py with the patch, I am no longer having the infinite resubmission. I have tried with my previous ISO-8859 file and do receive the character set mismatch error, but there is no longer a resubmit attempt. I've attached my CSV files, just in case you still want a copy. |
@inghamn great, I'll send a PR with the patch |
@inghamn or feel free to do it yourself if you have the time! |
When an upload to the DataPusher finishes we check `resource['last_modified']` (in UTC) against `task['last_updated']` to check whether it changed during the previous upload and resubmit it. As `task['last_updated']` was created using `datetime.now()` this could lead to infinite upload loops. Discussion is here: #2856 (comment)
PR here #3051 |
[#2856] Create DataPusher task timestamps using UTC
When an upload to the DataPusher finishes we check `resource['last_modified']` (in UTC) against `task['last_updated']` to check whether it changed during the previous upload and resubmit it. As `task['last_updated']` was created using `datetime.now()` this could lead to infinite upload loops. Discussion is here: ckan#2856 (comment)
When an upload to the DataPusher finishes we check `resource['last_modified']` (in UTC) against `task['last_updated']` to check whether it changed during the previous upload and resubmit it. As `task['last_updated']` was created using `datetime.now()` this could lead to infinite upload loops. Discussion is here: #2856 (comment)
Hi @amercader, we encountered the same issue while using CKAN 2.5.3 (source-install). "Fetching", "deleting", "determining headers and types", "saving chunk", "successfully saved" is repeating continuously (endless loop). A screenshot: It is strange that the date returns -1. The field 'last_modified' in package_show (from action-api) looks the following: However, if I open the details (from the screenshot above) I get following: So, here the timestamp is correct. Could it be a problem with the date? Does anyone have any suggestion for solving it? |
@amercader : No, unfortunately! We figured out following: |
When an upload to the DataPusher finishes we check `resource['last_modified']` (in UTC) against `task['last_updated']` to check whether it changed during the previous upload and resubmit it. As `task['last_updated']` was created using `datetime.now()` this could lead to infinite upload loops. Discussion is here: ckan#2856 (comment)
Hi, we have the same problem with endless loop, appearing very often when we try to delete some resource (other than the last one in the list of package resources) via web interface. CKAN/DataPusher start looping and the resource is never actually deleted. We are using CKAN 2.5.3 with #3331 applied, and DataPusher 0.0.10. |
When you upload a CSV the DataPusher gets called to do its thing.
On #2234 we added a couple of calls to
resource_patch
ondatastore_create
anddatastore_delete
to set thedatastore_active
extra on the resource.This is all very well but this update on the resource triggers again the DataPusher, which pings
datastore_delete
anddatastore_create
, which trigger...Not sure how we can handle this on the
notify
extension point, as we only get the model object there.The text was updated successfully, but these errors were encountered: