Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataPusher called multiple times when creating a dataset #2856

Closed
amercader opened this issue Jan 28, 2016 · 31 comments

Comments

Projects
None yet
@amercader
Copy link
Member

commented Jan 28, 2016

When you upload a CSV the DataPusher gets called to do its thing.
On #2234 we added a couple of calls to resource_patch on datastore_create and datastore_delete to set the datastore_active extra on the resource.

This is all very well but this update on the resource triggers again the DataPusher, which pings datastore_delete and datastore_create, which trigger...

Not sure how we can handle this on the notify extension point, as we only get the model object there.

@amercader

This comment has been minimized.

Copy link
Member Author

commented Jan 28, 2016

And for the record, before 2.5, DataPusher always got called 2 times when creating a dataset because the wonderful new_resource action from the package controller first calls resource_create and then package_update.

@amercader amercader self-assigned this Jan 28, 2016

@Xenros

This comment has been minimized.

Copy link

commented Mar 10, 2016

Hi, may i know is the problem solve the ? My ckan is install on ubuntu 12.10 with ckan version 2.5.1
I have successfully install datapusher and datastore. But something when i upload a dataset, datapusher always got called 2 times. This error happened randomly, sometime it push normally but most of the time it pushes twice.

Just like amercader said, I check the log, most of the api got call twice when i upload a dataset. Such as store_create,datastore_delete, datastore_search etc. But teach me if there is any solution for this. Thank you

@amercader

This comment has been minimized.

Copy link
Member Author

commented Mar 16, 2016

@TkTech I'm working on a patch for this, so grabbing it from you

amercader added a commit that referenced this issue Mar 17, 2016

[#2856] Avoid submitting resources to the DataPusher multiple times
The datapusher plugin uses the notify extension point to check if a
resource was created or updated. `datapusher_submit` will create a new
task with status pending when first called and `datapusher_hook` will
mark it as complete when necessary. We can use that to avoid resending
jobs unnecessarily.

amercader added a commit that referenced this issue Mar 17, 2016

[#2856] Only set datastore_active when necessary
Rather than updating the resource always, just do it when is actually
needed, ie:

* On datastore_create, when the resource doesn't already has a
  datastore_active=True extra

* On datastore_delete, only if the resource has datastore_active=False

amercader added a commit that referenced this issue Mar 17, 2016

amercader added a commit to ckan/datapusher that referenced this issue Mar 17, 2016

Only delete existing DataStore resource if it exists
Added a new `datastore_resource_exists` function.

Eventhough this does not reduce the number of requests, it helps with
ckan/ckan#2856

amercader added a commit that referenced this issue Mar 17, 2016

amercader added a commit that referenced this issue Mar 23, 2016

[#2856] Resubmit to DataPusher if file is reuploaded or URL changes
Previous commits prevent new DataPusher jobs from being created if there
is an ongoing one. There are cases though where we actually want to
resubmit, basically if the data source has changed since the first
DataPusher job started. This can be because a new file was uploaded or
because the URL was changed.

* For file uploads we check the `last_modified` resource property, which
  is set after an upload against the task creation date.
* For URLs we check if the original URL is different from the current
  one.

 We store both the task creation date and the original URL in the
 metadata sent to (and returned from) the DataPusher.

amercader added a commit that referenced this issue Mar 23, 2016

amercader added a commit that referenced this issue Mar 24, 2016

amercader added a commit that referenced this issue Mar 25, 2016

amercader added a commit that referenced this issue Mar 25, 2016

[#2856] Add decorator to mock a CKAN action in tests
Because of the way actions are accessed via `get_action`, it is a pain to mock
them. This decorator adds a mock object for the provided action as a parameter
to the test function. The mock is discarded at the end of the function, even
if there is an exception raised.

This mocks the action both when it's called directly via
`ckan.logic.get_action` and via `ckan.plugins.toolkit.get_action`

amercader added a commit that referenced this issue Mar 25, 2016

[#2856] Added tests for DataPusher submissions
Test basic behaviour and resubmissions if updates are performed during
an ongoing job.

@wardi wardi closed this in #2921 Mar 30, 2016

wardi added a commit that referenced this issue Mar 30, 2016

Merge pull request #2921 from ckan/2856-fix-datapusher-multiple-calls
[#2856] Limit calls between DataPusher -> DataStore -> CKAN

amercader added a commit that referenced this issue Mar 30, 2016

[#2856] Avoid submitting resources to the DataPusher multiple times
The datapusher plugin uses the notify extension point to check if a
resource was created or updated. `datapusher_submit` will create a new
task with status pending when first called and `datapusher_hook` will
mark it as complete when necessary. We can use that to avoid resending
jobs unnecessarily.

amercader added a commit that referenced this issue Mar 30, 2016

[#2856] Only set datastore_active when necessary
Rather than updating the resource always, just do it when is actually
needed, ie:

* On datastore_create, when the resource doesn't already has a
  datastore_active=True extra

* On datastore_delete, only if the resource has datastore_active=False
@linuxitux

This comment has been minimized.

Copy link

commented May 19, 2016

@inghamn I'm not using SSL, just plain HTTP. And this only happens with .xls or .cvs files.

@inghamn

This comment has been minimized.

Copy link

commented May 19, 2016

I have also been able to trigger it without SSL, but using a CSV file in ISO 8859-1 encoding. When the file includes non-utf8 special characters, the infinite loop is triggered. If I convert the file to UTF-8 encoding (converting or deleting any special characters), datapusher considers the file upload a success and does not retry

I speculate that this is also part of jobs.py. Datapusher tries to download the newly uploaded file, to confirm everything worked. I believe CKAN is converting the file to UTF-8 when the file is received. Thus, whe datapusher compares the newly downloaded file with the hash of the original ISO 8859-1 file, the hashes are different, and a retry is triggered.

@linuxitux

This comment has been minimized.

Copy link

commented May 19, 2016

@inghamn uploaded an UTF-8 encoded CVS file without special chars and still happens.

@davidread

This comment has been minimized.

Copy link
Contributor

commented May 19, 2016

Perhaps you need to ignore package notifications that don't involve a change in resource.url? I recently did some code for that that you might wish to reuse? See:

https://github.com/ckan/ckanext-archiver/pull/28/files

@FacundoAdorno

This comment has been minimized.

Copy link

commented May 19, 2016

I have the same problem in my CKAN instance. I'm using the latest CKAN v2.5.2 and when I finish to add a dataset with a csv file, the datapusher begins to push "quasi" infinitely this csv. Due to this, the ckan log grows a lot and the resource preview remains inaccesible by a while.

In my humble opinion, I think that the logic of the datapusher extension is good. I think that the "problem" is in the CKAN source code. Viewing the CKAN log and understanding the CKAN base code, I've noticed an inconsistence in the manner where the "now" datetime is requested. In some places of the code, the "now" time is calculated using the datetime.utcnow function, and in other places using the datetime.now function. This causes that, if your local datetime is different that the UTC (for example, UTC-3 in my case), the "now" datetime differs depending the manner it is requested.

Particularly, considering this issue, a resource is resubmitted to the datastore (at this line if I'm not wrong) when the resource 'last_modified_datetime' is greater than the 'task_created_datetime'. When a resource is uploaded to a dataset, the ckan/lib/uploader.py modifies the resource['last_modified'] date using the datetime.utcnow (at this line). When a datapusher task is created and submited, it sends between its data the 'task_created' date using the task['last_modified'] date as value; this last date is initialized using the datetime.now (by default at the task_status.py or explicitely at this line). When comparing this dates, the datapusher pushes the resource to the datastore until the 'last_modified_datetime' be lesser than the 'task_created_datetime'.

To solve this particular issue, I've modified (and compiled) the uploader.py in this line with the following:

resource['last_modified'] = datetime.datetime.now()

Doing this change, the datapusher will finish with this odd behaviour. In my case, it works.

@linuxitux

This comment has been minimized.

Copy link

commented May 19, 2016

@FacundoAdorno thank you very much! That solved the issue. The only thing is, in the site, the dataset still says "last modified: 6 hours ago" for every new upload. Maybe we should use datetime.now() instead of datetime.utcnow() in both sides?

@TkTech

This comment has been minimized.

Copy link
Member

commented May 19, 2016

Internally times should always be in in utc using utcnow(), the template should be doing the conversion to the local tz.

@inghamn

This comment has been minimized.

Copy link

commented May 20, 2016

@TkTech does this mean the task_created data should be UTC, then?

    task = {
        'entity_id': res_id,
        'entity_type': 'resource',
        'task_type': 'datapusher',
        'last_updated': str(datetime.datetime.utcnow()),
        'state': 'submitting',
        'key': 'datapusher',
        'value': '{}',
        'error': '{}',
    }
@TkTech

This comment has been minimized.

Copy link
Member

commented May 20, 2016

In a perfect world yes, you don't want any internally stored timestamps to be tz-dependant. Even if it's user-generated, you store a utc timestamp and the timezone separately.

@amercader

This comment has been minimized.

Copy link
Member Author

commented May 24, 2016

@linuxitux @FacundoAdorno @inghamn @TkTech
Apologies for the late reply on this. Without having investigating it in depth, the UTC timestamp discrepancy seems to be the most likely culprit. We should be really using UTC everywhere internally so rather than @FacundoAdorno approach of changing the resource last_modified date to use .now() I'd like to try using .utcnow() on datapusher's action.py. Can you check if the patch below fixes the issues when uploading new resources?

diff --git a/ckanext/datapusher/logic/action.py b/ckanext/datapusher/logic/action.py
index 8ce97c2..5afb4b0 100644
--- a/ckanext/datapusher/logic/action.py
+++ b/ckanext/datapusher/logic/action.py
@@ -74,7 +74,7 @@ def datapusher_submit(context, data_dict):
         'entity_id': res_id,
         'entity_type': 'resource',
         'task_type': 'datapusher',
-        'last_updated': str(datetime.datetime.now()),
+        'last_updated': str(datetime.datetime.utcnow()),
         'state': 'submitting',
         'key': 'datapusher',
         'value': '{}',
@@ -119,7 +119,7 @@ def datapusher_submit(context, data_dict):
                  'details': str(e)}
         task['error'] = json.dumps(error)
         task['state'] = 'error'
-        task['last_updated'] = str(datetime.datetime.now()),
+        task['last_updated'] = str(datetime.datetime.utcnow()),
         p.toolkit.get_action('task_status_update')(context, task)
         raise p.toolkit.ValidationError(error)

@@ -134,7 +134,7 @@ def datapusher_submit(context, data_dict):
                  'status_code': r.status_code}
         task['error'] = json.dumps(error)
         task['state'] = 'error'
-        task['last_updated'] = str(datetime.datetime.now()),
+        task['last_updated'] = str(datetime.datetime.utcnow()),
         p.toolkit.get_action('task_status_update')(context, task)
         raise p.toolkit.ValidationError(error)

@@ -143,7 +143,7 @@ def datapusher_submit(context, data_dict):

     task['value'] = value
     task['state'] = 'pending'
-    task['last_updated'] = str(datetime.datetime.now()),
+    task['last_updated'] = str(datetime.datetime.utcnow()),
     p.toolkit.get_action('task_status_update')(context, task)

     return True
@@ -175,7 +175,7 @@ def datapusher_hook(context, data_dict):
     })

     task['state'] = status
-    task['last_updated'] = str(datetime.datetime.now())
+    task['last_updated'] = str(datetime.datetime.utcnow())

     resubmit = False

In any case I also want to check the issues with encoding you mention @inghamn as I haven't been able to replicate them. Can you paste a link or DM one of the files that is failing for you? Thanks

@inghamn

This comment has been minimized.

Copy link

commented May 24, 2016

Success! After updating action.py with the patch, I am no longer having the infinite resubmission.

I have tried with my previous ISO-8859 file and do receive the character set mismatch error, but there is no longer a resubmit attempt. I've attached my CSV files, just in case you still want a copy.

ckan.zip

@amercader

This comment has been minimized.

Copy link
Member Author

commented May 24, 2016

@inghamn great, I'll send a PR with the patch

@amercader

This comment has been minimized.

Copy link
Member Author

commented May 24, 2016

@inghamn or feel free to do it yourself if you have the time!

@amercader amercader reopened this May 25, 2016

amercader added a commit that referenced this issue May 25, 2016

[#2856] Create DataPusher task timestamps using UTC
When an upload to the DataPusher finishes we check
`resource['last_modified']` (in UTC) against `task['last_updated']`
to check whether it changed during the previous upload and resubmit it.
As `task['last_updated']` was created using `datetime.now()` this could
lead to infinite upload loops.

Discussion is here:

#2856 (comment)
@amercader

This comment has been minimized.

Copy link
Member Author

commented May 25, 2016

PR here #3051

davidread added a commit that referenced this issue May 25, 2016

Merge pull request #3051 from ckan/2856-datapusher-tasks-utc
[#2856] Create DataPusher task timestamps using UTC

alex-perfilov-reisys added a commit to GSA/ckan that referenced this issue May 31, 2016

[ckan#2856] Create DataPusher task timestamps using UTC
When an upload to the DataPusher finishes we check
`resource['last_modified']` (in UTC) against `task['last_updated']`
to check whether it changed during the previous upload and resubmit it.
As `task['last_updated']` was created using `datetime.now()` this could
lead to infinite upload loops.

Discussion is here:

ckan#2856 (comment)

amercader added a commit that referenced this issue Aug 26, 2016

[#2856] Create DataPusher task timestamps using UTC
When an upload to the DataPusher finishes we check
`resource['last_modified']` (in UTC) against `task['last_updated']`
to check whether it changed during the previous upload and resubmit it.
As `task['last_updated']` was created using `datetime.now()` this could
lead to infinite upload loops.

Discussion is here:

#2856 (comment)

@amercader amercader closed this Aug 26, 2016

@m1h

This comment has been minimized.

Copy link

commented Nov 21, 2016

Hi @amercader,

we encountered the same issue while using CKAN 2.5.3 (source-install). "Fetching", "deleting", "determining headers and types", "saving chunk", "successfully saved" is repeating continuously (endless loop). A screenshot:

ckan-error

It is strange that the date returns -1. The field 'last_modified' in package_show (from action-api) looks the following:
"last_modified": "2016-11-21T17:16:11.247765"

However, if I open the details (from the screenshot above) I get following:
ckan_error_info

So, here the timestamp is correct. Could it be a problem with the date? Does anyone have any suggestion for solving it?

@wardi

This comment has been minimized.

Copy link
Contributor

commented Dec 7, 2016

@m1h try the fix in #3331

@amercader

This comment has been minimized.

Copy link
Member Author

commented Jan 3, 2017

@m1h did #3331 help?

@m1h

This comment has been minimized.

Copy link

commented Jan 11, 2017

@amercader : No, unfortunately!

We figured out following:
The first time a ressource is uploaded to datastore, it works. So, when a new ressource is created, it automatically uploads it to datastore - perfect. However, when one re-uploads it to datastore (via button-click in web-frontend), it is resulting then in an endless loop.

gvonness pushed a commit to datagovau/ckan that referenced this issue Feb 9, 2017

[ckan#2856] Create DataPusher task timestamps using UTC
When an upload to the DataPusher finishes we check
`resource['last_modified']` (in UTC) against `task['last_updated']`
to check whether it changed during the previous upload and resubmit it.
As `task['last_updated']` was created using `datetime.now()` this could
lead to infinite upload loops.

Discussion is here:

ckan#2856 (comment)
@janaslo

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2017

Hi, we have the same problem with endless loop, appearing very often when we try to delete some resource (other than the last one in the list of package resources) via web interface. CKAN/DataPusher start looping and the resource is never actually deleted.

We are using CKAN 2.5.3 with #3331 applied, and DataPusher 0.0.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.