Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataPusher called multiple times when creating a dataset #2856

Closed
amercader opened this issue Jan 28, 2016 · 31 comments
Closed

DataPusher called multiple times when creating a dataset #2856

amercader opened this issue Jan 28, 2016 · 31 comments

Comments

@amercader
Copy link
Member

@amercader amercader commented Jan 28, 2016

When you upload a CSV the DataPusher gets called to do its thing.
On #2234 we added a couple of calls to resource_patch on datastore_create and datastore_delete to set the datastore_active extra on the resource.

This is all very well but this update on the resource triggers again the DataPusher, which pings datastore_delete and datastore_create, which trigger...

Not sure how we can handle this on the notify extension point, as we only get the model object there.

@amercader
Copy link
Member Author

@amercader amercader commented Jan 28, 2016

And for the record, before 2.5, DataPusher always got called 2 times when creating a dataset because the wonderful new_resource action from the package controller first calls resource_create and then package_update.

@amercader amercader self-assigned this Jan 28, 2016
@Xenros
Copy link

@Xenros Xenros commented Mar 10, 2016

Hi, may i know is the problem solve the ? My ckan is install on ubuntu 12.10 with ckan version 2.5.1
I have successfully install datapusher and datastore. But something when i upload a dataset, datapusher always got called 2 times. This error happened randomly, sometime it push normally but most of the time it pushes twice.

Just like amercader said, I check the log, most of the api got call twice when i upload a dataset. Such as store_create,datastore_delete, datastore_search etc. But teach me if there is any solution for this. Thank you

@amercader
Copy link
Member Author

@amercader amercader commented Mar 16, 2016

@TkTech I'm working on a patch for this, so grabbing it from you

amercader added a commit that referenced this issue Mar 17, 2016
The datapusher plugin uses the notify extension point to check if a
resource was created or updated. `datapusher_submit` will create a new
task with status pending when first called and `datapusher_hook` will
mark it as complete when necessary. We can use that to avoid resending
jobs unnecessarily.
amercader added a commit that referenced this issue Mar 17, 2016
Rather than updating the resource always, just do it when is actually
needed, ie:

* On datastore_create, when the resource doesn't already has a
  datastore_active=True extra

* On datastore_delete, only if the resource has datastore_active=False
amercader added a commit that referenced this issue Mar 17, 2016
amercader added a commit to ckan/datapusher that referenced this issue Mar 17, 2016
Added a new `datastore_resource_exists` function.

Eventhough this does not reduce the number of requests, it helps with
ckan/ckan#2856
amercader added a commit that referenced this issue Mar 17, 2016
amercader added a commit that referenced this issue Mar 23, 2016
Previous commits prevent new DataPusher jobs from being created if there
is an ongoing one. There are cases though where we actually want to
resubmit, basically if the data source has changed since the first
DataPusher job started. This can be because a new file was uploaded or
because the URL was changed.

* For file uploads we check the `last_modified` resource property, which
  is set after an upload against the task creation date.
* For URLs we check if the original URL is different from the current
  one.

 We store both the task creation date and the original URL in the
 metadata sent to (and returned from) the DataPusher.
amercader added a commit that referenced this issue Mar 24, 2016
amercader added a commit that referenced this issue Mar 25, 2016
amercader added a commit that referenced this issue Mar 25, 2016
Because of the way actions are accessed via `get_action`, it is a pain to mock
them. This decorator adds a mock object for the provided action as a parameter
to the test function. The mock is discarded at the end of the function, even
if there is an exception raised.

This mocks the action both when it's called directly via
`ckan.logic.get_action` and via `ckan.plugins.toolkit.get_action`
amercader added a commit that referenced this issue Mar 25, 2016
Test basic behaviour and resubmissions if updates are performed during
an ongoing job.
@wardi wardi closed this in #2921 Mar 30, 2016
wardi added a commit that referenced this issue Mar 30, 2016
[#2856] Limit calls between DataPusher -> DataStore -> CKAN
amercader added a commit that referenced this issue Mar 30, 2016
The datapusher plugin uses the notify extension point to check if a
resource was created or updated. `datapusher_submit` will create a new
task with status pending when first called and `datapusher_hook` will
mark it as complete when necessary. We can use that to avoid resending
jobs unnecessarily.
amercader added a commit that referenced this issue Mar 30, 2016
Rather than updating the resource always, just do it when is actually
needed, ie:

* On datastore_create, when the resource doesn't already has a
  datastore_active=True extra

* On datastore_delete, only if the resource has datastore_active=False
@inghamn
Copy link

@inghamn inghamn commented May 18, 2016

I believe something in the datapusher upload process is altering the resource URLmid-process, causing a new upload process to start.

@inghamn
Copy link

@inghamn inghamn commented May 19, 2016

It turns out, it's not the URL changing, but rather unexpected response codes when running CKAN behind SSL. This might be something for the datapusher project, but the issue is documented here.

We also have the InsecurePlatformWarning when enabling SSL for CKAN.
ckan/datapusher#83

The SSL warning cases a response that this CKAN patch considers a failure state, resulting in infinite reattempts. Datapusher does successfully clear and save the resouce in the datastore. However, under SSL, the response is not the hard-coded, expected 200 or 201.

Should we expand the range of what is considered a success response?
The code is in /datapusher/jobs.py

Is there a way we can add testing for this situation using SSL-only and a self-signed certificate for CKAN.

@linuxitux
Copy link

@linuxitux linuxitux commented May 19, 2016

@inghamn I'm not using SSL, just plain HTTP. And this only happens with .xls or .cvs files.

@inghamn
Copy link

@inghamn inghamn commented May 19, 2016

I have also been able to trigger it without SSL, but using a CSV file in ISO 8859-1 encoding. When the file includes non-utf8 special characters, the infinite loop is triggered. If I convert the file to UTF-8 encoding (converting or deleting any special characters), datapusher considers the file upload a success and does not retry

I speculate that this is also part of jobs.py. Datapusher tries to download the newly uploaded file, to confirm everything worked. I believe CKAN is converting the file to UTF-8 when the file is received. Thus, whe datapusher compares the newly downloaded file with the hash of the original ISO 8859-1 file, the hashes are different, and a retry is triggered.

@linuxitux
Copy link

@linuxitux linuxitux commented May 19, 2016

@inghamn uploaded an UTF-8 encoded CVS file without special chars and still happens.

@davidread
Copy link
Contributor

@davidread davidread commented May 19, 2016

Perhaps you need to ignore package notifications that don't involve a change in resource.url? I recently did some code for that that you might wish to reuse? See:

https://github.com/ckan/ckanext-archiver/pull/28/files

@FacundoAdorno
Copy link

@FacundoAdorno FacundoAdorno commented May 19, 2016

I have the same problem in my CKAN instance. I'm using the latest CKAN v2.5.2 and when I finish to add a dataset with a csv file, the datapusher begins to push "quasi" infinitely this csv. Due to this, the ckan log grows a lot and the resource preview remains inaccesible by a while.

In my humble opinion, I think that the logic of the datapusher extension is good. I think that the "problem" is in the CKAN source code. Viewing the CKAN log and understanding the CKAN base code, I've noticed an inconsistence in the manner where the "now" datetime is requested. In some places of the code, the "now" time is calculated using the datetime.utcnow function, and in other places using the datetime.now function. This causes that, if your local datetime is different that the UTC (for example, UTC-3 in my case), the "now" datetime differs depending the manner it is requested.

Particularly, considering this issue, a resource is resubmitted to the datastore (at this line if I'm not wrong) when the resource 'last_modified_datetime' is greater than the 'task_created_datetime'. When a resource is uploaded to a dataset, the ckan/lib/uploader.py modifies the resource['last_modified'] date using the datetime.utcnow (at this line). When a datapusher task is created and submited, it sends between its data the 'task_created' date using the task['last_modified'] date as value; this last date is initialized using the datetime.now (by default at the task_status.py or explicitely at this line). When comparing this dates, the datapusher pushes the resource to the datastore until the 'last_modified_datetime' be lesser than the 'task_created_datetime'.

To solve this particular issue, I've modified (and compiled) the uploader.py in this line with the following:

resource['last_modified'] = datetime.datetime.now()

Doing this change, the datapusher will finish with this odd behaviour. In my case, it works.

@linuxitux
Copy link

@linuxitux linuxitux commented May 19, 2016

@FacundoAdorno thank you very much! That solved the issue. The only thing is, in the site, the dataset still says "last modified: 6 hours ago" for every new upload. Maybe we should use datetime.now() instead of datetime.utcnow() in both sides?

@TkTech
Copy link
Member

@TkTech TkTech commented May 19, 2016

Internally times should always be in in utc using utcnow(), the template should be doing the conversion to the local tz.

@inghamn
Copy link

@inghamn inghamn commented May 20, 2016

@TkTech does this mean the task_created data should be UTC, then?

    task = {
        'entity_id': res_id,
        'entity_type': 'resource',
        'task_type': 'datapusher',
        'last_updated': str(datetime.datetime.utcnow()),
        'state': 'submitting',
        'key': 'datapusher',
        'value': '{}',
        'error': '{}',
    }

@TkTech
Copy link
Member

@TkTech TkTech commented May 20, 2016

In a perfect world yes, you don't want any internally stored timestamps to be tz-dependant. Even if it's user-generated, you store a utc timestamp and the timezone separately.

@amercader
Copy link
Member Author

@amercader amercader commented May 24, 2016

@linuxitux @FacundoAdorno @inghamn @TkTech
Apologies for the late reply on this. Without having investigating it in depth, the UTC timestamp discrepancy seems to be the most likely culprit. We should be really using UTC everywhere internally so rather than @FacundoAdorno approach of changing the resource last_modified date to use .now() I'd like to try using .utcnow() on datapusher's action.py. Can you check if the patch below fixes the issues when uploading new resources?

diff --git a/ckanext/datapusher/logic/action.py b/ckanext/datapusher/logic/action.py
index 8ce97c2..5afb4b0 100644
--- a/ckanext/datapusher/logic/action.py
+++ b/ckanext/datapusher/logic/action.py
@@ -74,7 +74,7 @@ def datapusher_submit(context, data_dict):
         'entity_id': res_id,
         'entity_type': 'resource',
         'task_type': 'datapusher',
-        'last_updated': str(datetime.datetime.now()),
+        'last_updated': str(datetime.datetime.utcnow()),
         'state': 'submitting',
         'key': 'datapusher',
         'value': '{}',
@@ -119,7 +119,7 @@ def datapusher_submit(context, data_dict):
                  'details': str(e)}
         task['error'] = json.dumps(error)
         task['state'] = 'error'
-        task['last_updated'] = str(datetime.datetime.now()),
+        task['last_updated'] = str(datetime.datetime.utcnow()),
         p.toolkit.get_action('task_status_update')(context, task)
         raise p.toolkit.ValidationError(error)

@@ -134,7 +134,7 @@ def datapusher_submit(context, data_dict):
                  'status_code': r.status_code}
         task['error'] = json.dumps(error)
         task['state'] = 'error'
-        task['last_updated'] = str(datetime.datetime.now()),
+        task['last_updated'] = str(datetime.datetime.utcnow()),
         p.toolkit.get_action('task_status_update')(context, task)
         raise p.toolkit.ValidationError(error)

@@ -143,7 +143,7 @@ def datapusher_submit(context, data_dict):

     task['value'] = value
     task['state'] = 'pending'
-    task['last_updated'] = str(datetime.datetime.now()),
+    task['last_updated'] = str(datetime.datetime.utcnow()),
     p.toolkit.get_action('task_status_update')(context, task)

     return True
@@ -175,7 +175,7 @@ def datapusher_hook(context, data_dict):
     })

     task['state'] = status
-    task['last_updated'] = str(datetime.datetime.now())
+    task['last_updated'] = str(datetime.datetime.utcnow())

     resubmit = False

In any case I also want to check the issues with encoding you mention @inghamn as I haven't been able to replicate them. Can you paste a link or DM one of the files that is failing for you? Thanks

@inghamn
Copy link

@inghamn inghamn commented May 24, 2016

Success! After updating action.py with the patch, I am no longer having the infinite resubmission.

I have tried with my previous ISO-8859 file and do receive the character set mismatch error, but there is no longer a resubmit attempt. I've attached my CSV files, just in case you still want a copy.

ckan.zip

@amercader
Copy link
Member Author

@amercader amercader commented May 24, 2016

@inghamn great, I'll send a PR with the patch

@amercader
Copy link
Member Author

@amercader amercader commented May 24, 2016

@inghamn or feel free to do it yourself if you have the time!

@amercader amercader reopened this May 25, 2016
amercader added a commit that referenced this issue May 25, 2016
When an upload to the DataPusher finishes we check
`resource['last_modified']` (in UTC) against `task['last_updated']`
to check whether it changed during the previous upload and resubmit it.
As `task['last_updated']` was created using `datetime.now()` this could
lead to infinite upload loops.

Discussion is here:

#2856 (comment)
@amercader
Copy link
Member Author

@amercader amercader commented May 25, 2016

PR here #3051

davidread added a commit that referenced this issue May 25, 2016
[#2856] Create DataPusher task timestamps using UTC
vasili4 added a commit to GSA/ckan that referenced this issue May 31, 2016
When an upload to the DataPusher finishes we check
`resource['last_modified']` (in UTC) against `task['last_updated']`
to check whether it changed during the previous upload and resubmit it.
As `task['last_updated']` was created using `datetime.now()` this could
lead to infinite upload loops.

Discussion is here:

ckan#2856 (comment)
amercader added a commit that referenced this issue Aug 26, 2016
When an upload to the DataPusher finishes we check
`resource['last_modified']` (in UTC) against `task['last_updated']`
to check whether it changed during the previous upload and resubmit it.
As `task['last_updated']` was created using `datetime.now()` this could
lead to infinite upload loops.

Discussion is here:

#2856 (comment)
@amercader amercader closed this Aug 26, 2016
@m1h
Copy link

@m1h m1h commented Nov 21, 2016

Hi @amercader,

we encountered the same issue while using CKAN 2.5.3 (source-install). "Fetching", "deleting", "determining headers and types", "saving chunk", "successfully saved" is repeating continuously (endless loop). A screenshot:

ckan-error

It is strange that the date returns -1. The field 'last_modified' in package_show (from action-api) looks the following:
"last_modified": "2016-11-21T17:16:11.247765"

However, if I open the details (from the screenshot above) I get following:
ckan_error_info

So, here the timestamp is correct. Could it be a problem with the date? Does anyone have any suggestion for solving it?

@wardi
Copy link
Contributor

@wardi wardi commented Dec 7, 2016

@m1h try the fix in #3331

@amercader
Copy link
Member Author

@amercader amercader commented Jan 3, 2017

@m1h did #3331 help?

@m1h
Copy link

@m1h m1h commented Jan 11, 2017

@amercader : No, unfortunately!

We figured out following:
The first time a ressource is uploaded to datastore, it works. So, when a new ressource is created, it automatically uploads it to datastore - perfect. However, when one re-uploads it to datastore (via button-click in web-frontend), it is resulting then in an endless loop.

gvonness pushed a commit to datagovau/ckan that referenced this issue Feb 9, 2017
When an upload to the DataPusher finishes we check
`resource['last_modified']` (in UTC) against `task['last_updated']`
to check whether it changed during the previous upload and resubmit it.
As `task['last_updated']` was created using `datetime.now()` this could
lead to infinite upload loops.

Discussion is here:

ckan#2856 (comment)
@janaslo
Copy link
Contributor

@janaslo janaslo commented Mar 7, 2017

Hi, we have the same problem with endless loop, appearing very often when we try to delete some resource (other than the last one in the list of package resources) via web interface. CKAN/DataPusher start looping and the resource is never actually deleted.

We are using CKAN 2.5.3 with #3331 applied, and DataPusher 0.0.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet