Add clearsource history command #268

seitenbau-govdata · 2016-10-28T09:32:04Z

Adds a clearsource history command / operation.

amercader · 2016-10-28T13:44:37Z

ckanext/harvest/logic/action/update.py

+
+    job_history_clear_results = []
+    # We assume that the maximum of 1000 (hard limit) rows should be enough
+    result = logic.get_action('package_search')(context, {'fq': '+type:"harvest"', 'rows': 1000})


The Solr index field that holds the dataset type is called dataset_type, so this query does no return anything. You should change it to {'fq': '+dataset_type:harvest', 'rows': 1000}). Note that if you want to wrap the value in double quotes we need to this change in plugin.py:

diff --git a/ckanext/harvest/plugin.py b/ckanext/harvest/plugin.py index 55af1c7..fa57103 100644 --- a/ckanext/harvest/plugin.py +++ b/ckanext/harvest/plugin.py @@ -92,7 +92,7 @@ class Harvest(p.SingletonPlugin, DefaultDatasetForm, DefaultTranslation): '''Prevents the harvesters being shown in dataset search results.''' fq = search_params.get('fq', '') - if 'dataset_type:harvest' not in fq: + if 'dataset_type:harvest' not in fq and 'dataset_type:"harvest"' not in fq: fq = u"{0} -dataset_type:harvest".format(search_params.get('fq', '')) search_params.update({'fq': fq})

Which is an excellent idea anyway.

Thanks for the hint, amercader. The pull request is based on the branch "release-v2.0" and in this branch the additionally filter query isn't included. We are actually upgrading the dependency to the ckanext-harvest of our custom CKAN extension to the version "0.0.5". So I have modified the code in according to the code in /ckanext/harvest/plugin.py and updated the pull request.

amercader

This looks great @seitenbau-govdata. Check the comment about the filter query.
And also could you write a small test on test_action.py that covers the new functionality?

Thanks!

amercader · 2016-11-14T16:36:35Z

@seitenbau-govdata if you merge master to solve the conflicts and add a small test we can merge this one

Changed filter query for reading harvest sources in according to the code in /ckanext/harvest/plugin.py.

davidread

Aside from the limit, it looks great and super useful.

davidread · 2016-11-15T15:09:07Z

ckanext/harvest/logic/action/update.py

+
+    job_history_clear_results = []
+    # We assume that the maximum of 1000 (hard limit) rows should be enough
+    result = logic.get_action('package_search')(context, {'fq': '+dataset_type:harvest', 'rows': 1000})


Why add the limit? e.g. in data.gov.uk we have approximately 20,000 harvested datasets, so this would be no good.

This is searching for harvest sources, not harvested datasets. It could potentially be an issue on larger instances but I think it's good enough for a first version.

ah, good point @amercader. We have 400 of those. Still, why not remove the limit?

The limit is set to the maximum of 1000 because the standard value is only 10 without defining the parameter 'limit'. And 1000 is the hard coded maximum of the limit within package_search. You have to read the harvest source packages in blocks for getting really all harvest sources.

Fair enough then - thanks for explaining. Perhaps add a note in the documentation about the limit, just in case?

@davidread That's a good idea. I have added a note about the limit right now.

Fixed harvest_sources_job_history_clear test by creating different harvest sources.

Fix creating different harvest sources. Different harvest sources can't be created with factory.

Using test class wide unique harvest source url, because in a test created objects are still present in following tests.

…clear Ignoring not existent harvest sources harvest_sources_job_history_clear because of a possibly corrupt search index.

seitenbau-govdata · 2016-11-16T10:35:26Z

Does anybody have an idea why with CKAN 2.2 in the test "test_harvest_sources_job_history_clear" is only one harvest source in the result instead of two? This is the reason why the travis ci build is failing on CKAN 2.2.
For me it seems like the solr index isn't updated with the second harvest source yet while the action "harvest_sources_job_history_clear" reads the existent harvest sources with "package_search".

davidread · 2016-11-16T12:07:37Z

@seitenbau-govdata I did a little tidy of the use of test fixtures, including the one you added test_harvest_sources_job_history_clear:

117f037

Perhaps you'd like to add that to your branch? Otherwise I can add it as a separate PR.

I created a PR anyway #271 to see if it happens to fix the ckan 2.2 problem.

BTW I had a quick look at that the ckan 2.2 test failure, but I couldn't see why it occurs. If @amercader can't see why then I guess someone needs to set-up ckan 2.2 and debug with that?

davidread · 2016-11-16T12:30:05Z

Interesting - I ran the tests again and they pass! Both for you and on the alternative PR with my test tidies.

Perhaps let me know if you're happy with the test tidies and we can choose which branch for @amercader to review.

…e" by copying the SOURCE_DICT each time, rather than letting tests edit the master copy.

seitenbau-govdata · 2016-11-16T14:45:19Z

@davidread Really strange that the tests are now successful. Your test tidies looking good for me. Thanks!
I have added your commit to our branch, too. So it would be nice when @amercader could use our branch for the review.

davidread · 2016-11-16T14:52:03Z

Great, lgtm

Added note with the limit of 1000 harvest sources

amercader · 2016-11-23T12:36:15Z

Nice work @seitenbau-govdata, thanks for the PR and thanks @davidread for your help reviewing

davidread · 2020-02-28T10:30:42Z

I was thinking this is a useful command to clear out unused info about old harvests - it builds up. But a problem has been pointed out - on the next harvest it'll create duplicate datasets, because there are no HarvestObjects any more to link up a source dataset's GUID to the local dataset. (This might not be the case for all harvesters - some may use another method to identify datasets.)

(I guess for the usage I'm thinking of, we should keep the most recent HarvestObject for every GUID, to prevent the duplicates.)

So is this a bug, or maybe @seitenbau-govdata's harvester doesn't have this problem, or is there some other intended purpose for this command? It's just that leaving the harvest source implies you want to reharvest, so why keep the datasets if you're going to get duplicates?

At the least I've added a PR to warn about the duplicates problem:

…#268 (comment)

seitenbau-govdata · 2020-03-24T13:16:58Z

Hi @davidread Sorry for the late response. Your assumption is right. We just need the paster command to get rid of the old harvest jobs which reserve a big amount of data (in this case just waste) in the database with the time.

Maybe you can you show me the piece of code which causes the problems when the harvest object is no more present?

We don't have the problem with duplicates after re-harvesting when all harvest jobs are deleted. But we only use the dcat harvester from ckanext-dcat as base. The only thing is that in the stats after the first re-harvesting all updated datasets are shown as "created".

As suggested from you I also think it's much better to preserve the current harvest jobs instead of delete them all.

amercader reviewed Oct 28, 2016

View reviewed changes

amercader requested changes Oct 28, 2016

View reviewed changes

raphaelstolt and others added 2 commits November 15, 2016 15:04

Add clearsource history command

e8570b9

Changed filter query for reading harvest sources

af0e171

Changed filter query for reading harvest sources in according to the code in /ckanext/harvest/plugin.py.

seitenbau-govdata force-pushed the clearsource-history-command branch from 545aace to af0e171 Compare November 15, 2016 14:05

seitenbau-govdata added 2 commits November 15, 2016 15:37

Added tests for clearsource history command

1acab98

Fixed using property of object

cf1cfcc

davidread reviewed Nov 15, 2016

View reviewed changes

seitenbau-govdata added 7 commits November 15, 2016 21:09

Fixed harvest_sources_job_history_clear test

8d5ff4b

Fixed harvest_sources_job_history_clear test by creating different harvest sources.

Fixed HarvestSourceObj argument

096e746

Fix creating different harvest sources

d01a866

Fix creating different harvest sources. Different harvest sources can't be created with factory.

Using test class wide unique harvest source url

f68bf32

Using test class wide unique harvest source url, because in a test created objects are still present in following tests.

Ignoring not existent harvest sources in harvest_sources_job_history_…

95d0c1c

…clear Ignoring not existent harvest sources harvest_sources_job_history_clear because of a possibly corrupt search index.

Sort lists for assert

d511663

Fixed variable name

7f76f60

davidread mentioned this pull request Nov 16, 2016

Add clearsource history command 2 #271

Closed

Avoid the "# dont use factory because it looks for the existing sourc…

e7c0385

…e" by copying the SOURCE_DICT each time, rather than letting tests edit the master copy.

Update documentation

ff1b861

Added note with the limit of 1000 harvest sources

amercader approved these changes Nov 23, 2016

View reviewed changes

amercader merged commit 3836fcf into ckan:master Nov 23, 2016

seitenbau-govdata deleted the clearsource-history-command branch November 23, 2016 21:05

davidread pushed a commit that referenced this pull request Feb 28, 2020

Add warning about clearsource_history command and duplicates. See also …

7bc87f8

…#268 (comment)

davidread mentioned this pull request Feb 28, 2020

Add warning about clearsource_history command and duplicates #393

Open

seitenbau-govdata mentioned this pull request Mar 24, 2020

fix clearsource_history command to avoid disassociation #397

Closed

jonty-uk-gov-mirror pushed a commit to uk-gov-mirror/datagovuk.ckanext-harvest that referenced this pull request Apr 11, 2021

[ckan#268] Fix for harvest error: No module named auth.create

faab0cd

seitenbau-govdata mentioned this pull request May 19, 2022

Allow full clear of completed jobs #503

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add clearsource history command #268

Add clearsource history command #268

seitenbau-govdata commented Oct 28, 2016

amercader Oct 28, 2016

seitenbau-govdata Nov 11, 2016

amercader left a comment

amercader commented Nov 14, 2016

davidread left a comment

davidread Nov 15, 2016

amercader Nov 15, 2016

davidread Nov 15, 2016

seitenbau-govdata Nov 15, 2016

davidread Nov 16, 2016

seitenbau-govdata Nov 16, 2016

seitenbau-govdata commented Nov 16, 2016

davidread commented Nov 16, 2016

davidread commented Nov 16, 2016

seitenbau-govdata commented Nov 16, 2016

davidread commented Nov 16, 2016

amercader commented Nov 23, 2016

davidread commented Feb 28, 2020

seitenbau-govdata commented Mar 24, 2020

Add clearsource history command #268

Add clearsource history command #268

Conversation

seitenbau-govdata commented Oct 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amercader left a comment

Choose a reason for hiding this comment

amercader commented Nov 14, 2016

davidread left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seitenbau-govdata commented Nov 16, 2016

davidread commented Nov 16, 2016

davidread commented Nov 16, 2016

seitenbau-govdata commented Nov 16, 2016

davidread commented Nov 16, 2016

amercader commented Nov 23, 2016

davidread commented Feb 28, 2020

seitenbau-govdata commented Mar 24, 2020