Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clearsource history command #268

Merged
merged 13 commits into from Nov 23, 2016

Conversation

seitenbau-govdata
Copy link
Member

Adds a clearsource history command / operation.


job_history_clear_results = []
# We assume that the maximum of 1000 (hard limit) rows should be enough
result = logic.get_action('package_search')(context, {'fq': '+type:"harvest"', 'rows': 1000})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Solr index field that holds the dataset type is called dataset_type, so this query does no return anything. You should change it to {'fq': '+dataset_type:harvest', 'rows': 1000}). Note that if you want to wrap the value in double quotes we need to this change in plugin.py:

diff --git a/ckanext/harvest/plugin.py b/ckanext/harvest/plugin.py
index 55af1c7..fa57103 100644
--- a/ckanext/harvest/plugin.py
+++ b/ckanext/harvest/plugin.py
@@ -92,7 +92,7 @@ class Harvest(p.SingletonPlugin, DefaultDatasetForm, DefaultTranslation):
         '''Prevents the harvesters being shown in dataset search results.'''

         fq = search_params.get('fq', '')
-        if 'dataset_type:harvest' not in fq:
+        if 'dataset_type:harvest' not in fq and 'dataset_type:"harvest"' not in fq:
             fq = u"{0} -dataset_type:harvest".format(search_params.get('fq', ''))
             search_params.update({'fq': fq})

Which is an excellent idea anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint, amercader. The pull request is based on the branch "release-v2.0" and in this branch the additionally filter query isn't included. We are actually upgrading the dependency to the ckanext-harvest of our custom CKAN extension to the version "0.0.5". So I have modified the code in according to the code in /ckanext/harvest/plugin.py and updated the pull request.

Copy link
Member

@amercader amercader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @seitenbau-govdata. Check the comment about the filter query.
And also could you write a small test on test_action.py that covers the new functionality?

Thanks!

@amercader
Copy link
Member

@seitenbau-govdata if you merge master to solve the conflicts and add a small test we can merge this one

raphaelstolt and others added 2 commits November 15, 2016 15:04
Changed filter query for reading harvest sources in according to the code in /ckanext/harvest/plugin.py.
Copy link
Contributor

@davidread davidread left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the limit, it looks great and super useful.


job_history_clear_results = []
# We assume that the maximum of 1000 (hard limit) rows should be enough
result = logic.get_action('package_search')(context, {'fq': '+dataset_type:harvest', 'rows': 1000})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the limit? e.g. in data.gov.uk we have approximately 20,000 harvested datasets, so this would be no good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is searching for harvest sources, not harvested datasets. It could potentially be an issue on larger instances but I think it's good enough for a first version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good point @amercader. We have 400 of those. Still, why not remove the limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit is set to the maximum of 1000 because the standard value is only 10 without defining the parameter 'limit'. And 1000 is the hard coded maximum of the limit within package_search. You have to read the harvest source packages in blocks for getting really all harvest sources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough then - thanks for explaining. Perhaps add a note in the documentation about the limit, just in case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidread That's a good idea. I have added a note about the limit right now.

Fixed harvest_sources_job_history_clear test by creating different harvest sources.
Fix creating different harvest sources. Different harvest sources can't be created with factory.
Using test class wide unique harvest source url, because in a test created objects are still present in following tests.
…clear

Ignoring not existent harvest sources harvest_sources_job_history_clear because of a possibly corrupt search index.
@seitenbau-govdata
Copy link
Member Author

Does anybody have an idea why with CKAN 2.2 in the test "test_harvest_sources_job_history_clear" is only one harvest source in the result instead of two? This is the reason why the travis ci build is failing on CKAN 2.2.
For me it seems like the solr index isn't updated with the second harvest source yet while the action "harvest_sources_job_history_clear" reads the existent harvest sources with "package_search".

@davidread
Copy link
Contributor

@seitenbau-govdata I did a little tidy of the use of test fixtures, including the one you added test_harvest_sources_job_history_clear:

117f037

Perhaps you'd like to add that to your branch? Otherwise I can add it as a separate PR.

I created a PR anyway #271 to see if it happens to fix the ckan 2.2 problem.

BTW I had a quick look at that the ckan 2.2 test failure, but I couldn't see why it occurs. If @amercader can't see why then I guess someone needs to set-up ckan 2.2 and debug with that?

@davidread
Copy link
Contributor

Interesting - I ran the tests again and they pass! Both for you and on the alternative PR with my test tidies.

Perhaps let me know if you're happy with the test tidies and we can choose which branch for @amercader to review.

…e" by copying the SOURCE_DICT each time, rather than letting tests edit the master copy.
@seitenbau-govdata
Copy link
Member Author

@davidread Really strange that the tests are now successful. Your test tidies looking good for me. Thanks!
I have added your commit to our branch, too. So it would be nice when @amercader could use our branch for the review.

@davidread
Copy link
Contributor

Great, lgtm

Added note with the limit of 1000 harvest sources
@amercader amercader merged commit 3836fcf into ckan:master Nov 23, 2016
@amercader
Copy link
Member

Nice work @seitenbau-govdata, thanks for the PR and thanks @davidread for your help reviewing

@seitenbau-govdata seitenbau-govdata deleted the clearsource-history-command branch November 23, 2016 21:05
@davidread
Copy link
Contributor

I was thinking this is a useful command to clear out unused info about old harvests - it builds up. But a problem has been pointed out - on the next harvest it'll create duplicate datasets, because there are no HarvestObjects any more to link up a source dataset's GUID to the local dataset. (This might not be the case for all harvesters - some may use another method to identify datasets.)

(I guess for the usage I'm thinking of, we should keep the most recent HarvestObject for every GUID, to prevent the duplicates.)

So is this a bug, or maybe @seitenbau-govdata's harvester doesn't have this problem, or is there some other intended purpose for this command? It's just that leaving the harvest source implies you want to reharvest, so why keep the datasets if you're going to get duplicates?

At the least I've added a PR to warn about the duplicates problem:

@seitenbau-govdata
Copy link
Member Author

Hi @davidread Sorry for the late response. Your assumption is right. We just need the paster command to get rid of the old harvest jobs which reserve a big amount of data (in this case just waste) in the database with the time.

Maybe you can you show me the piece of code which causes the problems when the harvest object is no more present?

We don't have the problem with duplicates after re-harvesting when all harvest jobs are deleted. But we only use the dcat harvester from ckanext-dcat as base. The only thing is that in the stats after the first re-harvesting all updated datasets are shown as "created".

As suggested from you I also think it's much better to preserve the current harvest jobs instead of delete them all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants