Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index packages of all status (except 'deleted') not just the 'active' ones #2261

Closed
clementmouchet opened this issue Feb 3, 2015 · 9 comments
Assignees

Comments

@clementmouchet
Copy link

Hi all,

I wonder if there is a reason for indexing only active packages?
I understand that there might be a performance overhead from indexing drafts & custom states but I think this would be very handy in many cases. (@benjaminlaot, @andreivicentiuhincu and I are using custom states for packages to enable a sort of validation workflow)

Having the ability to do package search using the index to get drafts & custom states would be better than just active ones.

I played with it very briefly and changed the state == 'active' to state != 'deleted' in the two places below.

diff --git a/ckan/lib/search/__init__.py b/ckan/lib/search/__init__.py
index d4cb59e..ea28b04 100644
--- a/ckan/lib/search/__init__.py
+++ b/ckan/lib/search/__init__.py
@@ -165,7 +165,7 @@ def rebuild(package_id=None, only_missing=False, force=False, refresh=False, def
             package_index.update_dict(pkg_dict, True)
     else:
         package_ids = [r[0] for r in model.Session.query(model.Package.id).
-                       filter(model.Package.state == 'active').all()]
+                       filter(model.Package.state != 'deleted').all()]
         if only_missing:
             log.info('Indexing only missing packages...')
             package_query = query_for(model.Package)
diff --git a/ckan/lib/search/index.py b/ckan/lib/search/index.py
index 3acea3c..f4f7506 100644
--- a/ckan/lib/search/index.py
+++ b/ckan/lib/search/index.py
@@ -126,7 +126,7 @@ class PackageSearchIndex(SearchIndex):
         if title:
             pkg_dict['title_string'] = title

-        if (not pkg_dict.get('state')) or ('active' not in pkg_dict.get('state')):
+        if (not pkg_dict.get('state')) or ('deleted' in pkg_dict.get('state')):
             return self.delete_package(pkg_dict)

         index_fields = RESERVED_FIELDS + pkg_dict.keys()

It doesn't break existing behaviour because all the places using package search filter the index by state == 'active' anyway.

What do you think?
How does that fit with your road map?
Would you take a PR that allows indexing other states as above, or using config options?

@wardi
Copy link
Contributor

wardi commented Feb 3, 2015

We were just discussing this one on the developer call. My biggest concern is this changes the behaviour of draft datasets. Currently draft datasets aren't indexed in solr and are visible only to their owner. My second concern is that your custom states might overlap with the way CKAN is already using the state field - why wouldn't you want draft or deleted versions of datasets in custom states?

I suggest using an extra field to store your state value instead of the one ckan uses.

Would you tell us what your state values and their meanings are?

This might also be related to the workflow ideas tickets like ckan/ideas#108 consider contributing your use cases to the discussion there.

@clementmouchet
Copy link
Author

@wardi what you are discussing in ckan/ideas#108 is really what I wanted to do originally. Instead we have added a pending state that comes in after the draft state.
We used the existing stage field deliberately as we considered that the existing draft active and deleted behaviours would be unchanged.
Essentially we're playing the card that CKAN considers that active is a final stage so we jumped in between the creation and that final stage.

Pending datasets are visible to their owners, org admins & sys admins through a management page inside the organisation, and we have approve them to (change the status to active) from there.

You are right in saying that adding a custom stage has a significant impact on some functions like the resource views, where we have to override the default behaviour to allow users to create views.
On the plus side it means that pending datasets are not harvestable, because they are not active.

I've tested the change of visibility and draft datasets are still only showed through the user_show. All the other places (I'm aware of) using package search and the index are filtering by 'active' already so I don't think this would introduce a security risk...

We're too far down the line to go back and didn't have enough time to develop something as powerful as what's suggested in idea 108 unfortunately...

@wardi
Copy link
Contributor

wardi commented Feb 3, 2015

@clementmouchet yes, I don't know when or if I'll have time to implement anything like I've written up in ckan/ideas#108 either. For now we're using two ckan sites and one extra date field that indicates when the dataset is to be "published" (copied from the internal site to the external one) That solution might be good enough for our needs. We have to have separate internal and external sites regardless.

The problem with a single "pending" state is that edits to an approved dataset don't go through the same approval process, and sometimes clients want that.

The old moderated-edits feature was written to solve this problem but was never finished, was slowing down the code and has now been removed. A colleague is considering using organizations to implement these workflows: datasets move from one org responsible for the stage to the next org responsible for the next stage etc..

This seems like a really common problem. I'd really like to find a general solution. Probably something much simpler than what I suggested in ckan/ideas#108

@davidread you're interested in supporting using different values for 'state', do you use 'state' it the same way as @clementmouchet ?

@davidread
Copy link
Contributor

@wardi actually my project doesn't have much need for draft changes, so I'll keep out of this discussion.

@rbvictor
Copy link
Contributor

@clementmouchet have you implemented the pending state idea? If so, have you published it as a ckan extension? I have been looking to implement the same kind of functionality, something simpler than the workflow suggested in ckan/ideas#108. If you'd already done something like that, it would be awesome to have it shared with us.

@clementmouchet
Copy link
Author

hi @rbvictor yes we did write an extension supporting a pending state and a basic custom workflow. I have moved to another company since and I don't have access to the source code. Unfortunately I don't think it has been released as an Open Source extension.

@benjaminlaot, you might be able to help?

@amercader
Copy link
Member

I have moved to another company since and I don't have access to the source code. Unfortunately I don't think it has been released as an Open Source extension.

ಠ_ಠ

@clementmouchet
Copy link
Author

Agreed. Not calling the shots here.

@wardi
Copy link
Contributor

wardi commented Jun 27, 2017

lots of work done on private and draft dataset visibility recently including #3191 and #3192 If these don't cover your use case, please consider submitting a new PR

@wardi wardi closed this as completed Jun 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants