Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for new version of lookup-by-base-path #812

Closed
MatMoore opened this issue Mar 2, 2017 · 5 comments
Closed

Proposal for new version of lookup-by-base-path #812

MatMoore opened this issue Mar 2, 2017 · 5 comments

Comments

@MatMoore
Copy link
Contributor

MatMoore commented Mar 2, 2017

Background

In regular publishing apps, we refer to documents by content ID, and using base paths as an identifier is discouraged. GET /v2/content can be used to search for editions.

In content tagger, things work a bit differently. We work at the document level rather than with specific editions, and use PATCH /v2/links to add links via the LinkSet.

For the basic tagging workflow, users enter a URL or base path of anything on GOV.UK, and we lookup the content id with POST /lookup-by-base-path.

This request may take multiple content ids to lookup simultaneously, which we use for bulk tagging workflows.

Problem

Currently if a document has not been published yet, we can't look up the item in content tagger.

We want to allow content tagger to work with any draft content, while continuing to support live content (which may be unpublished and withdrawn). We should distinguish between the two so the user can either view their changes immediately (live) or preview them on the draft-origin (draft).

For error handling, it would also be useful to distinguish between content that doesn't exist, and content that used to exist, but has been redirected/removed.

This is something we noticed when bulk tagging to the education taxonomy, because while identifying content to tag to the taxonomy, education publishers also improved the content and consolidated a lot of pages. In this case the bulk imports triggered a lot of "Content not on GOV.UK" errors.

Proposal

I'd like to create a new version of /lookup-by-base-path, that is less restrictive and returns more context about what content store it's in and whether it's been unpublished.

Request

POST /v2/lookup-by-base-path

base_path[]=/foo&base_path[]=/bar

Response

In the response there is now an object representing the current state of each content store.

{
  "/foo" => {
    {
     draft: {
      content_id: "123123-12312-12312",
     }
     live: {
       content_id: "1213123-12312-12552",
       unpublishing: {
         type: "redirect"
         alternative_path: "/foo",
       },
      },
    }
  }
  "/bar" => {
    {
     draft: {
      content_id: "123123-12312-12312",
     }
     live: {
       content_id: "1213123-12312-12552",
     }
    }
  }
}
  • If the content id never existed, the base path is not returned in the response.
  • draft returns details of the content item currently in the draft content store. There should always be a draft object.
  • live returns details of the content item currently in the live content store. There will only be a live object for published and unpublished content.
  • unpublishing is returned for the live content item if the live content item
    is an unpublished document.
  • anything with a nil value is omitted from the response

Usage

We'd add a new gds-api-adapter to return the response.

In content tagger we'd use it like this:

if response.include?('live')
  live = response['live']
  unpublishing_type = response.dig('unpublishing', 'type')

  if unpublishing_type.nil? || unpublishing_type == 'withdrawn'
    state = 'live'
    content_id = live['content_id']
  else 
    # error: you can't tag this page any more
  end

else
  state = 'draft'
  content_id = response['draft']['content_id']
end

Rejected Alternatives

  • We could combine the two objects into a single one, with the current state as a field.

    I don't like this solution as much because a state field is easily confusable with the publishing state of an edition. I think for this request, we are thinking more in terms of content store content items than publishing api editions. I'm also not convinced this is flexible enough, as it assumes the content id will never change for a base path, and we'll never be interested in additional fields that differ between draft and live.

  • We could use edition states instead of content store states

    This would make the responses more difficult to work with.

  • We could keep using v1 and use a query parameter to toggle the new response

    I'd prefer to indicate the version more explicitly but not sure if there's other considerations here.

POST vs GET

In the proposal above I've left it as POST (matching the existing request), although I think GET would convey the semantics better.

IIRC, the request was POST rather than GET to avoid hitting problems with limits on the length of a URI due to a long query string.

Alternatively we could use GET with a body. Happy to go with whichever approach is preferred.

@kevindew
Copy link
Member

kevindew commented Mar 2, 2017

GET does seem the semantic choice for this. How many of these do you look up with a single request? Do we know if we have a limit since it's just an internal request?

Longest base_path we have is 799 characters percent encoded.

Re: "draft returns details of the content item currently in the draft content store. There should always be a draft object."

I like that this matches up with the "unpublishing" implementation on get content 👍

it would be closer to the semantics we use in other places if draft was nil when there isn't an active draft. Eg if you run document.draft you get nil for instance, and if you requested something that is in draft we'd return nothing rather than live one. This scenario could well make sense as an exception to that as it's base_path specific but thought I'd make you aware that it could be an inconsistency so it can be considered.

Also.. would you want locale returned too? for content that is in multiple locales the content id is only half the picture.

@MatMoore
Copy link
Contributor Author

MatMoore commented Mar 3, 2017

Sounds like an interesting URL we've got there 🤔

We're not normally going to be sending a massive amount of base paths, but I think we could still hit limits like this one. The multi-base-path form of the request is only used in a few places:
https://github.com/search?q=org%3Aalphagov+lookup_content_ids&ref=searchresults&type=Code&utf8=%E2%9C%93

The biggest usage seems to be for translating related item tags into content ids. I don't have good data on this but the largest amount of ordered_related_items (the link type we use in content tagger) tagged to a page is 11. We also have pages that have loads of related things with different link types, eg https://www.gov.uk/api/content/guidance/beginners-guide-to-export-controls.

I was a bit unsure about always returning drafts (I assume you mean "if you requested something that is not in draft we'd return nothing rather than live one"). I don't have a good reason for differing from the standard semantics though, so I'll make it return nil unless there's an active draft.

Good point about locale. I'm not sure we need it, since link-set-tagging affects all editions of a document regardless of locale. But I'm happy to add it if it's useful more generally.

@kevindew
Copy link
Member

kevindew commented Mar 6, 2017

yes - it's loaded up with utf8 characters percent encoded (a whole other problem)

hmm maybe we can make it respond to both GET and POST and make a decision on which one of those based off number of base_paths in API Adapters?

yeah locale sounds like it'd be good for general purpose.

Less of a risk than linkables but what do you think about adding a limit to the amount of base_paths we accept? Just something big like 1000?

@MatMoore
Copy link
Contributor Author

MatMoore commented Mar 6, 2017

👍 to both of those suggestions. I'll start implementing something today.

MatMoore pushed a commit that referenced this issue Mar 6, 2017
This allows draft and unpublished documents to be looked up in addition
to live content.

See #812
@MatMoore
Copy link
Contributor Author

MatMoore commented Mar 6, 2017

Have started the work in #822

@MatMoore MatMoore closed this as completed Mar 6, 2017
MatMoore pushed a commit that referenced this issue Mar 8, 2017
This allows draft and unpublished documents to be looked up in addition
to live content.

See #812

Locale and content id are returned for each successful lookup, as well as
unpublishing details if applicable.

I've also included document type, because there are some document types that
the application may not want. For example it makes no sense to tag something to
something with document type "redirect".
MatMoore pushed a commit that referenced this issue Mar 13, 2017
This allows draft and unpublished documents to be looked up in addition
to live content.

See #812

Locale and content id are returned for each successful lookup, as well as
unpublishing details if applicable.

I've also included document type, because there are some document types that
the application may not want. For example it makes no sense to tag something to
something with document type "redirect".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants