Proposal for new version of lookup-by-base-path #812

MatMoore · 2017-03-02T11:55:04Z

Background

In regular publishing apps, we refer to documents by content ID, and using base paths as an identifier is discouraged. GET /v2/content can be used to search for editions.

In content tagger, things work a bit differently. We work at the document level rather than with specific editions, and use PATCH /v2/links to add links via the LinkSet.

For the basic tagging workflow, users enter a URL or base path of anything on GOV.UK, and we lookup the content id with POST /lookup-by-base-path.

This request may take multiple content ids to lookup simultaneously, which we use for bulk tagging workflows.

Problem

Currently if a document has not been published yet, we can't look up the item in content tagger.

We want to allow content tagger to work with any draft content, while continuing to support live content (which may be unpublished and withdrawn). We should distinguish between the two so the user can either view their changes immediately (live) or preview them on the draft-origin (draft).

For error handling, it would also be useful to distinguish between content that doesn't exist, and content that used to exist, but has been redirected/removed.

This is something we noticed when bulk tagging to the education taxonomy, because while identifying content to tag to the taxonomy, education publishers also improved the content and consolidated a lot of pages. In this case the bulk imports triggered a lot of "Content not on GOV.UK" errors.

Proposal

I'd like to create a new version of /lookup-by-base-path, that is less restrictive and returns more context about what content store it's in and whether it's been unpublished.

Request

POST /v2/lookup-by-base-path

base_path[]=/foo&base_path[]=/bar

Response

In the response there is now an object representing the current state of each content store.

{
  "/foo" => {
    {
     draft: {
      content_id: "123123-12312-12312",
     }
     live: {
       content_id: "1213123-12312-12552",
       unpublishing: {
         type: "redirect"
         alternative_path: "/foo",
       },
      },
    }
  }
  "/bar" => {
    {
     draft: {
      content_id: "123123-12312-12312",
     }
     live: {
       content_id: "1213123-12312-12552",
     }
    }
  }
}

If the content id never existed, the base path is not returned in the response.
draft returns details of the content item currently in the draft content store. There should always be a draft object.
live returns details of the content item currently in the live content store. There will only be a live object for published and unpublished content.
unpublishing is returned for the live content item if the live content item
is an unpublished document.
anything with a nil value is omitted from the response

Usage

We'd add a new gds-api-adapter to return the response.

In content tagger we'd use it like this:

if response.include?('live')
  live = response['live']
  unpublishing_type = response.dig('unpublishing', 'type')

  if unpublishing_type.nil? || unpublishing_type == 'withdrawn'
    state = 'live'
    content_id = live['content_id']
  else 
    # error: you can't tag this page any more
  end

else
  state = 'draft'
  content_id = response['draft']['content_id']
end

Rejected Alternatives

We could combine the two objects into a single one, with the current state as a field.

I don't like this solution as much because a state field is easily confusable with the publishing state of an edition. I think for this request, we are thinking more in terms of content store content items than publishing api editions. I'm also not convinced this is flexible enough, as it assumes the content id will never change for a base path, and we'll never be interested in additional fields that differ between draft and live.
We could use edition states instead of content store states

This would make the responses more difficult to work with.
We could keep using v1 and use a query parameter to toggle the new response

I'd prefer to indicate the version more explicitly but not sure if there's other considerations here.

POST vs GET

In the proposal above I've left it as POST (matching the existing request), although I think GET would convey the semantics better.

IIRC, the request was POST rather than GET to avoid hitting problems with limits on the length of a URI due to a long query string.

Alternatively we could use GET with a body. Happy to go with whichever approach is preferred.

The text was updated successfully, but these errors were encountered:

kevindew · 2017-03-02T18:25:24Z

GET does seem the semantic choice for this. How many of these do you look up with a single request? Do we know if we have a limit since it's just an internal request?

Longest base_path we have is 799 characters percent encoded.

Re: "draft returns details of the content item currently in the draft content store. There should always be a draft object."

I like that this matches up with the "unpublishing" implementation on get content 👍

it would be closer to the semantics we use in other places if draft was nil when there isn't an active draft. Eg if you run document.draft you get nil for instance, and if you requested something that is in draft we'd return nothing rather than live one. This scenario could well make sense as an exception to that as it's base_path specific but thought I'd make you aware that it could be an inconsistency so it can be considered.

Also.. would you want locale returned too? for content that is in multiple locales the content id is only half the picture.

MatMoore · 2017-03-03T11:39:55Z

Sounds like an interesting URL we've got there 🤔

We're not normally going to be sending a massive amount of base paths, but I think we could still hit limits like this one. The multi-base-path form of the request is only used in a few places:
https://github.com/search?q=org%3Aalphagov+lookup_content_ids&ref=searchresults&type=Code&utf8=%E2%9C%93

The biggest usage seems to be for translating related item tags into content ids. I don't have good data on this but the largest amount of ordered_related_items (the link type we use in content tagger) tagged to a page is 11. We also have pages that have loads of related things with different link types, eg https://www.gov.uk/api/content/guidance/beginners-guide-to-export-controls.

I was a bit unsure about always returning drafts (I assume you mean "if you requested something that is not in draft we'd return nothing rather than live one"). I don't have a good reason for differing from the standard semantics though, so I'll make it return nil unless there's an active draft.

Good point about locale. I'm not sure we need it, since link-set-tagging affects all editions of a document regardless of locale. But I'm happy to add it if it's useful more generally.

kevindew · 2017-03-06T09:26:10Z

yes - it's loaded up with utf8 characters percent encoded (a whole other problem)

hmm maybe we can make it respond to both GET and POST and make a decision on which one of those based off number of base_paths in API Adapters?

yeah locale sounds like it'd be good for general purpose.

Less of a risk than linkables but what do you think about adding a limit to the amount of base_paths we accept? Just something big like 1000?

MatMoore · 2017-03-06T10:08:04Z

👍 to both of those suggestions. I'll start implementing something today.

This allows draft and unpublished documents to be looked up in addition to live content. See #812

MatMoore · 2017-03-06T16:35:15Z

Have started the work in #822

This allows draft and unpublished documents to be looked up in addition to live content. See #812 Locale and content id are returned for each successful lookup, as well as unpublishing details if applicable. I've also included document type, because there are some document types that the application may not want. For example it makes no sense to tag something to something with document type "redirect".

MatMoore mentioned this issue Mar 2, 2017

[DO NOT MERGE] Allow lookup requests to return draft content #806

Closed

MatMoore pushed a commit that referenced this issue Mar 6, 2017

Add v2 lookup-by-base-path request

e36ca0c

This allows draft and unpublished documents to be looked up in addition to live content. See #812

MatMoore mentioned this issue Mar 6, 2017

Lookup by base paths v2 #822

Closed

MatMoore closed this as completed Mar 6, 2017

kevindew mentioned this issue Dec 7, 2017

Add 'with_draft' option to '/lookup-by-base-path' endpoint #1076

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for new version of lookup-by-base-path #812

Proposal for new version of lookup-by-base-path #812

MatMoore commented Mar 2, 2017

kevindew commented Mar 2, 2017

MatMoore commented Mar 3, 2017 •

edited

Loading

kevindew commented Mar 6, 2017

MatMoore commented Mar 6, 2017

MatMoore commented Mar 6, 2017

Proposal for new version of lookup-by-base-path #812

Proposal for new version of lookup-by-base-path #812

Comments

MatMoore commented Mar 2, 2017

Background

Problem

Proposal

Request

Response

Usage

Rejected Alternatives

POST vs GET

kevindew commented Mar 2, 2017

MatMoore commented Mar 3, 2017 • edited Loading

kevindew commented Mar 6, 2017

MatMoore commented Mar 6, 2017

MatMoore commented Mar 6, 2017

MatMoore commented Mar 3, 2017 •

edited

Loading