Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 116: store attachment data in content items #116

Merged
merged 1 commit into from Jan 31, 2020
Merged

RFC 116: store attachment data in content items #116

merged 1 commit into from Jan 31, 2020

Conversation

@barrucadu
Copy link
Contributor

barrucadu commented Jan 3, 2020

Deadline for discussion: 2020-01-31
Deadline for discussion: 2020-01-24

Rendered

Copy link
Contributor

ChrisBAshton left a comment

Looks like a great proposal to me 👍 would have made the migration of assets from Whitehall to Content Publisher a bit easier! I have only one question (and a related suggestion) - see below.

rfc-116-store-attachment-data-in-content-items.md Outdated Show resolved Hide resolved
rfc-116-store-attachment-data-in-content-items.md Outdated Show resolved Hide resolved
@kevindew

This comment has been minimized.

Copy link
Member

kevindew commented Jan 3, 2020

Thanks for raising this Michael 👍 This RFC is of interest to us on the Publishing Workflow team as we intend to work with the publications document type and want to get past the situation where Whitehall is currently rendering an array of HTML blobs from a content item for attachments (see: the documents section of https://www.gov.uk/api/content/government/publications/help-with-budgeting-your-universal-credit as an example). I know this was also a problem for the Content Data team when trying to extract attachment data.

We also want to have Publishing API be able to not only render attachments in govspeak but also images. We'd like this as we hope that if Publishing API can render Content Publisher govspeak then we could set it up to eventually automatically update embedded content (e.g. organisational contact details).

One of the things where I think you should consider more is how this data can be consolidated with the data needed to render govspeak assets from govspeak. In Govspeak we've already got informal definitions of expected data structure of attachment files (and there'll be a much richer one in Whitehall with their extra data). If we go with a new data structure that can meet the needs for Govspeak then we've got a bunch of migration headaches to restructure everything for that. If we go with a data structure that doesn't meet govspeak needs then we're going to end up having the same data in a content item twice. My view would be that it'd be better if we could be backwards compatible with existing structures of data and actually formalise that rather than the informal approach we have.

An area not mentioned in the RFC is image files since they're treated slightly differently and are similar data that Publishing API is lacking.

I also think we should be very cautious about adding Asset Manager communication to Publishing API. I think it could easily become complex and lead to undesirable situations - for example they can be superseded by newer versions, they can be setup with their own redirects, and their removal needs managing.

@barrucadu barrucadu force-pushed the msw/rfc-116 branch from 0619502 to 035d4ea Jan 6, 2020
@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 6, 2020

Thanks Kevin, I've removed the Asset Manager bits (admittedly those were a bit of a stretch) and written about the Govspeak attachments stuff. Do you think we should go straight to adding all of Whitehall's attachment model to Govspeak, or add it as we discover things are needed?

Another thing I'm not certain on is what exactly the relationship between the asset_link schema and the Govspeak attachment metadata should be. They need to be compatible: field names matching and all fields required in one required in the other; but I'm not sure they need to be exactly the same. I'm leaning towards asset_link being a supertype, so we can have metadata in the content item which isn't used in the HTML output.

@kevindew

This comment has been minimized.

Copy link
Member

kevindew commented Jan 6, 2020

Thanks Kevin, I've removed the Asset Manager bits (admittedly those were a bit of a stretch) and written about the Govspeak attachments stuff. Do you think we should go straight to adding all of Whitehall's attachment model to Govspeak, or add it as we discover things are needed?

Another thing I'm not certain on is what exactly the relationship between the asset_link schema and the Govspeak attachment metadata should be. They need to be compatible: field names matching and all fields required in one required in the other; but I'm not sure they need to be exactly the same. I'm leaning towards asset_link being a supertype, so we can have metadata in the content item which isn't used in the HTML output.

Great thanks. Do you want to grab a chat about the asset side of things and then we can summarise it as a comment here? it strikes me as something that might be easier to work through face-to-face.

@kevindew

This comment has been minimized.

Copy link
Member

kevindew commented Jan 8, 2020

Thanks for the chat yesterday Michael. I'll summarise the bits we covered:

Could the microformats schema.org data be exposed through HTML attributes on the attachments rather than needing to be JSON+LD data? You have an expectation that Google (and co) would get confused by a mixture of JSON+LD and embedded microformats and felt we also needed the attachment data to be available to search API which doesn't work easily if it's embedded in HTML.

We discussed what the roll out might be. You indicated that the search team would be responsible for doing this I suggested we have some info of that in the RFC.

I learnt that the asset_link schema already exists. We both concluded it wouldn't be suitable for this to contain both images and attachment handling given the way the data stored for each of these diverges, however felt changing to a new schema for those could be backwards compatible.

Finally we talked about the intention that the attachment data can eventually replace the need to render the documents in a publishing app and that we can intend for this new field to eventually lead to the deprecation of documents.

@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 9, 2020

@kevindew I updated the draft based on our chat, let me know if I've missed anything

@kevindew

This comment has been minimized.

Copy link
Member

kevindew commented Jan 10, 2020

Thanks Michael - this is looking great 👍

Few comments from me are:

  • I'd have thought we'd still need type to know whether something is file/html/external attachment?
  • I'd have thought we needed external_url to know the URL of an external attachment
  • Probably very minor but I'm not sure why file_size is marked as from Govspeak
  • I'd suggest rather than using link in the naming you maybe use asset so image_asset and attachment_asset, this feels to have more semantic meaning than link in the context
  • I'm not sure we should be setting created_at and updated_at as they may reveal information that is misleading or inaccurate (particularly if they're straight from the db) - I also don't think they're used in presentation

For compatibility with Content Publisher we'd need more fields in both image and attachment, though I think it'd be fine that we add them later when we get to actually, hopefully using it. But seems worth telling you:

For image these are: id, alt_text, caption and credit
For attachment these are: id and filename

I also notice that content_type is not used for images in Content Publisher and thus not needed for govspeak so I'd suggest we drop that being a required field for an image (It seems mostly unnecessary).

Finally, it'd be good to set a deadline on this RFC if there's not one already?

@benthorner are the print_meta_data_contact_address and web_isbn fields the same fields you identified as potentially removable from Whitehall?

@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 13, 2020

I'd have thought we'd still need type to know whether something is file/html/external attachment?

I thought we could get that from the URL: if it's a URL on GOV.UK that ends in .html it's an HTML attachment, if it's some other url URL on GOV.UK it's a file attachment, if it' s a non-GOV.UK URL then it's an external attachment. We can keep that field if you think it's useful though.

I'd have thought we needed external_url to know the URL of an external attachment

url is already a required field, and does the same job.

Probably very minor but I'm not sure why file_size is marked as from Govspeak

It's in the metadata here: https://github.com/alphagov/govspeak#attachments but it's not in the asset_link type, so it's a new field from Govspeak (unlike all the other fields which are from Whitehall).

I'd suggest rather than using link in the naming you maybe use asset so image_asset and attachment_asset, this feels to have more semantic meaning than link in the context

👍

I'm not sure we should be setting created_at and updated_at as they may reveal information that is misleading or inaccurate (particularly if they're straight from the db) - I also don't think they're used in presentation

Yeah, we can remove those.

For compatibility with Content Publisher we'd need more fields in both image and attachment, though I think it'd be fine that we add them later when we get to actually, hopefully using it. But seems worth telling you:

For image these are: id, alt_text, caption and credit
For attachment these are: id and filename

Would the attachment filename be in place of the url field (in which case I see your point about adding an external_url field), or in addition to?

I also notice that content_type is not used for images in Content Publisher and thus not needed for govspeak so I'd suggest we drop that being a required field for an image (It seems mostly unnecessary).

I'll remove that too. Content type for an image isn't too useful.

Finally, it'd be good to set a deadline on this RFC if there's not one already?

How about end of Friday this week?

@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 14, 2020

I've set Friday next week as the deadline.

@kevindew

This comment has been minimized.

Copy link
Member

kevindew commented Jan 14, 2020

I'd have thought we'd still need type to know whether something is file/html/external attachment?

I thought we could get that from the URL: if it's a URL on GOV.UK that ends in .html it's an HTML attachment, if it's some other url URL on GOV.UK it's a file attachment, if it' s a non-GOV.UK URL then it's an external attachment. We can keep that field if you think it's useful though.

Yeah I think it'll be useful as it'd be really nuanced to work out which ones are which based on the URL. Html Attachments don't have a suffix of .html for instance, for assets we'd probably have to assert on the hostname (which is under some discussion), and although we expect an external attachment to be a different hostname there are likely no guarantees of that.

I'd have thought we needed external_url to know the URL of an external attachment

url is already a required field, and does the same job.

Super

Probably very minor but I'm not sure why file_size is marked as from Govspeak

It's in the metadata here: https://github.com/alphagov/govspeak#attachments but it's not in the asset_link type, so it's a new field from Govspeak (unlike all the other fields which are from Whitehall).

Ah - so Whitehall does also have that field as well: https://github.com/alphagov/whitehall/blob/6c53d01361b83fa816dd2a05892e6f93f4cf602a/app/models/file_attachment.rb#L11

For compatibility with Content Publisher we'd need more fields in both image and attachment, though I think it'd be fine that we add them later when we get to actually, hopefully using it. But seems worth telling you:
For image these are: id, alt_text, caption and credit
For attachment these are: id and filename

Would the attachment filename be in place of the url field (in which case I see your point about adding an external_url field), or in addition to?

yes, filename should essentially be the basename of the url (however there are circumstances with govspeak when the URL doesn't have the basename so it's independent). The filename is used to assist in working out the mime-type to display the correct metadata: https://github.com/alphagov/govuk_publishing_components/blob/ecb19eae8f421ade7b650bd1cc02493768d323eb/lib/govuk_publishing_components/presenters/attachment.rb#L23-L28

Finally, it'd be good to set a deadline on this RFC if there's not one already?

How about end of Friday this week?

Sounds good 👍


I've thought of a rather annoying issue we may well hit with this approach due to the attachment list having potentially two meanings that may collide.

In Specialist Publisher, and future apps that may want to allow attachments to be embedded into govspeak, we have this list of attachments and govspeak can reference them. This list doesn't have a particular ordering.

In Publications and similar documents example we have a list of attachments that is ordered and should be rendered specially on the page with extra metadata.

The complication seems to arise when we consider what might happen when we want to have both inline attachments that govspeak can render and these publication attachments on the same document. A key difference between these types of attachments is the amount of metadata, the publication ones have all the official document and similar information, whereas inline ones that are only used by govspeak would lack this.

So I'm just wondering what should be done to make them distinct. Is it some form of two different lists? or flags in a singular list, I'm not sure myself. It's worth tagging @benthorner on this too as he's working in this area.

@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 15, 2020

Yeah I think it'll be useful as it'd be really nuanced to work out which ones are which based on the URL. Html Attachments don't have a suffix of .html for instance, for assets we'd probably have to assert on the hostname (which is under some discussion), and although we expect an external attachment to be a different hostname there are likely no guarantees of that.

Ok, I've added that.

Sounds good 👍

I actually went for Friday of next week, as this week might be a bit short if we still have issues to resolve. But next week should be fine.

The complication seems to arise when we consider what might happen when we want to have both inline attachments that govspeak can render and these publication attachments on the same document. A key difference between these types of attachments is the amount of metadata, the publication ones have all the official document and similar information, whereas inline ones that are only used by govspeak would lack this.

That's a good point, I'd not considered that. Mentally I was grouping all formats into "has an attachment block but no inline attachments" and "has inline attachments but no attachment block". But that's not necessarily the case.

I'm leaning towards the "two list" solution, as the two types of attachment are very different, so I'm not sure putting them in the same list makes sense. There are two issues with mixing them in the same list that I can see:

  1. Many formats can't have publication-style attachments, but if we just have one list we can't really enforce that the publishing app sends valid data (unless our schemas support refinement types?)

  2. When resolving attachment references in the page body, does the nth attachment mean the nth one in the list or the nth inline attachment in the list?

But it'll be good to get Ben's thoughts on this.

@benthorner

This comment has been minimized.

Copy link

benthorner commented Jan 16, 2020

@barrucadu thanks for asking! I've read through the RFC again, and the comments above, so hopefully what follows makes sense.

Images

It would be good to clarify the intended use of the image_asset definition in the proposed schema, since it's tangential to the problem statement. I can see we currently use it in formats like travel_advice. We should look to replace title and include the other metadata fields (alt_text, etc.) @kevindew suggested, to promote their use - I'm surprised we excluded them in the past.

It's seriously weird that government-frontend actually thinks the image blob will have an alt_text field already. Apparently it was a dream that might actually happen now.

I'm guessing that for the other places that appear to use asset_links, we'll want to convert them to use attachment_assets, so I don't see this being a problem.

Attachments

With the list of attachments, I think we should separate the rendering concern from the data / metadata concern, noting that Govspeak doesn't rely on ordering to fish them out. A "two list" approach, like you say, but different in that:

  • One list would be to store the metadata for all attachments. Different attachments could have different metadata, depending on what they are (external / file / HTML), but also on their intended use (inline or document-level), which could be left up to the publishing app. We will need to cope with all variations anyway, so I don't see this as a problem.

  • One list would be to replicate the current function of documents: to represent a curated/ordered list of "document-level" attachments. Instead of adding (yet) another key, we could stick with documents and modify it's definition to support a 'Govspeak hash' - the way the Publishing API stores the rendered Govspeak is a little weird, but whatever.

Other minor comments

@benthorner are the print_meta_data_contact_address and web_isbn fields the same fields you identified as potentially removable from Whitehall?

To confirm, these fields are currently redundant and we expect to remove them shortly. https://trello.com/c/vub2SU2P/1330-decide-what-to-do-with-redundant-html-attachment-fields

We also expect to remove the order_url and price_in_pence fields soon, as we've established these too are redundant.

For image these are: id, alt_text, caption and credit
For attachment these are: id and filename

The id field is notably missing in both schemas, but is required by Govspeak for the 'new' extensions used by Content Publisher.

id is a bit weird if we also have content_id, so a compromise might be filename, which is what we use in Content Publisher to enable more 'friendly' snippets like [Attachment:file-1.pdf].

I'd have thought we'd still need type to know whether something is file/html/external attachment?

The Whitehall rendering of attachments only seems to care about the external aspect - to modify the class and the link rel attribute - which we could just infer from the domain. Nothing about the rendering seems to depend on the difference between the file vs. html types.

If we do want to pass the type, I think we should make the name more precise: attachment_type springs to mind.

@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 16, 2020

@benthorner I've updated based on our chat. The only thing I wasn't sure about was the id field - do you think that should be replaced with filename?

@benthorner

This comment has been minimized.

Copy link

benthorner commented Jan 16, 2020

@barrucadu the approach is looking better to me, and hopefully consistent with @kevindew's ideas. The migration comments were food for thought:

  • Should we also have a step to migrate the (one) use of asset_links for images?
  • Should we have a step (perhaps Step 4) to remove title for images?

The migration pathway might be a bit fiddly for Specialist Publisher, as suggested in one of the comments. Not a biggie, but might help to provide some explicit consideration:

  • Renaming content_id to id. We could do this by changing the app to get data from either field, and only save it in the new field.
  • Moving details.attachments to attachments. Ditto.

We seem to have got lost with content_id vs. filename vs. id. I think we only need one of these, but it'll need to be a string, since Content Publisher uses a filename/slug as the id, whereas (e.g.) a manual section uses a UUID. It's irrelevant really, as long as the resolution mechanisms in Govspeak all work. So, in summary, I suggest just having id but making it more permissive.

@kevindew

This comment has been minimized.

Copy link
Member

kevindew commented Jan 21, 2020

I've been thinking a bit more about this and have some fresh ideas - sorry to be giving lots of staggered inputs, it's an area I haven't thought about in detail before and I keep learning/realising new things from our team chats exploring the area.

On Publishing Workflow we've been talking about attachments a lot. We're currently using terminology of featured attachments to reference attachments that are ordered and shown on publications and consultations (currently in the documents array) and in-body attachments to reference ones that are rendered through govspeak. This terminology isn't without flaws but it seems to work in the absence of something better.

So one thing in particular to think about, and this goes right back to the problem you're working on Michael, is whether the Search API is interested on the data for all the attachments (in-body and featured) or just the featured attachments that publications and consultations have?

I do wonder if we really want to have all the attachment fields available for all document types, since these fields are only used by publication and consultation types. Perhaps we want to only apply this at a schema level for publication and consultation schemas? What do you think? I'm not sure if we should consider this the base set of all attachments or just special extra fields for these types. I also wonder a bit what we'd want to do if a document type wanted extra metadata on an attachment, would that be a new type in schemas or adding to base (the former seems more appealing)

Another thing I wondered is that we probably want to use the Publishing API link system to pull in data for HTML attachments. So rather than those having fields like title and url, we may instead want content_id and have a link type that we can pull these in from the links of the document. This seems a bit more fiddly but more consistent with the Publishing API and reduces duplicated data.

On the existing side of things

I think we do need attachment_type, URL alone isn't going to be enough to differentiate between attachment types, and it'd be good to not have to do any regexing of a URL. I expect we should use a oneOf in schema to decide between them as I think some fields won't be valid for all (for example content type should only matter on file attachment).

With the id field can we do anything at a schema level to enforce uniqueness across a collection? from a quick google it doesn't seem so.

Whats the logic for moving attachments to root level? This strikes me as something that increases the migration costs substantially and I'm not sure what we gain from it? I imagine it opens the door straight away to wanting to move images also out of details otherwise those things become inconsistent.

Changing details.documents seems quite risky for a roll-out because we'd not be able to have the old field and the new field at the same time. I feel like we'd do better seeing if we could go with a new name and then deprecate documents. Can you elaborate at all by what is meant by govspeak_hashes? would this be the same data that is in attachments or would it be references to data that is referenced there and pulled in? In Content Publisher we've been talking about having an array where we reference id and types to flag the featured attachments and their ordering. Something like:

[{ id: "file.jpg", attachment_type: "file" }, { id: "my-html-attachment", attachment_type: "html" }]

On the id, content_id and filename front. I think we'd want to use id as the means to embed an attachment in govspeak, content_id should become surplus to requirements (aside from a usage linking html_attachments) and I think we'd still want filename even if some apps (Content Publisher in particular) have the same value for id.

Copy link
Contributor

ChrisBAshton left a comment

I've had a re-read in light of the recent changes. Looking great, just a couple of suggestions for improving the clarity of the proposal (and checking my own understanding).

rfc-116-store-attachment-data-in-content-items.md Outdated Show resolved Hide resolved
rfc-116-store-attachment-data-in-content-items.md Outdated Show resolved Hide resolved
rfc-116-store-attachment-data-in-content-items.md Outdated Show resolved Hide resolved
rfc-116-store-attachment-data-in-content-items.md Outdated Show resolved Hide resolved
@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 22, 2020

So one thing in particular to think about, and this goes right back to the problem you're working on Michael, is whether the Search API is interested on the data for all the attachments (in-body and featured) or just the featured attachments that publications and consultations have?

Statistical data sets, which is one place thing we'd want to use attachments from, don't have featured attachments, they just link to in-body attachments.

I do wonder if we really want to have all the attachment fields available for all document types, since these fields are only used by publication and consultation types. Perhaps we want to only apply this at a schema level for publication and consultation schemas? What do you think? I'm not sure if we should consider this the base set of all attachments or just special extra fields for these types. I also wonder a bit what we'd want to do if a document type wanted extra metadata on an attachment, would that be a new type in schemas or adding to base (the former seems more appealing)

Another thing I wondered is that we probably want to use the Publishing API link system to pull in data for HTML attachments. So rather than those having fields like title and url, we may instead want content_id and have a link type that we can pull these in from the links of the document. This seems a bit more fiddly but more consistent with the Publishing API and reduces duplicated data.

A way of solving both these issues would be to have multiple attachment schemas, and document schemas could use a union of whatever types makes sense. eg, a "PublicationesqueAttachment", and "HTMLAttachment".

What do you think should be in the basic attachment type and what should be in the publication attachment type?

I think we do need attachment_type, URL alone isn't going to be enough to differentiate between attachment types, and it'd be good to not have to do any regexing of a URL. I expect we should use a oneOf in schema to decide between them as I think some fields won't be valid for all (for example content type should only matter on file attachment).

Ok, I can bring that back.

With the id field can we do anything at a schema level to enforce uniqueness across a collection? from a quick google it doesn't seem so.

We could still do that after validating the schema and reject documents with duplicate IDs.

Whats the logic for moving attachments to root level? This strikes me as something that increases the migration costs substantially and I'm not sure what we gain from it? I imagine it opens the door straight away to wanting to move images also out of details otherwise those things become inconsistent.

This was Ben's suggestion, the motivation being that attachments are basically metadata. Here's a slack excerpt: "top-level attachments with ordered (backwards compatible) rendering specified in a details.documents section. We can use the former to replace details.attachments for special/manual docs, since those are only rendered inline - it’s not a curated list."

I'm kind of on-the-fence about this change, I'd be happy to keep details.attachments.

Changing details.documents seems quite risky for a roll-out because we'd not be able to have the old field and the new field at the same time. I feel like we'd do better seeing if we could go with a new name and then deprecate documents. Can you elaborate at all by what is meant by govspeak_hashes? would this be the same data that is in attachments or would it be references to data that is referenced there and pulled in?

This would be an array of { content_type: "text/govspeak or text/html", body: "govspeak or html to render attachment" } values. Publishing apps could put either govspeak or HTML in there while migration happens, and publishing-api can render the govspeak.

On the id, content_id and filename front. I think we'd want to use id as the means to embed an attachment in govspeak, content_id should become surplus to requirements (aside from a usage linking html_attachments) and I think we'd still want filename even if some apps (Content Publisher in particular) have the same value for id.

👍

There's a lot of feedback so I'll leave you to respond to this before I update the RFC.

@kevindew

This comment has been minimized.

Copy link
Member

kevindew commented Jan 22, 2020

So one thing in particular to think about, and this goes right back to the problem you're working on Michael, is whether the Search API is interested on the data for all the attachments (in-body and featured) or just the featured attachments that publications and consultations have?

Statistical data sets, which is one place thing we'd want to use attachments from, don't have featured attachments, they just link to in-body attachments.

Ah cool - so Search API will have an interest in all of them. Would it be only pulling in ones for particular document_types or will it pull in from any? - I ask based on the impact for Search API if schema rules for attachments differ.

I do wonder if we really want to have all the attachment fields available for all document types, since these fields are only used by publication and consultation types. Perhaps we want to only apply this at a schema level for publication and consultation schemas? What do you think? I'm not sure if we should consider this the base set of all attachments or just special extra fields for these types. I also wonder a bit what we'd want to do if a document type wanted extra metadata on an attachment, would that be a new type in schemas or adding to base (the former seems more appealing)
Another thing I wondered is that we probably want to use the Publishing API link system to pull in data for HTML attachments. So rather than those having fields like title and url, we may instead want content_id and have a link type that we can pull these in from the links of the document. This seems a bit more fiddly but more consistent with the Publishing API and reduces duplicated data.

A way of solving both these issues would be to have multiple attachment schemas, and document schemas could use a union of whatever types makes sense. eg, a "PublicationesqueAttachment", and "HTMLAttachment".

What do you think should be in the basic attachment type and what should be in the publication attachment type?

In a basic attachment we should have:

  • accessible
  • alternative_format_contact_email
  • attachment_type
  • content_type
  • filename
  • file_size
  • hoc_paper_number
  • id
  • isbn
  • locale (not 100% sure of this as no app can populate it accurately yet but seems good thing to aim for)
  • number_of_pages
  • preview_url
  • title
  • url

Then a publication file should have those with:

  • command_paper_number
  • hoc_paper_number
  • isbn
  • parliamentary_session
  • unique_reference
  • unnumbered_command_paper
  • unnumbered_hoc_paper

Then we'd have to deal with attachment_type and other subtle differences between HTML and external attachments.

I find myself a bit torn on what route is best to proceed with here. It feels slightly odd to permit all the various attachment fields and types to all documents when we only know of two schemas that use them. On the other hand they're somewhat coupled because they're a product of Whitehall where they were considered the same thing.

The pros to separating them seem to be:

  • doesn't create a convention where any future attachment metadata has to be global by default
  • schemas are more realistic to what the data actually is for content
  • makes it a more conscious decision on whether to expand either HTML attachments or external attachments which are both somewhat dubious concepts

The cons seem to be:

  • It's more complex having two sets of attachment types
  • there's already an element of coupling and hard to avoid more with file attachments containing so many similar fields
  • unclear what to name the publication types

I think we do need attachment_type, URL alone isn't going to be enough to differentiate between attachment types, and it'd be good to not have to do any regexing of a URL. I expect we should use a oneOf in schema to decide between them as I think some fields won't be valid for all (for example content type should only matter on file attachment).

Ok, I can bring that back.

👍

With the id field can we do anything at a schema level to enforce uniqueness across a collection? from a quick google it doesn't seem so.

We could still do that after validating the schema and reject documents with duplicate IDs.

Whats the logic for moving attachments to root level? This strikes me as something that increases the migration costs substantially and I'm not sure what we gain from it? I imagine it opens the door straight away to wanting to move images also out of details otherwise those things become inconsistent.

This was Ben's suggestion, the motivation being that attachments are basically metadata. Here's a slack excerpt: "top-level attachments with ordered (backwards compatible) rendering specified in a details.documents section. We can use the former to replace details.attachments for special/manual docs, since those are only rendered inline - it’s not a curated list."

I'm kind of on-the-fence about this change, I'd be happy to keep details.attachments.

Yes - I ended up having a chat with Ben about it. What seems to have been lost to history is what exactly the intended purpose of details was. My understanding is that it's meant to be used for any data that isn't consistent across all document types. There's already numerous examples of metadata being present in details for example: https://www.gov.uk/api/content/aaib-reports,

I'd suggest that unless we're really clear that this change is beneficial we should avoid it as I think it opens up a bunch of fresh questions about what else should not be in details. Whereas in details, where it is metadata that fulfils the needs of of some but not all formats, it is consistent with existing usages and avoids the needs for adding additional columns to Pub API and Content Store.

Changing details.documents seems quite risky for a roll-out because we'd not be able to have the old field and the new field at the same time. I feel like we'd do better seeing if we could go with a new name and then deprecate documents. Can you elaborate at all by what is meant by govspeak_hashes? would this be the same data that is in attachments or would it be references to data that is referenced there and pulled in?

This would be an array of { content_type: "text/govspeak or text/html", body: "govspeak or html to render attachment" } values. Publishing apps could put either govspeak or HTML in there while migration happens, and publishing-api can render the govspeak.

I think the risk we have with this is that we need to have updated the frontend apps before any publishing apps can send this data, and that means that frontend apps wouldn't have examples to work off while being developed or a means to rollback if there are problems.

On the id, content_id and filename front. I think we'd want to use id as the means to embed an attachment in govspeak, content_id should become surplus to requirements (aside from a usage linking html_attachments) and I think we'd still want filename even if some apps (Content Publisher in particular) have the same value for id.

👍

There's a lot of feedback so I'll leave you to respond to this before I update the RFC.

Thanks Michael 👍

@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 23, 2020

Ah cool - so Search API will have an interest in all of them. Would it be only pulling in ones for particular document_types or will it pull in from any? - I ask based on the impact for Search API if schema rules for attachments differ.

I think we'd only pull in the data for certain document types.

I find myself a bit torn on what route is best to proceed with here. It feels slightly odd to permit all the various attachment fields and types to all documents when we only know of two schemas that use them. On the other hand they're somewhat coupled because they're a product of Whitehall where they were considered the same thing.

I think we have some degree of the same complexity even with different types. With one type, we set the precedent of metadata fields being global by default. So a consumer might have to deal with unexpected fields. With multiple types, we solve that problem, but unless we make the new fields mandatory, a consumer might have to deal with missing expected fields.

I'm not sure what's best either.

I'd suggest that unless we're really clear that this change is beneficial we should avoid it as I think it opens up a bunch of fresh questions about what else should not be in details.

👍

I think the risk we have with this is that we need to have updated the frontend apps before any publishing apps can send this data, and that means that frontend apps wouldn't have examples to work off while being developed or a means to rollback if there are problems.

Yeah, that's a good point. Though the frontend logic would be:

get_attachment_html(attachment)
    if attachment.is_a? Hash
        attachment[:body]
    else
        attachment
    end
end

Which is fairly straightforward. I'd feel pretty safe making that change, then we can try the publishing app & schema change in govuk-docker or integration before committing to it.

@barrucadu barrucadu force-pushed the msw/rfc-116 branch from 0e4c23e to a81af5b Jan 23, 2020
@stevenjmesser

This comment has been minimized.

Copy link

stevenjmesser commented Jan 23, 2020

This RFC looks really interesting. I had a conversation last year with the National Archives (representing other web archiving organisations) who said it would be useful to have JSON for attachments in the API responses in content items. It would allow them to collect attachments from pages.

I'll send the notes of the conversation to Michael, who can circulate it. (Apologies if this has already been mentioned in the thread above.)

@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 23, 2020

The relevant bit of the notes is that they mostly use the json content items, but have to scrape the html to get "A more specific title for the specific PDF attachment we found.", "The ISSN, if any.", and "If this appears to be Command or Act paper (Cm/HC)." And that they would like "The machine readable metadata to include explicit fields for attachments, including title, ISSN, HC/Cm status".

There's some other stuff in the notes, but that's the bit about attachments. So this RFC solves at least one external problem, which is good!

@barrucadu barrucadu force-pushed the msw/rfc-116 branch from 67d9248 to e0a3658 Jan 24, 2020
@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 24, 2020

I've updated the RFC with the current feedback and set the deadline to next Friday.

@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 28, 2020

@kevindew @benthorner Just a reminder that the deadline for this RFC is Friday

@chao-xian

This comment has been minimized.

Copy link
Member

chao-xian commented Jan 28, 2020

Is there room to add a "lang" field? We currently have the problem of translated pages pointing to documents that are not of that language, and that's an accessibility issue. We don't currently have a way to flag this up. This could be a neat solution to that.

@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 28, 2020

@maxgds

This comment has been minimized.

Copy link

maxgds commented Jan 28, 2020

@chao-xian Language was discussed here: #116 (comment) :)

@kevindew

This comment has been minimized.

Copy link
Member

kevindew commented Jan 29, 2020

Thanks for the update @barrucadu - I think this is looking really good and not far off now.

A snag I can see is the required content_type field on base_attachment that wouldn't make sense with html and external attachments, and thinking about there a few more fields there that wouldn't make sense in those contexts (file_size / filename)

I wonder if we should make base_attachment have less fields so those don't have to cascade down. Also base_attachment won't actually be a real thing just a way we share data in schemas (right?) so it may not actually need to be as highlighted here in the RFC which might make life easier if we don't need to think as much in terms of a hierarchy and the minor pain points that can give. It'd also make it easier to use a oneOf (file, html, external) schema for publication attachments to cover any divergences between these types.

Other thing I thought was about the removal of order_url and price. Depending on how quickly work will begin on implementing this we may get caught where Publishing Workflow haven't finished or come to full conclusions on its removal. This may not really be a problem going forward as it can just be dealt with post the RFC depending on context/status but thought I'd flag.

Copy link

benthorner left a comment

@kevindew beat me to the comment button ⌨️. This is looking good and thanks for the effort and consideration you're putting in to it ⭐️.

I agree with @kevindew's comments and see we've both got some thoughts about the base_attachment_asset schema. Maybe we should meet and discuss those together.

I've also added a few other suggestions: one around the naming of document_attachments; and the others are more minor things about the RFC document itself.

rfc-116-store-attachment-data-in-content-items.md Outdated Show resolved Hide resolved
rfc-116-store-attachment-data-in-content-items.md Outdated Show resolved Hide resolved
rfc-116-store-attachment-data-in-content-items.md Outdated Show resolved Hide resolved
rfc-116-store-attachment-data-in-content-items.md Outdated Show resolved Hide resolved
@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 29, 2020

A snag I can see is the required content_type field on base_attachment that wouldn't make sense with html and external attachments, and thinking about there a few more fields there that wouldn't make sense in those contexts (file_size / filename)
[...]
It'd also make it easier to use a oneOf (file, html, external) schema for publication attachments to cover any divergences between these types.

Do you think publication_attachment_asset should be split into those three types?

Also base_attachment won't actually be a real thing just a way we share data in schemas (right?)

I was imagining it to be a real thing.

@barrucadu barrucadu force-pushed the msw/rfc-116 branch from 9341112 to 80470a3 Jan 30, 2020
Copy link

benthorner left a comment

Yay 🎉 🎉 ⭐️ 🎆

@barrucadu barrucadu force-pushed the msw/rfc-116 branch from 93c3db7 to eb96e9e Jan 30, 2020
@barrucadu

This comment has been minimized.

Copy link
Contributor Author

barrucadu commented Jan 31, 2020

I'm leaning towards removing the html_attachment_asset and external_attachment_asset schemas, as they're no longer used in the oneOf in publication_attachment_asset. Other than that, I think I'm now pretty happy with this.

@barrucadu barrucadu requested a review from kevindew Jan 31, 2020
Copy link
Member

kevindew left a comment

Nice one Michael - this looks good to me 👍

@barrucadu barrucadu force-pushed the msw/rfc-116 branch from a34e2d4 to c34027f Jan 31, 2020
@barrucadu barrucadu merged commit 10059f5 into master Jan 31, 2020
@barrucadu barrucadu deleted the msw/rfc-116 branch Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

8 participants
You can’t perform that action at this time.