Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
RFC 116: store attachment data in content items #116
ChrisBAshton left a comment
Looks like a great proposal to me
Thanks for raising this Michael
We also want to have Publishing API be able to not only render attachments in govspeak but also images. We'd like this as we hope that if Publishing API can render Content Publisher govspeak then we could set it up to eventually automatically update embedded content (e.g. organisational contact details).
One of the things where I think you should consider more is how this data can be consolidated with the data needed to render govspeak assets from govspeak. In Govspeak we've already got informal definitions of expected data structure of attachment files (and there'll be a much richer one in Whitehall with their extra data). If we go with a new data structure that can meet the needs for Govspeak then we've got a bunch of migration headaches to restructure everything for that. If we go with a data structure that doesn't meet govspeak needs then we're going to end up having the same data in a content item twice. My view would be that it'd be better if we could be backwards compatible with existing structures of data and actually formalise that rather than the informal approach we have.
An area not mentioned in the RFC is image files since they're treated slightly differently and are similar data that Publishing API is lacking.
I also think we should be very cautious about adding Asset Manager communication to Publishing API. I think it could easily become complex and lead to undesirable situations - for example they can be superseded by newer versions, they can be setup with their own redirects, and their removal needs managing.
Thanks Kevin, I've removed the Asset Manager bits (admittedly those were a bit of a stretch) and written about the Govspeak attachments stuff. Do you think we should go straight to adding all of Whitehall's attachment model to Govspeak, or add it as we discover things are needed?
Another thing I'm not certain on is what exactly the relationship between the
Great thanks. Do you want to grab a chat about the asset side of things and then we can summarise it as a comment here? it strikes me as something that might be easier to work through face-to-face.
Thanks for the chat yesterday Michael. I'll summarise the bits we covered:
Could the microformats schema.org data be exposed through HTML attributes on the attachments rather than needing to be JSON+LD data? You have an expectation that Google (and co) would get confused by a mixture of JSON+LD and embedded microformats and felt we also needed the attachment data to be available to search API which doesn't work easily if it's embedded in HTML.
We discussed what the roll out might be. You indicated that the search team would be responsible for doing this I suggested we have some info of that in the RFC.
I learnt that the asset_link schema already exists. We both concluded it wouldn't be suitable for this to contain both images and attachment handling given the way the data stored for each of these diverges, however felt changing to a new schema for those could be backwards compatible.
Finally we talked about the intention that the attachment data can eventually replace the need to render the documents in a publishing app and that we can intend for this new field to eventually lead to the deprecation of documents.
Thanks Michael - this is looking great
Few comments from me are:
For compatibility with Content Publisher we'd need more fields in both image and attachment, though I think it'd be fine that we add them later when we get to actually, hopefully using it. But seems worth telling you:
I also notice that
Finally, it'd be good to set a deadline on this RFC if there's not one already?
@benthorner are the
I thought we could get that from the URL: if it's a URL on GOV.UK that ends in
It's in the metadata here: https://github.com/alphagov/govspeak#attachments but it's not in the
Yeah, we can remove those.
Would the attachment
I'll remove that too. Content type for an image isn't too useful.
How about end of Friday this week?
Yeah I think it'll be useful as it'd be really nuanced to work out which ones are which based on the URL. Html Attachments don't have a suffix of .html for instance, for assets we'd probably have to assert on the hostname (which is under some discussion), and although we expect an external attachment to be a different hostname there are likely no guarantees of that.
Ah - so Whitehall does also have that field as well: https://github.com/alphagov/whitehall/blob/6c53d01361b83fa816dd2a05892e6f93f4cf602a/app/models/file_attachment.rb#L11
yes, filename should essentially be the basename of the url (however there are circumstances with govspeak when the URL doesn't have the basename so it's independent). The filename is used to assist in working out the mime-type to display the correct metadata: https://github.com/alphagov/govuk_publishing_components/blob/ecb19eae8f421ade7b650bd1cc02493768d323eb/lib/govuk_publishing_components/presenters/attachment.rb#L23-L28
I've thought of a rather annoying issue we may well hit with this approach due to the attachment list having potentially two meanings that may collide.
In Specialist Publisher, and future apps that may want to allow attachments to be embedded into govspeak, we have this list of attachments and govspeak can reference them. This list doesn't have a particular ordering.
In Publications and similar documents example we have a list of attachments that is ordered and should be rendered specially on the page with extra metadata.
The complication seems to arise when we consider what might happen when we want to have both inline attachments that govspeak can render and these publication attachments on the same document. A key difference between these types of attachments is the amount of metadata, the publication ones have all the official document and similar information, whereas inline ones that are only used by govspeak would lack this.
So I'm just wondering what should be done to make them distinct. Is it some form of two different lists? or flags in a singular list, I'm not sure myself. It's worth tagging @benthorner on this too as he's working in this area.
Ok, I've added that.
I actually went for Friday of next week, as this week might be a bit short if we still have issues to resolve. But next week should be fine.
That's a good point, I'd not considered that. Mentally I was grouping all formats into "has an attachment block but no inline attachments" and "has inline attachments but no attachment block". But that's not necessarily the case.
I'm leaning towards the "two list" solution, as the two types of attachment are very different, so I'm not sure putting them in the same list makes sense. There are two issues with mixing them in the same list that I can see:
But it'll be good to get Ben's thoughts on this.
@barrucadu thanks for asking! I've read through the RFC again, and the comments above, so hopefully what follows makes sense.
It would be good to clarify the intended use of the
It's seriously weird that government-frontend actually thinks the image blob will have an
I'm guessing that for the other places that appear to use
With the list of attachments, I think we should separate the rendering concern from the data / metadata concern, noting that Govspeak doesn't rely on ordering to fish them out. A "two list" approach, like you say, but different in that:
Other minor comments
To confirm, these fields are currently redundant and we expect to remove them shortly. https://trello.com/c/vub2SU2P/1330-decide-what-to-do-with-redundant-html-attachment-fields
We also expect to remove the
The Whitehall rendering of attachments only seems to care about the
If we do want to pass the
The migration pathway might be a bit fiddly for Specialist Publisher, as suggested in one of the comments. Not a biggie, but might help to provide some explicit consideration:
We seem to have got lost with
I've been thinking a bit more about this and have some fresh ideas - sorry to be giving lots of staggered inputs, it's an area I haven't thought about in detail before and I keep learning/realising new things from our team chats exploring the area.
On Publishing Workflow we've been talking about attachments a lot. We're currently using terminology of featured attachments to reference attachments that are ordered and shown on publications and consultations (currently in the documents array) and in-body attachments to reference ones that are rendered through govspeak. This terminology isn't without flaws but it seems to work in the absence of something better.
So one thing in particular to think about, and this goes right back to the problem you're working on Michael, is whether the Search API is interested on the data for all the attachments (in-body and featured) or just the featured attachments that publications and consultations have?
I do wonder if we really want to have all the attachment fields available for all document types, since these fields are only used by publication and consultation types. Perhaps we want to only apply this at a schema level for publication and consultation schemas? What do you think? I'm not sure if we should consider this the base set of all attachments or just special extra fields for these types. I also wonder a bit what we'd want to do if a document type wanted extra metadata on an attachment, would that be a new type in schemas or adding to base (the former seems more appealing)
Another thing I wondered is that we probably want to use the Publishing API link system to pull in data for HTML attachments. So rather than those having fields like title and url, we may instead want content_id and have a link type that we can pull these in from the links of the document. This seems a bit more fiddly but more consistent with the Publishing API and reduces duplicated data.
On the existing side of things
I think we do need attachment_type, URL alone isn't going to be enough to differentiate between attachment types, and it'd be good to not have to do any regexing of a URL. I expect we should use a oneOf in schema to decide between them as I think some fields won't be valid for all (for example content type should only matter on file attachment).
With the id field can we do anything at a schema level to enforce uniqueness across a collection? from a quick google it doesn't seem so.
Whats the logic for moving attachments to root level? This strikes me as something that increases the migration costs substantially and I'm not sure what we gain from it? I imagine it opens the door straight away to wanting to move images also out of details otherwise those things become inconsistent.
Changing details.documents seems quite risky for a roll-out because we'd not be able to have the old field and the new field at the same time. I feel like we'd do better seeing if we could go with a new name and then deprecate documents. Can you elaborate at all by what is meant by govspeak_hashes? would this be the same data that is in attachments or would it be references to data that is referenced there and pulled in? In Content Publisher we've been talking about having an array where we reference id and types to flag the featured attachments and their ordering. Something like:
On the id, content_id and filename front. I think we'd want to use id as the means to embed an attachment in govspeak, content_id should become surplus to requirements (aside from a usage linking html_attachments) and I think we'd still want filename even if some apps (Content Publisher in particular) have the same value for id.
ChrisBAshton left a comment
I've had a re-read in light of the recent changes. Looking great, just a couple of suggestions for improving the clarity of the proposal (and checking my own understanding).
Statistical data sets, which is one place thing we'd want to use attachments from, don't have featured attachments, they just link to in-body attachments.
A way of solving both these issues would be to have multiple attachment schemas, and document schemas could use a union of whatever types makes sense. eg, a "PublicationesqueAttachment", and "HTMLAttachment".
What do you think should be in the basic attachment type and what should be in the publication attachment type?
Ok, I can bring that back.
We could still do that after validating the schema and reject documents with duplicate IDs.
This was Ben's suggestion, the motivation being that attachments are basically metadata. Here's a slack excerpt: "top-level
I'm kind of on-the-fence about this change, I'd be happy to keep
This would be an array of
There's a lot of feedback so I'll leave you to respond to this before I update the RFC.
Ah cool - so Search API will have an interest in all of them. Would it be only pulling in ones for particular document_types or will it pull in from any? - I ask based on the impact for Search API if schema rules for attachments differ.
In a basic attachment we should have:
Then a publication file should have those with:
Then we'd have to deal with attachment_type and other subtle differences between HTML and external attachments.
I find myself a bit torn on what route is best to proceed with here. It feels slightly odd to permit all the various attachment fields and types to all documents when we only know of two schemas that use them. On the other hand they're somewhat coupled because they're a product of Whitehall where they were considered the same thing.
The pros to separating them seem to be:
The cons seem to be:
Yes - I ended up having a chat with Ben about it. What seems to have been lost to history is what exactly the intended purpose of details was. My understanding is that it's meant to be used for any data that isn't consistent across all document types. There's already numerous examples of metadata being present in details for example: https://www.gov.uk/api/content/aaib-reports,
I'd suggest that unless we're really clear that this change is beneficial we should avoid it as I think it opens up a bunch of fresh questions about what else should not be in details. Whereas in details, where it is metadata that fulfils the needs of of some but not all formats, it is consistent with existing usages and avoids the needs for adding additional columns to Pub API and Content Store.
I think the risk we have with this is that we need to have updated the frontend apps before any publishing apps can send this data, and that means that frontend apps wouldn't have examples to work off while being developed or a means to rollback if there are problems.
I think we'd only pull in the data for certain document types.
I think we have some degree of the same complexity even with different types. With one type, we set the precedent of metadata fields being global by default. So a consumer might have to deal with unexpected fields. With multiple types, we solve that problem, but unless we make the new fields mandatory, a consumer might have to deal with missing expected fields.
I'm not sure what's best either.
Yeah, that's a good point. Though the frontend logic would be:
Which is fairly straightforward. I'd feel pretty safe making that change, then we can try the publishing app & schema change in govuk-docker or integration before committing to it.
This RFC looks really interesting. I had a conversation last year with the National Archives (representing other web archiving organisations) who said it would be useful to have JSON for attachments in the API responses in content items. It would allow them to collect attachments from pages.
I'll send the notes of the conversation to Michael, who can circulate it. (Apologies if this has already been mentioned in the thread above.)
The relevant bit of the notes is that they mostly use the json content items, but have to scrape the html to get "A more specific title for the specific PDF attachment we found.", "The ISSN, if any.", and "If this appears to be Command or Act paper (Cm/HC)." And that they would like "The machine readable metadata to include explicit fields for attachments, including title, ISSN, HC/Cm status".
There's some other stuff in the notes, but that's the bit about attachments. So this RFC solves at least one external problem, which is good!
Is there room to add a "lang" field? We currently have the problem of translated pages pointing to documents that are not of that language, and that's an accessibility issue. We don't currently have a way to flag this up. This could be a neat solution to that.
Yes, there's a
Thanks for the update @barrucadu - I think this is looking really good and not far off now.
A snag I can see is the required content_type field on base_attachment that wouldn't make sense with html and external attachments, and thinking about there a few more fields there that wouldn't make sense in those contexts (file_size / filename)
I wonder if we should make base_attachment have less fields so those don't have to cascade down. Also base_attachment won't actually be a real thing just a way we share data in schemas (right?) so it may not actually need to be as highlighted here in the RFC which might make life easier if we don't need to think as much in terms of a hierarchy and the minor pain points that can give. It'd also make it easier to use a oneOf (file, html, external) schema for publication attachments to cover any divergences between these types.
Other thing I thought was about the removal of order_url and price. Depending on how quickly work will begin on implementing this we may get caught where Publishing Workflow haven't finished or come to full conclusions on its removal. This may not really be a problem going forward as it can just be dealt with post the RFC depending on context/status but thought I'd flag.
benthorner left a comment
@kevindew beat me to the comment button
I agree with @kevindew's comments and see we've both got some thoughts about the
I've also added a few other suggestions: one around the naming of
Do you think
I was imagining it to be a real thing.