Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a unique value (version identifier) when replacing existing files/images #22606

Closed
that1matt opened this issue May 29, 2024 · 33 comments · Fixed by #23035
Closed

Provide a unique value (version identifier) when replacing existing files/images #22606

that1matt opened this issue May 29, 2024 · 33 comments · Fixed by #23035

Comments

@that1matt
Copy link
Contributor

that1matt commented May 29, 2024

Describe the Improvement

Currently using the Replace File option in Directus, there is no indication that this file was replaced. This causes issues with external CDN providers where images are not updated to reflect the new changes. There is no indication that the file has been replaced, as the ID is the same, and the file overwrites the original.

Changing the name of the file or any field related to directus_files, does update the modified_on value.

Ideally, having an indicator that returns a version number, or another specific value stored on directus_files. This would indicate that this file has been replaced using the Replace File and could be grabbed through the API endpoint to indicate that we need to invalidate the cache of the specific file.

Using the modified_on value doesn't exactly limit the scope of files changed if the file has changed other fields as well.
Using a lower cache TTL defeats the purpose of caching the image.

@that1matt that1matt changed the title Provide a unique value when using the Replace File option on files/images. Provide a unique value (version identifier) when using the Replace File option on files/images. May 30, 2024
@that1matt
Copy link
Contributor Author

that1matt commented Jun 4, 2024

Potential Solution:

  • Create a new field on directus_files named replaced_on that takes the same timestamp/datetime value as other datetime fields. This might not work as the timestamp value would be updated on insertion/update.

  • Then where calling the Replace File functionality it would update the value of the timestamp for this field. Only calling the Replace File through the API or UI would ever update this field.

  • Creating a version integer column on directus_files that is updated by one when an image is replaced using the Replace File. Current example is using version, but might be a good idea to use another name such as cache_version, file_version, file_revision or replaced_version something more descriptive.

This has the benefit of giving a version number when the image was replaced, and would allow the user to build their own logic on how to handle caching. With it being as simple as checking the field, if it is not null, append a query param to break previous cache issues.

I am happy to take on solving this, with whatever solution the team thinks would be ideal.

@paescuj
Copy link
Member

paescuj commented Jun 27, 2024

I'd like to suggest another potential solution:

When replacing a file, we could update the filename_disk field, using a new random UUID.
That way we can also ensure that the file extension present in filename_disk is updated - currently this is not the case (for example, when replacing a jpeg with a png file).
This would possibly eliminate the need for a new version field and solve #22556 at the same time 🤔

If we take this path, we might want to use an arbitrary UUID for filename_disk on the initial upload as well, instead of the primary key, so that there's no coupling from the beginning.

@that1matt
Copy link
Contributor Author

that1matt commented Jun 27, 2024

@paescuj I didn't even think about the file extension, that is a great point. I think we would also want to update the filename_download to make sure it includes the file extension as well. We have had issues where the filename_download doesn't have the file extension, and users cannot open the file. Would we want to overwrite the filename_disk/filename_download to keep them both in "sync"?

Would the primary key stay the same, just the filename_disk/filename_download change?

I did see that this change did get added, #22848 which allows a custom filename_disk.

@that1matt that1matt changed the title Provide a unique value (version identifier) when using the Replace File option on files/images. Provide a unique value (version identifier) when replacing existing files/images Jun 27, 2024
@that1matt
Copy link
Contributor Author

@paescuj another option would be to check the data passed in from the new file, and replace the filename_disk extension here https://github.com/directus/directus/blob/main/api/src/services/files.ts#L89

@that1matt
Copy link
Contributor Author

@paescuj I have a change within files.ts that would solve changing the filename_disk extension, along with making sure that the filename_download extension is present. I added some tests to make sure that when changing an existing file, it does update the extension only for both filename_download and filename_disk.

if (!path.extname(payload.filename_download!) && payload?.type) {
   payload.filename_download = payload.filename_download + '.' + extension(payload.type)
}

// The filename_disk is the FINAL filename on disk
payload.filename_disk ||= primaryKey + (fileExtension || '');

if (path.extname(payload.filename_disk!) !== fileExtension) {
   payload.filename_disk = primaryKey + (fileExtension || '');
}

@paescuj
Copy link
Member

paescuj commented Jun 28, 2024

@paescuj I didn't even think about the file extension, that is a great point. I think we would also want to update the filename_download to make sure it includes the file extension as well. We have had issues where the filename_download doesn't have the file extension, and users cannot open the file. Would we want to overwrite the filename_disk/filename_download to keep them both in "sync"?

Would the primary key stay the same, just the filename_disk/filename_download change?

I did see that this change did get added, #22848 which allows a custom filename_disk.

Thanks for your prompt feedback! ❤️

I think we would also want to update the filename_download [...]

Oh good point! Yeah, I think consequently we should also update that - as well as the type field!

Would the primary key stay the same [...]

Right, I think the primary key always stays the same, as after a replacement, it's still the same file "resource/container/..." that is now just pointing to a different "asset".


With those changes, do you think there's still a need for the version field?
Not that I am fundamentally against it, but in the end it's a new field that we would have to maintain.
As it seems, it would primarily serve as cache invalidation, so I wonder how useful it really is besides that.
For example, the revisions already indicate the "version" of a file.

If it turns out that such a field is necessary:

  • I'd consider using a replacement date instead, as in my opinion this is a bit more meaningful than a simple version counter. It seems like for the current use case we just need a value that changes, but we're not really interested in the actual "version number". At the same time, a date field would contain information that could be useful in other ways...
  • For a version field, as mentioned by you, I'd choose a more descriptive name like file_version to make it clearer that the value is only about the file itself.
  • Seems pretty irrelevant at the moment, but still (🤓😇): What happens when the version counter reaches the max integer value? 1 Also would we want to block direct modification of that value?

Footnotes

  1. https://arstechnica.com/information-technology/2014/12/gangnam-style-overflows-int_max-forces-youtube-to-go-64-bit/

@that1matt
Copy link
Contributor Author

Ideally I wanted to have a replaced_on field that was a timestamp that was only updated when the existing file was replaced. I wasn't able to figure out how to get that to work in my testing. PostgresQL would always update the field on any change. If you have thoughts on this, I would love to replace the version with a replaced_on timestamp value, as it made more sense than a version field.

I do like that the primary key is the same as the asset filename_disk, it does save some logic when we are pulling down files from Directus.

My goal with the additional field was to try and keep it simple without making a potential breaking change on how users expect file assets to work. (only because I am not as familiar with the depth of Directus and how a change like that can affect different areas). Having just a replaced_on or version field, would allow a simple additional field to be returned and if it is not null, it means it was updated. Where the timestamp/version field would indicate the date/int at which it was replaced, could be used as a query string to break caching when we are pulling the asset from Directus.

@that1matt
Copy link
Contributor Author

@paescuj what do you think about this implementation change on #22751 ?

@that1matt
Copy link
Contributor Author

that1matt commented Jun 28, 2024

Also would we want to block direct modification of that value?

Yes, this field should only change when an existing file is being replaced. You should not be able to change it via API/APP.

I'd consider using a replacement date instead,

For a version field, as mentioned by you, I'd choose a more descriptive name like file_version to make it clearer that the value is only about the file itself.

Your comment here about the date field, solved the issue I was having when using a timestamp field, where the INSERT/UPDATE issue no longer happens when any field is changed. So I created a new field replaced_on to match convention of modified_on & uploaded_on.

@that1matt
Copy link
Contributor Author

This would possibly eliminate the need for a new version field and solve #22556 at the same time 🤔

I do not have a Cloudinary account ATM, but I will create one and work on seeing if I can test this change against that issue.

@nickrum
Copy link
Member

nickrum commented Jun 28, 2024

I'm probably missing something, but wouldn't it make sense to simply update modified_on when a file is replaced? I feel like I would expect a replacement to also be a kind of modification 🤔

@that1matt
Copy link
Contributor Author

@nickrum the issue right now, is that modified_on gets updated when you make any change to the file. This doesn't narrow the scope of when a file has been replaced. Using the modified_on would defeat the ability of breaking the cache when using an external cache provider. This replaced_on will only update when the file is replaced, not if the filename was renamed.

@nickrum
Copy link
Member

nickrum commented Jun 28, 2024

@that1matt Got it, thanks! This sound like something that is usually solved by storing a checksum/hash of the file (I thought we already had something like this), which has the added benefit that the cache is only invalidated if the file is actually different.

@that1matt
Copy link
Contributor Author

@nickrum I wasn't able to find anything that informed me when a file was replaced. Nothing that returned through the API.

@paescuj
Copy link
Member

paescuj commented Jun 28, 2024

@that1matt Thanks again!

Some final remarks:

  • My initial idea in Provide a unique value (version identifier) when replacing existing files/images #22606 (comment) was that an additional field could be avoided by using the updated filename_disk field as a cache invalidator. But I think we are both of the opinion that one does not necessarily want to disclose that information for that use case, which is why an additional dedicated field makes more sense. Do you feel the same way?
  • If we're actually going to update the filename_disk field et al., the modified_on field will be updated as well when replacing a file. But that also seems to make sense to me.

@that1matt Got it, thanks! This sound like something that is usually solved by storing a checksum/hash of the file (I thought we already had something like this), which has the added benefit that the cache is only invalidated if the file is actually different.

Awesome input ❤️ I think I'd prefer this to a replaced_on date field, as we could, as mentioned above, now get this information from the modified_on field if needed. Implementation wise, we would need to see how we can calculate the checksum, though.

@that1matt
Copy link
Contributor Author

that1matt commented Jun 28, 2024

But I think we are both of the opinion that one does not necessarily want to disclose that information for that use case, which is why an additional dedicated field makes more sense. Do you feel the same way?

Yes, along with keeping the file id and filename_disk in sync as they currently are. Which was the case before that patch went into place to allow a custom filename_disk.

If we're actually going to update the filename_disk field et al., the modified_on field will be updated as well when replacing a file. But that also seems to make sense to me.

That makes sense to me as well. The goal of this change is to provide a definitive way of identifying files that have been replaced, to find a way to invalidate cache without allowing every modification to invalidate cache. (This could PR could fix areas where you are calling modified_on as a way to invalidate cache within the app.)

Got it, thanks! This sound like something that is usually solved by storing a checksum/hash of the file (I thought we already had something like this), which has the added benefit that the cache is only invalidated if the file is actually different.

I agree, having a hash/checksum would solve the problem, as long as this field is returned through the API. I am happy to work on changing the implementation. I also like this better than the replaced_on. My goal is to cause as little disruption as possible.

@that1matt
Copy link
Contributor Author

@paescuj I believe we will still need a new field to be returned through the API when replacing a file, as there is currently not a way to determine/notify when a file has been replaced through the API?

Just to make sure I am understanding what to implement, we would have a new hash field, that by default would be null. Starting off null indicates that the original file hasn't been changed. Then when replacing the file, we would calculate the hash of the image, and update the hash field with this value, along with updating the filename_disk & filename_download extension only, as well as the type field if the mime-type has changed.

@paescuj
Copy link
Member

paescuj commented Jun 28, 2024

Correct, while already creating a hash for the initial upload 👍

@that1matt
Copy link
Contributor Author

that1matt commented Jun 28, 2024

I am having some trouble getting a hash value, hopefully I am just missing something that someone can point out.

Here is the snippet of code

import { readableStreamToString } from '@directus/utils/node';
import { createHash } from 'node:crypto';

const generateChecksum = (data: string) => createHash('md5').update(data, 'utf8').digest('hex');

const string = await readableStreamToString(stream);
const checksum = generateChecksum(string);

While this does give me a hash, I am now getting an error.

Input buffer contains unsupported image format at Sharp.metadata

Any insights would be much appreciated.

It seems calling the stream before line 123 will throw the error, calling it after will not.

@that1matt
Copy link
Contributor Author

@paescuj @nickrum made the first round of changes, let me know what you think.

@rijkvanzanten
Copy link
Member

I'm out one day for a wedding and we've changed the plot! 😄 Good shoutout to use a checksum instead @nickrum. Feels more appropriate than a version number indeed.

const string = await readableStreamToString(stream);
const checksum = generateChecksum(string);

@that1matt This feels dangerous to me. That readable stream could be gigabytes worth of file, which is then all read into a single string. This is an easy way to crash the server with an out of memory issue. I believe you can pipe a readable stream to the hasher, so hopefully we can change that to something like (pseudo code):

import { pipeline } from 'node:stream/promises';
import { createHash } from 'node:crypto';

const hash = await pipeline(stream, createHash('md5'));

@that1matt
Copy link
Contributor Author

that1matt commented Jul 1, 2024

@rijkvanzanten thanks for the feedback. I appreciate all the insights into Directus that I miss.

I have a working implementation with the changes you suggested. I am still figuring out a typescript issue.

Property 'toArray' does not exist on type AsyncIterable

/**
* Extract file hash from stream content
*/
async getHash(stream: Readable): Promise<string> {
  let hash = '';

  await pipeline(stream, createHash('md5').setEncoding('hex'), async function (source: AsyncIterable<Hash>) {
     for await (const chunk of source) {
   			hash += chunk;
   		}
  });

  return hash;
}

I haven't played with streams much, so hopefully this is on the right track.

@hanneskuettner
Copy link
Member

hanneskuettner commented Jul 2, 2024

Just a consideration here. Since we do support uploading multi-gig files (in chunks) in the next version, doing a hash does seems like a performance pitfall, if the file has to be downloaded from the provider just to do local hashing in the Directus instance.

@nickrum
Copy link
Member

nickrum commented Jul 2, 2024

@hanneskuettner Provided the chunked upload still goes through the Directus instance, couldn't we just calculate the hash on the fly and update it as new chunks become available?
Also, we might want to use a very fast hashing algorithm that is intended for this kind of operation like xxhash.

@paescuj
Copy link
Member

paescuj commented Jul 2, 2024

@hanneskuettner Provided the chunked upload still goes through the Directus instance, couldn't we just calculate the hash on the fly and update it as new chunks become available? Also, we might want to use a very fast hashing algorithm that is intended for this kind of operation like xxhash.

Problem is that we cannot continue hashing once the upload has paused, as we cannot access and store the hash state... See e.g. nodejs/node#25411 (comment)
Probably the same with xxhash?

@nickrum
Copy link
Member

nickrum commented Jul 2, 2024

While we can't manually safe and restore the hasher instance, couldn't we just keep it around in memory? I haven't checked it, but I would assume that the hasher instance isn't much bigger than the resulting hash.

@hanneskuettner
Copy link
Member

While we can't manually safe and restore the hasher instance, couldn't we just keep it around in memory? I haven't checked it, but I would assume that the hasher instance isn't much bigger than the resulting hash.

Not possible, since the upload can happen across multiple horizontally scaled instances and does not have to be pinned to one instance only

@that1matt
Copy link
Contributor Author

@paescuj I actually like a lot of your changes on #22900 for the file hashing, and solved the issue I was having with trying to clone the stream. With your PR, should I stop working on #22751 in favor of your PR, could we combine them?

@nickrum
Copy link
Member

nickrum commented Jul 2, 2024

Not possible, since the upload can happen across multiple horizontally scaled instances and does not have to be pinned to one instance only

It's never easy when multiple horizontally scaled instances come into play...
Looks like this xxhash implementation allows you to access and modify the internal state, which means we could store the state somewhere where it can be accessed by all instances.

@paescuj
Copy link
Member

paescuj commented Jul 15, 2024

Thanks @that1matt for your active participation in this matter!

We've now decided within the team to continue with the date approach. Hashing currently raises too many questions, especially when it comes to the new "resumable uploads" feature.

The reason for the separate explorational pull request from my side (#22900), was that if we would add hashing, this most likely has to happen directly during the upload (as opposed to loading the file again from storage after upload).
The solution in #22900 seems to work so far, however as mentioned above, we choose the date approach, which is much simpler but still meets the original requirements.

Nevertheless, I'd like to keep your pull request (#22751) open for a bit, as it contains other useful work, that we can look at at a later point. We'll open an additional pull request shortly, which will exclusively cover the the date field change.

@rijkvanzanten
Copy link
Member

The hashing approach felt like the theoretical "correct" way to do it, but while testing it out in those PRs it does raise too many performance concerns and stability questions. The main goal of this implementation is to have the flag to indicate whether the file was changed, to solve that we don't need full checksumming. Juice ain't worth the squeeze 🙂

@that1matt
Copy link
Contributor Author

@paescuj let me know if you want me to adjust the PR I have open, I can change it back to use a replaced_on date, with the other updates to make sure that the file extension gets updated as well.

@that1matt
Copy link
Contributor Author

I have updated the linked PR to use a replaced_on field. Please feel free to modify it if it works. I also kept the logic on updating file extensions, mime-types, when you replace the image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment