Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Better artifact preview + use well-known file suffixes for uploaded artifacts for a better preview #418

Open
idantene opened this issue Aug 10, 2021 · 16 comments

Comments

@idantene
Copy link
Contributor

idantene commented Aug 10, 2021

It would be great if in task.upload_artifact, the ClearML agent could look at the file suffix and identify if it's one of the supported objects (e.g. suffixes npz, png, jpg, json, csv, csv.gz, etc). If it is, it could automatically generate a preview.

For example, for image files it could load the image with Im.load(...), for csv[.gz] it could load it with pandas (and limit number of rows for preview and efficiency), etc. Of course, if it fails - it could still show the same "preview" it has now.

EDIT: Similarly, it would be quite a nice feature if images in the artifacts section could also be presented in the preview as is. Otherwise, one now has to go to the Debug Samples for that, even if those images are not necessarily data (nor are they for debugging purposes).
In other words, why is the preview limited only to text? It seems to me that this can be expanded to:

  1. Image -> show image preview (downscaled version perhaps? + some metadata on dimensions and size, etc)
  2. Text files -> show text preview (perhaps only first N characters + some metadata on file size, number of lines/characters, etc)
  3. JSON/csv files -> show a table preview (perhaps only first N rows/keys? + some metadata on file size, number of rows/keys, etc)
  4. zipped/container files -> show a list of contained files + some metadata
    etc...
@idantene idantene changed the title [Feature] Use well-known file suffixes for uploaded artifacts for a better preview [Feature] Better artifact preview + use well-known file suffixes for uploaded artifacts for a better preview Aug 10, 2021
@jkhenning
Copy link
Member

Hi @idantene ,

That's a nice idea- we'll add that to the list 🙂

@bmartinn
Copy link
Member

It would be great if in task.upload_artifact, the ClearML agent could look at the file suffix and identify if it's one of the supported objects (e.g. suffixes npz, png, jpg, json, csv, csv.gz, etc). If it is, it could automatically generate a preview.

Actually it does, but it looks at the type of the object it serializes (e.g. numpy, Image etc.) then decides on the correct preview, we could probably extend it if uploading a single file (i.e. prefix based)
The main limitation is that we can only support text as preview, which would eliminate the option for more complex things (btw if this is an actual image, pressing on the link will preview it, no?!)

In other words, why is the preview limited only to text? It seems to me that this can be expanded to:

Actually this is exactly what you will get if you upload the Objects rather then the serialized files (e.g. pass the numpy object instead of .npz), the upload_artifact will pick the best prefix for the object and will also create a preview based on the content. WDYT?

@idantene
Copy link
Contributor Author

Actually it does, but it looks at the type of the object it serializes (e.g. numpy, Image etc.) then decides on the correct preview, we could probably extend it if uploading a single file (i.e. prefix based)

I see that it does this when you pass the object itself, but it feels somewhat redundant having to open the file and pass the actual object, when the framework could identify the file type similar to how it detects the object type. Ultimately, I don't think there should be a difference in preview from the user perspective when passing an object or a path to a file, when those are effectively identical - wouldn't you agree?

The main limitation is that we can only support text as preview, which would eliminate the option for more complex things (btw if this is an actual image, pressing on the link will preview it, no?!)

Unfortunately not. There's no clickable action (maybe because the file path is local..?), just a description.

Actually this is exactly what you will get if you upload the Objects rather then the serialized files

I admit I have not tested this fully in-depth, but:

  1. Why would I load a zip file just to pass it along to ClearML? How should it be passed in such a case? As zipfile.ZipFile? Passing along the path to the zip file just creates a "preview" of filename and file size.
  2. At least for images - I still only get a text preview:
    image
    (Following the example guide that recommends loading the image using Pillow and then passing that object to ClearML)

@bmartinn
Copy link
Member

it feels somewhat redundant having to open the file and pass the actual object ...

You are correct, I was actually pointing to the fact that, in your code, when you are actually creating the object itself, from anywhere inside the repo, you can always upload it (notice the upload itself happens in a background process)

from clearml import Task
Task.current_task().upload_artifact(name, artifact_object)

I don't think there should be a difference in preview from the user perspective when passing an object or a path to a file

I'm not sure I agree. When I pass an object, it means the function is responsible for serializing, when I'm providing a file, I just want it uploaded (i.e. the content of the file in opaque to the function).

There's no clickable action (maybe because the file path is local..?)

Yes that's exactly why in most cases the artifact is uploaded to a server (e.g. the clearml files server, S3 bucket etc.) this would make the URL link clickable.

Why would I load a zip file just to pass it along to ClearML?

Sorry, I meant passing a folder/list-of-files , then the upload_artifact function zips them for you (and updates the preview)

At least for images - I still only get a text preview

Correct the preview is limited to text, anything else is Actually the artifact itself stored on the remote server, this actually connect with my point on remote servers. It seems that in your case you are storing the artifacts on a shared folder, this means the data is not available in the browser (including serving the images).
Wouldn't it make sense to upload the artifacts to the files server ? Then the images are viewable, and data is always accessible, no?

A more general note here, the distinction I'm trying to make here is that it seems that parsing the Content of a file seems too invasive (and might slow things down). We were actually trying to provide two level of access, files as blackbox upload, and objects as a more opinionated interface with visibility into the content,
wdyt ?

@idantene
Copy link
Contributor Author

in your code, when you are actually creating the object itself, from anywhere inside the repo, you can always upload it

I can see where that assumption is coming from, but we don't always create the object itself and in my specific case, we don't have an intermediate object that's compatible with ClearML. Specifically, we use pygraphviz, so the end result is strictly a png file saved to the file system.

When I pass an object, it means the function is responsible for serializing, when I'm providing a file, I just want it uploaded

If the ClearML SDK can serialize an object, surely it can be extended to also attempt deserialization (for preview-specific purposes)..? I meant that for the user point of view, the preview should not be affected whether you pass an object or a path to a file. Obviously the serialization itself (and who is responsible for it) is a different question, and I very much agree with you on that.

I understand the points for making things as they are -- and perhaps the decision to deserialize (for preview purposes) should be left to the user. In our use case, we have our own on-premise hardware and we'd like to offer minimal setup to new team members. Settings up e.g. minio to mimic S3 just means more redundant environment variables (secret keys, bucket names, etc), where using our NVME drives is much simpler.
As it is right now, we're forced to upload a task-descriptive image to the "Debug Samples" part (with an iteration as well), where it really does not fit.

I think a bigger picture question is why limit the preview to text only? What if I want to control the serialization but still have a proper preview (e.g. have low-level control of the compression level for zipped files, the DPI used for pngs, etc)? Why does ClearML limit me in those cases?

Would it be instead possible to offer e.g. task.upload_object_with_preview or similar, indicating "this is a serialized object, please attempt to deserialize it for preview purposes, I'm okay with the invasiveness/time costs it may incur"?

@bmartinn
Copy link
Member

, we have our own on-premise hardware and we'd like to offer minimal setup to new team members. Settings up e.g. minio to mimic S3

But later you mention the debug samples pointing to the files-server, so why would you need to setup minio as well ? I'm a bit confusied on the setup, see next one ...

... As it is right now, we're forced to upload a task-descriptive image to the "Debug Samples" ...

Following on the previous point, if you have the files-server setup, why not use it to store all the artifacts ? Is it because the artifacts are often accessed from different Tasks, and the download speed is the issue ?

I think a bigger picture question is why limit the preview to text only?

Let me try to explain, if we need to store an image for example, this means someone needs to serve it, which by default is the files server... I'm sure you see my point :)
What I'm getting from your setup (and please, if you can elaborate on your current setup and maybe on the decision making behind the design, it will be great), is that we might want to have the option of mixing multiple target storage for the artifacts ?! Is that it ?

What if I want to control the serialization but still have a proper preview (e.g. have low-level control of the compression level for zipped files, the DPI used for pngs, etc)? Why does ClearML limit me in those cases?

Custom preview is fully supported, as long as this is text. You can provide it when uploading the artifact:

task.upload_artifact(name, artifact_object, meta={'key': 'value'}, preview='lots of stuff\n on the content here`)

Would it be instead possible to offer e.g. task.upload_object_with_preview or similar,

Are you suggesting we add an additional flag to upload_artifact something like guess_file_preview, and if True, offer some insight into the content of the file ?
If this is it, can you post a code that will produce such a preview ?

@idantene
Copy link
Contributor Author

Following on the previous point, if you have the files-server setup, why not use it to store all the artifacts ? Is it because the artifacts are often accessed from different Tasks, and the download speed is the issue ?

Maybe I'm missing something from my setup. For the Debug Samples I'm simply using the get_logger().report_image(...), and in any case, it's not a good solution for us (seeing as it's not a debug sample, but a task-specific artifact). I'm not providing any specific output_uri to Task.init. I have edited the clearml.conf so that it will use another directory instead of the default /tmp for caching, but that's it.

Our setup (or well, the setup I'm trying to build for my team) is to use the local file storage and minimal setup steps on their end. We like to use the terminal for simple tasks and it's extremely convenient for us to have e.g. ls /data/clearml/<experiment>/....

I still don't understand why the preview is limited to text only?

Custom preview is fully supported, as long as this is text

I've noticed that as well (and the documentation could use some cleaning in this respect, as it mentions Any but does not specify what it actually does -- call str(obj) for preview). I can use that for the zip file example (which would be yet another workaround..?), but not for the image preview obviously.

Are you suggesting we add an additional flag to upload_artifact something like guess_file_preview, and if True, offer some insight into the content of the file ?

Personally, I'm not a fan of boolean arguments unless absolutely necessary, so I suggest a different method for that with a similar flow:

  1. Use explicitly specifies they want ClearML to attempt to parse and deserialize the given file path (user accepts "terms and conditions" of potential overhead in terms of resource usage and runtime).
  2. ClearML attempts to deserialize based on file path (given some supported list of extensions).
  3. On failure/unsupported extension, the regular upload_artifact is used.
  4. On success, a more elaborate preview is available, ClearML does not need to serialize the file again (unless required by the backend, in which case it can be cached)

I don't mind providing a code example, but the question still remains about constraining the preview to a textual one. It would be incredibly helpful to have added support there for non-textual previews.

@bmartinn
Copy link
Member

so that it will use another directory instead of the default /tmp for caching,

Are you referring to default_output_uri, this is Not caching, this is the upload destination of all artifacts and models. BTW if you set default_output_uri: True , it will automatically upload all artifacts to the files-server (nothing to setup), and you will be able to access them (images included) from the UI or programmatically (example).

... We like to use the terminal for simple tasks

I'm intrigued! Could you elaborate? What is the use case, and why UI is not a good interface (or the python API) ?

the question still remains about constraining the preview to a textual one

Let's assume an image is stored as preview, where is it stored?
If stored on any local file system (/tmp in your example) the browser Cannot display it, as browser are not allowed to access local file system (just imagine any random site reading your files). This means the image file needs to be stored somewhere where the browser can access. One option is the clearml files-server, where the image will be uploaded and linked as http link for the browser to fetch, and the other on an object-storage solution (like S3), and again the browser can fetch it.
Now back to my point, if we store the image file preview on the clearml files-server we are actually storing the Artifact itself on it, does that make sense ?
(BTW: the only solution I can see is having "two" artifacts, one on the local filesystem and one on the files-server ?!)

@idantene
Copy link
Contributor Author

Are you referring to default_output_uri, this is Not caching, this is the upload destination of all artifacts and models. BTW if you set default_output_uri: True , it will automatically upload all artifacts to the files-server (nothing to setup), and you will be able to access them (images included) from the UI or programmatically (example).

Yes, I was referring to that. I may have misconfigured it in that case, but I do prefer to use the local storage over minio for the reasons mentioned earlier.

I'm intrigued! Could you elaborate? What is the use case, and why UI is not a good interface (or the python API) ?

Sure. Most of us are quite tech-savvy and we're more than happy to use the terminal over any UI. It's faster and more efficient for us, especially when it comes to moving files around. It's not impossible of course to use UI or SDK -- it's just slower for us.
The UI on the other hand is extremely useful for our team leads when showing results to other parties.
It will probably be more useful for us (the devs) once the other feature requests I made are addressed 😅 We might move to minio in the future as it seems some of ClearML functionalities are dependant on it, but for now I'd like to introduce as little changes to our flow as possible.

Let's assume an image is stored as preview, where is it stored?
If stored on any local file system (/tmp in your example) the browser Cannot display it, as browser are not allowed to access local file system (just imagine any random site reading your files). This means the image file needs to be stored somewhere where the browser can access. One option is the clearml files-server, where the image will be uploaded and linked as http link for the browser to fetch, and the other on an object-storage solution (like S3), and again the browser can fetch it.
Now back to my point, if we store the image file preview on the clearml files-server we are actually storing the Artifact itself on it, does that make sense ?
(BTW: the only solution I can see is having "two" artifacts, one on the local filesystem and one on the files-server ?!)

I understand how storing and serving an image works - this still doesn't answer my question.
Your original suggestion was to store the artifacts on S3 so that I could click on a link and open it in the web browser. My question is still the same - why do I need to click anything? Why not add a built-in image preview, over the textual one that offers very little information if at all?

Goes without saying that for our use case, this would indeed mean storing the image twice (once on the local storage as we'd like, and once more with the files-server for preview purposes). For this edge case, this can, of course, be made very clear to the user (even with e.g. assuming that by default such duplicity is not allowed, but it can be tweaked via the configuration file).

@bmartinn
Copy link
Member

Why not add a built-in image preview, over the textual one that offers very little information if at all?

Currently the artifact object itself only contains a text preview, because it is stored with all the rest of the artifact's properties inside the back-end DB . If there are enough use cases, we could add a specific field for additional preview link (for example, pointing to an external image)

To summarize the feature request, let's add an additional flag to upload_artifact function, specifically requesting to add the preview field based on the file's content.
Supported file extensions: .zip, .tgz/.tar.gz, .csv/.csv.gz, .npz/.npy , .json/.yaml
Anything else I'm missing here?

@idantene
Copy link
Contributor Author

Anything else I'm missing here?

Yes, the image file extensions. At least .png, .jpg, .jpeg should also be supported.

If there are enough use cases, we could add a specific field for additional preview link

This should perhaps then be a different feature request (would you like me to open one?). In any case, the idea would be to avoid the link altogether, and instead replace the text-only preview with a smarter text/image/media preview, depending on the artifact being displayed.

@bmartinn
Copy link
Member

bmartinn commented Aug 24, 2021

Yes, the image file extensions. At least .png, .jpg, .jpeg should also be supported.

Looks like this is really something that bothers you 😄
Point being, if the artifacts are Not stored on a local file system there is no need for preview for images, as the artifact link in the browser will display the actual image (again notice here the preview means text only).

Which bring me to your second point, the main reason this issue seems to resonant with you, is because the artifacts are stored on a local file system, again if the artifact is stored on the cleaml files-server, which is an http file server, there is no need for preview, the artifact link is the preview, the UI just needs to present it (as opposed to put a link and let the user click on the link to actually see the artifact in the browser. e.g. the image).
Back to the feature request, I "think" you are requesting another feature, which is an artifact upload + a preview link stored on the file server (the creation of the preview link is actually not the challenge it's the triage support (SDK/UI/Server) that makes it more complex) .
Is that correct?

@idantene
Copy link
Contributor Author

idantene commented Aug 24, 2021

Looks like this is really something that bothers you smile

Of course, it's the basis for my request 😅
The tl;dr for the feature request (on my end), is that I'd like to be able to:
task.upload_artifact("my_graph.png")
and get something like this under Artifacts/my_graph:
image

Since this happens "naturally" for the Debug Samples (regardless of whether this is stored on the local storage or S3 or wherever), I just don't see why this can't happen for the Artifacts.

@ainoam
Copy link
Collaborator

ainoam commented Aug 24, 2021

@idantene

Since this happens "naturally" for the Debug Samples (regardless of whether this is stored on the local storage or S3 or wherever),

Actually, this only happens "naturally" for Debug Samples when they are in a location able to serve them(e.g. the ClearML fileserver).

Would it, then, be fair to summarize your suggestion, that when an artifact is an image (obviously, can later be extended to other types the UI can render) AND its location is accessible to the UI (e.g. an HTTP/S URI), that the UI should simply show the resource instead of listing its metadata (i.e. location, size, hash and textual preview)?

I'd think we might want to make that a configurable choice since for some use cases such default behaviour could be problematic...

@idantene
Copy link
Contributor Author

Would it, then, be fair to summarize your suggestion, that when an artifact is an image (obviously, can later be extended to other types the UI can render) AND its location is accessible to the UI (e.g. an HTTP/S URI), that the UI should simply show the resource instead of listing its metadata (i.e. location, size, hash and textual preview)?

Sure 👍🏻 The location of artifacts is not something I brought up and I don't consider it part of the request 🤷🏻‍♂️ (IMO it is transparent wrt this feature request).

@ainoam
Copy link
Collaborator

ainoam commented Aug 24, 2021

@idantene The artifacts location is only relevant to the extent, as @bmartinn pointed out earlier, that the webapp running on a browser cannot automatically access locally stored files for security concerns.

This means that any way we take this request forward will only have value when artifacts are stored in a file serving network location (locally running file server included).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants