Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to combine ClearML with a feature store like Feast #717

Open
Make42 opened this issue Jul 12, 2022 · 9 comments
Open

How to combine ClearML with a feature store like Feast #717

Make42 opened this issue Jul 12, 2022 · 9 comments

Comments

@Make42
Copy link
Contributor

Make42 commented Jul 12, 2022

We are currently product-hunting for our MLOps infrastructure and ClearML, Kedro, MLRun are on our short list. We are considering to combine ClearML with Feast, since ClearML has no feature store. With feature store, I mean this:

We get data from our customers, which we put into a "big pile of data", the raw data store. Then we have a set of processes (workflows), such that each process takes some of the data (for which it is responsible), cleans it, re-formats it, and calculates features from it. The data from all those processes is put in a harmonized table in the features tore (or set of tables if necessary) and is available for both exploration and modelling/training. When new training data comes in, the feature store is updated.

When new data comes in for which we want to infer something using our models, it must prepared, similar to the training data, then and be brought to the model (all that would be somehow in the model serving, I guess). However, we also want that (even if unlabeled) data in the feature store - available for exploration. A typical supervised training/evaluation process would only grab all the labeled data from the feature store.

The feature store might also be the right place for data validation tools to detect issues like drift. Probably evententlyAI and great-expectations would be candidates to address those issues.

  1. How could we implement those ideas with ClearML? How should we approach this setup?
  2. How can we integrate Feast with ClearML to serve as the features store?

One (pragmatic) aspect which is particularly unclear to me: To implement the above, we would need to put data into the feature store and get it out again.

  1. Where do we install the feature store? On the same server/VM/machine as the ClearML server? If we self-host, then I don't see why not, but if we get managed ClearML, I am not sure what allegro AI provides.
  2. It might be possible to install the feature store anywhere, e.g., on a different machine then the ClearML server and/or the ClearML agents, and then do the pluming via database calls inside the ClearML tasks. But wouldn't that cost a lot of transport time (as in bandwidth)? Also, while ClearML does caching of tasks' outputs, I do not see how we leverage this for the described feature store setup.
@Make42 Make42 changed the title How-to combine ClearML with a feature store like Feast How to combine ClearML with a feature store like Feast Jul 12, 2022
@thepycoder
Copy link
Contributor

Hi @Make42 ,

This is quite the question :)
First of all thank you for being clear and writing down context for your use-case.

  1. How could we implement those ideas with ClearML? How should we approach this setup?

This is how I personally would implement this workflow with all-clearml tooling:

The raw data store would be a versioned dataset that is versioned with clearml-data, it allows you to keep track of different versions, store a lineage and have a quick overview on the contents of the dataset itself.

Then each of your processes would be either a clearml task (in the experiment manager) or a clearml-pipeline if it consists of multiple steps. Having these processes as experiments in the experiment manager has multiple benefits, such as being able to easily remotely execute them, have a version history of them and keeping track of outputs like plots, sample data and artefacts (such as saved transforms for example)

This pipeline / task process could be triggered by a clearml trigger whenever a new raw dataset is detected and automatically create a new version of the processed dataset: what you could call the feature store. This is just another dataset versioned with clearml-data. You can share this dataset and people can pull them in locally and inspect/explore them. or use it to train their models. If you're looking for an extensive on-line server-side exploration tool, that is something clearml does not have.

  1. How can we integrate Feast with ClearML to serve as the features store?

ClearML is modular, so you don't have to use clearml-data as stated above if you don't want to. The connection to your feature store is through code in any case, so you can simply setup feast somewhere that the training/exploration code can reach, and you're good to go (I think)

  1. Where do we install the feature store? On the same server/VM/machine as the ClearML server? If we self-host, then I don't see why not, but if we get managed ClearML, I am not sure what allegro AI provides.

The feature store (clearml-data/feast/any other) will have to be accessible from the code (of the preprocessing/exploration/training) itself. So it depends on where the code itself runs, not on which machine the clearml-server is running. The server doesn't execute the code, this is done by either the data scientist machine or a remote machine (possibly running a clearml agent) in that case, it's these machines that should be able to contact the feature store, so the clearml hosted version is not an issue.

  1. It might be possible to install the feature store anywhere, e.g., on a different machine then the ClearML server and/or the ClearML agents, and then do the pluming via database calls inside the ClearML tasks. But wouldn't that cost a lot of transport time (as in bandwidth)? Also, while ClearML does caching of tasks' outputs, I do not see how we leverage this for the described feature store setup.

How could it be different using other tools? In the end, it would always be the code itself that gets the data, no? Or am I misunderstanding?

@Make42
Copy link
Contributor Author

Make42 commented Jul 12, 2022

@thepycoder: Thanks for the helpful answer.

One important information was to clear up my misunderstanding of where code is run. Probably, I should think of the ClearML server as a centralized management platform - similar to github and I made the mistake of thinking of it as an execution platform - like github actions (or so). This clarifies the aspect of access to the feature store, as well as the question of transport.

Regarding the raw data store: Yes, I was thinking to use clearml-data, it seem to be a fitting implementation of a data catalog. I think it is not as useful as a feature store (in the sense I described it), but that is what my question was for in the first place. This opens up a follow up question though:

  1. Do I understand correctly, that the ClearML server does not save the raw data, but only metadata about it - thus it only tracks/manages the data if you will, but the data itself is stored in a remote files system or object store (e.g. S3 on one of my machines?
  2. Can this remote machine be my own laptop plus the laptop of my colleague - and ClearML brings it all together - like decentralized storage? (Of course, both our machines need to be on for this to work.)

This pipeline / task process could be triggered by a clearml trigger whenever a new raw dataset is detected and automatically create a new version of the processed dataset: what you could call the feature store. This is just another dataset versioned with clearml-data. You can share this dataset and people can pull them in locally and inspect/explore them. or use it to train their models.

  1. I understand, this trigger works automatically, right? Someone uploads new data and the cleaning-pipeline get activated - without me the data scientist - having to do anything. Then, when I later look at the "feature store" (as you describe it), I see the clean data already prepared for me. Is that right?
  2. For this to work: Does the ClearML server monitor the object storage and then calls the appropriate ClearML agent? This does not work if there is no agent and usually I only use my laptop for processing data, right?

If you're looking for an extensive on-line server-side exploration tool, that is something clearml does not have.

  1. To provide this to the data scientists, I would need to approach this topic similar to integrating Feast with ClearML, right?
  2. I think you indirectly answered my previous forth question: I was thinking along principles of "data locality", meaning "bring your code to the data, not data to your code". This should be possible if I set up an agent on the same machine, on which the feature store is running. Would that be correct?

@thepycoder
Copy link
Contributor

One important information was to clear up my misunderstanding of where code is run. Probably, I should think of the ClearML server as a centralized management platform - similar to github and I made the mistake of thinking of it as an execution platform - like github actions (or so). This clarifies the aspect of access to the feature store, as well as the question of transport.

Correct! The clearml server is the management plane, the clearml agents form the execution plane, glad I was able to clear it up.

  1. Do I understand correctly, that the ClearML server does not save the raw data, but only metadata about it - thus it only tracks/manages the data if you will, but the data itself is stored in a remote files system or object store (e.g. S3 on one of my machines?

You can choose, actually! The clearml server (both the hosted version and the open source) has a fileserver to save raw files if you want to, but it can also be configured to use a 3rd party storage backend. Check this documentation page for more info on what 3rd party storage is supported: https://clear.ml/docs/latest/docs/integrations/storage and you will find an architectural overview of what the server is here: https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server

Can this remote machine be my own laptop plus the laptop of my colleague - and ClearML brings it all together - like decentralized storage? (Of course, both our machines need to be on for this to work.)

Not really, a clearml-agent is essentially just a single native Linux process. It has it's benefits in being simple to setup and maintain, but the downside is there is no "clustering" as I think you mean with this. However, you CAN setup kubernetes and then run the agent on top of that, which means it's kubernetes handling the clustering and scheduling.

That said, you can have an agent running on both your and your colleague's machines and the clearml queue will make sure both machines are put to use at the same time, they just won't "share" resources, each machine works independently of the other.

I understand, this trigger works automatically, right? Someone uploads new data and the cleaning-pipeline get activated - without me the data scientist - having to do anything. Then, when I later look at the "feature store" (as you describe it), I see the clean data already prepared for me. Is that right?

Correct! You will have to set up the trigger once and from then on, it's all automatic. (it works by monitoring events such as newly created dataset on the server backend)

For this to work: Does the ClearML server monitor the object storage and then calls the appropriate ClearML agent? This does not work if there is no agent and usually I only use my laptop for processing data, right?

Like stated above, it monitors the server backend for newly incoming "events". So, if the server is using a 3rd party storage backend, the trigger will still work.

To provide this to the data scientists, I would need to approach this topic similar to integrating Feast with ClearML, right?
Depends on what tool you end up using. If you have a 3rd party dataset exploration tool that you'd like to use AND it is capable of analysing a dataset that is present on disk, then yeah, you could download the data using the clearml SDK and then analyse it with this tool. But I doubt there are many tools that allow you to analyse the features/data straight from clearml, without having to download the data first. That said, if your clearml-data storage backend is something like S3, chances are the 3rd party tool will integrate easily with that, skipping clearml entirely. It's pretty difficult to give a definitive answer here, sorry!

  1. I think you indirectly answered my previous forth question: I was thinking along principles of "data locality", meaning "bring your code to the data, not data to your code". This should be possible if I set up an agent on the same machine, on which the feature store is running. Would that be correct?

Hmm, no, clearml-data very much works in the way of "get your data to the code". If you have your dataset version, stored either on the server itself or somewhere in S3 for example, "getting' the dataset using the SDK essentially means it is downloaded to the machine that is requesting access to it (e.g. the clearml agent machine). It is done in a cache, so if another task at a later date also requests the same data, the worker will still have it available and not redownload.

That said, if you would have a clearml-agent running on the same machine as the clearml server and the agent asks a local copy of the dataset, it will still "download" it from the server to a local cache, even if both are on the same machine.

@Make42
Copy link
Contributor Author

Make42 commented Jul 12, 2022

@thepycoder

Can this remote machine be my own laptop plus the laptop of my colleague - and ClearML brings it all together - like decentralized storage? (Of course, both our machines need to be on for this to work.)

Not really, a clearml-agent is essentially just a single native Linux process. It has it's benefits in being simple to setup and maintain, but the downside is there is no "clustering" as I think you mean with this. However, you CAN setup kubernetes and then run the agent on top of that, which means it's kubernetes handling the clustering and scheduling.

No, that was a misunderstanding: Here, I do not want to use those two laptops as agents for processing, but for data storage - as an alternative to S3.

Hmm, no, clearml-data very much works in the way of "get your data to the code". If you have your dataset version, stored either on the server itself or somewhere in S3 for example, "getting' the dataset using the SDK essentially means it is downloaded to the machine that is requesting access to it (e.g. the clearml agent machine). It is done in a cache, so if another task at a later date also requests the same data, the worker will still have it available and not redownload.

That said, if you would have a clearml-agent running on the same machine as the clearml server and the agent asks a local copy of the dataset, it will still "download" it from the server to a local cache, even if both are on the same machine.

What I meant was, that instead of using S3, I would store the data on my on-prem server/VM. On the same machine/on-prem I install pip install clearml-agent, turning the VM additionally into an ClearML agent. I think this case is similar to what you mean in the second paragraph. The difference is, in my case, the ClearML server is still allowed to be on a different machine. Sure, we need to copy the data into the cache of the agent process, but the data does not have to be send over any network. Am I correct?

@thepycoder
Copy link
Contributor

No, that was a misunderstanding: Here, I do not want to use those two laptops as agents for processing, but for data storage - as an alternative to S3.

This will not work, the agent does not provide distributed storage, sorry! Again, though, you can set up a 3rd party like ceph or minio to create that "storage pool" first and then just instruct clearml to use it as backend. But ceph can be pretty hard to setup depending on skill levels.

What I meant was, that instead of using S3, I would store the data on my on-prem server/VM. On the same machine/on-prem I install pip install clearml-agent, turning the VM additionally into an ClearML agent. I think this case is similar to what you mean in the second paragraph. The difference is, in my case, the ClearML server is still allowed to be on a different machine. Sure, we need to copy the data into the cache of the agent process, but the data does not have to be send over any network. Am I correct?

Oh ok, that makes sense. Right, so in that case this page might be interesting for you! It would allow you to reference the local files directly, skipping cache altogether. However, I have not used this yet myself, so I'm going to add @bmartinn here, he will have some answers for you.

The main difficulty, I think, is where the data is stored. If the server is not on the same machine as where the data is stored, the server will keep track of the remote locations of the files instead and only store metadata. But the server has to support your remote locations! It supports S3 and Ceph, but I don't know about just files on a server that are not "hosted".

@Make42
Copy link
Contributor Author

Make42 commented Jul 13, 2022

Thank you already - super helpfull! I am also looking forward to @bmartinn 's answer. He can probably say something regarding your second paragraph as well.

Ceph looks pretty bad-ass. Maybe too bad-ass :-). Maybe we start a bit smaller during a spike and a subsequent evaluation phase and have everything (data, agent, server) on the same machine and later "upgrade" to ceph only if necessary. Maybe we can even get managed ClearML - but first I need to build a strong case for my superiors.

@bmartinn
Copy link
Member

What I meant was, that instead of using S3, I would store the data on my on-prem server/VM. ...

@Make42 if I understand you correctly, you want to eliminate the need to upload/download data from cloud storage (e.g. S3), because most of the processing happens on-prem.
If this is correct then as @thepycoder suggested, basically you can set a shared folder as the target for any artifact/storage.
To set a shared folder for storage, either set the files_host (or env CLEARML_FILES_HOST) to "/mnt/shared/folder" or programmatically with task.output_uri = "/mnt/shared/folder"
Notice the full link to the artifacts is stored, which means that on any machine that would need to access the data, the mount point should be the same.

Ceph looks pretty bad-ass. Maybe too bad-ass :-). Maybe we start a bit smaller ...

Yes, Ceph might be a bit much, you can definitely start with minio, it is very easy to spin (basically download the binary and run it)

@Make42
Copy link
Contributor Author

Make42 commented Jul 15, 2022

@bmartinn The OS of our development machine is Windows, while the servers are Ubuntu. What would we need to consider in this setup, regarding the mount points of the shared folder for storage?

@bmartinn
Copy link
Member

The OS of our development machine is Windows, while the servers are Ubuntu.

This might be tricky, the implementation assumes the full link (i.e. path to file) is valid, now the issue is Windows full path always starts with "C:" (or any other driver letter), and this would break on linux, and vice versa ...
I "think" the enterprise version has some solution for that (basically replacing prefixes), which would solve this issue.

The only hack I can think of, is trying to register the data on windows, which would end up with links like: "Z:\\shared\\folder\\some\\file\\here.bin" then on a linux machine create the root folder "/Z:/" and mount the shared folder to it (i.e. /Z:/shared/folder). it might work... 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants