[Epic] - DataScientist - MlOperator collaboration flow #37

alfsuse · 2021-03-18T11:24:44Z

Following Google definition of the collaboration between:

Data Engineer (responsible for Data preparation and ingestion)
Data Scientist (responsible to develop a model to be trained and overview the capability of this model to satisfy project requirements as: precision,accuracy, performance)
MLOp (responsible of glue together the Data and Model and create an end-to-end workflow that train the model and serve the model to the inference to expose it to the final application)

We may define the ideal flow to satisfy the following steps:

Data extraction: You select and integrate the relevant data from various data sources for the ML task.
Data analysis: You perform exploratory data analysis (EDA) to understand the available data for building the ML model. This process leads to the following:
Understanding the data schema and characteristics that are expected by the model.
Identifying the data preparation and feature engineering that are needed for the model.
Data preparation: The data is prepared for the ML task. This preparation involves data cleaning, where you split the data into training, validation, and test sets. You also apply data transformations and feature engineering to the model that solves the target task. The output of this step are the data splits in the prepared format.
Model training: The data scientist implements different algorithms with the prepared data to train various ML models. In addition, you subject the implemented algorithms to hyperparameter tuning to get the best performing ML model. The output of this step is a trained model.
Model evaluation: The model is evaluated on a holdout test set to evaluate the model quality. The output of this step is a set of metrics to assess the quality of the model.
Model validation: The model is confirmed to be adequate for deployment—that its predictive performance is better than a certain baseline.
Model serving: The validated model is deployed to a target environment to serve predictions. This deployment can be one of the following:
- Microservices with a REST API to serve online predictions.
- An embedded model to an edge or mobile device.
- Part of a batch prediction system.
Model monitoring: The model predictive performance is monitored to potentially invoke a new iteration in the ML process.

We may assume that FUSEML in its first releases will have to provide a simple way for those three personas to collaborate, removing, at the same time, the majority of friction and complexity.

To do so as the initial phase we will have to focus on a subset of the above points.

We may describe this subset as:

We assume the DE knows how to prepare the data and simply expose the datasets to the DS in some way (i.e.: S3 store, Remote URL,etc..).
DS code on his preferred tool/IDE and once ready push its files to FUSEML as a branch of the final repo.
We will embed these coding artifacts in a super simple pipeline that has only 2 or 3 steps:
- data ingestion/preparation
- training
- outcome
The simple pipeline will deploy everything the DS need (i.e. MLFlow instance for the experimentation phase)
Once the Ds is satisfied with the training outcome he will execute a request to the MLOps (from Git logic he does a PR) that notify the MLOp that training code is ready to be pickup.
MLOp inject the code into a more complex pipeline (merge) and start the end-to-end workflow

alfsuse added area/docs Improvements or additions to documentation enhancement New feature or request area/core area/blueprint Epic labels Mar 18, 2021

alfsuse added this to To do in Fuseml Repo via automation Mar 18, 2021

alfsuse changed the title ~~[EPICS] - DataScientist - MlOperator collaboration flow~~ [Epic] - DataScientist - MlOperator collaboration flow Mar 18, 2021

alfsuse added the in progress The issue or PR is in the works label Mar 18, 2021

alfsuse moved this from To do to In progress in Fuseml Repo Mar 18, 2021

stefannica added this to the MVP milestone Apr 23, 2021

alfsuse added Epic mvp and removed Epic labels Apr 23, 2021

alfsuse added this to To do in FuseML Project Board via automation Apr 23, 2021

stefannica moved this from Backlog to MVP in FuseML Project Board Apr 23, 2021

alfsuse added done The issue or PR is done and removed in progress The issue or PR is in the works labels Jun 7, 2021

alfsuse closed this as completed Jun 7, 2021

FuseML Project Board automation moved this from MVP to Done Jun 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] - DataScientist - MlOperator collaboration flow #37

[Epic] - DataScientist - MlOperator collaboration flow #37

alfsuse commented Mar 18, 2021

[Epic] - DataScientist - MlOperator collaboration flow #37

[Epic] - DataScientist - MlOperator collaboration flow #37

Comments

alfsuse commented Mar 18, 2021