Skip to content

This repository contains tutorials and resources for you to reproduce Microsoft Build 2020 session - Building an End-to-End ML Pipeline for Big Data​

License

Notifications You must be signed in to change notification settings

adipolak/ms-build-e2e-ml-bigdata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MS-Build 2020: Building an End-to-End ML Pipeline for Big Data​

This repo holds information and resources for you to create the Microsoft Build 2020 - Building End-to-End Machine Learning pipelines for Big Data Session demo.

Prerequisites:

  1. Azure account
  2. Eventhubs
  3. Azure Databricks
  4. Azure Machine Learning
  5. Azure KeyVault
  6. Kubernetes Environment / Azure Container Instance

Data Flow

  1. Ingest stream data into Azure Blob storage with Event hubs and Azure Databricks.
  2. Preprocess the data to fit our schema - Apache Spark.
  3. Save the data in parquet format - in raw storage directory.
  4. Merge Batch(historical) and Stream(new) data with Apache Spark - save in preprocessed storage directory.
  5. Create multiple Azure ML(AML) Datasets from Azure Databricks environment - save in refined storage directory.
  6. Use Azure Machine Learning cluster compute to run multiple experiments on AML Datasets from VSCode.
  7. Log ML models and ML algorithms parameters using MLflow.
  8. Serve chosen ML model through Dockerized REST API service on Kubernetes.

Tutorials:

Q&A

If you have questions/concerns or would like to chat, contact us:

About

This repository contains tutorials and resources for you to reproduce Microsoft Build 2020 session - Building an End-to-End ML Pipeline for Big Data​

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages