# Get Started with Databricks for Machine Learning

In this course, you will learn basic skills that will allow you to use the Databricks Data Intelligence Platform to perform a simple data science and machine learning workflow. You will be given a tour of the workspace, and you will be shown how to work with notebooks. You will train a baseline model with AutoML and transition the best model to production. Finally, the course will also introduce you to MLflow, feature store, and workflows, and demonstrate how to train and manage an end-to-end machine learning lifecycle.

---

## Prerequisites
The content was developed for participants with these skills/knowledge/abilities:  
- A beginner-level understanding of Python.
- Basic understanding of DS/ML concepts (e.g. classification and regression models), common model metrics (e.g. F1-score), and Python libraries (e.g. scikit-learn and XGBoost). 

---


## Setup

### Install required libraries

In [0]:
%pip install -U databricks-sdk -qqq --upgrade
%pip install -U databricks-feature-engineering -qqq --upgrade
%pip install -U scikit-learn -qqq --upgrade
%pip install -U hyperopt -qqq --upgrade
%pip install -U mlflow==2.20.0 -qqq --upgrade
%pip install -U pmdarima==2.0.4 -qqq --upgrade
%pip install -U pandas==1.5.3 -qqq --upgrade
%pip install -U category_encoders==2.6.3 -qqq --upgrade
%pip install -U databricks-automl-runtime==0.2.20.11 -qqq --upgrade
%restart_python

### Import libraries and functions


In [0]:
# Import required libraries
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import sklearn.datasets
import sklearn.metrics
import sklearn.model_selection
import sklearn.ensemble

import matplotlib.pyplot as plt

from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK
from hyperopt.pyll import scope

import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
from mlflow.tracking.client import MlflowClient

from pyspark.sql import functions as F
from pyspark.sql import Row

from databricks.sdk import WorkspaceClient
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()
w = WorkspaceClient()

mlflow.set_registry_uri("databricks-uc")

### Set default values

In [0]:
current_catalog = spark.sql("SELECT current_catalog()").collect()[0][0]
current_schema = spark.sql("SELECT current_schema()").collect()[0][0]
current_username = spark.sql("SELECT current_user()").collect()[0][0]

workspace_url = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}"
table_name = "wine_quality"

### Create the dataset

This dataset includes chemical properties like pH levels and alcohol content, used to predict wine q%md 

The dataset is available in `databricks-datasets`. 

In the following cell, you read the data in from `.csv` files into Spark DataFrames. 

You then write the DataFrames to tables in Unity Catalog.

In [0]:
from pyspark.sql.functions import monotonically_increasing_id, col, lit, concat, to_date

wine_quality_white = spark.read.csv("/databricks-datasets/wine-quality/winequality-white.csv", sep=';', header=True, inferSchema=True)
wine_quality_red = spark.read.csv("/databricks-datasets/wine-quality/winequality-red.csv", sep=';', header=True, inferSchema=True)

# Remove the spaces from the column names
for c in wine_quality_white.columns:
    wine_quality_white = wine_quality_white.withColumnRenamed(c, c.replace(" ", "_"))
for c in wine_quality_red.columns:
    wine_quality_red = wine_quality_red.withColumnRenamed(c, c.replace(" ", "_"))

wine_quality_white = wine_quality_white.withColumn("is_red", lit(0.0))
wine_quality_white = wine_quality_white.withColumn("id", concat(lit('white_'), monotonically_increasing_id().cast("string")))

wine_quality_red = wine_quality_red.withColumn("is_red", lit(1.0))
wine_quality_red = wine_quality_red.withColumn("id", concat(lit('red_'), monotonically_increasing_id().cast("string")))

# Write to tables in Unity Catalog
spark.sql(f"DROP TABLE IF EXISTS {table_name}")
spark.sql(f"DROP TABLE IF EXISTS {table_name}_red")
spark.sql(f"DROP TABLE IF EXISTS {table_name}_white")

wine_quality_red.write.saveAsTable(f"{table_name}_red")
wine_quality_white.write.saveAsTable(f"{table_name}_white")

wine_quality = wine_quality_red.unionAll(wine_quality_white)
wine_quality.write.saveAsTable(table_name)

spark.table(table_name).printSchema()

# Exploratory Data Analysis and Feature Engineering

In this lab, we’ll walk you through basic exploratory data analysis and the process of creating and storing a feature table in the Feature Store. We’ll begin by demonstrating how to load data into a Spark DataFrame, view essential statistical information, and perform visual analysis using both built-in tools and code. Next, we’ll create a feature table, showing you how to store and explore it within the Feature Store UI. By the end of this lab, you should have a foundational understanding of the key steps involved in creating a feature table for Feature Engineering.

## **Learning Objectives**:

_By the end of this lab, you will be able to:_


1. **Perform Basic Exploratory Data Analysis (EDA):**
    - Utilize Spark and Pandas to store our data as a DataFrame.
    - Use built-in functionality to analyze data from a statistical perspective. Additionally, we will visualize the summary statistics. 


2. **Introduction to Feature Engineering with Databricks:**
    - Create a feature table and store it in Feature Store from a PySpark DataFrame.
    - Inspect the Feature Store table using the UI and from the notebook.

## Perform Basic Exploratory Data Analysis (EDA)

In this section, we will show how you can utilize Databricks Notebooks for exploratory analysis. This will be presented in two flavors: built-in tools and labnstrative custom code.

### Read and Inspect the Dataset

In this section, we will utilize a fictional dataset from a wine rating company, which includes various information from acidity to pH levels. Ideally, a data scientist or machine learning practitioner, would take this dataset and perform various feature engineering tasks in order to be able to predict the `quality` rating of the wine. 

The next cell will create one table: `wine_quality_table`. Let's create two different dataframes, one using Spark and another using pandas.  


In [0]:
df = spark.read.table(table_name)
pdf = df.toPandas()

display(df)

### Inspect Statistics: Numerical Values and Visuals

Here we will exhibit different ways in which you can display and visualize descriptive statistics. 
- `dbutils.data.summarize(<spark_or_pandas_dataframe>)` - This method will separate out numerical and categorical features within your Spark or Pandas DataFrame. It also displays histograms and quartile estimates. There are various options available in the generated profile such as resizing and feature search. You can consider this a more managed approach for summarizing statistics. 
- `describe(<spark_or_pandas_dataframe>)` - This method will only return a table with the necessary information. You can recover the generated profile like that in the dbutils approach by adding a data profile. 
    - Click on the **+** icon and select **data profile**. 
- `display(<spark_or_pandas_dataframe>)` - This will return the table. From this, we can build a visual to inspect the feature variables. 
- Custom code - We can use the Pandas Dataframe along with other Python libraries to build custom visualizations.

In [0]:
dbutils.data.summarize(pdf)

In [0]:
# Spark DataFrame
display(df.describe())

In [0]:
# Pandas DataFrame
display(pdf.describe())

In [0]:
# Built-in Visuals
display(df)

In [0]:
# Extract additional statistics
# Let's find Q1, median, and Q3 of pH grouped by the quality ranking for 
column_stats = 'pH'

display(df.groupBy('quality').agg(
    F.min(f'{column_stats}').alias('min'),
    F.expr(f'percentile({column_stats}, 0.25)').alias('Q1'),
    F.expr(f'percentile({column_stats}, 0.5)').alias('median'),
    F.expr(f'percentile({column_stats}, 0.75)').alias('Q3'),
    F.max(f'{column_stats}').alias('max'),
    F.count(f'{column_stats}').alias('count')
))

Databricks visualization. Run in Databricks to view.

### Bubble Chart Using GUI Visualization Editor

We can now use the **Visualization Editor** in the Databricks UI to build a bubble chart using our grouped summary statistics.

**Steps:**
- Create the grouped DataFrame in the following cell.
- In the output result cell:
   - Click the **+**  dropdown next to Table (top-right of the table display).
   - Select **Visualization**.

- In the **Visualization Editor**:
   - Select **Bubble** as the visualization type.
   - Under **X column**, select `quality`.
   - Under **Y columns**, select `median_pH`.
   - Under **Group by**, select `count`,
   - Under **Bubble size column**, select `count`.
   - Under **Bubble size coefficient**, check if it's `1`,
   - Leave **Bubble size proportional to** as `Diameter`.

- Click **Save** to render the chart.

This creates a bubble chart that shows:
- Wine **quality** on the x-axis.
- **Median pH** level on the y-axis.
- **Bubble size** proportional to the number of samples.


In [0]:
from pyspark.sql.functions import expr, count

grouped_df = df.groupBy("quality").agg(expr("percentile(pH, 0.5)").alias("median_pH"), count("pH").alias("count"))
display(grouped_df)

## Introduction to Feature Engineering on Databricks

After exploring our data for a bit, we see that it would be beneficial to be able to predict `quality`. There are many things we can do to this dataset, such as outlier analysis, etc. Instead, since this is just an introductory lab, let's keep it simple and add an additional feature that separates out low, average, and high `pH`. This will add an additional feature to the data we already have. 



### Business Logic

Based on our analysis above, suppose business stakeholders give you the following guidelines for pH levels.

- Low pH: >= Q1
- Average pH: < Q1 and < Q3
- High pH: >= Q3

Let's take this business logic and create a new **feature** and store it in a feature table in our Feature Store.

In [0]:
feature_variables = ['fixed_acidity',
                     'volatile_acidity',
                     'citric_acid',
                     'pH',
                     'sulphates',
                     'alcohol',
                     'quality',
                     'is_red']
prediction_variable = 'quality'
primary_key = ['id']

In [0]:
feature_df = df.select(primary_key + feature_variables)
display(feature_df)

In [0]:
from pyspark.sql.functions import col, expr, when

quantiles = feature_df.approxQuantile("pH", [0.25, 0.75], 0.0)

Q1, Q3 = quantiles

wine_feature_df = feature_df.withColumn(
    "pHCategory",
    when(col("pH") <= Q1, "Low")
    .when((col("pH") > Q1) & (col("pH") < Q3), "Average")
    .otherwise("High")
)

display(wine_feature_df)

### Save features to feature table

Now that we have our feature store created, let's store it as a feature table within Feature Store. We have all the ingredients we need to do this within Databricks Unity Catalog: 
- Feature table (Spark DataFrame)
- Primary key (designated feature)

In [0]:
# Set the feature table name for storage in UC
feature_table_name = f'{table_name}_features'

# drop the table if it exists
try:
    fe.drop_table(
        name = feature_table_name,
    )
except:
    pass

# Create the feature table
fe.create_table(
    name = feature_table_name,
    primary_keys = primary_key,
    df = wine_feature_df, 
    description="Wine quality features", 
    tags = {"source": "bronze", "format": "delta"}
)

print(f"The name of the feature table: {feature_table_name} url: {workspace_url}/explore/data/{current_catalog}/{current_schema}/{feature_table_name}")

Now, go inspect your feature table using the UI with the URL provided in the previous cell output!

# Build your first machine learning model

## Binary Classification

This example illustrates how to train a machine learning classification model on Databricks. Databricks Runtime for Machine Learning comes with many libraries pre-installed, including scikit-learn for training and pre-processing algorithms, MLflow to track the model development process, and Hyperopt with SparkTrials to scale hyperparameter tuning.

With the Databricks Free Edition, as only Serverless compute is avaiilable, it is not possible to optimize and use the SparkTrials, and therefore, you will leverage the Trials library instead.

In this notebook, you create a classification model to predict whether a wine is considered "high-quality". The dataset[1] consists of 11 features of different wines (for example, alcohol content, acidity, and residual sugar) and a quality ranking between 1 to 10. 

This tutorial covers:
- Train a classification model with MLflow tracking
- Hyperparameter tuning to improve model performance
- Save results and models to Unity Catalog

For more details on productionizing machine learning on Databricks including model lifecycle management and model inference, see the ML End to End Example ([AWS](https://docs.databricks.com/mlflow/end-to-end-example.html) | [Azure](https://learn.microsoft.com/azure/databricks/mlflow/end-to-end-example) | [GCP](https://docs.gcp.databricks.com/mlflow/end-to-end-example.html)).

By default, the MLflow Python client creates models in the Databricks workspace model registry. To save models in Unity Catalog, configure the MLflow client as shown in the following cell.

[1] The example uses a dataset from the UCI Machine Learning Repository, presented in [*Modeling wine preferences by data mining from physicochemical properties*](https://www.sciencedirect.com/science/article/pii/S0167923609001377?via%3Dihub) [Cortez et al., 2009].


### Preprocess data

Here we will build a dataset to predict if the wine is 'red' or 'white'.
We also need to map the 'pHCategory' attribute to a numerical value.

In [0]:
# Load data from Unity Catalog as Pandas dataframes
training_df = spark.read.format('delta').table(f'{table_name}_features').toPandas()

# Define the pHCategory mapping to map it back to numbers instead of strings
mapping = {'Low': 0.0, 'Average': 1.0, 'High': 2.0}

# Apply the mapping to the 'pHCategory' column
training_df['pHCategory'] = training_df['pHCategory'].map(mapping)

# Use the training dataset to store variables X, the features, and y, the target variable. 
X = training_df.drop(columns = ['id', "is_red"])
y = training_df["is_red"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
  X, 
  y, 
  test_size=0.2, 
  random_state=42
)


### Train with XGBoost

In [0]:
# Enable MLflow autologging for this notebook
mlflow.autolog()

Next, train a classifier within the context of an MLflow run, which automatically logs the trained model and many associated metrics and parameters. 

You can supplement the logging with additional metrics such as the model's AUC score on the test dataset.

In [0]:
with mlflow.start_run(run_name='gradient_boost') as run:
    # Initialize the GradientBoosting classifier
    model = sklearn.ensemble.GradientBoostingClassifier(random_state=42)

    # Fit the model on the training data. Models, parameters, and training metrics are tracked automatically
    model.fit(X_train, y_train)

    predicted_probs = model.predict_proba(X_test)
    roc_auc = sklearn.metrics.roc_auc_score(y_test, predicted_probs[:,1])
    roc_curve = sklearn.metrics.RocCurveDisplay.from_estimator(model, X_test, y_test)
    
    # Save the ROC curve plot to a file
    roc_curve.figure_.savefig("roc_curve.png")
    
    # The AUC score on test data is not automatically logged, so log it manually
    mlflow.log_metric("test_auc", roc_auc)
    
    # Log the ROC curve image file as an artifact
    mlflow.log_artifact("roc_curve.png")
    
    print("Test AUC of: {}".format(roc_auc))


#### View MLflow runs

To view the logged training run, click the **Experiment** icon <img src="https://docs.databricks.com/_static/images/mlflow/quickstart/experiment-icon.png"/> at the upper right of the notebook to display the experiment sidebar. If necessary, click the refresh icon to fetch and monitor the latest runs. 

<img src="https://docs.databricks.com/_static/images/mlflow/quickstart/experiment-sidebar-icons.png"/>

To display the more detailed MLflow experiment page, click the experiment page icon. This page allows you to compare runs and view details for specific runs.

### Load a model and make predictions
You can also access the results for a specific run using the MLflow API. The code in the following cell illustrates how to load the model trained in a given MLflow run and use it to make predictions. You can also find code snippets for loading specific models on the MLflow run page.

In [0]:
# After a model has been logged, you can load it in different notebooks or jobs
# mlflow.pyfunc.load_model makes model prediction available under a common API
model_loaded = mlflow.pyfunc.load_model(
  'runs:/{run_id}/model'.format(
    run_id=run.info.run_id
  )
)

predictions_loaded = model_loaded.predict(X_test)
predictions_original = model.predict(X_test)

# The loaded model should match the original
assert(np.array_equal(predictions_loaded, predictions_original))

In [0]:
from pyspark.sql import Row
test_df = spark.createDataFrame([
    Row(
        fixed_acidity=6.2, 
        volatile_acidity=0.66, 
        citric_acid=0.48, 
        pH=3.33, 
        sulphates=0.39, 
        alcohol=12.8, 
        quality=8, 
        pHCategory=0.0
        # it's a white wine
    ),
    Row(
        fixed_acidity=6.6, 
        volatile_acidity=0.725, 
        citric_acid=0.2, 
        pH=3.29, 
        sulphates=0.54, 
        alcohol=9.2, 
        quality=6, 
        pHCategory=1.0
        # it's a red wine
    )
]).toPandas()
test_df = test_df.astype({col: 'int32' for col in test_df.select_dtypes('int64').columns})

test_predictions = model_loaded.predict(test_df)
display(test_predictions)


### Hyperparameter tuning with Hyperopt
At this point, you have trained a simple model and used the MLflow tracking service to organize your work. Next, you can perform more sophisticated tuning using Hyperopt.

[Hyperopt](http://hyperopt.github.io/hyperopt/) is a Python library for hyperparameter tuning. For more information about using Hyperopt in Databricks, see the documentation ([AWS](https://docs.databricks.com/applications/machine-learning/automl-hyperparam-tuning/index.html#hyperparameter-tuning-with-hyperopt) | [Azure](https://docs.microsoft.com/azure/databricks/applications/machine-learning/automl-hyperparam-tuning/index#hyperparameter-tuning-with-hyperopt) | [GCP](https://docs.gcp.databricks.com/applications/machine-learning/automl-hyperparam-tuning/index.html#hyperparameter-tuning-with-hyperopt)).

You can use Hyperopt to run hyperparameter sweeps and train multiple models in parallel. This reduces the time required to optimize model performance. MLflow tracking is integrated with Hyperopt to automatically log models and parameters.

In [0]:
# Define the search space to explore
search_space = {
  'n_estimators': scope.int(hp.quniform('n_estimators', 20, 1000, 1)),
  'learning_rate': hp.loguniform('learning_rate', -3, 0),
  'max_depth': scope.int(hp.quniform('max_depth', 2, 5, 1)),
}

def train_model(params):
  # Enable autologging on each worker
  mlflow.autolog()
  with mlflow.start_run(nested=True):
    model_hp = sklearn.ensemble.GradientBoostingClassifier(
      random_state=0,
      **params
    )
    model_hp.fit(X_train, y_train)
    predicted_probs = model_hp.predict_proba(X_test)
    # Tune based on the test AUC
    # In production, you could use a separate validation set instead
    roc_auc = sklearn.metrics.roc_auc_score(y_test, predicted_probs[:,1])
    mlflow.log_metric('test_auc', roc_auc)
    
    # Set the loss to -1*auc_score so fmin maximizes the auc_score
    return {'status': STATUS_OK, 'loss': -1*roc_auc}

# SparkTrials distributes the tuning using Spark workers
# Greater parallelism speeds processing, but each hyperparameter trial has less information from other trials
# On smaller clusters try setting parallelism=2
# spark_trials = SparkTrials(
#   parallelism=1
# )
# As we only have access to Serverless compute, we will leverage the Trials class instead
spark_trials = Trials(
  # parallelism=1
)

with mlflow.start_run(run_name='gb_hyperopt') as run:
  # Use hyperopt to find the parameters yielding the highest AUC
  best_params = fmin(
    fn=train_model, 
    space=search_space, 
    algo=tpe.suggest, 
    max_evals=32,
    trials=spark_trials)


#### Search runs to retrieve the best model

Because all of the runs are tracked by MLflow, you can retrieve the metrics and parameters for the best run using the MLflow search runs API to find the tuning run with the highest test auc.

This tuned model should perform better than the simpler models trained in the earlier section. 

In [0]:
# Sort runs by their test auc. In case of ties, use the most recent run.
best_run = mlflow.search_runs(
  order_by=['metrics.test_auc DESC', 'start_time DESC'],
  max_results=10,
).iloc[0]
print('Best Run')
print('AUC: {}'.format(best_run["metrics.test_auc"]))
print('Num Estimators: {}'.format(best_run["params.n_estimators"]))
print('Max Depth: {}'.format(best_run["params.max_depth"]))
print('Learning Rate: {}'.format(best_run["params.learning_rate"]))

best_model_pyfunc = mlflow.pyfunc.load_model(
  'runs:/{run_id}/model'.format(
    run_id=best_run.run_id
  )
)

# Make a dataset with all predictions
best_model_predictions = X_test
best_model_predictions["prediction"] = best_model_pyfunc.predict(X_test)


#### Get the results and Save the models to Unity Catalog

In [0]:
predictions_table = f"{table_name}_predictions"
spark.sql(f"DROP TABLE IF EXISTS {predictions_table}")

results = spark.createDataFrame(best_model_predictions)

# Write results back to Unity Catalog from Python
results.write.saveAsTable(f"{predictions_table}")

In [0]:
model_uri = 'runs:/{run_id}/model'.format(
    run_id=best_run.run_id
)

mlflow.register_model(model_uri, f"{table_name}_model_is_red")

## Multi-class Classification

#### Preprocess data

Here, you will predict the wine quality which is a numercal discrete value via a multi-class classification.

In [0]:
# Load the feature table
training_df = spark.read.format('delta').table(f'{table_name}_features').toPandas()

# Define the mapping
mapping = {'Low': 0.0, 'Average': 1.0, 'High': 2.0}

# Apply the mapping to the 'pHCategory' column
training_df['pHCategory'] = training_df['pHCategory'].map(mapping)
# Use the training dataset to store variables X, the features, and y, the target variable. 
X = training_df.drop(columns = ["id", "quality"])
y = training_df["quality"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
  X, 
  y, 
  test_size=0.2, 
  random_state=42
)


### Train with Random Forest

Now that we have our training set ready to go, the next step is to train a model using the `sklearn` library. We will build a random forest classification model, tracking the F1-score for a single run. We will initiate the tracking before creating the model using `mlflow.start_run()` as the context manager. Within this manager we will:

- Initialize the random forest classifier
- Fit the model

Make a prediction using our test set

- Log the F1-score metric as `test_f1`
- Capture the artifacts for model tracking and management using the flavor `mlflow.sklearn`. *Flavor* in this context simply means that MLflow will package our scikit-learn model in a consistent and standardized way. If we wished to use a different ML library, we would use a different *flavor*.

Finally, we will register the model to Unity Catalog. Note, Databricks does not recommend registering your model at the Workspace level.


In [0]:
# set the path for mlflow experiment
exp = mlflow.set_experiment(f"/Users/{w.current_user.me().user_name}/get-started-with-ml-flow-experiment")

print(f"The experiment {exp.experiment_id} is accessible at the url: {workspace_url}/ml/experiments/{exp.experiment_id}")

with mlflow.start_run(run_name = 'get-started-with-ml-flow-run') as run:  
    # Initialize the Random Forest classifier
    rf_classifier = sklearn.ensemble.RandomForestClassifier(n_estimators=100, random_state=42)

    # Fit the model on the training data
    rf_classifier.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = rf_classifier.predict(X_test)

    # Enable automatic logging of input samples, metrics, parameters, and models
    mlflow.sklearn.autolog(
        log_input_examples = True,
        silent = True
    )
    # Calculate F1 score with 'macro' averaging for multiclass
    mlflow.log_metric("test_f1", sklearn.metrics.f1_score(y_test, y_pred, average='macro'))
    # mlflow.log_metric("test_f1", sklearn.metrics.f1_score(y_test, y_pred))

    print(f"The experiment run {run.info.run_id} is accessible at the url: {workspace_url}/ml/experiments/{exp.experiment_id}/runs/{run.info.run_id}")

    print("\n\n")
    mlflow.sklearn.log_model(
        rf_classifier,
        name = "model-artifacts", 
        input_example=X_train[:3],
        signature=infer_signature(X_train, y_train)
    )

    model_uri = f"runs:/{run.info.run_id}/model-artifacts"


In [0]:
# Modify the registry uri to point to Unity Catalog
mlflow.set_registry_uri("databricks-uc")

# Define the model name 
model_name = f"{table_name}"

# Register the model in the model registry
registered_model = mlflow.register_model(model_uri=model_uri, name=f"{table_name}_model_quality")

Notice that you will now have an additional version under your model. 

Navigate to your model in Catalog explorer. You will find version 1 (created during the classroom setup with alias **staging**) and version 2, which you must created.

In [0]:
# Initialize an MLflow Client
client = MlflowClient()

# Assign a "dev" alias to model version 1
client.set_registered_model_alias(
    name= registered_model.name,  # The registered model name
    alias="dev",  # The alias representing the dev environment
    version=registered_model.version  # The version of the model you want to move to "dev"
)


# Build your first machine learning model - Lab

In this lab, we will construct a comprehensive ML model pipeline using Databricks. Initially, we will train and monitor our model using mlflow. Subsequently, we will register the model and advance it to the next stage. In the latter part of the lab, we will utilize Model Serving to deploy the registered model. Following deployment, we will interact with the model via a REST endpoint and examine its behavior through an integrated monitoring dashboard.



## Data Ingestion

The first step in this lab is to ingest data from .csv files and save them as delta tables. 

Navigate to the Catalog explorer and locate the datasets under shared and find `databricks_airbnb_sample_data`. 

Expand `v01` and locate `airbnb-cleaned-mlflow.csv` located in the volume `sf-listings`. 

Second, we grab a few relevant features to help train our model to predict the target variable for this dataset, `price`.

In [0]:
## Copy and paste the location of the airbnb dataset
file_path = '<FILL_IN>'

In [0]:
## Read in the csv file and store it in Unity Catalog within the catalog and schema shown in cell 8. 
## Name your delta table "airbnb_lab"
my_table = <FILL_IN>
df = spark.read.format(<FILL_IN>).option("header", "true").load(<FILL_IN>)

Let's preprocess this dataset since the schema shows all variables being of time `string`.

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType, IntegerType, StringType

## Specify columns that should be treated as categorical (e.g., integers in categorical context)
categorical_columns = ['neighbourhood_cleansed', 'zipcode', 'property_type', 'room_type', 'bed_type']
for col in categorical_columns:
    df = df.withColumn(col, df[col].cast(StringType()))

## Specify columns that should remain as floats for machine learning
numerical_columns = ['host_total_listings_count', 'latitude', 'longitude', 'accommodates', 'bathrooms', 
                 'bedrooms', 'beds', 'minimum_nights', 'number_of_reviews', 'review_scores_rating',
                 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin',
                 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'price']
for col in numerical_columns:
    df = df.withColumn(col, df[col].cast(FloatType()))

df = df.withColumn("airbnb_id", F.monotonically_increasing_id()).select(['airbnb_id'] + numerical_columns + categorical_columns)

## Check the schema to confirm data type changes
df.printSchema()

In [0]:
df.write.format('delta').mode('overwrite').saveAsTable('airbnb_lab')

## Feature Engineering

Next, using PySpark, create a DataFrame called `feature_df` that is the feature table. Recall that the feature table must contain a primary key and does not contain the target variable, which is `price` in our case.

In [0]:
feature_df = df.select(<FILL_IN>)

## Find rooms with a score of at least 6.0 and 80 reviews
feature_df = feature_df.filter((<FILL_IN>) & (<FILL_IN>))
display(feature_df)

Write to Databricks Feature Store. Remember, we do not include our target variable.

In [0]:
## Write feature_df to Databricks Feature Store. 
from databricks.feature_store import FeatureStoreClient

fs = FeatureStoreClient()

feature_df = feature_df.drop(<FILL_IN>)

fs.create_table(
    name="airbnb_features",
    primary_keys = <FILL_IN>, 
    df = <FILL_IN>,
    description = "This is the airbnb feature table",
    tags = {"source": "bronze", "format": "delta"}
    )

## Train a Model
To summarize what you have accomplished so far:
1. You have created a table that is a snapshot of the original dataset (Airbnb csv file) called `airbnb_lab`.
1. You have created a feature table and stored it in Databricks Feature Store called `airbnb_features`.

Next, we will simulate the process of reading in these Delta tables and training a model. We will train a machine learning model and register it to Unity Catalog.

In [0]:
## Read in the feature table airbnb_features from Unity Catalog using PySpark and store it as training_df
prediction_df = spark.read.format('delta').table(<FILL_IN>).select(<FILL_IN>)
features_df = spark.read.format('delta').table(<FILL_IN>)  

## Join these two dataframes on airbnb_id
training_df = prediction_df.join(<FILL_IN>, on='airbnb_id').toPandas()

## Perform train-test split
X = training_df.drop(columns = [<FILL_IN>])
y = training_df[<FILL_IN>]

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(<FILL_IN>, <FILL_IN>, test_size=0.2, random_state=42)

In [0]:
## Set the path for mlflow experiment
mlflow.set_experiment(f"/Users/{DA.username}/<FILL_IN>")

In [0]:
## Start the MLflow run
with mlflow.start_run(run_name=<FILL_IN>) as run:
    ## Initialize the Random Forest classifier
    rf_classifier = RandomForestClassifier(n_estimators=<FILL_IN>, random_state=42)

    ## Fit the model on the training data
    rf_classifier.fit(<FILL_IN>, <FILL_IN>)

    ## Make predictions on the test data
    y_pred = rf_classifier.predict(<FILL_IN>)

    ## Enable automatic logging of input samples, metrics, parameters, and models
    mlflow.sklearn.autolog(log_input_examples=<FILL_IN>, silent=True)
    ## Calculate F1 score with 'macro' averaging for multiclass
    mlflow.log_metric("test_f1", f1_score(<FILL_IN>, <FILL_IN>, average="macro"))
    ## mlflow.log_metric("test_f1", f1_score(y_test, y_pred))

    mlflow.sklearn.log_model(
        rf_classifier,
        artifact_path="model-artifacts",
        input_example=X_train[:3],
        signature=infer_signature(<FILL_IN>, <FILL_IN>),
    )

    model_uri = f"runs:/{run.info.run_id}/model-artifacts"


## Conclusion

In this lab, we explored the full potential of Databricks Data Intelligence Platform for machine learning tasks. 

From data ingestion to model deployment, we covered essential steps such as data preparation, model training, tracking, registration, and serving. 

By utilizing MLflow for model tracking and management, and Model Serving for deployment, we demonstrated how Databricks offers a seamless Lakeflow Jobs for building and deploying ML models. 

Through this comprehensive lab, users can gain a solid understanding of Databricks capabilities for ML tasks and streamline their development process effectively.


# Make predictions with your models


## Local Model Serving

As you did earlier, it is possible to load a model and the start prediting locally as demonstrated below.


In [0]:
model_loaded = mlflow.pyfunc.load_model(
  'runs:/{run_id}/model'.format(
    run_id=run.info.run_id
  )
)
test_df = spark.createDataFrame([
    Row(
        fixed_acidity=6.2, 
        volatile_acidity=0.66, 
        citric_acid=0.48, 
        pH=3.33, 
        sulphates=0.39, 
        alcohol=12.8, 
        pHCategory=0.0,
        is_red=0.0
        # it's a quality=8
    ),
    Row(
        fixed_acidity=6.6, 
        volatile_acidity=0.725, 
        citric_acid=0.2, 
        pH=3.29, 
        sulphates=0.54, 
        alcohol=9.2, 
        pHCategory=1.0,
        is_red=1.0
        # it's a quality=6
    )
]).toPandas()
test_df = test_df.astype({col: 'int32' for col in test_df.select_dtypes('int64').columns})

test_predictions = model_loaded.predict(test_df)
display(test_predictions)


## Mosaic AI Model Serving

In this lesson, we will focus on how to serve a registered model using **Mosaic AI Model Serving** for real-time inferencing. We’ll also introduce **Databricks Workflows** as a way to automate ML pipelines.



### Setting Up Model Serving

We can create Model Serving endpoints with the Databricks Machine Learning API or the Databricks Machine Learning UI. 

An endpoint can serve any registered Python MLflow model in the **Model Registry**.

In order to keep it simple, in this demo, we are going to use the Model Serving UI for creating, managing and using the Model Serving endpoints. We can create model serving endpoints with the **"Serving"** page UI or directly from registered **"Models"** page.  

Let's go through the steps of creating a model serving endpoint in Models page. **You will not actually create the endpoint.**

- Go to **Models**. 

- Select **Owned by me** at the top.

- Select the model you want to serve under the **Name** column. Notice this will take you to the Catalog menu. 

- Click the **Serve this model** button on the top right. This will take you to the **Serving endpoints** screen.

- Next in **General**, enter in a name of the form **wine_model_quality**.

- Select **workspace.default.wine_quality_model_quality** as **Entity**.

- There are several configurations under **Served entities** that we will not discuss here. 

- For **Compute scale-out**, select **small**. You can select **Scale to zero** for this lesson as well. We will be deleting the endpoint at the end of this lesson, so this doesn't matter too much for our purposes. 

- Click on **Create** at the bottom right. 


### Query Serving Endpoint

Let's use the deployed model for real-time inference. Here’s a step-by-step guide for querying an endpoint in Databricks Model Serving:

- Go to the **Serving** endpoints page and select the endpoint you want to query.

- Click **Use** button the top right corner.

There are 4 methods for querying an endpoint; **browser**, **CURL**, **Python**, and **SQL**. 

For now, let's use the easiest method; querying right in the **browser** window. 

In this method, we need to provide the input parameters in JSON format. 

Since we used `mlflow.sklearn.autolog()` with `log_input_examples = True`, we registered an example with MLflow, which appear automatically when selecting **browser**.

- Input the following request:

```json
{
  "dataframe_split": {
    "columns": [
      "fixed_acidity",
      "volatile_acidity",
      "citric_acid",
      "pH",
      "sulphates",
      "alcohol",
      "pHCategory",
      "is_red"
    ],
    "data": [
      [
        6.2, 
        0.66, 
        0.48, 
        3.33, 
        0.39, 
        12.8, 
        0.0,
        0.0
      ],
      [
        6.6, 
        0.725, 
        0.2, 
        3.29, 
        0.54, 
        9.2, 
        1.0,
        1.0
      ]
    ]
  }
}
```

- Click **Send request**.

- **Response** field on the right panel will show the result of the inference.

### Delete Your Serving Endpoint

**🚨 : Please delete your serving endpoint after completing the above steps.**



# Getting Started with Mosaic AI AutoML

In this lab, we will explore how **Mosaic AI AutoML** automates the process of model training, selection, and registration. 

_In the previous labs, you manually tracked and registered models using MLflow. In this one, you’ll see how AutoML automates those steps while still leveraging the same tracking infrastructure._

AutoML allows you to build, train, and evaluate models with minimal code. We will create an AutoML experiment, inspect the results, register the best model, and transition it to the **Staging** stage.

However, with the Databricks Free Edition, only Serverless compute is available and therefore only _Forecasting_ can be implemented for now and only using the UI.
When using non-Serverless computes, other AutoML algorithms and features are available. 

![automl-create-experiment](./images/automl-create-experiment.png)


## Preprocess the data



In [0]:
dbutils.fs.ls("/databricks-datasets/COVID/covid-19-data")

In [0]:
covid = spark.read.csv("/databricks-datasets/COVID/covid-19-data/us.csv", sep=',', header=True, inferSchema=True)
covid = covid.withColumn("date", col("date").try_cast("date"))
covid = covid.withColumn("cases", col("cases").cast("int"))
display(covid)

spark.sql(f"DROP TABLE IF EXISTS covid")

covid.write.saveAsTable("covid")

spark.table("covid").printSchema()

## Create and Run an AutoML Experiment



Let's initiate an AutoML experiment to construct a baseline model for predicting wine quality. The target field for this prediction will be the `quality` field.

Follow these step-by-step instructions to create an AutoML experiment:

- Navigate to **Experiments** under **Machine Learning** in the left sidebar menu.

- Click on **Forcasting - Preview**.

  ![automl-create-experiment-serverless](./images/automl-create-experiment-serverless.png)

- Under the **Training data** section:

  - To select the `covid` table as the input training data, select `Browse` under `Input training dataset` and navigate to the catalog and the same database we've been using (see **Classroom Setup** in this notebook). 

  - Specify **`date`** as the **Time Column**.

  - Specify **`Daily`** as the **Forecasting frequency**.

  - Specify **`7`** as the **Forecast horizon**.

- Under the **Prediction** section:

  - Specify **`cases`** as the **Target column**.

  - Specify **`workspace.default`** as the **Prediction data path**

  - Specify **`covid_forecast`** as the **Table name**

- Under the **Model registration** section:

  - Specify **`workspace.default`** as the **Register to location**

  - Specify **`covid_model`** as the **Model name**

- Under the **Advanced options** section:

  - Specify **`SDAPE`** as the **Primary metric**.

  - Specify **`10`** as the **Timeout (minutes)**

- Click on **Start training**. 


## Inspection of the Experiement

Once the experiment is finished, it's time to examine the best run:

- Access the completed experiment in the **Experiments** section.

- Identify the best model run by evaluating the displayed **metrics**. 

Metrics might not be displayed by default on the screen.

There are different columns such as the framework used (e.g., Scikit-Learn, XGBoost), evaluation metrics (e.g., Accuracy, F1 Score), and links to the corresponding notebooks for each model. This allows you to make informed decisions about selecting the best model for your specific use case.



In [0]:
experiments = mlflow.search_experiments(
    view_type = mlflow.entities.ViewType.ACTIVE_ONLY, 
    filter_string = f"name LIKE '/Users/{current_username}/databricks_automl/cases_covid_%'", 
    order_by = ["creation_time DESC"],
    max_results = 1,
)
experiment = experiments[0]

print(f"Exepriment URL: {workspace_url}/ml/experiments/{experiment.experiment_id}")

print(f"Experiment id: {experiment.experiment_id}")
print(f"Experiment Artifact Location: {experiment.artifact_location}")
# print(f"Tags: {experiment.tags}")
print(f"Experiment Lifecycle stage: {experiment.lifecycle_stage}")
print(f"Experiment Creation timestamp: {experiment.creation_time}")

print(f"Exepriment Registered Model URL: {workspace_url}/explore/data/models/{experiment.tags.get('_databricks_automl.output_model_name').replace('.', '/')}")

display(experiments)

## Inspection of the "Best" Experiement Run

In [0]:
runs = mlflow.search_runs(
    experiment_ids= [experiment.experiment_id], 
    run_view_type = mlflow.entities.ViewType.ACTIVE_ONLY, 
    order_by = ["metrics.val_smape ASC"], 
    max_results = 1,
    output_format = 'list'
)
run = runs[0]

print(f"Best Experiement Run URL: {workspace_url}/ml/experiments/{experiment.experiment_id}/runs/{run.info.run_id}")
print(f"Best Experiement Run id: {run.info.run_id}")
print(f"Best Experiement Run SMAPE: {run.data.metrics['val_smape']}")


## Inspect the genreated inference notebook

The AutoML process will provide access to a bacth inference notebook that you can access using the url generated by the next cell:

In [0]:
print(f"Exepriment Batch Inference Notebook URL: {workspace_url}/editor/notebooks/{experiment.tags.get('_databricks_automl.batch_inference_notebook_id')}")


## Use the model for forecasting
You can use the commands in this section with Databricks Runtime for Machine Learning 10.0 or above.

### Load the model with MLflow

MLflow allows you to easily import models back into Python by using the AutoML run_id.

You should check and install the model requirements using the follwoing code:

```
requirements = mlflow.pyfunc.get_model_dependencies(model_uri)
%pip install -r {requirements}
dbutils.library.restartPython()
```
But this require the otebook to restart so we have included the dependencies at the begining.

In [0]:
model_uri = "runs:/{run_id}/model".format(run_id=run.info.run_id)
pyfunc_model = mlflow.pyfunc.load_model(model_uri)


### Use the model to make forecasts

Call the `predict_timeseries` model method to generate forecasts.    
In Databricks Runtime for Machine Learning 10.5 or above, you can set `include_history=False` to get the predicted data only.

In [0]:
import matplotlib.pyplot as plt

time_column = "date"
history_table = "covid"
target_column = "cases"

history_df = spark.table(history_table)
history_df = history_df.withColumn(time_column, F.to_timestamp(time_column))
history_df = history_df.orderBy(time_column, ascending=False).limit(30).toPandas()

In [0]:
forecasts = pyfunc_model._model_impl.python_model.predict_timeseries()
forecasts = forecasts.rename(columns={'yhat': target_column})
forecasts = forecasts.rename(columns={'ds': time_column})
forecasts = forecasts.sort_values(by=time_column, ascending=False).head(30)
display(forecasts)

### Plot the forecasted points

In the plot below, the thick black line shows the time series dataset, and the blue line is the forecast created by the model.


In [0]:


# fig = plt.figure(facecolor='w', figsize=(10, 6))
# ax = fig.add_subplot(111)
# forecasts = pyfunc_model._model_impl.python_model.predict_timeseries(include_history=True)



# Code for plotting
plt.figure(figsize=(10, 6))

# Plot the solid line for the historical datapoints.
plt.plot(history_df[time_column], history_df[target_column], linestyle='-', marker='s', label='Historical values')

# Plot the dashed line for the forecasted values
plt.plot(forecasts[time_column], forecasts[target_column], linestyle='--', marker='o', label='Forecasts')

# Set proper ticks on x-axis.
tick_positions = pd.date_range(start=history_df[time_column].min(), end=forecasts[time_column].max(), periods=10)
plt.xticks(tick_positions, labels=[date.strftime("%Y-%m-%d %H:%M") for date in tick_positions], rotation=45)

# Adding labels and legend
plt.title('Recent Historical Data and Forecasts')
plt.xlabel(time_column)
plt.ylabel(target_column)
plt.legend()

# Display the plot
plt.show()