
# Getting Started with Databricks for Machine Learning

In this lab, we will construct a comprehensive ML model pipeline using Databricks. Initially, we will train and monitor our model using mlflow. Subsequently, we will register the model and advance it to the next stage. In the latter part of the lab, we will utilize Model Serving to deploy the registered model. Following deployment, we will interact with the model via a REST endpoint and examine its behavior through an integrated monitoring dashboard.


## Install required libraries

In [0]:
%pip install -U databricks-feature-engineering -qqq --upgrade
%restart_python

## Imports and default values

In [0]:
import mlflow
from mlflow.models.signature import infer_signature

import sklearn.model_selection
import sklearn.ensemble
import sklearn.metrics

current_catalog = spark.sql("SELECT current_catalog()").collect()[0][0]
current_schema = spark.sql("SELECT current_schema()").collect()[0][0]
current_username = spark.sql("SELECT current_user()").collect()[0][0]


## Data Ingestion
 - The first step in this lab is to ingest data from .csv files and save them as delta tables. Navigate to the Catalog explorer and locate the datasets under shared and find `dbacademy_airbnb`. Expand `v01` and `locate airbnb-cleaned-mlflow.csv` located in the volume `sf-listings`. 
 - Second, we grab a few relevant features to help train our model to predict the target variable for this dataset, `price`.

In [0]:
file_path = '/Volumes/databricks_airbnb_sample_data/v01/sf-listings/airbnb-cleaned-mlflow.csv'

In [0]:
my_table = "airbnb_lab"
df = spark.read.format("csv").option("header", "true").load(file_path)

display(df)

Let's preprocess this dataset since the schema shows all variables being of time `string`.

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType, IntegerType, StringType

## Specify columns that should be treated as categorical (e.g., integers in categorical context)
categorical_columns = ['neighbourhood_cleansed', 'zipcode', 'property_type', 'room_type', 'bed_type']
for col in categorical_columns:
    df = df.withColumn(col, df[col].cast(StringType()))

## Specify columns that should remain as floats for machine learning
numerical_columns = ['host_total_listings_count', 'latitude', 'longitude', 'accommodates', 'bathrooms', 
                 'bedrooms', 'beds', 'minimum_nights', 'number_of_reviews', 'review_scores_rating',
                 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin',
                 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'price']
for col in numerical_columns:
    df = df.withColumn(col, df[col].cast(FloatType()))

df = df.withColumn("airbnb_id", F.monotonically_increasing_id()).select(['airbnb_id'] + numerical_columns + categorical_columns)

## Check the schema to confirm data type changes
df.printSchema()

In [0]:
df.write.format('delta').mode('overwrite').saveAsTable('airbnb_lab')

## Feature Engineering

Next, using PySpark, create a DataFrame called `feature_df` that is the feature table. Recall that the feature table must contain a primary key and does not contain the target variable, which is `price` in our case.

In [0]:
feature_df = df.select(['airbnb_id'] + numerical_columns)

## Find rooms with a score of at least 6.0 and 80 reviews
feature_df = feature_df.filter((df.review_scores_rating >= 6.0) & (df.number_of_reviews >= 80))
display(feature_df)

Write to Databricks Feature Store. Remember, we do not include our target variable.

In [0]:
## Write feature_df to Databricks Feature Store.
from databricks.feature_store import FeatureStoreClient

fs = FeatureStoreClient()

feature_df = feature_df.drop("price")

fs.create_table(
    name="airbnb_features",
    primary_keys = ['airbnb_id'], 
    df = feature_df,
    description = "This is the airbnb feature table",
    tags = {"source": "bronze", "format": "delta"}
    )

## Train a Model
To summarize what you have accomplished so far:
1. You have created a table that is a snapshot of the original dataset (Airbnb csv file) called `airbnb_lab`.
1. You have created a feature table and stored it in Databricks Feature Store called `airbnb_features`.

Next, we will simulate the process of reading in these Delta tables and training a model. We will train a machine learning model and register it to Unity Catalog.

In [0]:
## Read in the feature table airbnb_features from Unity Catalog using PySpark and store it as training_df
prediction_df = spark.read.format('delta').table('airbnb_lab').select('airbnb_id','price')
features_df = spark.read.format('delta').table('airbnb_features')

## Join these two dataframes on airbnb_id
training_df = prediction_df.join(features_df, on='airbnb_id').drop('airbnb_id')
training_pdf = training_df.toPandas()

## Perform train-test split
X = training_pdf.drop(columns = ['price'])
y = training_pdf['price']

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2, random_state=42)

In [0]:
## Set the path for mlflow experiment
mlflow.set_experiment(f"/Users/{current_username}/model-serving-experiment")

In [0]:
## Start the MLflow run
with mlflow.start_run(run_name="model-serving-run") as run:
    ## Initialize the Random Forest classifier
    rf_classifier = sklearn.ensemble.RandomForestClassifier(n_estimators=100, random_state=42)

    ## Fit the model on the training data
    rf_classifier.fit(X_train, y_train)

    ## Make predictions on the test data
    y_pred = rf_classifier.predict(X_test)

    ## Enable automatic logging of input samples, metrics, parameters, and models
    mlflow.sklearn.autolog(log_input_examples=True, silent=True)
    ## Calculate F1 score with 'macro' averaging for multiclass
    mlflow.log_metric("test_f1", sklearn.metrics.f1_score(y_test, y_pred, average="macro"))
    ## mlflow.log_metric("test_f1", f1_score(y_test, y_pred))

    mlflow.sklearn.log_model(
        rf_classifier,
        artifact_path="model-artifacts",
        input_example=X_train[:3],
        signature=infer_signature(X_train, y_train),
    )

    model_uri = f"runs:/{run.info.run_id}/model-artifacts"


## Conclusion

In this lab, we explored the full potential of Databricks Data Intelligence Platform for machine learning tasks. From data ingestion to model deployment, we covered essential steps such as data preparation, model training, tracking, registration, and serving. By utilizing MLflow for model tracking and management, and Model Serving for deployment, we demonstrated how Databricks offers a seamless Lakeflow Jobs for building and deploying ML models. Through this comprehensive lab, users can gain a solid understanding of Databricks capabilities for ML tasks and streamline their development process effectively.