<img src= "https://cdn.oreillystatic.com/images/sitewide-headers/oreilly_logo_mark_red.svg"/>&nbsp;&nbsp;<font size="16"><b>AI, ML and GenAI in the Lakehouse<b></font></span>
<img style="float: left; margin: 0px 15px 15px 0px; width:30%; height: auto;" src="https://i.imgur.com/pQvJTVf.jpeg"   />  


 
  
   Name:          chapter 06-01-Iris Feature Store Example
 
   Author:    Bennie Haelen
   Date:      02-09-2025

   Purpose:   This notebook loads the "lending club" dataset, and converts it into a set of features, stored in the Feature Store. It then creates an ML model against those features.
                 
      An outline of the different sections in this notebook:
        1 - Initialize the Feature Store Client
        2 - Load and pre-processing the Lending Club Dataset
              2-1 Load the Dataset
              2-2 Add a unique ID to the Dataset
              2-3 Perform Data Cleansing
        3 - Perform Feature Engineering
              3-1 Select our first sub-set of features
              3-2 Make sure that our Feature Store Schema Exists.
              3-3 Create the name of our Feature Table in the Unity Catalog
              3-4 Write the initial set of features to the Feature Store.
              3-5 Select another sub-set of Features
              3-6 Merge in the Additional Features
        4 - Prepare the Features for Model Generation
              4-1 Read the table into a Pandas Dataframe.
              4-2 Load the Target Variable
              4-3 Merge the Features with the Labels
              4-4 Drop rows with missing Target Values
              4-5 Prepare Features for Model Training
              4-6 Ensure all Features are Numeric
              4-7 Impute Missing Values
        5 - Build the model with the Feature Store Table Values
              5-1 Split the dataset into Train and Test Datasets
              5-2 Train a Random Forest Classifier
              5-3 Make Predictions and Calculate the Model Accuracy
        6 - Create Visualizations on the model
              6-1 Create a Feature Importance Plot
              6-2 Plot the Confusion Matrix
              6-3 Visualize the distribution of the Target Variable
              6-4 Plot the ROC Curve
              6-5 Plot the Precision Recall curve
        7 - Register the model in the Feature Store with MLflow


%md
#Handle Pre-Requisites

##Make sure to run the notebook with our constants

In [0]:
# Import necessary libraries
from databricks.feature_store import FeatureStoreClient,FeatureLookup
from pyspark.sql.functions import col, when, row_number, monotonically_increasing_id, regexp_replace, trim
from pyspark.sql.window import Window
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.impute import SimpleImputer

In [0]:
%run "../common/Constants"

# Initialize the Feature Store Client

In [0]:
# Initialize Feature Store client
fs = FeatureStoreClient()

#Load and preprocess the Lending Club Dataset

##Load the Dataset

In [0]:
# Load and Preprocess the Dataset with feature engineering
# Load the Lending Club Loan dataset
# The dataset is stored in Parquet format in Databricks' public datasets.
# This dataset contains information on issued loans, including features
# like loan amount, interest rate, and borrower details.
lending_club_path = "/databricks-datasets/samples/lending_club/parquet"
df = spark.read.parquet(lending_club_path)

##Add a Unique ID to the Dataset

In [0]:
# Generate a globally unique ID for each row
# The monotonically_increasing_id() function generates a unique, 
# monotonically increasing 64-bit integer for each row.
# This ID is used as a primary key to ensure that each record is 
# uniquely identifiable in the Feature Store.
df = df.withColumn("unique_id", monotonically_increasing_id())

## Data Cleansing

In [0]:
# Clean percentage fields by removing '%' and converting to float
# The 'int_rate' and 'revol_util' columns contain percentage values 
# stored as strings (e.g., '13.56%').
# The regexp_replace() function removes the '%' symbol, and 
# the resulting string is cast to a double for numerical analysis.
# The trim() function is used to remove any leading or 
# trailing whitespace before conversion.
df = df.withColumn("int_rate", regexp_replace(trim(col("int_rate")), "%", "").cast("double"))
df = df.withColumn("revol_util", regexp_replace(trim(col("revol_util")), "%", "").cast("double"))

# Perform Feature Engineering

##Select our first sub-set of features

In [0]:
# Feature Engineering
# Selecting relevant features for the model.
# The 'unique_id' column serves as the primary key for uniquely 
# identifying each row in the Feature Store.
initial_features_df = df.select(
    col("unique_id"),   # Unique identifier for each loan record
    col("loan_amnt"),   # The amount of money requested by the borrower
    col("int_rate"),    # Interest rate of the loan, converted to numeric format
    col("annual_inc"),  # The Borrower's annual income
    col("dti"),         # Debt-to-income ratio, a key metrics for credit risk
    col("revol_util"),  # Revolving credit line utilization rate
    col("open_acc")     # Number of open credit lines in the borrow's credit file
)

##Make sure that our Feature Store Schema Exists

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS book_ai_ml_lakehouse.feature_store_db;

##Create the name of our Feature Table in the Unity Catalog

In [0]:
# We will be using Unity Catalog to store our Feature Table.
# With Unity Catalog, we specify the fully qualified name to ensure
# consistent governance and access control across all workspaces.
# The format for the name is: 'catalog.schema.table_name'. 
FEATURE_TABLE_NAME = "lending_club_loan"
feature_table_name = f"{CATALOG_NAME}.{FEATURE_STORE_DB}.{FEATURE_TABLE_NAME}"
print(f"Our feature table will be named: `{feature_table_name}`")

##Write the initial set of features to the Feature Store

In [0]:
fs.drop_table(feature_table_name)

In [0]:
# Write the initial subset of features to the Databricks Feature Store
fs.create_table(
    name=feature_table_name,     # Fully qualified name recommended if using Unity Catalog for centralized governance
    primary_keys=["unique_id"],  # 'unique_id' ensures each record in the feature store is uniquely identifiable
    df=initial_features_df,      # DataFrame containing the initial set of engineered features
    description="Initial subset of engineered features for Lending Club Loan dataset"  # Descriptive metadata to help with dataset management and discovery
)

##Select another sub-set of Features

In [0]:
# Additional feature engineering
# Selecting additional relevant features to enhance model performance.
# These features provide deeper insights into the borrower's credit history and financial health.
# Loading additional features into a PySpark DataFrame
additional_features_df = spark.createDataFrame(
    df.select(
        col("unique_id"),    # Unique identifier to maintain consistency with the primary key in the feature store
        col("delinq_2yrs"),  # Number of 30+ days past-due incidences in the borrower's credit file in the last 2 years
        col("pub_rec"),      # Number of derogatory public records (e.g., bankruptcies, liens)
        col("total_acc"),    # Total number of credit lines in the borrower's credit file
        col("mort_acc")      # Number of mortgage accounts the borrower has
    ).rdd  # Convert the selected columns to an RDD before creating the DataFrame
)

##Merge in the Additional Features

In [0]:
# Appending additional features to the Feature Store
# This step integrates the newly engineered features into 
# the existing feature table in the Feature Store.
# By using 'merge' mode, the new features are appended without 
# overwriting the existing data, ensuring historical data integrity.
# Unity Catalog governs the append operation, enforcing consistent 
# access control and ensuring data lineage is maintained across updates.
# Unity Catalog ensures that appending features maintains data 
# consistency and adheres to organization-wide access policies.
# Using 'merge' mode here allows for seamless integration of new 
# features while preserving existing data integrity.
fs.write_table(
    name=feature_table_name,    # Fully qualified feature table name ensures proper governance if Unity Catalog is enabled
    df=additional_features_df,  # DataFrame containing the newly engineered features
    mode="merge"  # 'merge' mode appends new features while preserving existing records
)

#Prepare the Features for Model Generation

##Reading Features from the Feature Store

In [0]:
# Reading features from the Feature Store
# Features are read from the Feature Store and converted into a Pandas DataFrame for local processing.
features = fs.read_table(feature_table_name).toPandas()

##Load the Target Variable

In [0]:
# Load target variable (e.g., predicting loan default)
# The target variable 'defaulted' is created based on the 'loan_status' column.
# If the loan status is 'Charged Off', it indicates a default, and the value 
# is set to True; otherwise, it is False.
# The 'unique_id' ensures the target variable aligns with the features 
# in the Feature Store for proper model training and evaluation.
# The data is then converted to a Pandas DataFrame for compatibility with scikit-learn.
labels = df.select("unique_id", (col("loan_status") == "Charged Off").alias("defaulted"))

##Merging the Features with the target variable

In [0]:
# Merge features with labels
# Combines the engineered features from the Feature Store 
# with the target variable 'defaulted'.
# The merge is performed using 'unique_id' to ensure each feature set 
# is accurately linked to its corresponding target label.
data = pd.merge(features, labels.toPandas(), on="unique_id")

In [0]:
print(data.shape)

##Drop rows with missing Target Values

In [0]:
# Drop rows with missing target values
# Ensures that any records without a defined 'defaulted' status are removed.
# This step is crucial because machine learning models cannot train on missing target labels.
data = data.dropna(subset=["defaulted"])

##Separating the Features from the Target Values

In [0]:
# Prepare features for model training
# Drops 'unique_id' (as it's not useful for prediction) and 'defaulted' (the target variable).
# This ensures the model only trains on relevant feature columns.
X = data.drop(columns=["unique_id", "defaulted"])

##Ensure all Features are Numeric

In [0]:
# Ensure all features are numeric
# Converts all feature columns to numeric types, coercing errors to NaN if necessary.
# This is essential because machine learning algorithms require numeric input.
X = X.apply(pd.to_numeric, errors='coerce')

In [0]:
print(X.shape)

##Impute Missing Values

In [0]:
# Impute missing values using mean strategy
# The SimpleImputer replaces missing values (NaNs) in the feature set with the mean of each column.
# This is a common preprocessing step to handle incomplete data, ensuring that machine learning algorithms
# can be trained without errors due to missing values.
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

In [0]:
# Set the Target variable
y = data["defaulted"]

#Build the model with the Features from the Feature Store Table

##Split the dataset into Train and Test Datasets

In [0]:
# Split the dataset
# Divides the dataset into training and testing sets to evaluate model performance.
# 80% of the data is used for training, and 20% is reserved for testing.
# The 'random_state' ensures reproducibility by setting a seed for the random number generator.
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

##Train a Random Forest Classifier

In [0]:
# Train a Random Forest classifier
# Initializes the Random Forest model with 100 decision trees (estimators).
# The 'random_state' parameter ensures reproducibility by fixing the random number generation.
# The model is then trained using the training feature set (X_train) and target labels (y_train).
# The target variable is explicitly converted to integers to ensure compatibility with the classifier.
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train.astype('int'))

## Make Predictions and Calculate the Model Accuracy

In [0]:
# Predict and evaluate
# Ensure target labels are binary and consistent
# Converts the target labels in y_test to integers if they are in boolean format.
# This step is crucial because scikit-learn's classification metrics expect numerical labels (0 and 1).
y_test = y_test.astype(int)

# Make predictions on the test dataset
# Uses the trained Random Forest model to predict loan default outcomes on the test set.
y_pred = model.predict(X_test)

# Check for unexpected values
# Prints the unique values in y_test and y_pred to ensure both contain only binary values (0 and 1).
# This step helps identify any anomalies or data inconsistencies that could affect evaluation.
print("Unique values in y_test:", y_test.unique())
print("Unique values in y_pred:", pd.Series(y_pred).unique())

# Evaluate model accuracy
# Calculates the accuracy of the model by comparing predicted values (y_pred) with actual values (y_test).
# Accuracy represents the proportion of correct predictions out of all predictions made.
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")

##Infer the signature, we can use it later with model registration

In [0]:
from mlflow.models import infer_signature

# Infer model signature for Unity Catalog
signature = infer_signature(X_test, y_pred)

#Create Visualizations on the model

##Create a Feature Importance Plot

In [0]:
# Extract feature importances from the trained model
feature_importances = model.feature_importances_

# Create a DataFrame with feature names and their corresponding importance scores
importance_df = pd.DataFrame({
    'Feature': X.columns,  # X.columns gives the names of your features
    'Importance': feature_importances
})

# Sort the features by importance (highest to lowest) for better visualization
importance_df = importance_df.sort_values(by='Importance', ascending=False)


In [0]:
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure 'importance_df' exists and is correctly structured
# Example: importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': model.feature_importances_})

# Visualize feature importance
plt.figure(figsize=(10, 6))

# Apply a clean, modern style for publication
sns.set_style("whitegrid")  # Add a white grid background
sns.set_context("notebook", font_scale=1.2)  # Increase font size for better visibility

# Sort features by importance for better visualization
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot feature importances
ax = sns.barplot(x='Importance', y='Feature', data=importance_df, palette='viridis', edgecolor='black')

# Enhance plot aesthetics
ax.set_title('Feature Importances in Random Forest Model', fontsize=16, weight='bold')
ax.set_xlabel('Importance Score', fontsize=12)
ax.set_ylabel('Feature', fontsize=12)

# Add data labels on bars for clarity
for p in ax.patches:
    width = p.get_width()
    # Add a small offset to prevent overlap with the bar edge
    ax.annotate(f'{width:.3f}', (width + 0.01, p.get_y() + p.get_height() / 2),
                ha='left', va='center', fontsize=10, color='black', weight='bold')

# Remove top and right spines for a cleaner look
sns.despine()

# Ensure layout fits well
plt.tight_layout()
plt.show()


##Plot the Confusion Matrix

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay
import pandas as pd

# Generate synthetic data for demonstration
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Random Forest Classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Visualize predictions against actuals
plt.figure(figsize=(8, 5))

# Apply a clean, modern style for the plot
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.2)

# Create a confusion matrix to compare predictions and actuals
cm = confusion_matrix(y_test, y_pred)
cmd = ConfusionMatrixDisplay(cm, display_labels=["Fully Paid", "Defaulted"])

# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 5))
cmd.plot(cmap='coolwarm', values_format='d', ax=ax, colorbar=False)

# Enhance aesthetics by removing grid lines and adding custom formatting
ax.set_title('Confusion Matrix: Predictions vs Actuals', fontsize=16, weight='bold')
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_ylabel('True Label', fontsize=12)

# Remove the grid lines for a cleaner look
ax.grid(False)

# Add black borders to each square for better separation
for text in ax.texts:
    text.set_color('black')  # Ensure text is black for better contrast

# Adjust tick labels for clarity
ax.xaxis.set_tick_params(width=0)
ax.yaxis.set_tick_params(width=0)

# Remove spines for a more minimalistic appearance
sns.despine(left=True, bottom=True)

plt.tight_layout()
plt.show()

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Create a DataFrame for metrics
metrics_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
    'Score': [accuracy, precision, recall, f1]
})

# Display the metrics table
plt.figure(figsize=(6, 2))
plt.axis('off')
table = plt.table(cellText=metrics_df.values, colLabels=metrics_df.columns, cellLoc='center', loc='center')
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1, 1.5)
plt.title('Model Performance Metrics')
plt.show()


##Visualize the distribution of the Target Variable

In [0]:
# Visualize the distribution of the target variable
plt.figure(figsize=(8, 5))

# Create a count plot with a modern, clean style
sns.set_style("whitegrid")  # Apply a white grid background for better readability
sns.set_context("notebook", font_scale=1.2)  # Increase font size for better visibility in print

# Plot the distribution of loan defaults
ax = sns.countplot(x='defaulted', data=data, palette='coolwarm', edgecolor='black')

# Enhance plot aesthetics
ax.set_title('Distribution of Loan Defaults', fontsize=16, weight='bold')
ax.set_xlabel('Loan Status (0 = Fully Paid, 1 = Defaulted)', fontsize=12)
ax.set_ylabel('Number of Loans', fontsize=12)

# Add data labels on bars for clarity
for p in ax.patches:
    height = p.get_height()
    ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2., height), 
                ha='center', va='bottom', fontsize=10, color='black', weight='bold')

# Remove top and right spines for a cleaner look
sns.despine()

plt.tight_layout()
plt.show()

#Log the model in the Feature Store

In [0]:
# The unique identifier for the MLflow run where the model was logged.
# MLflow assigns a unique 'run_id' to each experiment run. This ID is critical 
# for tracking the specific model version, its parameters, metrics, and artifacts.
run_id = "3cb01eb8044e47feb5d9dcaedbfc9fcd"

# The name assigned to the model when registering it in MLflow.
# This 'model_name' serves as a reference for the registered model in both MLflow 
# and Unity Catalog, enabling consistent access across different environments.
model_name = "lending_club_default_model"

# The catalog name in Unity Catalog, representing the top-level namespace.
# Unity Catalog uses 'catalog_name' as the primary organizational layer 
# for managing data and models across the lakehouse.
catalog_name = CATALOG_NAME  # Ensure CATALOG_NAME is defined earlier in your code.

# The schema name within the catalog, used for organizing registered models.
# The 'schema_name' helps categorize models within a specific catalog, 
# promoting better governance and discoverability.
schema_name = FEATURE_STORE_DB  # Ensure FEATURE_STORE_DB is defined earlier in your code.

# Construct the fully qualified Unity Catalog model name.
# The format 'catalog.schema.model_name' ensures that the model is uniquely 
# identifiable and governed across workspaces in Databricks.
unity_model_name = f"{catalog_name}.{schema_name}.{model_name}"

# Display the full model path in Unity Catalog for verification.
# This helps confirm that the model has been correctly registered 
# and can be easily referenced for deployment or inference.
print(f"Unity Model Name: {unity_model_name}")

In [0]:
# Set the registry URI to Databricks Unity Catalog.
# By default, MLflow uses its internal model registry for tracking models. 
# This command overrides the default and instructs MLflow to use Databricks Unity Catalog 
# as the backend for model management.
#
# Unity Catalog provides centralized governance, lineage tracking, and access control 
# for models across all Databricks workspaces. By integrating MLflow with Unity Catalog, 
# models benefit from enhanced security, consistent data governance, and 
# cross-workspace discoverability.
mlflow.set_registry_uri("databricks-uc")

In [0]:
from mlflow import MlflowClient 

# Start an MLflow run to log model artifacts and metrics
# This context manager initializes a new MLflow run, allowing you to log models, metrics, 
# and other artifacts. The run provides a unique run ID that helps track and reference 
# the model within the MLflow and Unity Catalog ecosystems.
with mlflow.start_run() as run:
    
    # Log the trained Random Forest model along with its signature
    # The model is logged using MLflow's sklearn integration. The 'signature' ensures 
    # that the input-output schema of the model is recorded, maintaining consistency 
    # between training and inference.
    mlflow.sklearn.log_model(clf, "model", signature=signature)
    
    # Log the model's accuracy metric
    # This captures the model's accuracy as a performance metric in MLflow, 
    # allowing easy tracking, comparison, and evaluation across different experiments.
    mlflow.log_metric("accuracy", accuracy)

    # Register the model in Unity Catalog
    # This step registers the logged model in Unity Catalog for centralized governance 
    # and cross-workspace accessibility. The 'run.info.run_id' links this registration 
    # to the specific MLflow run, ensuring full traceability.
    model_version = mlflow.register_model(f"runs:/{run.info.run_id}/model", unity_model_name)

    # Define feature lookups for model deployment
    # The FeatureLookup specifies which features the model requires at inference time. 
    # It maps the 'unique_id' key to features stored in the Feature Store, ensuring 
    # consistency between training and serving.
    feature_lookups = [
        FeatureLookup(
            table_name=feature_table_name,  # Name of the feature table in Unity Catalog
            feature_names=["loan_amnt", "int_rate", "annual_inc", "dti", "revol_util", "open_acc"],  # Features used in the model
            lookup_key="unique_id"  # Primary key used for joining features during inference
        )
    ]

    # Link the model to the Feature Store for feature consistency during inference
    # This step logs the model in both MLflow and the Feature Store, associating it 
    # with the features used during training. The 'training_set' ensures that the model 
    # remains consistent with the feature data, enabling reproducibility and 
    # simplifying deployment workflows.
    fs.log_model(
        model=clf,  # The trained Random Forest model
        artifact_path="model",  # Path within the MLflow run where the model is stored
        flavor=mlflow.sklearn,  # Specifies the MLflow flavor for scikit-learn models
        training_set=fs.create_training_set(
            df=df.select("unique_id", (col("loan_status") == "Charged Off").alias("defaulted")),  # Dataset with labels
            feature_lookups=feature_lookups,  # Defined feature lookups for inference
            label="defaulted"  # Target variable used during training
        ),
        registered_model_name=unity_model_name  # Fully qualified model name in Unity Catalog
    )

# Confirm successful model training, logging, and registration
print("Model training and logging to Unity Catalog completed.")


#End of Notebook