#Oil Extraction Production Forecasting
<br/>
<img src="https://www.nsenergybusiness.com/wp-content/uploads/sites/4/2022/07/refinery-ga56d4972f_640.jpg" />

## Building our model using MLFlow
Training and tuning a model using MLflow in Databricks provides a structured and efficient way to manage experiments, track performance, and optimize hyperparameters. With MLflow Tracking, you can log metrics, parameters, and artifacts, allowing you to compare different runs systematically. When tuning a model, you can integrate Hyperopt, Optuna, or custom grid/random search within MLflow to automate hyperparameter optimization. Once the best model is identified, MLflow Models enables seamless registration, versioning, and deployment, ensuring that the optimized model can be easily used for inference in production environments. This approach enhances reproducibility, collaboration, and operationalization of ML workflows.

In this notebook we'll be taking a look at how to construct an experiment to ensure that the best and most reliable models are compiled, logged, and registered for production inference.

### Loading custom libraries
For this notebook, we'll need to load some custom libraries. Specifically we'll be requiring joblib and joblibspark for distributed tuning. I've also included hyperopt as an alternative to Optuna. Using the `%pip` install command ensures all nodes in the cluster receive the package install.

_Note:_ After installing python packages across the cluster it's recommended to restart the python interpreter.

In [0]:
%pip install hyperopt joblib pyspark joblibspark
dbutils.library.restartPython()

### Initialization
Below is an initialization block to help us out. This is designed so that each user has their own set of unique names credentials. Don't worry too much about what it's doing - this is mostly because we have several users doing the same lab with the same parameters in a shared workspace and don't want any collisions. For enterprise work this is largely unnecessary.

In [0]:
import hashlib, base64

#IMPORTANT! DO NOT CHANGE THESE VALUES!!!!
catalog = "workshop"
db = "default"
current_user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("user").get()
hash_object = hashlib.sha256(current_user.encode())
hash_user_id = base64.b32encode(hash_object.digest()).decode("utf-8").rstrip("=")[:12]  #Trim to 12 chars for readability
initials = "".join([x[0] for x in current_user.split("@")[0].split(".")])
short_hash = hashlib.md5(current_user.encode()).hexdigest()[:8]  #Short 8-char hash
safe_user_id = f"{initials.upper()}_{short_hash}"
src_table = f"{safe_user_id}_oil_yield"
model_name = f"{safe_user_id}_oil_yield_forecast"
model_uri = f"{catalog}.{db}.{model_name}"

#### Setting our experiment and load our features
Just like before, we'll use the common experiment and read our latest feature table with our transformations applied using the feature engineering client.

In [0]:
import mlflow

# Set a named experiment
mlflow.set_experiment(f"/Users/{current_user}/Oil Extraction Production Forecasting")

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

df = fe.read_table(
  name=f'{catalog}.{db}.{src_table}_features_transformed'
)

### Training and hyperparameter tuning
This is one of the most important parts of the experiment. Now that we've identified our features and performed the necessary transformations it's time to start building our trial runs to find the best way to tune a model. Most training libraries have ways we can tune our algorithms to further improve performance. These meta parameters are called hyperparameters and they affect how the training pipeline behaves. Since we're using MLFlow, we can log each of these runs and see how adjusting parameters in a variety of ways can impact model performance.

**From Google:**
Hyperparameter tuning is the process of optimizing the configuration settings that control how a machine learning model learns, rather than the parameters it learns from data. These hyperparameters impact model performance, complexity, and generalization.

In XGBoost, hyperparameter tuning is crucial for maximizing predictive accuracy and avoiding overfitting. Key hyperparameters include:
- 	Learning rate (eta) – Controls step size during training.
- 	Max depth – Limits tree depth to balance bias-variance tradeoff.
- 	Min child weight – Prevents overly complex trees.
- 	Subsample & colsample_bytree – Improve generalization by using random subsets of data and features.
- 	Gamma & lambda – Regulate tree pruning and L1/L2 regularization.

Defining the search space is essential to ensure tuning focuses on meaningful hyperparameter ranges rather than blindly testing values. Grid search, random search, and Bayesian optimization (Hyperopt) refine the selection process, balancing computational efficiency with finding the best model configuration. Properly tuning XGBoost leads to higher accuracy, better generalization, and reduced overfitting, making it a key step in model optimization.

#### A quick note on algorithms
##### XGBoost
XGBoost, which stands for Extreme Gradient Boosting, is a type of gradient boosting algorithm that uses decision trees as its base learners, making it an ensemble learning method where multiple decision trees are combined to make predictions with higher accuracy; it is considered a powerful and scalable machine learning algorithm often used for both classification and regression tasks.

A Gradient Boosted Regressor in XGBoost is a powerful machine learning algorithm that builds an ensemble of decision trees sequentially, where each new tree corrects the errors of the previous ones. It uses gradient descent to minimize the loss function, making it highly effective for forecasting tasks where capturing complex, nonlinear relationships is crucial.

In the context of forecasting, XGBoost’s gradient boosting approach helps:
- Capture trends and interactions in time-series and structured data.
- Handle missing values and outliers more robustly than traditional regressors.
- Improve predictive accuracy by reducing bias and variance through iterative learning.

With hyperparameter tuning and feature engineering, XGBoost can be a highly efficient choice for forecasting oil yield, energy prices, or demand trends, leveraging both structured time-series features and external variables for improved performance.

##### Prophet
Prophet is a time series forecasting algorithm based on an additive regression model, which means it decomposes a time series into components like trend, seasonality, and holidays to make predictions, making it particularly useful for data with strong seasonal patterns and non-linear trends; it's considered a type of machine learning algorithm when applied to data to generate forecasts. Prophet comes from Meta (the Facebook guys) and is a fairly lightweight, easy to use algorithm. We will see it later when we use it to do a short-term prediction of our boosting features.
##### LSTM
LSTM stands for "Long Short-Term Memory" and is a type of Recurrent Neural Network (RNN) algorithm, specifically designed to handle long-term dependencies in sequential data by overcoming the vanishing gradient problem common to traditional RNNs; making it effective for tasks like time series prediction and natural language processing. 
Key points about LSTM:
Special architecture:
LSTMs use specialized "gates" to control the flow of information through the network, allowing them to selectively remember relevant information from the past while forgetting irrelevant details. 
Applications:
Used in tasks like language modeling, machine translation, speech recognition, time series forecasting, and handwriting recognition. 
Advantage over standard RNNs:
LSTM's ability to learn long-term dependencies makes it superior to standard RNNs which often struggle with remembering information from distant past inputs. 

#### Parallelizing our hyperparameter tuning
Using Joblib-Spark on Databricks enables efficient parallelization of hyperparameter tuning with Optuna by distributing trial evaluations across a Spark cluster. Since Optuna is inherently single-threaded, integrating it with Joblib-Spark allows multiple trials to run concurrently on different nodes, significantly speeding up the search for optimal hyperparameters.

By setting optuna.integration.JoblibStudy with joblibspark.register_spark(), you can leverage Spark’s distributed computing to scale Bayesian optimization, reducing tuning time while maintaining model performance. This approach is particularly useful for training compute-intensive models like XGBoost, ensuring faster convergence and better utilization of Databricks’ cluster resources.

##### joblibspark() v. joblib with Loky

Both joblibspark and loky (joblib.parallel_backend("loky", n_jobs=-1)) enable parallelization for hyperparameter tuning with Optuna, but they differ in how they distribute computations:

1. Joblib-Spark (joblibspark.register_spark())
- Best for distributed environments (Databricks, Spark clusters)
- Runs Optuna trials across multiple Spark workers, allowing efficient use of a Databricks cluster.
- Scales well for large datasets and compute-intensive models (e.g., XGBoost, deep learning).
- Reduces the memory overhead on a single machine by distributing the workload.
- Requires Databricks or a Spark cluster to be effective.

2. Loky (joblib.parallel_backend("loky", n_jobs=-1))
- Best for single-node, multi-core parallelization (local CPU-based execution)
- Uses Python’s Loky multiprocessing backend to distribute trials across multiple CPU cores on a single machine.
- Ideal for smaller-scale tuning tasks where Spark overhead isn’t justified.
- May run into memory bottlenecks if too many parallel trials are executed on a machine with limited RAM/CPU.
- Does not leverage distributed computing beyond a single node.

**When to Use Each?**
Use joblibspark on Databricks when working with large datasets, long-running experiments, or when leveraging a Spark cluster for hyperparameter tuning. Use loky for local tuning on a single machine when running experiments on a laptop or a non-distributed environment where Spark isn’t available.

If you’re on Databricks with access to a Spark cluster, joblibspark is the clear choice for faster, scalable hyperparameter tuning.

In [0]:
import optuna
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
import joblib

#Load dataset from our feature table
pdf = df.toPandas()

#Select features & target
features = ["temperature", "precipitation_transformed"]
target = "yield_bbl"

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(pdf[features], pdf[target], test_size=0.2, random_state=42)

#Define objective function
def objective(trial):
    params = {
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0)
    }
    
    model = xgb.XGBRegressor(objective="reg:squarederror", **params)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    mae = mean_absolute_error(y_test, y_pred)
    
    #Log trial as a nested MLflow run
    with mlflow.start_run(nested=True):
        mlflow.log_params(params)
        mlflow.log_metric("trial_MAE", mae)
    
    return mae

with mlflow.start_run(run_name="XGBoost Optuna with Nested Runs") as parent_run:

    #Set parallelism
    study = optuna.create_study(direction="minimize", study_name="xgboost_optuna", sampler=optuna.samplers.TPESampler())
    with joblib.parallel_backend("loky", n_jobs=-1):
        study.optimize(objective, n_trials=50, timeout=600, n_jobs=4) #Set parallelism to the number of available cores

#Get best parameters
best_params = study.best_params

#Log best parameters to parent run
mlflow.log_params(best_params)

print(f"\n✅ Best Hyperparameters Found: {best_params}")

### Training with the best hyperparameters

After a series of runs with Optuna, we've captured and logged the best parameters with our `best_params` object. By using MLFlow's `log_params()` function we can store them as any other object in our experiment for later recall and evaluation.

Let's initiate a final training run using just the best parameters.

In [0]:
#Train optimized XGBoost model
best_xgb = xgb.XGBRegressor(objective="reg:squarederror", **best_params)
best_xgb.fit(X_train, y_train)

#Predict and evaluate
y_pred = best_xgb.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print(f"\n✅ Optimized XGBoost Model Performance:")
print(f"  MAE: {mae:.2f}, RMSE: {rmse:.2f}")

mlflow.end_run()

### Logging and registering our model for use
Unity Catalog on Databricks is an important tool to help manage ML artifacts and models.

Logging and registering your model in Unity Catalog on Databricks ensures centralized governance, versioning, and secure access control across teams and workspaces. By leveraging Unity Catalog, you can:
- **Ensure Model Lineage & Reproducibility** – Track model metadata, parameters, and performance metrics alongside datasets and features, ensuring full traceability.
- **Enable Cross-Workspace Collaboration** – Share and manage models across multiple Databricks workspaces securely.
- **Streamline Deployment & Monitoring** – Seamlessly integrate models with MLflow Serving, batch inference, or real-time applications.
- **Enforce Access Controls & Compliance** – Use fine-grained permissions to control who can read, modify, and deploy models, ensuring governance in production.

By registering models in Unity Catalog, you create a scalable, production-ready ML lifecycle with governed access, auditing, and operational efficiency, making it ideal for enterprise-wide AI deployments.

As long as we set our registry uri to `databricks-uc` MLFlow understands that our uri will contain a 3-level unity namespace. There's a lot to cover in here, which is largely out of scope for this lab, but more information can be found [here](https://docs.databricks.com/aws/en/mlflow/models)

In [0]:
#If we want to use the UC registry rather than the local mlflow registry, set databricks-uc as the registry uri
mlflow.set_registry_uri("databricks-uc")

In [0]:
from mlflow.models.signature import infer_signature
from mlflow import MlflowClient

#Create an instance of the MLFlowClient class
client = MlflowClient()

#Prepare a sample input for signature inference
sample_input = X_train.iloc[:5]  # Take 5 rows from training data
sample_output = best_xgb.predict(sample_input)  # Predict output

#Infer signature
signature = infer_signature(sample_input, sample_output)

#Start MLflow run
with mlflow.start_run(run_name="XGBoost Optuna with joblibspark") as run:

    #Log best hyperparameters
    mlflow.log_params(best_params)

    #Log model performance
    mlflow.log_metric("MAE", mae)
    mlflow.log_metric("RMSE", rmse)

    #Log model to UC Model Registry
    mlflow.xgboost.log_model(
        best_xgb,
        artifact_path=model_uri,
        signature=signature,
        input_example=X_test.sample(n=10),
        registered_model_name=f"{catalog}.{db}.{model_name}"
    )

    run_id = run.info.run_id
    model_path = f'runs:/{run_id}/{catalog}.{db}.{model_name}'
    model_version = client.create_model_version(
        name=f"{catalog}.{db}.{model_name}",
        source=model_path,
        run_id=run_id
    ).version

    print("✅ Logged optimized model to MLflow")

#Update alias
alias_name = "Champion"
client.set_registered_model_alias(name=f"{catalog}.{db}.{model_name}", alias=alias_name, version=model_version)

### Testing our model
Testing our model is as simple as retrieving it from the unity catalog model store and applying it against our original feature table.

In [0]:
import mlflow
from pyspark.sql.functions import struct, col
logged_model = f'runs:/{run_id}/{catalog}.{db}.{model_name}'

#Load model as a Spark UDF. Override result_type if the model does not return double values.
loaded_model = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model)

#Predict on a Spark DataFrame.
df2 = df.withColumn('predictions', loaded_model(struct(*map(col, df.columns))))

#Preview our sample forecast
display(df2)

Predictions are expectedly poor - we were tuning this for forward looking forecasting.
Also, there are likely better algorithms for this specific dataset since we don't have a lot of valuable ancillary data.

This experiment serves as an example for a good starting point. There are many things we can do to iterate and build upon this experiment to improve on it.

### Lab Challenge: