#Oil Extraction Production Forecasting
<br/>
<img src="https://www.nsenergybusiness.com/wp-content/uploads/sites/4/2022/07/refinery-ga56d4972f_640.jpg" />

## Advanced feature engineering
In this notebook we'll be taking the information we learned from 01_Data_Exploration and building tranformed feature sets to use in training our model. We will be using Databricks Feature Engineering Client to help us manage and store these values as feature tables and kick off our experiment to log all of our work.

### Initialization
Below is an initialization block to help us out. This is designed so that each user has their own set of unique names credentials. Don't worry too much about what it's doing - this is mostly because we have several users doing the same lab with the same parameters in a shared workspace and don't want any collisions. For enterprise work this is largely unnecessary.

In [0]:
import hashlib, base64

#IMPORTANT! DO NOT CHANGE THESE VALUES!!!!
catalog = "workshop"
db = "default"
current_user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("user").get()
hash_object = hashlib.sha256(current_user.encode())
hash_user_id = base64.b32encode(hash_object.digest()).decode("utf-8").rstrip("=")[:12]  #Trim to 12 chars for readability
initials = "".join([x[0] for x in current_user.split("@")[0].split(".")])
short_hash = hashlib.md5(current_user.encode()).hexdigest()[:8]  #Short 8-char hash
safe_user_id = f"{initials.upper()}_{short_hash}"
src_table = f"{safe_user_id}_oil_yield"
model_name = f"{safe_user_id}_oil_yield_forecast"
model_uri = f"{catalog}.{db}.{model_name}"

### Feature Engineering Client
The Databricks Feature Engineering Client simplifies the process of creating, managing, and serving features for machine learning models. It provides a unified API to ingest, transform, and store features in a feature table, ensuring consistency between training and inference. With built-in support for feature versioning, lineage tracking, and real-time feature serving, it helps streamline MLOps workflows and reduce data leakage risks. By leveraging Delta Lake and MLflow, the Feature Engineering Client enables efficient, scalable, and reproducible feature engineering in Databricks.

In the cell below we will use it to load our features (untransformed) that we created in the last notebook. This allows us to preserve integrity of the features, promote discover of where they are used and manage lineage of our data.

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

df = fe.read_table(
  name=f'{catalog}.{db}.{src_table}_features'
)

### Where do we normalize & transform our data?
We can handle our data normalization in one of two ways. We can either compute the data as it lands in the feature tables which we would normally do as part of the ingestion pipeline or we can late-stage process them as a wrapper function for the compiled model. There are benefits and drawbacks of both, but for this lab we'll be simulating pre-processing the features as though they were part of the ingestion pipeline.

### Figuring out our lambda values
Calculating and storing our lambda values once is a good idea.

Storing lambda values for Box-Cox transformations in a dedicated Delta table ensures consistency, reproducibility, and efficiency across our ML pipeline. Since the Box-Cox transformation is parametric, meaning it depends on the estimated lambda to stabilize variance and normalize data, computing these values once prevents discrepancies between training and inference. By storing them in a Delta table, you:
- Ensure Consistency – The same lambda values are used across different pipeline stages, avoiding mismatches that could degrade model performance.
- Improve Reproducibility – Enables easy retrieval of precomputed lambda values for future transformations, keeping experiments and deployments aligned.
- Enhance Efficiency – Avoids redundant recomputation, reducing computational overhead when processing large datasets in Databricks.
- Enable Auditing & Versioning – Delta Lake’s built-in version control allows tracking changes to lambda values, making it easier to debug and maintain transformations over time.

This structured approach supports scalable and production-ready ML workflows, ensuring that your transformed features remain stable and reliable throughout model development and deployment.

In simpler terms, doing it once means it's done the same way every time and can be applied in the future with expected (idempotent) results

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
import pandas as pd
from scipy.stats import boxcox, yeojohnson

#Convert PySpark DF to Pandas for Box-Cox Calculation
df_pd = df.select("yield_bbl", "precipitation", "temperature").toPandas()

#Apply Box-Cox transformation & store lambda values
df_pd["yield_bbl"], lambda_yield = boxcox(df_pd["yield_bbl"] + 1)  # Shift to avoid zero
df_pd["precipitation"], lambda_precip = boxcox(df_pd["precipitation"] + 1)
df_pd["temperature"], lambda_temp = yeojohnson(df_pd["temperature"])

#Convert back to Spark DF
df_transformed = spark.createDataFrame(pd.DataFrame(df_pd))

#Define schema for lambda values DataFrame
schema = StructType([
    StructField("feature_name", StringType(), True),
    StructField("lambda_value", DoubleType(), True)
])

#Convert numpy.float64 to native Python float
lambda_yield = float(lambda_yield)
lambda_precip = float(lambda_precip)
lambda_temp = float(lambda_temp)

#Add lambda values as feature metadata
df_lambdas = spark.createDataFrame([
    ("lambda_yield", lambda_yield),
    ("lambda_precipitation", lambda_precip),
    ("lambda_temp", lambda_temp)
], schema)

#Store lambda values in Delta table (feature metadata)
df_lambdas.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{db}.{src_table}_lambdas")

print(f"Stored Box-Cox lambdas: {lambda_yield}, {lambda_precip}, {lambda_temp}")

### Applying the lambda values
Applying the lambda values is a fairly straightforward process of applying the value of the feature as a function of a scalar value between 0 an 1 and dividing it by the lambda value. We apply this on every row and store the values as new columns with the _transformed suffix to preserve the original values (we'll see later that we'll be using an un-transformed version of temperature in our prediction).

As long as we apply this consistently, even new data coming in can be transformed in the same way. New data can be either actual or inferred data to make a prediction decision.

In [0]:
from pyspark.sql.functions import col, log, when, lit
import math

#Load lambda values from feature store
df_lambdas = spark.read.table(f"{catalog}.{db}.{src_table}_lambdas")
lambda_dict = {row["feature_name"]: row["lambda_value"] for row in df_lambdas.collect()}

lambda_yield = lambda_dict["lambda_yield"]
lambda_precip = lambda_dict["lambda_precipitation"]
lambda_temp = lambda_dict["lambda_temp"]

#Define Box-Cox transformation function in PySpark
def boxcox_pyspark(column, lambda_value):
    return when(lambda_value == 0, log(col(column) + 1)).otherwise(
        ((col(column) + 1) ** lambda_value - 1) / lambda_value
    )

#Apply transformations in PySpark
df = df.withColumn("yield_bbl_transformed", boxcox_pyspark("yield_bbl", lit(lambda_yield)))
df = df.withColumn("precipitation_transformed", boxcox_pyspark("precipitation", lit(lambda_precip)))
df = df.withColumn("temperature_transformed", boxcox_pyspark("temperature", lit(lambda_temp)))

print("Box-Cox transformed features saved successfully.")

### Committing our transformed features
Much like our raw features we stored in the last notebook, we'll create a new feature table with the transformed data. Using the Feature Engineering Client we can store this information in a new feature table.

#### New v. Old feature tables
Why not just update the original feature table with the transformed values? This is a valid debate and one that can go either way. Personally, I'm a fan of decoupling responsibilities. The first feature table (sans transformations) can be used for a lot more. This feature table contains transformations that are specific to our use case - therefore, it's more clear to me to have a separate feature table dedicated to this use case. I have to assume later that I'll want to fork my original features. I don't want to have recursive logic mudying those waters.

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

#Create feature table with `id` as the primary key.
customer_feature_table = fe.create_table(
  name=f'{catalog}.{db}.{src_table}_features_transformed',
  primary_keys=['id', 'date'],
  schema=df.schema,
  description='oil yield features - transformed',
  df = df,
  timeseries_columns='date'
)

#### Adding our lambda values to a new experiment
This is the first step to setting up an experiment we will be using for the rest of this project. Logging our lambda values as parameters in the experiment helps us tie back the actual values we used when factoring our transformed data. Remember, transformed and normalized data will be what we train our model on. This allows us to associate our lambda values (stored in a dedicated delta table) to the pipeline required for our model evaluation. Keeping all of this information together using an MLFlow experiment is a precursor to an MLOps pipeline and 'true' agile ML Engineering where we can tune, re-train and modify future versions of our model without risk.

In [0]:
import mlflow

#Set a named experiment
mlflow.set_experiment(f"/Users/{current_user}/Oil Extraction Production Forecasting")

#Start MLflow run
with mlflow.start_run(run_name=f"{src_table} BoxCox Transformation"):

    #Log transformation parameters
    mlflow.log_param("lambda_yield", lambda_yield)
    mlflow.log_param("lambda_precipitation", lambda_precip)
    mlflow.log_param("lambda_temp", lambda_temp)

    #Log feature table paths
    mlflow.log_param("transformed_feature_table", f"{catalog}.{db}.{src_table}_features_transformed")
    mlflow.log_param("lambda_values_table", f"{catalog}.{db}.{src_table}_lambdas")

    print("Logged Box-Cox transformation details to MLflow.")

### Lab challenge 
What would be the best way to handle pre-processing?
- What would be the costs & benefits of pre-processing in a pipeline?
- What would be the costs & benefits of pre-processing in a model function wrapper?
- How could we decouple pre-processing further?