#Oil Extraction Production Forecasting
<br/>
<img src="https://www.nsenergybusiness.com/wp-content/uploads/sites/4/2022/07/refinery-ga56d4972f_640.jpg" />

## Evaluating our transformations
Transformations and normalization are good, only insofar that they have a net-positive impact on our model. In this notebook we'll be doing a sample training run to look at mean-absolute error and root mean-squared error to see what kind of impact our transformations have. This is an important step, and we must consider if we were to 'production-alize' this into a workflow pipeline. Once we have a good idea of the effect our transformations have, we can make an informed decision on how to best train our model.

%md
### Initialization
Below is an initialization block to help us out. This is designed so that each user has their own set of unique names credentials. Don't worry too much about what it's doing - this is mostly because we have several users doing the same lab with the same parameters in a shared workspace and don't want any collisions. For enterprise work this is largely unnecessary.

In [0]:
import hashlib, base64

#IMPORTANT! DO NOT CHANGE THESE VALUES!!!!
catalog = "workshop"
db = "default"
current_user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("user").get()
hash_object = hashlib.sha256(current_user.encode())
hash_user_id = base64.b32encode(hash_object.digest()).decode("utf-8").rstrip("=")[:12]  #Trim to 12 chars for readability
initials = "".join([x[0] for x in current_user.split("@")[0].split(".")])
short_hash = hashlib.md5(current_user.encode()).hexdigest()[:8]  #Short 8-char hash
safe_user_id = f"{initials.upper()}_{short_hash}"
src_table = f"{safe_user_id}_oil_yield"
model_name = f"{safe_user_id}_oil_yield_forecast"
model_uri = f"{catalog}.{db}.{model_name}"

### Loading our experiment context
Since we've already set up an experiment in the previous notebook, we'll load that one and use it going forward. We want to track everything in a common place to make it easy to tie our results back to our work.

In [0]:
import mlflow

#Set a named experiment. We want to use the same experiment where we logged our feature artifacts
mlflow.set_experiment(f"/Users/{current_user}/Oil Extraction Production Forecasting")

### Loading our transformed features
Just like last time, we'll use the feature engineering client to load our transformed features table. This will preserve the lineage of our work. Since we want equal access to the transformed and pre-transformed data we'll load all features for evaluation.

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

#Read in our feature table with normalized & tranformed features for model training
df = fe.read_table(
  name=f'{catalog}.{db}.{src_table}_features_transformed'
)

### Evaluating and comparing transformations
Since we have our full feature set, we can look at the effect of our transformations and compare them.

A Box-Cox transformation is a power transformation that stabilizes variance, reduces skewness, and makes data more normally distributed, improving the effectiveness of statistical models and machine learning algorithms. When applied:
- Skewness is reduced, making the data more symmetrical and improving assumptions for models that rely on normality.
- Kurtosis is adjusted, reducing extreme outliers and making the distribution more Gaussian-like.
- Heteroscedasticity (unequal variance) is minimized, improving model stability.

By comparing pre- and post-transformation results, we can assess how well the transformation has normalized the distribution and whether it enhances model performance by making statistical properties more suitable for predictive algorithms.

Let's have a look at the distributions side-by-side. We will leverage a seaborn analysis to compare the distributions and plot them out using the plot function of matplotlib.

In [0]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew, kurtosis

# Load transformed data
df_transformed = df.toPandas()

# Define function to print skewness & kurtosis
def check_distribution(df, feature):
    print(f"\nFeature: {feature}")
    print(f"  Skewness: {skew(df[feature]):.2f}")
    print(f"  Kurtosis: {kurtosis(df[feature]):.2f}")

# Compare original vs. transformed features
for feature in ["yield_bbl", "precipitation", "temperature"]:
    print("\n🔹 BEFORE Transformation:")
    check_distribution(df_transformed, feature)
    
    print("\n✅ AFTER Transformation:")
    check_distribution(df_transformed, f"{feature}_transformed")

# Plot distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

for i, feature in enumerate(["yield_bbl", "precipitation", "temperature"]):
    sns.histplot(df_transformed[feature], bins=30, kde=True, ax=axes[0, i], color="red")
    axes[0, i].set_title(f"Before Box-Cox: {feature}")

    sns.histplot(df_transformed[f"{feature}_transformed"], bins=30, kde=True, ax=axes[1, i], color="blue")
    axes[1, i].set_title(f"After Box-Cox: {feature}")

plt.tight_layout()
plt.show()

We can see that precipitation definitely benefitted from the transformation. We were able to drastically reduce skew and kurtosis. Although there are a high degree of negative values (potentially impacting our model), we can still justify it's use here.

### Performing a trial run
Next, we'll do a trial training run. All we're doing here is looking to see how the box-cox or yeo-johnson transforms affect the reliability of the trained model. We want to see what kind of effect our transformation had on more tangible metrics such as mean-absolute error and root mean-squared error. What we're really looking for here is an improvement (reduction) of both. If little or not effect is noticed we can determine that the transformation didn't add much value. What we want to avoid is worsening scores.

In [0]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

#Select features & target
features_original = ["temperature", "precipitation"]
features_transformed = ["temperature_transformed", "precipitation_transformed"]
target = "yield_bbl"

#Load datasets
df_transformed = df.toPandas()

#Train-test split
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(df_transformed[features_original], df_transformed[target], test_size=0.2, random_state=42)
X_train_trans, X_test_trans, y_train_trans, y_test_trans = train_test_split(df_transformed[features_transformed], df_transformed[target], test_size=0.2, random_state=42)

#Train XGBoost models
model_orig = xgb.XGBRegressor(objective="reg:squarederror", n_estimators=100, learning_rate=0.1)
model_trans = xgb.XGBRegressor(objective="reg:squarederror", n_estimators=100, learning_rate=0.1)

model_orig.fit(X_train_orig, y_train_orig)
model_trans.fit(X_train_trans, y_train_trans)

#Predictions
y_pred_orig = model_orig.predict(X_test_orig)
y_pred_trans = model_trans.predict(X_test_trans)

#Compute Errors
mae_orig = mean_absolute_error(y_test_orig, y_pred_orig)
rmse_orig = mean_squared_error(y_test_orig, y_pred_orig, squared=False)

mae_trans = mean_absolute_error(y_test_trans, y_pred_trans)
rmse_trans = mean_squared_error(y_test_trans, y_pred_trans, squared=False)

#Print Results
print("\n🔹 Model Performance (Without Box-Cox):")
print(f"  MAE: {mae_orig:.2f}, RMSE: {rmse_orig:.2f}")

print("\n✅ Model Performance (With Box-Cox):")
print(f"  MAE: {mae_trans:.2f}, RMSE: {rmse_trans:.2f}")

### Analyzing our results
Clearly we didn't see much of a positive change when using our transformed temperature and precipitation. Precipitation is interesting because we noticed it was the most positively affected in terms of reducing skewness and kurtosis with the box-cox power transformer, but it also had the least effect on predictability of yield. Temperature which had a much higher correltation was actually made slightly worse with the box-cox transformation. Despite being bi-modal, the temperature feature was the best predictor of yield other than yield seasonality itself.

Lab Challenge: How could we further improve MAE and RMSE?
- Further adjustments to features?
- HP tuning?
- Yeo Johnson?
- What other algorithms might be better? LSTM for DNN processing?
- What's causing noise?

## Tuning and managing our experiment
MLFlow is key, and using hyperopt or optuna are good for distributed hyperparameter tuning. In the next notebook, we'll be setting up an MLFlow experiment for training and tuning out model.