#Oil Extraction Production Forecasting
<br/>
<img src="https://www.nsenergybusiness.com/wp-content/uploads/sites/4/2022/07/refinery-ga56d4972f_640.jpg" />

## Forecasting yield based on projected temperature and precipitation
In this notebook we'll be using our Unity Catalog managed model for prediction of new data. We'll forecast temperature and precipitation and apply our model based on those new estimates.

**Note:** In this notebook there's an obvious ommision to applying a box-cox transformation to precipitation. Since we're using prophet based on the timestamp from historical records (which have the transformation applied) we can predict the transformed value. In general, as new data arrives (or is predicted) it should run through the same transformation using the stored lambda values we saved in the feature engineering section of this lab (02_Advanced_Feature_Engineering).

### A quick note on prophet
Prophet is an easy to use, timeseries forecasting algorithm. We'll be using it to forecast weather which is much easier than forecasting yield. With the forecasted weather data we'll infer the barrels of oil yield using our trained model.

In [0]:
%pip install prophet
dbutils.library.restartPython()

### Initialization
Below is an initialization block to help us out. This is designed so that each user has their own set of unique names credentials. Don't worry too much about what it's doing - this is mostly because we have several users doing the same lab with the same parameters in a shared workspace and don't want any collisions. For enterprise work this is largely unnecessary.

In [0]:
import hashlib, base64

#IMPORTANT! DO NOT CHANGE THESE VALUES!!!!
catalog = "workshop"
db = "default"
current_user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("user").get()
hash_object = hashlib.sha256(current_user.encode())
hash_user_id = base64.b32encode(hash_object.digest()).decode("utf-8").rstrip("=")[:12]  #Trim to 12 chars for readability
initials = "".join([x[0] for x in current_user.split("@")[0].split(".")])
short_hash = hashlib.md5(current_user.encode()).hexdigest()[:8]  #Short 8-char hash
safe_user_id = f"{initials.upper()}_{short_hash}"
src_table = f"{safe_user_id}_oil_yield"
model_name = f"{safe_user_id}_oil_yield_forecast"
model_uri = f"{catalog}.{db}.{model_name}"

In [0]:
import mlflow

# Set a named experiment
mlflow.set_experiment(f"/Users/{current_user}/Oil Extraction Production Forecasting")

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

df = fe.read_table(
  name=f'{catalog}.{db}.{src_table}_features_transformed'
)

In [0]:
#If we want to use the UC registry rather than the local mlflow registry, set databricks-uc as the registry uri
mlflow.set_registry_uri("databricks-uc")

### Loading our best model from Unity Catalog
Loading our best-performing model is easy with Unity Catalog model aliases. Although MLOps is out of scope for this lab, it's worth noting that the latest model may not be the best performing one. Generally speaking, having a champion-challenger paradigm when training is a good idea. Models only get promoted if they outperform their predecessor. For the sake of this lab, we'll be assuming the latest version of the model is the best performing one.

In [0]:
import mlflow
from mlflow import MlflowClient

# Define Unity Catalog Model URI with alias
model_alias = "Champion"
model_uri = f"models:/{catalog}.{db}.{model_name}@{model_alias}"

# Load the trained model
loaded_model = mlflow.xgboost.load_model(model_uri)

print(f"✅ Model Loaded from Unity Catalog. Loaded {model_name}")

### Loading our sample window
We're going to sub-sample our data for prophet forecasting. Since we want to the last 30 days, we'll organize our data by descending date. This should give us a good estimate of short-term weather forecasting based on historical seasonality. Let's see how this behaves.

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient
from pyspark.sql.functions import col, date_add
import pandas as pd

fe = FeatureEngineeringClient()

df = fe.read_table(
  name=f'{catalog}.{db}.{src_table}_features'
).orderBy(col("date").desc()).toPandas()

Let's quickly preview our pandas dataframe to see the top and bottom rows

In [0]:
df

### Basic forecasting using mean
Let's make our first attempt at forecasting using a simple mean() calculation.

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient
from pyspark.sql.functions import col, date_add
import pandas as pd

fe = FeatureEngineeringClient()

df_latest_features = fe.read_table(
  name=f'{catalog}.{db}.{src_table}_features_transformed'
).orderBy(col("date").desc()).limit(30).toPandas()

#Generate future dates
future_dates = pd.date_range(start=df_latest_features["date"].max(), periods=30, freq="D")

#Estimate future temperature & precipitation based on past seasonality
df_future_features = df_latest_features.copy()
df_future_features["date"] = future_dates
df_future_features["temperature"] = df_latest_features["temperature"].mean()  #Replace with seasonal estimate
df_future_features["precipitation_transformed"] = df_latest_features["precipitation_transformed"].mean()  #Replace with seasonal estimate

print("✅ Generated Future Feature Data")
print(df_future_features.head())

#### Previewing our dataset
As expected, each row has the same temperature and precipitation_transformed value. Since our prediction is based on these two fields, it probably won't give us a good prediction once we run these values through our forecasting algorithm (in other words, all of the forecasts will be the same)

In [0]:
df_future_features

#### Predicting on the forecast dataset
Running the predictions is as simple as declaring the two fields we want to use for each row to build the prediction off of, and declaring a new field with the output. Once we have that pandas dataframe with the predicted column we can plot it out in matplotlib again and see what we get.

In [0]:
# Select input features for prediction
X_future = df_future_features[["temperature", "precipitation_transformed"]]

# Run predictions
df_future_features["predicted_yield"] = loaded_model.predict(X_future)

# Display results
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 5))
plt.plot(df_future_features["date"], df_future_features["predicted_yield"], marker="o", linestyle="dashed", color="red")
plt.xlabel("Date")
plt.ylabel("Predicted Yield (BBL)")
plt.title("Predicted Oil Yield for Next 30 Days")
plt.xticks(rotation=45)
plt.grid()
plt.show()

print("✅ Predictions Complete")
print(df_future_features.head())

#### Saving our predicted values
Saving our predicted data is easy. All we need to do is convert the pandas dataframe back to a PySpark dataframe and write it back to our delta lake.

In [0]:
# Convert Pandas DataFrame to Spark DataFrame
df_future_spark = spark.createDataFrame(df_future_features)

# Save predictions to a Delta Table
df_future_spark.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{db}.{src_table}_predictions")

print("✅ Saved Predictions to Unity Catalog Feature Store")

### Improving our forecast
We can do better. Since we know that the predicted temperature and precipitation values were the same for each prediction row, they're not doing anything useful. Since we're ultimately trying to forecast using temperature and precipitation, we can use prophet to 'extend' out those two timeseries, and use those extensions with our tuned XGBoost model for a better yield forecast.

In [0]:
from prophet import Prophet
import pandas as pd

#Load historical feature data (this is the same data we transformed in notebook 02_Advanced_Feature_Engineering)
df_features = spark.read.table(f"{catalog}.{db}.{src_table}_features_transformed").toPandas()

#Ensure date format. Dates aren't always parsed properly.
df_features["date"] = pd.to_datetime(df_features["date"])

#Prepare data for Prophet. Set our target (y) and our date/time series (ds)
temp_df = df_features[["date", "temperature"]].rename(columns={"date": "ds", "temperature": "y"})
precip_df = df_features[["date", "precipitation_transformed"]].rename(columns={"date": "ds", "precipitation_transformed": "y"})

#Train Prophet models for temperature & precipitation
temp_model = Prophet()
temp_model.fit(temp_df)

precip_model = Prophet()
precip_model.fit(precip_df)

#Forecast next 30 days
future_dates = temp_model.make_future_dataframe(periods=30)
temp_forecast = temp_model.predict(future_dates)
precip_forecast = precip_model.predict(future_dates)

#Extract predictions
df_predicted_env = future_dates.copy()
df_predicted_env["temperature"] = temp_forecast["yhat"]
df_predicted_env["precipitation_transformed"] = precip_forecast["yhat"]

print("✅ Forecasted Temperature & Precipitation for Next 30 Days")
print(df_predicted_env.head())

#### Using our tuned model on forecasted features
Previewing the first five rows looks promising. We see that both temperature and precipitation have a degree of seasonality to them. This means that each row will have a distinct forecast when we apply it to our XGBoost model. Let's go ahead and try it out.

In [0]:
import mlflow

#Select input features for prediction
X_future = df_predicted_env[["temperature", "precipitation_transformed"]]

#Run predictions
df_predicted_env["predicted_yield"] = loaded_model.predict(X_future)

print("✅ Oil Yield Predictions for Next 30 Days")
print(df_predicted_env.head())

#### Plotting our predictions
This looks promising. Now we'll go ahead and plot out the last 30 records in our dataset. Since our max lookahead was defined when asked the prophet algorithm to forecast the upcoming window, we'll use that for our histogram plot of future values. We can sort by date descending and take the top 30 records.

In [0]:
predict_df = df_predicted_env.sort_values(by='ds', ascending=False).head(30)

In [0]:
predict_df

#### Viewing our predicted data
Plotting out the histogram we can see the data shunting slightly at first, with increasing frequency as we extend the forecast further in the future. This has a direct implication on the confidence interval of the forecast. Generally predicting shorter windows in the near-term are more accurate than larger windows in the long-term.

In [0]:
import matplotlib.pyplot as plt

#Plot results
plt.figure(figsize=(12, 5))

plt.plot(predict_df["ds"], predict_df["predicted_yield"], marker="o", linestyle="dashed", color="red")
plt.xlabel("Date")
plt.ylabel("Predicted Yield (BBL)")
plt.title("Predicted Oil Yield for Next 30 Days (Using Forecasted Features)")
plt.xticks(rotation=45)
plt.grid()
plt.show()

### Saving our predicted values (again)
Since our first version of predicted values wasn't great, let's go ahead and overwrite them. as new predictions are made we may want to merge or append new predictions to the current predict dataset. Also, as data backfills, we'll be able to further tune and adjust the model while tracking for prediction drift. 

In [0]:
# Convert Pandas DataFrame to Spark DataFrame
df_future_spark = spark.createDataFrame(df_predicted_env.rename(columns={"ds": "date"}))

# Save predictions to a Delta Table in Unity Catalog
df_future_spark.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{db}.{src_table}_predictions")

print("✅ Saved Forecasted Oil Yield to Unity Catalog Feature Store")

### Lab challenge

How could we improve our forecasting?
- Right now, we're doing a global forecast based on _all wells_. What about running a forecast for each well?
- Instead of predicting weather (temperature and precipitation) what would be some alternative methods to get forecast data for those values in the near-term?

What would be the best way to carry these predictions going forward?
- How often should we re-run the prediction?
- How should we be treating and updating the predicted v. actual data?
- Would we want online or offline inference?
- How would model serving benefit us?