#Oil Extraction Production Forecasting
<br/>
<img src="https://www.nsenergybusiness.com/wp-content/uploads/sites/4/2022/07/refinery-ga56d4972f_640.jpg" />
<br/>

## Introduction: Forecasting Oil Yield with Databricks, MLflow, and XGBoost

In this hands-on workshop, we will explore how to leverage Databricks, MLflow, and XGBoost to forecast oil yield (measured in barrels) based on environmental and operational features. Accurate yield predictions are critical in the energy sector, helping optimize production planning, reduce operational risks, and improve overall efficiency. By applying machine learning techniques, we can uncover complex relationships between geological, weather, and drilling parameters that traditional models may struggle to capture.

Throughout the session, we will guide you through the end-to-end workflow—from data ingestion and feature engineering to model training, hyperparameter tuning, and tracking experiments with MLflow. You will learn best practices for working with structured time-series data in a distributed environment, handling missing values, and selecting key features that drive predictive accuracy. By the end of this lab, you will have a practical understanding of how to deploy and scale predictive models for oil yield forecasting using XGBoost on Databricks—enabling data-driven decision-making for more efficient and sustainable operations.

## Source Data
For this lab, we generated some data for you. We've uploaded the .csv data to a dbfs volume. Volumes are logical storage areas for good stuff like source data, images etc. Generally speaking, the structure of data organization is a 3-tier model:
1. **Catalog:** This is generally used to organize a series of data. We recommend that business units or project teams share a catalog.
1. **Database (also called schema):** Databases are collections of artifiacts, typically with related data. Databases generally contain a couple of sub-types of objects such as: *Tables, Volumes, Models and Functions*
1. **Objects:** These are the things that a database or schema can hold
- **Tables:** Contain all of our structured data
- **Volumes:** Used to store raw data. Generally mapped to some type of cloud storage location. Normally this is where we land data from outside systems.
- **Models:** We use this location to store models that we build for easy tracking in production. Many people and applications can consume them from here.
- **Functions:** This is a new thing, but advanced functions for transforming data can be stored here. We're not using this for our lab.

### Initialization
Below is an initialization block to help us out. This is designed so that each user has their own set of unique names credentials. Don't worry too much about what it's doing - this is mostly because we have several users doing the same lab with the same parameters in a shared workspace and don't want any collisions. For enterprise work this is largely unnecessary.

In [0]:
import hashlib, base64

#IMPORTANT! DO NOT CHANGE THESE VALUES!!!!
catalog = "workshop"
db = "default"
current_user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("user").get()
hash_object = hashlib.sha256(current_user.encode())
hash_user_id = base64.b32encode(hash_object.digest()).decode("utf-8").rstrip("=")[:12]  #Trim to 12 chars for readability
initials = "".join([x[0] for x in current_user.split("@")[0].split(".")])
short_hash = hashlib.md5(current_user.encode()).hexdigest()[:8]  #Short 8-char hash
safe_user_id = f"{initials.upper()}_{short_hash}"
src_table = f"{safe_user_id}_oil_yield"
model_name = f"{safe_user_id}_oil_yield_forecast"
model_uri = f"{catalog}.{db}.{model_name}"

## Creating the source data table
There are many ways to ingest data. In a production case, we'd build some type of ingestion pipeline. This gives us the flexibility to ingest data in-flight or at-rest at a variety of different points.

### Using Databricks Volumes
The simplest and easiest way to ingest data is using Databricks Volumes. These are mount points within the context of a database. Data can either be dragged and dropped there using the Databricks Web UI, or added by an external application. This will be the starting point of our data journey. For this lab, we've included a filed called `data/synthetic_oil_yield_csv_files/synthetic_oil_yield_20250227_181101.csv`. We've uploaded this to the workshop catalog, under the default database. Let's go ahead and create our first table with this data.

#### 1. Create a unique id
First, we'll create a unique id to use so we don't run into any collisions with other users. Normally in an organization we'd decide on a unique name or naming convention. Run the code cell below to generate your unique id.

#### 2. Copy the unique id
Next, copy your unique id. It should be something like `AC_5fae8e4d_oil_yield`. Store this unique code somewhere safe so we can re-use it in a minute.



In [0]:
#THIS IS UNIQUE FOR EVERY USER
#Use this value and create a new delta table under workshop.default with this table!
print(src_table)

#### 3. Find the storage volume
Using the Catalog explorer, navigate to the `workshop` catalog, under the `default` database. There you should see a group called `volumes` with a volume called `data`. `data` is where we uploaded the data file. If it does not exist (you are running this in your own environment for example) you can create your own storage volume and upload the file there. More info on how to do that can be found [here](https://docs.databricks.com/aws/en/volumes).
#### 4. Create the table from the volume
Select the three vertical dots to the right of the file and select `create table`.
<img src="https://raw.githubusercontent.com/andrijdemianczuk/jolly-jackalope/refs/heads/main/src/Screenshot%202025-03-02%20at%2022.14.52.png" width="750" />
#### 5. Configure the table
In the dialog that comes up, set the following parameters:
- Catalog: `workshop`
- Database: `default`
- Table Name: `{Your Unique ID from above}`
<img src="https://raw.githubusercontent.com/andrijdemianczuk/jolly-jackalope/refs/heads/main/src/Screenshot%202025-03-02%20at%2022.15.15.png" width="750" />
#### 6. Create the table
Click the `Create table` button and wait a few minutes. You just created your first delta table!

## Using PySpark dataframes
Next we'll load our delta table we just created into a pyspark dataframe. Most of our work will be done both in pyspark dataframes and pandas dataframes. The data formats are largely interchangeable.

### A bit about PySpark dataframes
PySpark DataFrames are distributed, schema-aware data structures optimized for big data processing in Apache Spark, making them a powerful choice for handling large-scale datasets in Databricks. Unlike traditional Pandas DataFrames, PySpark DataFrames distribute computations across multiple nodes, enabling efficient parallel processing and scalability. They integrate seamlessly with Delta Lake, MLflow, and Databricks’ managed compute environment, allowing users to perform ETL, feature engineering, and machine learning workflows at scale. With built-in support for SQL queries, transformations, and optimizations via Catalyst and Tungsten, PySpark DataFrames provide a flexible and efficient way to process structured and semi-structured data in Databricks.

In [0]:
#Load the delta table into a PySpark dataframe
df = spark.table(f"{catalog}.{db}.{src_table}")

Let's quickly preview the schema of our dataframe. This helps us understand what fields are available to us and how we interact with them. Previewing the first few rows may also help.

In [0]:
df.printSchema()
df.show(5, truncate=False)  # Display first 5 rows

#### Describing data
Running `df.describe()` gives us a quick sense of the shape of our data. Immediately, we can see the brief outline of our numeric features. Most of our fields in this example are numeric with the exception of the well_id and _rescued columns.

#### A quick note on _rescued data
A _rescued column in Databricks appears when using Auto Loader or schema evolution to ingest data with fields that don’t match the expected schema. Instead of failing the ingestion, Databricks captures these unexpected or unrecognized columns in a single _rescued_data column (in JSON format) to preserve the data and allow for later inspection. This feature helps maintain data integrity and enables debugging schema mismatches without data loss, making it especially useful when working with evolving or semi-structured datasets.

In [0]:
display(df.describe())

## Analyzing our data
Understanding our data is essential to knowing how to deal with it for our use cases. Missing data is usually the most obvious in need to address. Let's quickly look for the incidence rate of missing data in our dataframe. This can typically be corrected in a number of ways:

- **Remove Missing Data** – Drop rows or columns with excessive missing values if they don’t contribute useful information.
- **Imputation** – Fill missing values using statistical methods:
    1. _Mean/Median/Mode_ for numerical data
    1. _Forward/Backward Fill_ for time-series data
	1. _Constant Value_ (e.g., “Unknown” for categorical features)
- **Predictive Imputation** – Use machine learning models (e.g., k-NN, regression) to estimate missing values.
- **Flagging Missingness** – Create a binary indicator column (is_null) to capture missing data patterns.
- **Leverage Domain Knowledge** – Replace missing values with domain-specific estimates when applicable.
- Use Databricks Functions – Utilize fillna(), dropna(), replace() in PySpark DataFrames for efficient handling.
- **Consider Feature Removal** – If a feature has excessive missing values and imputation isn’t viable, consider dropping it.

In [0]:
# If any columns have missing values, we need to decide whether to fill, drop, or interpolate them. Sometimes empty or missing values may be valuable though.

from pyspark.sql.functions import col, sum

df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).show()

### Previewing seasonality
Looking for trends is the easiest way to understand timeseries data. Since we're all bound to the same time structures understand the ebb and flow of our data allows us to consider this when forecasting future data in series.

In [0]:
#Let's look for some seasonality based on the timeseries plot
import matplotlib.pyplot as plt
import pandas as pd

#Convert PySpark DataFrame to Pandas for plotting
df_pd = df.select("date", "yield_bbl").groupby("date").avg("yield_bbl").orderBy("date").toPandas()

#Plot a time series
plt.figure(figsize=(12, 5))
plt.plot(df_pd["date"], df_pd["avg(yield_bbl)"], marker="o", linestyle="-")
plt.xlabel("Date")
plt.ylabel("Average Yield (BBL)")
plt.title("Oil Yield Trend Over Time")
plt.xticks(rotation=45)
plt.show()

#### Downsampling data
First we're going to downsample our data to get a better view of otherwise overly atomic or noisey data. Downsampling seasonal data is crucial for improving model efficiency, reducing noise, and capturing long-term trends without unnecessary granularity. High-frequency seasonal data can introduce redundancy and volatility, making it harder for machine learning models to generalize patterns effectively. By aggregating data at a lower frequency—such as daily to weekly or hourly to daily—we can smooth out short-term fluctuations while preserving the underlying seasonality. This not only enhances computational efficiency but also prevents models from overfitting to minor variations, leading to more stable and interpretable forecasts in time-series analysis.

In [0]:
import matplotlib.pyplot as plt
import pandas as pd

#Convert PySpark DataFrame to Pandas
df_pd = df.select("date", "temperature", "precipitation").toPandas()

#Convert date to datetime
df_pd["date"] = pd.to_datetime(df_pd["date"])

#Resample to weekly average to reduce data size
df_resampled = df_pd.set_index("date").resample("W").mean().reset_index()

#Create figure and axes
fig, ax1 = plt.subplots(figsize=(12, 5))

#Plot temperature on primary y-axis
ax1.plot(df_resampled["date"], df_resampled["temperature"], color="red", marker="o", linestyle="-", label="Temperature (°C)")
ax1.set_xlabel("Date")
ax1.set_ylabel("Temperature (°C)", color="red")
ax1.tick_params(axis="y", labelcolor="red")

#Create secondary y-axis for precipitation
ax2 = ax1.twinx()
ax2.bar(df_resampled["date"], df_resampled["precipitation"], color="blue", alpha=0.5, label="Precipitation (mm)")
ax2.set_ylabel("Precipitation (mm)", color="blue")
ax2.tick_params(axis="y", labelcolor="blue")

#Title and layout
plt.title("Temperature and Precipitation Over Time (Weekly Avg)")
fig.tight_layout()
plt.show()

#### Previewing outliers
Looking for outliers is essential in feature engineering and model performance because outliers can skew statistical measures, distort model predictions, and impact overall data integrity. By detecting, analyzing, and handling outliers appropriately (removing, capping, or transforming them), we ensure that models are trained on reliable and representative data, leading to better predictive performance. Outliers can affect our ability to generalize our models and helps prevent overfitting or 'learning the data, not the relationships'

In [0]:
#We need to check for abnormally high or low values in oil yield (barrels produced), well pressure and oil price.
#Convert to Pandas for visualization
df_outliers = df.select(["yield_bbl", "temperature", "well_pressure", "oil_price"]).toPandas()

#Plot boxplots
df_outliers.plot(kind="box", subplots=True, layout=(2, 2), figsize=(10, 8), sharex=False, sharey=False)
plt.suptitle("Box Plot of Key Features")
plt.show()

#### Understanding feature relationships
Feature relationships are important to forecasting models. Often environmental factors have an impact on the effectiveness of being able to predict or forecast numeric or timeseries data. Using a heatmap helps us quickly identify highly correlated fields.

A feature correlation heatmap helps identify which features are highly correlated so you can remove redundancy, improve model interpretability, and optimize training performance. For XGBoost, this helps prevent overfitting, while for LSTMs, it ensures the model learns meaningful sequential patterns rather than noise or redundant signals.

Using adjacent or ancillary data is useful for boosting predictability.

In [0]:
#Let's look for some field correlation
import seaborn as sns

#Convert PySpark DF to Pandas
df_corr = df.select(["yield_bbl", "temperature", "precipitation", "humidity", "wind_speed", "well_pressure", "sand_quality", "drilling_efficiency", "oil_price"]).toPandas()

#Plot a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df_corr.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()

#### Skew & Kurtosis

**Skewness**

Skewness measures the asymmetry of a dataset’s distribution. A highly skewed feature (left or right) can negatively impact models that assume a normal distribution, such as linear regression and neural networks. Skewed features can cause biased weight updates in optimization algorithms and distort relationships in models like XGBoost. Addressing skewness through log, square root, or Box-Cox transformations can improve model stability and predictive performance.

**Kurtosis**

Kurtosis measures the tailedness of a distribution—how often extreme values (outliers) occur. High kurtosis (leptokurtic) means more outliers, which can destabilize models and lead to overfitting, as the model may focus too much on rare events. Low kurtosis (platykurtic) suggests a lack of significant deviations, which can make a model less sensitive to important rare occurrences. Managing kurtosis through outlier detection and robust feature scaling helps improve model generalization and reliability.

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

#Convert PySpark DataFrame to Pandas
df_pd = df.select("yield_bbl", "temperature", "precipitation").toPandas()

#Create the KDE plot (bell curve)
plt.figure(figsize=(10, 6))

#Plot yield distribution
sns.kdeplot(df_pd["yield_bbl"], label="Yield (BBL)", color="red", linewidth=2)

#Labels and title
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Bell Curve of Yield")
plt.legend()
plt.grid(True)

#Show plot
plt.show()

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

#Convert PySpark DataFrame to Pandas
df_pd = df.select("yield_bbl", "temperature", "precipitation").toPandas()

#Create the KDE plot (bell curve)
plt.figure(figsize=(10, 6))

#Plot yield distribution
sns.kdeplot(df_pd["temperature"], label="Temperature", color="green", linewidth=2)

#Labels and title
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Bell Curve of temperature")
plt.legend()
plt.grid(True)

#Show plot
plt.show()

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

#Convert PySpark DataFrame to Pandas
df_pd = df.select("yield_bbl", "temperature", "precipitation").toPandas()

#Create the KDE plot (bell curve)
plt.figure(figsize=(10, 6))

#Plot yield distribution
sns.kdeplot(df_pd["precipitation"], label="Precipitation (mm)", color="blue", linewidth=2)

#Labels and title
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Bell Curve of precipitation")
plt.legend()
plt.grid(True)

#Show plot
plt.show()

#### What next?
At this point we know that temperature and precipitation are our biggest contributing features to yield. We'll create a feature table with those values for forecasting.

Given that we want to boost our forecast with predicting temperature and precipitation, we'll likely also have to forecast those two features as well (here's a hint as to where we're going!). The predictions of those fields will contribute to the prediction of yield. This means that we'll have to fix the distributions of yield and precipitation. The temperature looks okay since we know that seasonally there's a change between summer and winter with a degree of outliers.

### Evaluate Skew & Kurtosis
Now that we have a pretty good understanding of our data, what do we do about it? Let's start by getting some numbers to quantify our skew and kurtosis values for our target (`yield_bbl`) and related features (`precipitation` and `temperature`)

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

#Convert PySpark DataFrame to Pandas
df_pd = df.select("date","yield_bbl", "precipitation", "temperature").toPandas()

#Plot distributions
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.histplot(df_pd["yield_bbl"], kde=True, bins=30, color="red")
plt.title("Yield Distribution")

plt.subplot(1, 3, 2)
sns.histplot(df_pd["precipitation"], kde=True, bins=30, color="blue")
plt.title("Precipitation Distribution")

plt.subplot(1, 3, 3)  
sns.histplot(df_pd["temperature"], kde=True, bins=30, color="green")
plt.title("Temperature Distribution")

plt.show()

In [0]:
from scipy.stats import skew, kurtosis

print(f"Yield Skewness: {skew(df_pd['yield_bbl'])}, Kurtosis: {kurtosis(df_pd['yield_bbl'])}")
print(f"Precipitation Skewness: {skew(df_pd['precipitation'])}, Kurtosis: {kurtosis(df_pd['precipitation'])}")
print(f"Temperature Skewness: {skew(df_pd['temperature'])}, Kurtosis: {kurtosis(df_pd['temperature'])}")

Temperature isn't really that skewed (close to zero) and it's Kurtosis is pretty minimal (far from 3) however it is bi-modal which can have a negative effect on our model's precision. Precipitation is the most abhorant and also has the least amount of impact on yield.

## Applying algorithmic transformations
We know that some of our features will require transformation. Since yield and precipitation are positive values, we can try and normalize them with a box-cox transformation. Since temperature is both positive and negative we could either shift all values above zero, or try to normalize it with a yeo-johnson transformation.

We're going to do our initial investigation to see the effect of a box-cox transformation, however we'll be capturing our features and creating our feature tables pre-transformation. We want to encapsulate this kind of transformation in the model itself so we don't have to tightly couple the transformation with the feature engineering and ingestion process.

We will do a sample transformation below to see the effect of the box-cox transformation and its effect on our data set.

#### Creating lambda values
lambda values are the aggregated change required across all values of a field to adjust them but preserve their semantics. Since we're dealing with timeseries data, the order which data appears next to its neighbour matters and has a net effect on the shape and distribution of our data. To make sure this is applied to all values equally, we'll need to persist this data (we'll be doing this in the next notebook).

In [0]:
from scipy.stats import boxcox, yeojohnson

df_pd["yield_bbl"], lambda_yield = boxcox(df_pd["yield_bbl"] + 1)  # Shift to avoid zero
df_pd["precipitation"], lambda_precip = boxcox(df_pd["precipitation"] + 1)
df_pd["temperature"], lambda_temp = yeojohnson(df_pd["temperature"] + 1)

print(f"Box-Cox Lambda for Yield: {lambda_yield}")
print(f"Box-Cox Lambda for Precipitation: {lambda_precip}")
print(f"Yeo-Johnson Lambda for Temperature: {lambda_temp}")

#### Normalizing data
Normalizing data on a standard scale (usually between 0 and 1 or some other standard scale) helps us better understand and preserve the relationships of our data. This keeps our data on even-keel. Since we don't need between 0 and 1 and want to preserve the relative distance of our data we'll use a StandardScaler.

In [0]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

#Use MinMaxScaler() if you prefer [0,1] range
scaler = StandardScaler()  

df_pd[["yield_bbl", "precipitation", "temperature"]] = scaler.fit_transform(df_pd[["yield_bbl", "precipitation", "temperature"]])

In [0]:
#Let's preview our transformations
df_pd

#### Previewing the effect of our transformations
Now let's have a look and see if we improved at all with our transformation methods. We can see that we may have made temperature actually worse with our yeo-johnson transformation but improved the shape of precipitation and yield. Since yield is our target feature we won't be using the transformation there. Precipitation still has heavy outliers which can affect our model precision which remains heavily skewed. Let's consider that later - this feature might be introducing more noise, compromising our predictability.

In [0]:
#Plot distributions
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.histplot(df_pd["yield_bbl"], kde=True, bins=30, color="red")
plt.title("Yield Distribution")

plt.subplot(1, 3, 2)
sns.histplot(df_pd["precipitation"], kde=True, bins=30, color="blue")
plt.title("Precipitation Distribution")

plt.subplot(1, 3, 3)  
sns.histplot(df_pd["temperature"], kde=True, bins=30, color="green")
plt.title("Temperature Distribution")

plt.show()

When we plot out our heatmap, we'll also notice that we had a small improvement of temperature and a slight worsening of precipitation as boosting features.

In [0]:
#Plot the heatmap with the transformed variables
plt.figure(figsize=(10, 6))
sns.heatmap(df_pd.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()

#### Reviewing outliers of our transformed data
It looks like we did get rid of our outliers for our target feature. Precision still looks skewed but temperature remains fairly well balanced.

In [0]:
#Plot boxplots
df_pd.plot(kind="box", subplots=True, layout=(2, 2), figsize=(10, 8), sharex=False, sharey=False)
plt.suptitle("Box Plot of Key Features")
plt.show()

Let's also have a quick look at the normalized relationhip of temperature and precipitation. Here we see that after the normalization process we have a much closer relationship. This means that while precipitation may not have a direct, close, releationship with yield it may be a good booster when we need to predict temperature in the future based on seasonal timing.

In [0]:
#Convert date to datetime
df_pd["date"] = pd.to_datetime(df_pd["date"])

#Resample to weekly average to reduce data size
df_resampled = df_pd.set_index("date").resample("W").mean().reset_index()

#Create figure and axes
fig, ax1 = plt.subplots(figsize=(12, 5))

#Plot temperature on primary y-axis
ax1.plot(df_resampled["date"], df_resampled["temperature"], color="red", marker="o", linestyle="-", label="Temperature (°C)")
ax1.set_xlabel("Date")
ax1.set_ylabel("Temperature (°C)", color="red")
ax1.tick_params(axis="y", labelcolor="red")

#Create secondary y-axis for precipitation
ax2 = ax1.twinx()
ax2.bar(df_resampled["date"], df_resampled["precipitation"], color="blue", alpha=0.5, label="Precipitation (mm)")
ax2.set_ylabel("Precipitation (mm)", color="blue")
ax2.tick_params(axis="y", labelcolor="blue")

#Title and layout
plt.title("Temperature and Precipitation Over Time (Weekly Avg)")
fig.tight_layout()
plt.show()

Based on the downsampled, transformed, and normalized data, I'm pretty satisfied with temperature and precipitation as _potentially_ relevant features to predict yield. We'll go back to our 'raw' state and create a feature table based on those two numeric fields along our date. Let's keep in mind that although precipitation is still correlated, it's correlation value is still quite low which may introduce more noise than value.

In [0]:
from pyspark.sql.functions import monotonically_increasing_id

#Load our desired features, along with a monotonically increasing id. Feature tables require a unique identifier.
df_features = df.select("date", "temperature", "precipitation", "yield_bbl").withColumn("id", monotonically_increasing_id())

## Creating the feature table
Now let's create our feature table. We'll be using this feature table for our experiment and training our model. Creating a feature table ensures discoverability and consistency when using this data for modelling and training.

Feature tables in Databricks provide a centralized, governed, and scalable way to store and serve features for machine learning models. By leveraging Delta Lake and MLflow, feature tables ensure consistent, high-quality features across training and inference, reducing data drift and improving model reliability. They support real-time and batch feature serving, enable feature reuse across multiple models, and integrate seamlessly with Databricks Feature Store, making it easier to manage feature lineage, versioning, and operationalization at scale.

We will source our data in the next notebooks from our newly generated feature table, creating a lineage of events.

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

#Create feature table with `id` as the primary key.
customer_feature_table = fe.create_table(
  name=f'{catalog}.{db}.{src_table}_features',
  primary_keys=['id', 'date'],
  schema=df_features.schema,
  description='oil yield features',
  df = df_features,
  timeseries_columns='date'
)

### Lab challenge
We've largely been focusing on distribution of data.
- What other types of exploratory data analysis (eda) would be useful?
- What other metrics or values should we consider at this investigation stage?
- How could we further normalize the data for better standardization?