In [33]:
# Dependencies
import xgboost as xgb
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

## Introduction

The primary objective of constructing this predictive model is to forecast Crude Oil WTI futures contract spot prices, thereby enhancing the comprehension of risk exposure. By accurately predicting crude oil price movements, financial analysts can make more informed decisions regarding hedging strategies and risk management in the volatile world of commodities trading.

## Data Selection

In the initial model development phase, monthly data (2020 Jan - 2023 May) for crude oil prices was chosen as the foundation for analysis. This decision was based on several considerations:

1. **Data Availability**: Monthly data was selected due to its widespread availability and accessibility, making it an advantageous starting point for our analytical endeavors. Additionally, it harmonizes well with various economic indices.
   - E.g. Export/Import Price Index (North American Industry Classification System/ NAICS)Petroleum and Coal Products Manufacturing ; Consumer Price Index, federal effective interest rate, etc.)

2. **Ended ban on export in 2015**: The ban on crude oil exports, a significant event in the petroleum industry, was lifted in December 2015. Recognizing this milestone is pivotal as it signifies a fundamental shift in the dynamics of the crude oil market.

3. **Feature Analysis**: Beyond the goal of predicting monthly future prices accurately, a secondary goal was to attain a holistic comprehension of the features exerting the most pronounced influence on crude oil prices.

In [43]:
# Loading data
data = pd.read_csv("Outputs/oil_model.csv")
data.tail()

Unnamed: 0,Date,montly_value,sum_export,sum_import,sum_stock,sum_production,cpi,interest,gas_Henry,crude_brent,vehicle_sales,personal_income,export_index,import_index
36,2023-01,78.123,524624,46496,4467045,818731,299.17,4.33,3.27,82.5,16.455,22432.0,163.0,145.2
37,2023-02,76.832632,514197,39620,4497009,739817,300.84,4.57,2.38,82.59,15.438,22520.6,163.1,145.1
38,2023-03,73.277826,668335,40232,4441135,831218,301.836,4.65,2.31,78.43,15.404,22605.1,164.6,137.6
39,2023-04,79.446316,551776,47896,4393584,797765,303.363,4.83,2.16,84.64,16.587,22669.7,161.6,142.0
40,2023-05,71.578182,551385,46108,4389598,826179,304.127,5.06,2.15,75.47,15.63,22761.5,141.0,132.6


# Preprocessing Data

In [3]:
# Split the original column "date"
data[['year', 'month']] = data['Date'].str.split('-',expand=True)

#review dataframe
data.head()

Unnamed: 0,Date,montly_value,sum_export,sum_import,sum_stock,sum_production,cpi,interest,gas_Henry,crude_brent,vehicle_sales,personal_income,export_index,import_index,year,month
0,2020-01,57.519048,512114,33860,5471999,837842,257.971,1.55,2.02,63.65,17.314,19065.2,110.0,101.3,2020,1
1,2020-02,50.542632,499910,25852,5494183,784968,258.678,1.58,1.91,55.66,17.061,19197.0,97.4,96.6,2020,2
2,2020-03,29.207727,536736,17012,5609773,833917,258.115,0.65,1.79,32.01,11.68,18838.8,93.3,84.1,2020,3
3,2020-04,16.547619,437670,12052,5881927,752800,256.389,0.05,1.74,18.38,8.923,21050.3,61.3,61.5,2020,4
4,2020-05,28.5625,430916,26004,5920453,634626,256.394,0.05,1.75,29.38,12.328,20216.6,59.4,67.1,2020,5


In [4]:
# Convert the columns "year","month" to a `integer` data type.
data[['year', 'month']] = data[['year', 'month']].astype('int')

# Drop the original "date" column
data = data.drop('Date', axis=1)

#review dataframe
data.head()

Unnamed: 0,montly_value,sum_export,sum_import,sum_stock,sum_production,cpi,interest,gas_Henry,crude_brent,vehicle_sales,personal_income,export_index,import_index,year,month
0,57.519048,512114,33860,5471999,837842,257.971,1.55,2.02,63.65,17.314,19065.2,110.0,101.3,2020,1
1,50.542632,499910,25852,5494183,784968,258.678,1.58,1.91,55.66,17.061,19197.0,97.4,96.6,2020,2
2,29.207727,536736,17012,5609773,833917,258.115,0.65,1.79,32.01,11.68,18838.8,93.3,84.1,2020,3
3,16.547619,437670,12052,5881927,752800,256.389,0.05,1.74,18.38,8.923,21050.3,61.3,61.5,2020,4
4,28.5625,430916,26004,5920453,634626,256.394,0.05,1.75,29.38,12.328,20216.6,59.4,67.1,2020,5


In [5]:
# Print out the names of columns
data.columns

Index(['montly_value', 'sum_export', 'sum_import', 'sum_stock',
       'sum_production', 'cpi', 'interest', 'gas_Henry', 'crude_brent',
       'vehicle_sales', 'personal_income', 'export_index', 'import_index',
       'year', 'month'],
      dtype='object')

In [6]:
# Use the `StandardScaler()` module from scikit-learn to normalize the features 
# except column "Date" and target variable "montly_value"
data_scaled = StandardScaler().fit_transform(data[['sum_export', 'sum_import', 'sum_stock',
       'sum_production', 'cpi', 'interest', 'gas_Henry', 'crude_brent',
       'vehicle_sales', 'personal_income', 'export_index', 'import_index',
       'year', 'month']])

In [7]:
# Create a DataFrame with the scaled data
data_scaled_df = pd.DataFrame(data_scaled,columns =['sum_export', 'sum_import', 'sum_stock',
       'sum_production', 'cpi', 'interest', 'gas_Henry', 'crude_brent',
       'vehicle_sales', 'personal_income', 'export_index', 'import_index',
       'year', 'month'])

# Review the dataframe
data_scaled_df.head()

Unnamed: 0,sum_export,sum_import,sum_stock,sum_production,cpi,interest,gas_Henry,crude_brent,vehicle_sales,personal_income,export_index,import_index,year,month
0,0.352589,-0.625748,0.469667,1.721716,-1.182179,0.210813,-0.934219,-0.344392,1.217122,-1.835994,-0.4885,-0.515031,-1.235479,-1.463338
1,0.135693,-1.254298,0.50971,0.741029,-1.139438,0.22877,-0.988381,-0.664402,1.08523,-1.720854,-0.752446,-0.63352,-1.235479,-1.174892
2,0.790185,-1.948153,0.718355,1.648917,-1.173474,-0.327899,-1.047468,-1.611614,-1.719946,-2.033776,-0.838333,-0.948651,-1.235479,-0.886445
3,-0.97047,-2.337465,1.209605,0.144388,-1.277818,-0.68704,-1.072088,-2.157512,-3.157201,-0.101823,-1.508673,-1.518407,-1.235479,-0.597999
4,-1.090505,-1.242368,1.279146,-2.04746,-1.277515,-0.68704,-1.067164,-1.716949,-1.382136,-0.830138,-1.548475,-1.377229,-1.235479,-0.309552


# Splitting Data for trainning and testing

In [8]:
# Separate the features from the target
y = data["montly_value"]
X = data_scaled_df

In [9]:
# Split the data into training and testing sets
# Set the test size as 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Building XGBoost Model and making predictions

In [10]:
# Create an XGBoost Random Forest Regressor model
xgb_model = xgb.XGBRFRegressor(random_state=42)

In [11]:
# Train the model on the training data
xgb_model.fit(X_train, y_train)

In [12]:
# Make predictions on the testing data
y_pred = xgb_model.predict(X_test)

# Evaluating Model Performance

In [42]:
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = mse**0.5   
score = xgb_model.score(X_test, y_test, sample_weight=None) #XGBoost built in function which return the coefficient of determination of the prediction

print(f"Model Performance Result")
print(f"-------------------------------------")
print(f"Mean Squared Error: {round(mse,3)}")
print(f"Root Mean Squared Error (RMSE): {round(rmse,3)}")
print(f"R-squared: {round(score,3)}")

Model Performance Result
-------------------------------------
Mean Squared Error: 33.014
Root Mean Squared Error (RMSE): 5.746
R-squared: 0.946


In [17]:
# Get the feature importance array
importances = xgb_model.feature_importances_

In [41]:
# List the top five most important features
feature_importance_list = sorted(zip(xgb_model.feature_importances_, X.columns), reverse=True)

for importance, feature_name in feature_importance_list[:5]:
    print(f"Feature: {feature_name}, Importance: {importance}")

Feature: cpi, Importance: 0.4327486455440521
Feature: sum_stock, Importance: 0.13244210183620453
Feature: export_index, Importance: 0.10596395283937454
Feature: crude_brent, Importance: 0.08285009860992432
Feature: import_index, Importance: 0.08249511569738388


In [29]:
# Check the correlation and sort by "crude oil montly price"in ascending order
correlation = data.corr().sort_values('montly_value',ascending = False).iloc[:,[0]]
correlation

Unnamed: 0,montly_value
montly_value,1.0
crude_brent,0.997987
export_index,0.952818
import_index,0.933143
cpi,0.803724
gas_Henry,0.795831
year,0.765163
sum_import,0.632127
personal_income,0.566913
sum_export,0.397303


## Model Evaluation Report
---
### Statistical Analysis

- **Mean Squared Error (MSE)**: 33.014
- **Root Mean Squared Error (RMSE)**: 5.746
  - After checking both MSE AND RMSE, a lower RMSE indicates that, on average, the model's predictions are off by approximately 5.746 units from the actual montly crude oil prices.
- **R-squared (R2)**: 0.946
  - An R-squared value of 0.946 suggests that the model explains approximately 94.6% of the variance in crude oil prices. 
  
### Features & Correlation Analysis
- **Consumer Price Index (CPI)**: According to the feature importance analysis conducted using XGBoost, CPI emerges as the most influential feature, boasting an importance score of approximately 0.433. In addition, it exhibits a substantial correlation of 0.803 with monthly crude oil WTI prices. This robust correlation underscores the pivotal role of overall economic health in shaping the dynamics of crude oil WTI prices. It's worth noting that the use of a more specific CPI category, such as CPI for vehicles, could potentially yield an even higher correlation, further accentuating the significance of consumer price trends.

- **Export/Import Price Index (NAICS): Petroleum and Coal Products Manufacturing**: Both the Export Price Index and Import Price Index within the NAICS category of Petroleum and Coal Products Manufacturing demonstrate significant correlations of 0.95 and 0.93, respectively, with monthly crude oil WTI prices. These correlations underscore the profound influence of international trade dynamics, specifically within the petroleum and coal products sector, on the fluctuations in crude oil prices.

- **Crude Oil Brent Price**: As another global leading benchmark, the Brent crude oil price, primarily sourced from the North Sea in Europe, exhibits the highest correlation among all features, standing at an impressive 0.997. This feature also ranks as the fourth most important, as determined by the XGBoost feature importance analysis. (**Note**: While Brent crude oil, developed in the North Sea, Europe, is a notable benchmark, its high correlation with the target variable may necessitate its exclusion as an independent variable when constructing a weekly basis model for predicting crude oil WTI prices. This adjustment aims to rigorously test the model's real-world performance by removing highly correlated features.)

### Conclusion
Although those statistical metrics are strong indications that the model fits the data well and captures a significant portion of the price fluctuations, it's prudent to consider refining the temporal granularity. Given the unique dynamics of the crude oil oligopoly market, transitioning to weekly or daily data could yield more meaningful R-squared (R2) scores, enhancing both statistical and financial significance. This approach enhances our ability to derive actionable insights for crude oil trading and risk management.
