# TIØ4317 Empirical Mini Project
## Introduction
Hva er dette prosjektet, formål osv. forklare litt

## Data Preprocessing

In [None]:
# Imports
import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt
import seaborn as sns
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
from statsmodels.stats.diagnostic import acorr_breusch_godfrey, het_arch
from statsmodels.graphics.tsaplots import plot_pacf
from sklearn.metrics import mean_absolute_error, mean_squared_error

We start by reading the different csv-files that contain the data.

In [None]:
# Define the folder path
data_path = "data/"

# Load datasets
zero_coupon = pd.read_csv(os.path.join(data_path, "zero_coupon_rates.csv"), delimiter=";")
exchange_rates = pd.read_csv(os.path.join(data_path, "usd_nok.csv"), delimiter=";")
inflation = pd.read_csv(os.path.join(data_path, "kpi.csv"), delimiter=";")
osebx = pd.read_csv(os.path.join(data_path, "osebx_prices.csv"), delimiter=";")

# Replace commas with dots in the KPI column and convert to float
inflation["kpi"] = inflation["kpi"].str.replace(",", ".").astype(float)


Then, we check for missing values in the data

In [None]:
datasets = {
    "OSEBX": osebx,
    "Zero Coupon": zero_coupon,
    "Exchange Rates": exchange_rates,
    "Inflation": inflation
}

for name, df in datasets.items():
    print(f"\nMissing Values in {name} Dataset:")
    print(df.isnull().sum())

From the output, we can see that the data does not contain any null values.

Continue with converting the dates to Datetime format. Also, for inflation, since we are originally only having monthly data, we use forward-fill to get daily inflation data.

In [None]:
# Convert Date to Datetime format
osebx["Date"] = pd.to_datetime(osebx["Date"])
zero_coupon["TIME_PERIOD"] = pd.to_datetime(zero_coupon["TIME_PERIOD"])
exchange_rates["TIME_PERIOD"] = pd.to_datetime(exchange_rates["TIME_PERIOD"])
inflation["Date"] = pd.to_datetime(inflation["Date"], format="%YM%m")

# Create a full date range from the first to last available date in your dataset
full_date_range = pd.date_range(start=inflation["Date"].min(), end=inflation["Date"].max(), freq="D")

# Create a DataFrame with daily dates
inflation_daily = pd.DataFrame({"Date": full_date_range})

# Merge with the original inflation data (left join) and forward-fill missing values
inflation_daily = inflation_daily.merge(inflation, on="Date", how="left")
inflation_daily["kpi"] = inflation_daily["kpi"].interpolate(method="linear")

Compute log returns for OSEBX. Also, rename columns to ensure that all dataframes have a column named "Date", so that we can merge all datasets on "Date". After having merged all the datasets to one dataframe, we drop all the columns we are not interested in. Consequently, the columns we are left with are "Date", "kpi", "zero_coupon_rate", "usd_nok_exchange_rate", and "log_return".

In [None]:
# Compute log returns
osebx["log_return"] = np.log(osebx["Close"] / osebx["Close"].shift(1))
osebx.dropna(inplace=True)  # Drop the first row where return cannot be calculated

# Rename columns to "Date"
zero_coupon.rename(columns={"TIME_PERIOD": "Date"}, inplace=True)
exchange_rates.rename(columns={"TIME_PERIOD": "Date"}, inplace=True)

# Merge all datasets on 'Date'
df = inflation_daily.merge(zero_coupon, on="Date", how="inner")
df = df.merge(exchange_rates, on="Date", how="inner")
df = df.merge(osebx, on="Date", how="inner") 

df = df.drop(columns=["Close", "High", "Low", "Open", "Volume",
       "FREQ_x", "Frequency_x", "TENOR_x", "Tenor_x", "DECIMALS_x",
       "FREQ_y", "Frequency_y", "BASE_CUR", "Base Currency",
       "QUOTE_CUR", "Quote Currency", "TENOR_y", "Tenor_y", "DECIMALS_y",
       "CALCULATED", "UNIT_MULT", "Unit Multiplier", "COLLECTION",
       "Collection Indicator"])

df.rename(columns={"OBS_VALUE_x": "zero_coupon_rate"}, inplace=True)
df.rename(columns={"OBS_VALUE_y": "usd_nok_exchange_rate"}, inplace=True)

print(df.head())

Check for multicollinearity by computing the correlation matrix.

In [None]:
# Compute the correlation matrix
selected_columns = ["kpi", "usd_nok_exchange_rate", "zero_coupon_rate"]
corr_matrix = df[selected_columns].corr()

# Visualize the correlation matrix as a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()

We define highly correlated values to have a correlation coefficient >0.9. Because all variables have a correlation coefficient below 0.9, we decide to keep all of the explanatory variables. However, we notice that especially between usd_nok_exchange_rate and kpi and between zero_coupon_rate and kpi the correlation coefficient is close to the boundary. 

After having checked for correlation, we check for stationarity by applying the Augmented Dickey-Fuller (ADF) test.

In [None]:
# Check for stationarity 
def adf_test(series, var_name):
    result = adfuller(series.dropna())  
    print("="*40)  # Separator line
    print(f"ADF Test for {var_name}:")
    print(f"ADF Statistic: {result[0]:.4f}")
    print(f"p-value: {result[1]:.4f}")
    print("Stationary" if result[1] < 0.05 else "Non-stationary")
    print("="*40 + "\n")  # End separator

# Run ADF tests on all relevant variables
adf_test(df["log_return"], "Log Return")
adf_test(df["zero_coupon_rate"], "Zero Coupon Rate")  
adf_test(df["usd_nok_exchange_rate"], "USD/NOK Exchange Rate")  
adf_test(df["kpi"], "CPI")

From the output, we can see that the variables "zero coupon rate", "USD/NOK exchange rate", and "CPI" are non-stationary. To get these variables to be stationary, we apply first differences to these columns.

In [None]:
# Transform to stationary variables by applying first differences
df["d_kpi"] = df["kpi"].diff()
df["d_zero_coupon_rate"] = df["zero_coupon_rate"].diff()
df["d_usd_nok_exchange_rate"] = df["usd_nok_exchange_rate"].diff()

df = df.dropna()

adf_test(df["d_zero_coupon_rate"], "Differenced Zero Coupon Rate") 
adf_test(df["d_usd_nok_exchange_rate"], "Differenced USD/NOK Exchange Rate") 
adf_test(df["d_kpi"], "Differenced CPI") 

From the output, we can see that all the variables now are stationary.

#### Split into training and testing data

In [None]:
# Define the split percentage (e.g., 80% train, 20% test)
split_idx = int(len(df) * 0.8)

# Split the data
train_df = df.iloc[:split_idx].copy()  # First 80% for training
test_df = df.iloc[split_idx:].copy()   # Last 20% for testing

# Define training and testing sets
X_train, y_train = train_df[['d_kpi', 'd_zero_coupon_rate', 'd_usd_nok_exchange_rate']], train_df['log_return']
X_test, y_test = test_df[['d_kpi', 'd_zero_coupon_rate', 'd_usd_nok_exchange_rate']], test_df['log_return']


## Multiple Linear Regression (MLR)
### Autocorrelation

In [None]:
# Add constant for intercept
X_train_ols = sm.add_constant(X_train)

# Train OLS model
ols_model = sm.OLS(y_train, X_train_ols).fit()

# Print summary
print(ols_model.summary())

Checking for autocorrelation using Breusch-Godfrey test

NOTE: Se på hva LM, F osv. er. Nødvendig å ha med?

In [None]:
def breusch_godfrey_test(model, nlags=1):
   
    # Perform the test
    lm_stat, p_value, f_stat, f_p_value = acorr_breusch_godfrey(model, nlags=nlags)

    # Print results
    print(f"Breusch-Godfrey Test (nlags={nlags})")
    print(f"LM Statistic: {lm_stat:.4f}")
    print(f"P-Value: {p_value:.4f}")
    print(f"F-Statistic: {f_stat:.4f}")
    print(f"F-Test P-Value: {f_p_value:.4f}")

    # Interpretation
    if p_value < 0.05:
        print("Autocorrelation detected in residuals (reject H0).")
    else:
        print("No significant autocorrelation detected (fail to reject H0).")

    # Return results as a dictionary
    return {
        "LM Statistic": lm_stat,
        "P-Value": p_value,
        "F-Statistic": f_stat,
        "F-Test P-Value": f_p_value
    }

breusch_godfrey_test(ols_model, nlags=1)

To deal with autocorrelation we will try to add lagged variables. To get an impression of how many lags to use we visualize the partial autocorrelation function (PACF). 

In [None]:
# Check PACF for 'log_return'
plt.figure(figsize=(8, 5))
plot_pacf(df['log_return'], lags=21, method='ywm')
plt.title("Partial Autocorrelation Function (PACF)")
plt.show()

The figure above shows the partial autocorrelation coefficient for each lag, with the blue dashed lines representing the 95% confidence interval. Any bar extending this interval indicates statistical significance. From the figure, we observe that lag 7 and lag 18 seems to be most significant. 

We try to add lagged variables for lag 7 and lag 18 to remove autocorrelation.

Add lags until no autocorrelation

In [None]:
# Define the lags you want to create
lags = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
variables = ['d_kpi', 'd_zero_coupon_rate', 'd_usd_nok_exchange_rate']

# Create lagged variables using a loop
for var in variables:
    for lag in lags:
        train_df[f'{var}_lag{lag}'] = train_df[var].shift(lag)

# Drop NaN values introduced by lagging
train_df.dropna(inplace=True)

# Define independent variables dynamically
X = train_df[[col for col in train_df.columns if col.startswith('d_kpi') or 
               col.startswith('d_zero_coupon_rate') or 
               col.startswith('d_usd_nok_exchange_rate')]]


y_train = train_df['log_return']

# Add a constant for intercept
X = sm.add_constant(X)

# Fit the new OLS model
lagged_model = sm.OLS(y_train, X).fit()


Check for autocorrelation again

In [None]:
breusch_godfrey_test(lagged_model, nlags=1)

We continue with using model with only lags 7 and 18. Bla bla

In [None]:
# Define the lags we want to keep
keep_lags = [7, 18]

# Keep only the original variables and the selected lagged variables
columns_to_keep = ['d_kpi', 'd_zero_coupon_rate', 'd_usd_nok_exchange_rate']  # Original columns
columns_to_keep += [f'{var}_lag{lag}' for var in variables for lag in keep_lags]  # Only lags 7 & 18

# Select only the necessary columns
train_df = train_df[columns_to_keep + ['log_return']]

# Drop NaN values introduced by lagging
train_df.dropna(inplace=True)

# Define independent and dependent variables
X = train_df.drop(columns=['log_return'])
y_train = train_df['log_return']

# Add a constant for intercept
X = sm.add_constant(X)

# Fit the new OLS model
keep_lags_model = sm.OLS(y_train, X).fit()

breusch_godfrey_test(keep_lags_model, nlags=1)

### Heteroskedasticity
We continue by testing for heteroskedasticity. This is important because we assume homoskedasticity, which means that all the errors have the same variance. Identifying heteroskedasticity is important because it affects standard errors, confidence intervals, and hypothesis testing reliability in OLS regression.

We test if the homoskedasticity assumption holds by using the ARCH test, as it is well-suited for time-series data because it detects time-dependent variance. 

In [None]:
residuals = keep_lags_model.resid

arch_test = het_arch(residuals)

print(f"ARCH Test Statistic: {arch_test[0]}")
print(f"p-value: {arch_test[1]}")

To address the strong heteroskedasticity identified by the ARCH test, we use heteroskedasticity-consistent (HC) standard error estimates. This is to ensure valid inference in our regression model. Since our training dataset consists of approximately 2000 observations, we choose HC3 since it is well-suited for medium to large sample sized. This improves the robustness of our statistical inference.

In [None]:
model_robust = keep_lags_model.get_robustcov_results(cov_type='HC3')

print(model_robust.summary())

### Performance Evaluation
Now, we want to evaluate the performance of our Multiple Linear Regression (MLR) model. In order to do this, we first need to add lag 7 and 18 to our test set.

After having done this, we use the robust model to make predictions and compute the performance metrics Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), and Mean Absolute Error (MAE).

In [None]:
from tabulate import tabulate

X_test = X_test.copy()

X_test['d_kpi_lag7'] = X_test['d_kpi'].shift(7)
X_test['d_zero_coupon_rate_lag7'] = X_test['d_zero_coupon_rate'].shift(7)
X_test['d_usd_nok_exchange_rate_lag7'] = X_test['d_usd_nok_exchange_rate'].shift(7)

X_test['d_kpi_lag18'] = X_test['d_kpi'].shift(18)
X_test['d_zero_coupon_rate_lag18'] = X_test['d_zero_coupon_rate'].shift(18)
X_test['d_usd_nok_exchange_rate_lag18'] = X_test['d_usd_nok_exchange_rate'].shift(18)

# Drop rows with NaN values (first 18 rows will have NaNs due to lags)
X_test = X_test.dropna()

# Ensure y_test aligns with new X_test
y_test = y_test.loc[X_test.index]  

X_test = sm.add_constant(X_test)

y_pred = model_robust.predict(X_test)

mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100  
mse = mean_squared_error(y_test, y_pred)  
mae = mean_absolute_error(y_test, y_pred) 

results = [
    ["Metric", "Value"],
    ["MAPE (%)", f"{mape:.4f}"],
    ["MSE", f"{mse:.4f}"],
    ["MAE", f"{mae:.4f}"]
]

print(tabulate(results, headers="firstrow", tablefmt="grid"))

**Mean Absolute Percentage Error (MAPE)**

A MAPE value of **244.0910%** indicates that, on average, the model's predictions deviate from the actual values by more than twice their magnitude. This indicates clearly that the model struggles to make accurate predictions, and is likely due to a combination of the model not being able to capture the complex dynamics in the explanatory variables and the fact that log returns are very small values. Therefore, a small mistake in the prediction can lead to a large percentage error.

**Mean Squared Error (MSE)**

The MSE value of **0.0001** seems quite low, however, this is quite misleading. As noted earlier, the scale of the dependent variable is small, which makes the squared errors appear low even if the relative percentage error is large.

**Mean Absolute Error (MAE)**

The MAE value of **0.0061** represents the average absolute error in raw terms and indicates small deviations in absolute terms. However, the reason for such a low MAE value is the small scale of the dependent variable. This makes the absolute errors to seem small even if the relative error is large. 

**Conclusion**

While the MSE and MAE values appear low, the extremely high MAPE highly suggests that the model performs poorly in predicting the dependent variable. The small scale of the log returns of the OSEBX makes the absolute errors seem small but leads to large percentage errors. Also, the low **adjusted R squared value of 0.113** further confirms that the model has weak explanatory power. 

Overall, the model is not reliable for forecasting OSEBX returns and requires improvements, such as using a more sophisticated approach like ARIMAX.

## ARIMAX

In [None]:
# TODO: Elias og Erik