### This is the code to iterate throught subset datasets and outliers thresholds using SARIMAX model to get error metrics. 
#### References: 
- Harris, C. R., Millman, K. J., Van Der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., Van Kerkwijk, M. H., Brett, M., Haldane, A., Del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2
- Kalyvas, V. (2024, January 19). Time Series Episode 3: ARIMA predictioning with exogenous variables. Medium. https://python.plainenglish.io/time-series-episode-3-arima-forecasting-with-exogenous-variables-6658f82170e4
- Peixeiro, M. (2022). Time series forecasting in Python (Section 9). Manning.


#### Packages
- Package Pandas (2.2). (2024). [Python]. https://pandas.pydata.org/
- Package NumPy (1.23). (2023). [Pyhton]. https://numpy.org/
- Droettboom, J. D. H., Michael. (2024). Package matplotlib (3.8.4) [Python]. https://matplotlib.org
- Package scikit-learn (1.4). (2024). [Pyhton]. https://scikit-learn.org/stable/index.html
- Package statsmodels (0.14). (2024). [Python]. statsmodels. https://github.com/statsmodels/statsmodels

In [7]:
# Import packages
from pmdarima import auto_arima
import pandas as pd
import numpy as np
import useful_functions as uf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error
from sklearn.preprocessing import MinMaxScaler

# Define a list of file paths
file_paths = [
    '../data/data_orig_parameters.csv',
    '../data/data_cleaned_RF.csv',
    '../data/data_cleaned_LASSO.csv',
    '../data/data_cleaned_RFE.csv'
]

# List of thresholds for outliers
outlier_thresholds = [np.nan, 0.05, 0.10, 0.15, 0.20]

# Dictionary to store the errors
errors_dict = {}

# Loop through the files and thresholds
for file_path in file_paths:
    print(f"REading File: {file_path}") # Print the file path
    for remove_outliers_threshold in outlier_thresholds:
        print(f"Outlier Threshold: {remove_outliers_threshold}") # Print the threshold
        # Load data
        df_raw = pd.read_csv(file_path, parse_dates=['Date'], index_col='Date')
        target_variable = df_raw.columns[0]

        # Remove outliers using the threshold
        if not pd.isna(remove_outliers_threshold):
            df_cleaned = uf.remove_outliers(df_raw.copy(), threshold=remove_outliers_threshold)
        else:
            df_cleaned = df_raw.copy()

        # After removing the outliers, fill the missing values
        df_adjusted = uf.fill_missing_values(df_cleaned)

        # Define the train and test sets
        test_size = 48  # meses
        df_train = df_adjusted[:-test_size]
        df_test = df_adjusted[-test_size:]

        # Let´s scale the dfs
        # Define the scaler
        scaler = MinMaxScaler(feature_range=(0,1))
        scaled_train = scaler.fit_transform(df_train) # Fit and transform the train set
        scaled_test = scaler.transform(df_test) # Transform the test set
        # include df columns names in the train and test sets
        train = pd.DataFrame(scaled_train, columns=df_adjusted.columns)
        test = pd.DataFrame(scaled_test, columns=df_adjusted.columns)
        # Include the index in the train and test sets
        train.index = df_adjusted.index[:-test_size]
        test.index = df_adjusted.index[-test_size:]
        # define the exogenous variables as all except the first column
        exog_var_train = train.iloc[:, 1:].ffill() # fill NAs with the last valid observation
        exog_var_test = test.iloc[:, 1:].ffill()# fill NAs with the last valid observation
        # Define the model using the same parameters as the SARIMA
        model = SARIMAX(train[target_variable], order=(5,1,4), 
                        seasonal_order=(2,0,0,12), exog = exog_var_train)
        # Fit the model
        model_fit = model.fit(disp=False, maxiter=200)
        # Predict the test set
        predictions = model_fit.forecast(steps=len(test[target_variable]), exog = exog_var_test)

        # Let's reverse the scaling to get the real values
        original_data_test = df_adjusted[-test_size:][target_variable]
        # Convert Pandas Series to NumPy arrays and reshape
        forecasts_on_test_scaled_np = predictions.to_numpy().reshape(-1, 1)
        forecasts_on_test_scaled_np = np.repeat(forecasts_on_test_scaled_np,test.shape[1], axis=-1)

        # Inverse transform to get the real values
        forecasts_on_test_all = scaler.inverse_transform(forecasts_on_test_scaled_np)

        # Subset the forecast to get only the first column
        forecasts_on_test = forecasts_on_test_all[:,0]

        # Convert to pandas dataframe and include the index
        forecasts_on_test = pd.DataFrame(forecasts_on_test, index=test.index, columns=[target_variable])

        # Calculate the errors
        mape = mean_absolute_percentage_error(original_data_test, forecasts_on_test)
        rmse = np.sqrt(mean_squared_error(original_data_test, forecasts_on_test))
        mae = mean_absolute_error(original_data_test, forecasts_on_test)

        # Save the erros and the model summary in the dictionary
        errors_dict[(file_path, remove_outliers_threshold)] = {'MAPE': mape, 'RMSE': rmse, 'MAE': mae, 'model': model_fit.summary()}



REading File: ../data/data_orig_parameters.csv
Outlier Threshold: nan


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


Outlier Threshold: 0.05


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


Outlier Threshold: 0.1


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


Outlier Threshold: 0.15


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


Outlier Threshold: 0.2


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


REading File: ../data/data_cleaned_RF.csv
Outlier Threshold: nan


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


Outlier Threshold: 0.05


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


Outlier Threshold: 0.1


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


Outlier Threshold: 0.15


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


Outlier Threshold: 0.2


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


REading File: ../data/data_cleaned_LASSO.csv
Outlier Threshold: nan


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


Outlier Threshold: 0.05


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


Outlier Threshold: 0.1


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


Outlier Threshold: 0.15


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


Outlier Threshold: 0.2


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


REading File: ../data/data_cleaned_RFE.csv
Outlier Threshold: nan


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


Outlier Threshold: 0.05


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


Outlier Threshold: 0.1


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


Outlier Threshold: 0.15


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


Outlier Threshold: 0.2


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


In [8]:
# Print the errors to evaluate the best model
for key, value in errors_dict.items():
    mape = value['MAPE']
    rmse = value['RMSE']
    mae = value['MAE']
    print(f"Model: SARIMAX., File: {key[0]}, Outlier Threshold: {key[1]} ->, MAPE: {mape:.2f}, RMSE: {rmse:.2f}, MAE: {mae:.2f}")

Model: SARIMAX., File: ../data/data_orig_parameters.csv, Outlier Threshold: nan ->, MAPE: 4.00, RMSE: 90089.25, MAE: 70023.61
Model: SARIMAX., File: ../data/data_orig_parameters.csv, Outlier Threshold: 0.05 ->, MAPE: 3.25, RMSE: 77843.47, MAE: 58207.18
Model: SARIMAX., File: ../data/data_orig_parameters.csv, Outlier Threshold: 0.1 ->, MAPE: 4.53, RMSE: 45271.07, MAE: 35030.28
Model: SARIMAX., File: ../data/data_orig_parameters.csv, Outlier Threshold: 0.15 ->, MAPE: 10.06, RMSE: 28717.24, MAE: 23046.59
Model: SARIMAX., File: ../data/data_orig_parameters.csv, Outlier Threshold: 0.2 ->, MAPE: 3.15, RMSE: 27656.73, MAE: 20848.04
Model: SARIMAX., File: ../data/data_cleaned_RF.csv, Outlier Threshold: nan ->, MAPE: 1.43, RMSE: 50348.55, MAE: 33085.53
Model: SARIMAX., File: ../data/data_cleaned_RF.csv, Outlier Threshold: 0.05 ->, MAPE: 1.32, RMSE: 45320.22, MAE: 30772.38
Model: SARIMAX., File: ../data/data_cleaned_RF.csv, Outlier Threshold: 0.1 ->, MAPE: 4.85, RMSE: 34119.23, MAE: 24432.60
Mod

In [9]:
print(model_fit.summary())

                                      SARIMAX Results                                      
Dep. Variable:             ECO_fiscal_result_month   No. Observations:                  228
Model:             SARIMAX(5, 1, 4)x(2, 0, [], 12)   Log Likelihood                 301.489
Date:                             Mon, 15 Apr 2024   AIC                           -498.978
Time:                                     15:21:52   BIC                           -320.880
Sample:                                 01-01-2001   HQIC                          -427.113
                                      - 12-01-2019                                         
Covariance Type:                               opg                                         
                                         coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
EXPEC_inflation_y                     -0.0604      0.068  