### About SARIMA

SARIMA = Seasonal AutoRegressive Integrated Moving Average

Seasonal: Is like it sounds, this is the seasonality contribution to the model

Autoregressive: Autoregression is when the variable you are interested in is forecasted using a linear combination of past values. Essentially, it is a regression against itself. Autocorrelation and partial autocorrelation are used to find optimal p values. Autocorrelation will consider the effect of a variable in the past, and effects of other variables on it. Partial autocorrelation will only measure the direct values between current and past values.

Integrated: Integration is the differencing applied to data to check the mean and variance. A dickery-fuller test is run on the data to make it stationary, or level, and check for the aformentioned mean and variance. 

Moving Average: Moving average is a regression-like past-error model. The error can only be found after fitting the model, and it represents the random deviations between the variable and the model. 

### Pipeline details:

The decomposition will give hints as to the p,q values, where p represents autoregression and q represents the moving average. ACF (aucorrelation) and PACF (partial autocorrelation) tests are good to find potential p values. The adfuller test will tell whether the data is stationary (p-value < .05 & adf > critical value), or needs diffing (p-value >= .05).

The test-train split ratio is 80:20


(p, d, q) represents the non-seasonal part of the model

(P, D, Q, m) represents the seasonal part of the model - where P, D, Q can be found from the same parts as the non-season section, and m is the observation frequency (1 hour in this particular case). 

### Notes, Limitations, and Discussion:

My local machine doesn't have the RAM to run this as of now, hence why the p, q, d and P, Q, D values are currently blank. They will be filled once the appropriate values are found.

To further my understanding (and also just general curiosity) I think I'm going to run the model as is without the test-train split by commenting it out- and while it is running on the distributed computing cluster, I will update this with the actual test-train split. 

Weaknesses: the main weakness from this model will probably come from the amount of data that was thrown away (>90%), and the amount of data that was generated using the Interpolate Refill Sample method. There was ~186,000 values before the resampling, and ~334,000 values after. Only ~55% of the total data is original. This could potentially be improved by predicting the gaps using previous data, as opposed to a resampling method.

Additional future work could include checking different time intervals and comparing the results, and trying other filters with the same process.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy

In [None]:
solar_cleaned = pd.read_csv('/home/cyrus/Documents/Solar/cleaned_PSP_data.csv')
solar_cleaned['time'] = pd.to_datetime(solar_cleaned.time)

In [None]:
solar_cleaned

In [None]:
time = solar_cleaned['time']
radiance = solar_cleaned['Global PSP [W/m^2]']

In [None]:
plt.plot(time, radiance)
plt.show()

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
decompose_data = seasonal_decompose(solar_cleaned, model="additive")
decompose_data.plot();

In [None]:
seasonality=decompose_data.seasonal
seasonality.plot(color='red')

In [None]:
from statsmodels.tsa.stattools import adfuller
solar_adfuller = adfuller(radiance, autolag = 'AIC')
print("ADF : ",solar_adfuller[0])
print("P-Value : ", solar_adfuller[1])
print("Crit-Value: ", solar_adfuller[4])

In [None]:
train = radiance[0:265141]
test = radiance[265142:-1]

In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX as sm

In [None]:
solar_model = sm(solar_cleaned['Global PSP [W/m^2]'], 
                order=(p, d, q), 
                seasonal_order=(P, D, Q, M)
predictions = solar_model.fit().predict()

In [None]:
plt.plot(predictions, color = 'red', label = 'Predicted')
plt.plot(radiance, color = 'black', label = 'Actual')
plt.figsize(16, 4)
plt.title('Prediction of the Solar Output over Time')
plt.show()