# Model Research 

In this Jupyter notebook, we will share our research on the different models we will be using, namely:  
- SARIMAX  
- LSTM   
- Random Forest  
- XGBOOST  

This entails the following points:  
- **Data preprocessing:** how does the data need to be processed for the various models? Does it need to be **scaled** or **standardized**? Do outliers need to be treated in a particular way?  
- **Model specifics:** why are these models suited for working with the kind of data we have (i.e. time series data that exhibits both trends and seasonality and therefore is not stationary)? Which assumptions do these models rely on to perform their analysis? What are their strengths? What are their limitations?

## SARIMAX
### Data Preprocessing  
A generally helpful guide to [SARIMAX](https://www.kaggle.com/code/nholloway/seasonality-and-sarimax).

**Do we need to scale/standardize the data?**  
SARIMAX does not require the scaling/standardization of the data, see [here](https://stats.stackexchange.com/questions/608658/arima-or-sarima-scale-and-normalize-data). The guide cited above also does not scale the data. What it does require is that **categorical data are one-hot encoded**, but we should check if the required Python libraries handle this under the hood. It seems that scaling can enhace [numerical stability](https://github.com/statsmodels/statsmodels/issues/7382), however.
Furthermore, SARIMAX needs us to **predefine the seasonal frequency**: we need to tell it the amount of periods per season. 

## LSTM
### Data Preprocessing  
**Do we need to scale/standardize the data?**   
Data for LSTM **needs to be scaled**. The reason for this is that fitting a network on unscaled data with different ranges of values my slow down the learning and convergence of our network significantly. Worse, it can prevent our network from learning the problem effectively. [Brownlee](https://machinelearningmastery.com/how-to-scale-data-for-long-short-term-memory-networks-in-python/).  
We could scale via **normalization** or **standardization**, but with standardization, there is a caveat:  
"Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.
Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data. If your time series is trending up or down, estimating these expected values may be difficult and normalization may not be the best method to use on your problem." (Brownlee)  

Hence, **I would suggest standardization.**

## Random Forest
### Data Preprocessing
**Do we need to scale/standardize the data?**   
Answer from lab 5: "**You do not need to scale variables before using tree-based methods (e.g., decision trees, random forests, gradient boosting methods like XGBoost, LightGBM, and CatBoost)**. Tree-based models are scale-invariant because they split data based on feature thresholds rather than distance-based calculations. We *can* still scale variables if we want **however, doing so can have benefits for numerical stability in cases where values get very big.**"  

--> I suggest we scale the data, because although it is not strictly necessary, it does no harm and can enhance numerical stability. It is possible to scale with via Z-score or normalization, among other things. [Ahsan et al.](https://www.mdpi.com/2227-7080/9/3/52) (p. 14)find that normalization performs well with tree-based models like XGBoost, but performance depends on the dataset. 

## XGBOOST  
### Data Preprocessing  
**Do we need to scale/standardize the data?**   
XGBoost, like Random Forest, is a tree-based method. Hence, the same reasoning as for Random Forest applies:  
--> I suggest we scale, because although it is not strictly necessary, it does no harm and can enhance numerical stability.[Ahsan et al.](https://www.mdpi.com/2227-7080/9/3/52) (p. 14)find that normalization performs well with tree-based models like XGBoost, but performance depends on the dataset. 