# Workshop 4.2 Time_Series_Forecasting_ML

When training any supervised learning model, it is important to split the data into training and test data. The training data is used to fit the model. The algorithm uses the training data to learn the relationship between the features and the target. The test data is used to evaluate the performance of the model.


To fit and train aa model, we’ll be following The Machine Learning Workflow:

- Feature engineering
- Split the data
- Train the model
- Hyperparameter tuning
- Assess model performance



To know how a random forest algorithm works we need to know Decision Trees which is again a Supervised Machine Learning algorithm used for classification as well as regression problems. Decision Trees are used for both regression and classification problems. 


Ensemble learning is the process of using multiple models, trained over the same data, averaging the results of each model ultimately finding a more powerful predictive/classification result.

We will use the sklearn module for training our random forest regression model, specifically the RandomForestRegressor function.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime


df=pd.DataFrame()
%matplotlib inline

In [None]:
df = pd.read_csv('Alcohol_Sales.csv',index_col='DATE',parse_dates=True)
df.index.freq = 'MS'

In [None]:
df

In [None]:
df.tail()

In [None]:
df.describe()

In [None]:
Visualise data:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("Alcohol_Sales.csv")
df.columns = ["date", "sales"]
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")

df_train = df.iloc[:-48]
df_test = df.iloc[-48:]

plt.figure(figsize = (18,7))
plt.plot(df, label="Training data")
plt.plot(df_test, label = "Test data")
plt.grid(alpha=0.5)
plt.margins(x=0)
plt.title("Alcohol Sales")

In [None]:
df.columns = ['Sales']
plt.figure()
df.plot(figsize=(12,8))

In [None]:
df['Sale_LastMonth']=df['Sales'].shift(+1)
df['Sale_2Monthsback']=df['Sales'].shift(+2)
df['Sale_3Monthsback']=df['Sales'].shift(+3)
df

In [None]:
df=df.dropna()
df

In [None]:
import seaborn as sns


sns.pairplot(df)

In [None]:
# Import shift method (if needed to look at data)
from scipy.ndimage import shift
import matplotlib.pyplot as plt

In [None]:
import numpy as np
x1,x2,x3,y=df['Sale_LastMonth'],df['Sale_2Monthsback'],df['Sale_3Monthsback'],df['Sales']
x1,x2,x3,y=np.array(x1),np.array(x2),np.array(x3),np.array(y)
x1,x2,x3,y=x1.reshape(-1,1),x2.reshape(-1,1),x3.reshape(-1,1),y.reshape(-1,1)
final_x=np.concatenate((x1,x2,x3),axis=1)
print(x1.shape)
final_x.shape

Let's understand the output:

In [None]:
final_x[:3]

In [None]:
plt.plot(final_x[:100])

# Choose the model

In [None]:
from sklearn.linear_model import LinearRegression
lin_model=LinearRegression()

In [None]:
from sklearn.ensemble import RandomForestRegressor
model=RandomForestRegressor(n_estimators=100,max_features=3, random_state=18)


# Splitting the Data

In [None]:
#from sklearn.model_selection import train_test_split


In [None]:
X_train,X_test,y_train,y_test=final_x[:-30],final_x[-30:],y[:-30],y[-30:]

In [None]:
print(X_test.shape, y_test.shape)

In [None]:
X_test[1,1]  #got it!

In [None]:
y_test[:]

# Train the model, fitting 

We first create an instance of the Random Forest model, with the default parameters. We then fit this to our training data. We pass both the features and the target variable, so the model can learn.

So our models from working with the RandomForestRegressor and LinearRegression algorythms give us : m1 and m2 (models)

In [None]:
m1=model.fit(X_train,y_train.ravel())
m2=lin_model.fit(X_train,y_train.ravel())

# the simpler option m1=model.fit(X_train,y_train) didn't work

At this point, we have a trained Random Forest **model**, but we need to find out whether it is making accurate predictions for it.

In [None]:
pred=model.predict(X_test)
plt.rcParams["figure.figsize"] = (12,8)


plt.plot(pred,label='Random_Forest_Predictions')
plt.plot(y_test,label='Actual Sales')
plt.legend(loc="upper left")
plt.show()

Understand the output, the data format

In [None]:
print(pred.shape)

In [None]:
pred[1]

At this point, we have also  trained the **Linear Regression model**, but we need to find out whether it is making accurate predictions.

In [None]:
linpred=lin_model.predict(X_test)
#linpredshifted = shift(linpred, 1, cval=np.NaN)

plt.rcParams["figure.figsize"] = (11,6)
plt.plot(linpred,label='Linear_Regression_Predictions')
plt.plot(y_test,label='Actual Sales')
plt.legend(loc="upper left")
#plt.ylim((0,15000))
plt.show()

Both models seem ok.

##  Statistics and Accuracy of the model

The simplest way to evaluate these models is using accuracy; we check the predictions against the actual values in the test sets and count up how many the models got right.

 When viewing the performance metrics of a regression model, we can use factors such as mean squared error, root mean squared error, $R^²$, adjusted $r^²$, and others. For this article I will focus on mean squared error and root mean squared error.

Note: mean squared error (MSE) 
    
* Our goal is to reduce the MSE as much as possible. *  

For example, if we have an actual output array of (3,5,7,9) and a predicted output of (4,5,7,7), then we could calculate the mean squared error as:
$((3-4)^² + (5–5)^² + (7–7)^² +(9–7)^²)/4 = (1+0+0+4)/4 = 5/4 = 1.25$


The root mean squared error (RMSE) is just simply the square root of the MSE, so the in this case the $ RMSE = 1.25^.5 = 1.12$.



In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse_rf=sqrt(mean_squared_error(pred,y_test))
rmse_lr=sqrt(mean_squared_error(linpred,y_test))
print('Mean Squared Error for Random Forest Model is:',rmse_rf)
print('Mean Squared Error for Linear Regression Model is:',rmse_lr)

In [None]:
print("Accuracy Score for Random Forest Model:", m1.score(X_test, y_test))
print("Accuracy Score for Linear Regression Model:", m2.score(X_test, y_test))

In [None]:
#residual plots to compare oredicted values with test values

plt.plot(np.ravel(y_test)-pred, marker='D',c='green', alpha=0.35)

## More about the parameters of the Random Forrest algorythm

The RandomForestRegressor documentation shows many different parameters we can select for our model. Some of the important parameters are highlighted below:

- **n_estimators** — the number of decision trees you will be running in the model
- **criterion** — this variable allows you to select the criterion (loss function) used to determine model outcomes. We can select from loss functions such as mean squared error (MSE) and mean absolute error (MAE). The default value is MSE.
- **max_depth** — this sets the maximum possible depth of each tree
- **max_features** — the maximum number of features the model will consider when determining a split
- **bootstrap** — the default value for this is True, meaning the model follows bootstrapping principles (defined earlier)
- max_samples — This parameter assumes bootstrapping is set to True, if not, this parameter doesn’t apply. In the case of True, this value sets the largest size of each sample for each tree.

- **Other** important parameters are min_samples_split, min_samples_leaf, n_jobs, and others that can be read in the sklearn’s RandomForestRegressor documentation here.

Example:

rf = RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 18).fit(x_train, y_train)

What does it mean?

Looking at our base model above, we are using 300 trees; max_features per tree is equal to the squared root of the number of parameters in our training dataset. The max depth of each tree is set to 5. And lastly, the random_state was set to 18 just to keep everything standard.

In [None]:
#model=RandomForestRegressor(n_estimators=100,max_features=3, random_state=1)


# 100 trees
# he random_state was set to 1 just to keep everything standard.



This can be done:

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse_rf=sqrt(mean_squared_error(pred,y_test))
rmse_lr=sqrt(mean_squared_error(lin_pred,y_test))
print('RMSE Mean Squared Err for Random Forest Model is:',rmse_rf)
print('RMSE Mean Squared Err for Linear Regression Model is:',rmse_lr)

Our results from this basic random forest model weren’t that great overall. The RMSE value of 1913 is pretty high given most values of our dataset are between 10000–14000. Looking ahead, we will see if tuning helps create a better performing model.

In [None]:
np.ravel(y_test)

In [None]:
pred

Cons:
    
    One thing to consider when running random forest models on a large dataset is the potentially long training time. For example, the time required to run this first basic model was about 30 seconds, which isn’t too bad, but as I’ll demonstrate shortly, this time requirement can increase quickly.

# Next Step: Using GridSearchCV to optimize your Machine Learning model
    
 Now that we did our basic random forest regression, we will look to find a better performing choice of parameters and will do this utilizing the GridSearchCV sklearn method.

   

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
print(coddnfusion_matrix(y_test, pred))

In [None]:
## Define Grid 
grid1 = { 
    'n_estimators': [200,300],
    'max_features': ['sqrt','log2'],
    'max_depth' : [3,4,5,6,7],
    'random_state' : [18]
}

I included two print statements that will display the current datetime, this way we can track the start and end-times of the function to measure the runtime.

In [None]:
from sklearn.model_selection import GridSearchCV

## show start time
print(datetime.datetime.now())
## Grid Search function
CV_rfr = GridSearchCV(estimator=RandomForestRegressor(), param_grid=grid1, cv= 5)
CV_rfr.fit(X_train, y_train)
## show end time
print(datetime.datetime.now())

# More Evaluation Metrics
Let’s look at the confusion matrix. 

# More on Data and models to predict sales

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

In [None]:
decompose=seasonal_decompose(df)
decompose.plot();

In [None]:
pip install statsmodels

In [None]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

In [None]:
from statsmodels.tsa.stattools import adfuller
tesa=ExponentialSmoothing(df,trend='add',seasonal='add',seasonal_periods=12).fit().fittedvalues.rename('tripple-expo add')
tesm=ExponentialSmoothing(df,trend='mul',seasonal='mul',seasonal_periods=12).fit().fittedvalues.rename('tripple-expo mul')

In [None]:
df.plot(figsize=(10,7),legend=True)
tesa.plot(legend=True)
tesm.plot(legend=True)

In [None]:
#let's zoom in:
df.iloc[:24].plot(figsize=(10,7),legend=True,)
tesa[:24].plot(legend=True)
tesm[:24].plot(legend=True)

In [None]:
from sklearn.metrics import r2_score
print('rmse tesa:',r2_score(df,tesa))
print('rmse tesm:',r2_score(df,tesm))

For curious people:
    
    https://www.kaggle.com/code/jshivam101998/alcohol-sales-csv-file-forecasting-for-next-3-year