# Machine Learning with Python

## 4.2 Forecasting

With time series data, we are often interested in making regression predictions for future timepoints - this is *forecasting*.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
sun = pd.read_csv('sunspots.csv')

In [None]:
sun.head()

In [None]:
spots = sun['Monthly_Spots']
plt.figure(figsize = (20, 6))
plt.plot(np.arange(0,len(spots)),spots)
plt.show()

Let's try to predict the current number of sunspots using the ten previous months' values as features - this is an *autoregressive model*.

In [None]:
shifts = np.arange(1,11)
ten_shifts = {'lag_{}'.format(i): spots.shift(i) for i in shifts}
ten_shifts = pd.DataFrame(ten_shifts)
ten_shifts

Ignoring the first ten rows (which have missing data), we will use the first 2000 months as training data and the rest as testing data.

In [None]:
X_train = ten_shifts[10:2000]
X_test = ten_shifts[2000:]

y_train = spots[10:2000]
y_test = spots[2000:]


In [None]:
from sklearn.ensemble import RandomForestRegressor

rfc = RandomForestRegressor()
rfc.fit(X_train,y_train)


In [None]:
from sklearn.linear_model import Ridge

lm = Ridge(alpha=0.01)
lm.fit(X_train,y_train)

In [None]:
y_pred = lm.predict(X_test)

In [None]:

#fig,ax = plt.subplots()
plt.figure(figsize = (20, 6))
plt.plot(np.arange(0,len(y_test)),y_test)
plt.plot(np.arange(0,len(y_pred)),y_pred)
plt.show()

Given the real values for the ten previous months, the predictions are matching the observed values very well.

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

### Interpreting the model

The model coefficients can show us information about the smoothness of the signal.

In [None]:
plt.bar(X_train.columns,lm.coef_)
plt.show()

Clearly the month immediately before has by far the strongest contribution to the prediction - after this, the coefficients are considerably smaller.

If the signal were smoother then there would be more predictive value in previous months' data.

### What about cross-validation?

Cross-validation is possible for forecasting, but we need to be careful. If we randomly shuffle rows then we are definitely going to cross-contaminate information between the training and testing folds, often leading to an unreasonably high estimate of performance.

In general, we should only use PAST information to predict the future. There is a specific iterator for time series that deals with this properly: [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html#timeseriessplit)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import TimeSeriesSplit

# a 5-fold cross-validation, scored using r2
score = cross_val_score( lm,
                         X_train,
                         y_train,
                         cv=TimeSeriesSplit(),
                         scoring='r2')
print("Cross-validated r2:")
print(score)
print("mean:", np.mean(score))

### Rolling predictions

It is often interesting to see how far into the future our model can successfully predict. To do this, we can replace the *actual* lag features with *predicted* ones.

In [None]:
current_features = X_test.iloc[0]
current_features

In [None]:
current_pred = lm.predict(current_features.to_frame().T)
current_pred

In [None]:
next_features = current_features.shift()
next_features[0] = current_pred
next_features

In [None]:
predictions = []
current_features = X_test.iloc[0]
for i in range(len(X_test)):
    current_pred = lm.predict(current_features.to_frame().T)
    predictions.append(current_pred[0])
    next_features = current_features.shift()
    next_features[0] = current_pred
    current_features = next_features



In [None]:
plt.figure(figsize = (20, 6))
plt.plot(np.arange(0,len(y_test)),y_test)
plt.plot(np.arange(0,len(predictions)),predictions)
plt.show()

Clearly our model is not good for predicting more than a few months ahead.

### Exercise

Explore using different lag values to supply forecasting features for this dataset. Can we predict further into the future?

Would a sliding window help in producing smoother predictions?