# How to Forecast a Time Series with Python

Wouldn't it be nice to know the future? This is the notebook that relates to the blog post on medium. Please check the blog for visualizations and explanations, this notebook is really just for the code :)


## Processing the Data

Let's explore the Industrial production of electric and gas utilities in the United States, from the years 1985-2018, with our frequency being Monthly production output.

You can access this data here: https://fred.stlouisfed.org/series/IPG2211A2N

This data measures the real output of all relevant establishments located in the United States, regardless of their ownership, but not those located in U.S. territories.

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("Electric_Production.csv",index_col=0)
# data = pd.read_csv("Taste_Garden_Updated.csv",index_col=0)
data.head()

Unnamed: 0_level_0,IPG2211A2N
DATE,Unnamed: 1_level_1
1985-01-01,72.5052
1985-02-01,70.672
1985-03-01,62.4502
1985-04-01,57.4714
1985-05-01,55.3151


Right now our index is actually just a list of strings that look like a date, we'll want to adjust these to be timestamps, that way our forecasting analysis will be able to interpret these values:

In [2]:
data.index

Index(['2017-12-09 17:17:00', '2017-12-09 17:27:00', '2017-12-09 17:37:00',
       '2017-12-09 17:47:00', '2017-12-09 17:57:00', '2017-12-09 18:07:00',
       '2017-12-09 18:17:00', '2017-12-09 18:27:00', '2017-12-09 18:37:00',
       '2017-12-09 18:47:00',
       ...
       '2018-09-13 12:58:00', '2018-09-13 13:08:00', '2018-09-13 13:18:00',
       '2018-09-13 13:27:00', '2018-09-13 13:37:00', '2018-09-13 13:47:00',
       '2018-09-13 13:57:00', '2018-09-13 14:07:00', '2018-09-13 14:17:00',
       '2018-09-13 14:27:00'],
      dtype='object', name='Date', length=39340)

In [3]:
data.index = pd.to_datetime(data.index)

In [4]:
data.head()

Unnamed: 0_level_0,Value
Date,Unnamed: 1_level_1
2017-12-09 17:17:00,11.6
2017-12-09 17:27:00,11.1
2017-12-09 17:37:00,12.8
2017-12-09 17:47:00,11.1
2017-12-09 17:57:00,11.5


In [5]:
data.index

DatetimeIndex(['2017-12-09 17:17:00', '2017-12-09 17:27:00',
               '2017-12-09 17:37:00', '2017-12-09 17:47:00',
               '2017-12-09 17:57:00', '2017-12-09 18:07:00',
               '2017-12-09 18:17:00', '2017-12-09 18:27:00',
               '2017-12-09 18:37:00', '2017-12-09 18:47:00',
               ...
               '2018-09-13 12:58:00', '2018-09-13 13:08:00',
               '2018-09-13 13:18:00', '2018-09-13 13:27:00',
               '2018-09-13 13:37:00', '2018-09-13 13:47:00',
               '2018-09-13 13:57:00', '2018-09-13 14:07:00',
               '2018-09-13 14:17:00', '2018-09-13 14:27:00'],
              dtype='datetime64[ns]', name='Date', length=39340, freq=None)

Let's first make sure that the data doesn't have any missing data points:

In [6]:
data[pd.isnull(data['Value'])]

Unnamed: 0_level_0,Value
Date,Unnamed: 1_level_1


Let's also rename this column since its hard to remember what "IPG2211A2N" code stands for:

In [7]:
data.columns = ['Energy Production']

In [8]:
data.head()

Unnamed: 0_level_0,Energy Production
Date,Unnamed: 1_level_1
2017-12-09 17:17:00,11.6
2017-12-09 17:27:00,11.1
2017-12-09 17:37:00,12.8
2017-12-09 17:47:00,11.1
2017-12-09 17:57:00,11.5


In [9]:
import plotly
# plotly.tools.set_credentials_file()

In [10]:
from plotly.plotly import plot_mpl
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(data, model='multiplicative')
fig = result.plot()

#plot_mpl(fig)

ValueError: You must specify a freq or x must be a pandas object with a timeseries index with a freq not set to None

In [None]:
import plotly.plotly as ply
import cufflinks as cf
# Check the docs on setting up offline plotting

In [None]:
plt.plot(data)
#data.iplot(title="Energy Production Jan 1985--Jan 2018", theme='pearl')

In [None]:
from pyramid.arima import auto_arima

**he AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.

In [None]:
stepwise_model = auto_arima(data, start_p=1, start_q=1,
                           max_p=3, max_q=3, m=12,
                           start_P=0, seasonal=True,
                           d=1, D=1, trace=True,
                           error_action='ignore',  
                           suppress_warnings=True, 
                           stepwise=True) 

In [None]:
stepwise_model.aic()

## Train Test Split

In [10]:
data.head()

Unnamed: 0_level_0,Value
Date,Unnamed: 1_level_1
2017-12-09 17:17:00,11.6
2017-12-09 17:27:00,11.1
2017-12-09 17:37:00,12.8
2017-12-09 17:47:00,11.1
2017-12-09 17:57:00,11.5


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 39340 entries, 2017-12-09 17:17:00 to 2018-09-13 14:27:00
Data columns (total 1 columns):
Value    39340 non-null float64
dtypes: float64(1)
memory usage: 1.9 MB


We'll train on 20 years of data, from the years 1985-2015 and test our forcast on the years after that and compare it to the real data.

In [12]:
train = data.loc['1985-01-01':'2016-12-01']

In [13]:
train.tail()

Unnamed: 0_level_0,Value
Date,Unnamed: 1_level_1


In [14]:
test = data.loc['2015-01-01':]

In [15]:
test.head()

Unnamed: 0_level_0,Value
Date,Unnamed: 1_level_1
2017-12-09 17:17:00,11.6
2017-12-09 17:27:00,11.1
2017-12-09 17:37:00,12.8
2017-12-09 17:47:00,11.1
2017-12-09 17:57:00,11.5


In [16]:
test.tail()

Unnamed: 0_level_0,Value
Date,Unnamed: 1_level_1
2018-09-13 13:47:00,18.8
2018-09-13 13:57:00,17.5
2018-09-13 14:07:00,15.8
2018-09-13 14:17:00,14.9
2018-09-13 14:27:00,14.4


In [17]:
len(test)

39340

In [18]:
stepwise_model.fit(train)

NameError: name 'stepwise_model' is not defined

In [19]:
future_forecast = stepwise_model.predict(n_periods=180)

NameError: name 'stepwise_model' is not defined

In [20]:
future_forecast

NameError: name 'future_forecast' is not defined

In [21]:
future_forecast = pd.DataFrame(future_forecast,index = test.index,columns=['Prediction'])

NameError: name 'future_forecast' is not defined

In [22]:
future_forecast.head()

NameError: name 'future_forecast' is not defined

In [23]:
test.head()

Unnamed: 0_level_0,Value
Date,Unnamed: 1_level_1
2017-12-09 17:17:00,11.6
2017-12-09 17:27:00,11.1
2017-12-09 17:37:00,12.8
2017-12-09 17:47:00,11.1
2017-12-09 17:57:00,11.5


In [24]:
plt.plot(future_forecast)
#pd.concat([test,future_forecast],axis=1).iplot()

NameError: name 'future_forecast' is not defined

In [25]:
future_forecast2 = future_forecast

NameError: name 'future_forecast' is not defined

In [26]:
plt.plot(future_forecast2)
#pd.concat([data,future_forecast2],axis=1).iplot()

NameError: name 'future_forecast2' is not defined