<a href="https://colab.research.google.com/github/cagBRT/timeSeries/blob/main/3_DifferencingATimeSeries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Differencing a Time Series**

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/timeSeries.git cloned-repo
%cd cloned-repo

https://machinelearningmastery.com/difference-time-series-dataset-python/

**A stationary time series** is one whose properties do not depend on the time at which the series is observed.  <br>
Time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times.

In the most intuitive sense, stationarity means that the statistical properties of a process generating a time series do not change over time. **It does not mean that the series does not change over time, just that the way it changes does not itself change over time**. 

The algebraic equivalent is a linear function, perhaps, and not a constant one; the value of a linear function changes as 𝒙 grows, but the way it changes remains constant — it has a constant slope; one value that captures that rate of change.

https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322

In [None]:
from IPython.display import Image
Image("StationaryNonStationary.png" , width=640)

Which of the times series shown below are stationary?

(a) Google stock price for 200 consecutive days; <br>
(b) Daily change in the Google stock price for 200 consecutive days;<br> 
(c) Annual number of strikes in the US; <br>
(d) Monthly sales of new one-family houses sold in the US; <br>
(e) Annual price of a dozen eggs in the US (constant dollars);<br>
(f) Monthly total of pigs slaughtered in Victoria, Australia; <br>
(g) Annual total of lynx trapped in the McKenzie River district of north-west Canada;<br>
(h) Monthly Australian beer production; <br>
(i) Monthly Australian electricity production.<br><br>
https://otexts.com/fpp2/stationarity.html



In [None]:
Image("stationary-1.png" , width=640)

In [None]:
#@title 
#Obvious seasonality rules out series 
#(d), (h) and (i). Trends and changing
# levels rules out series (a), (c), 
#(e), (f) and (i). Increasing variance
# also rules out (i). That leaves only 
#(b) and (g) as stationary series.

#At first glance, the strong cycles 
#in series (g) might appear to make it
# non-stationary. But these cycles are 
#aperiodic — they are caused when the
# lynx population becomes too large
# for the available feed, so that they 
#stop breeding and the population 
#falls to low numbers, then the 
#regeneration of their food sources 
#allows the population to grow again,
# and so on. In the long-term, the timing
# of these cycles is not predictable. 
#Hence the series is stationary.

So how to make a series stationary?

The most common approach is to difference it. That is, subtract the previous value from the current value. Sometimes, depending on the complexity of the series, more than one differencing may be needed.

The value of d is the minimum number of differencing needed to make the series stationary. <br>
If the time series is already stationary, then d = 0.

The first difference of a time series is the series of changes from one period to the next. If Yt denotes the value of the time series Y at period t, then the first difference of Y at period t is equal to Yt-Yt-1<br>
What this transformation does, is that rather than considering the index directly, we are calculating the difference between consecutive time steps.

Defining the model to predict the difference in values between time steps rather than the value itself, is a much stronger test of the models predictive powers.

**Import libraries**

In [None]:
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
from pandas import Series

In [None]:
def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

# **Differencing is a method of transforming a time series dataset.**

It can be used to remove the series dependence on time, so-called temporal dependence. This includes structures like trends and seasonality.

Differencing can help stabilize the mean of the time series by removing changes in the level of a time series, and so eliminating (or reducing) trend and seasonality.


Differencing is performed by subtracting the previous observation from the current observation.

>difference(t) = observation(t) - observation(t-1)


# **Lag Difference**
Taking the difference between consecutive observations is called a lag-1 difference.

The lag difference can be adjusted to suit the specific temporal structure.

For time series with a seasonal component, the lag may be expected to be the period (width) of the seasonality.

# **Difference Order**
Temporal structure may still exist after performing a differencing operation, such as in the case of a nonlinear trend.

As such, the process of differencing can be repeated more than once until all temporal dependence has been removed.

The number of times that differencing is performed is called the difference order.

**Shampoo Sales Dataset**<br>
This dataset describes the monthly number of sales of shampoo over a 3 year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).


In [None]:
series = read_csv('shampoo.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
series.plot()
pyplot.show()

# **Manual Differencing**
We can difference the dataset manually.

This involves developing a new function that creates a differenced dataset. The function would loop through a provided series and calculate the differenced values at the specified interval or lag.

The function below named difference() implements this procedure.

In [None]:
# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return Series(diff)

We can see that the function is careful to begin the differenced dataset after the specified interval to ensure differenced values can, in fact, be calculated. A default interval or lag value of 1 is defined. This is a sensible default.

One further improvement would be to also be able to specify the order or number of times to perform the differencing operation.

The example below applies the manual difference() function to the Shampoo Sales dataset

In [None]:
X = series.values
diff = difference(X)
pyplot.plot(diff)
pyplot.show()

# **Automatic Differencing**
The Pandas library provides a function to automatically calculate the difference of a dataset.

This diff() function is provided on both the Series and DataFrame objects.

Like the manually defined difference function in the previous section, it takes an argument to specify the interval or lag, in this case called the periods.

The example below demonstrates how to use the built-in difference function on the Pandas Series object.

In [None]:
diff = series.diff()
pyplot.plot(diff)
pyplot.show()