# Time series and autocorrelations

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import month_plot, seasonal_plot, quarter_plot
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import acf

In [None]:
import warnings
warnings.filterwarnings('ignore')

## Time series

A time series is a list of numbers ordered by time.
Time series-like objects can be also other types of ordered data: e.g. spatial series (geographical proximity), genome sequences (e.g. adjacent loci/nucelotides).
More generally, these data are known as **sequence data** (text, sounds are other types of sequence data).

Data on Mauna Loa monthly atmospheric $\text{CO}_2$ concentration

In [None]:
# Load the CO2 dataset (assume it's the Mauna Loa CO2 time series)
from statsmodels.datasets import co2
data = co2.load_pandas().data.dropna().reset_index()
data = data.rename(columns={"index": "date"})
data

In [None]:
data['co2'].describe()

In [None]:
# Plot the time series
plt.figure(figsize=(12, 5))
plt.plot(data['date'], data['co2'], label='CO₂ concentration')
plt.title("Monthly atmospheric CO₂ concentration in Mauna Loa")
plt.xlabel("Year")
plt.ylabel("CO₂ (ppm)")
plt.grid(True)
plt.tight_layout()
plt.show()

We see from the plot above that there is an increasing trend over time with some seasonal patterns.

Let's look at the seasonal patterns more in detail.

In [None]:
# Extract components
co2_series = data['co2']
years = pd.DatetimeIndex(data['date']).year
months = pd.DatetimeIndex(data['date']).month

In [None]:
co2_df = pd.DataFrame({
    'CO2': co2_series.values,
    'year': years,
    'month': months,
    'date': data['date']
})

co2_df['year'] = co2_df['year'].astype(str)

import calendar
co2_df['monthn'] = co2_df['month']
co2_df['month'] = co2_df['month'].apply(lambda x: calendar.month_abbr[x])

co2_df

In [None]:
## order months explicitly
month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
          "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
co2_df['month'] = pd.Categorical(co2_df['month'], categories=month_order, ordered=True)

This plot highlights the underlying **seasonal pattern** to be seen clearly, and also shows the changes in seasonality over time.

In [None]:
plt.figure(figsize=(12, 6))
sns.lineplot(data=co2_df, x='month', y='CO2', hue='year', legend=False)
plt.title('Monthly CO2 by Year')
plt.ylabel('CO2')
plt.show()

---

Now we use data from two paired time series: electricity demand (Gigawatts) and temperature (Celsius degrees) from Victoria (Australia) in 2014:

In [None]:
url="https://raw.githubusercontent.com/filippob/longitudinal_data_analysis/refs/heads/main/data/elecdemand.csv"
elecdemand = pd.read_csv(url)

elecdemand

In [None]:
elecdemand['Datetime'] = pd.date_range('2014-01-01', periods=len(elecdemand), freq='30T')
elecdemand.set_index('Datetime', inplace=True)
elecdemand

In [None]:
# Create a FacetGrid to plot 'Demand' and 'Temperature' as separate subplots
data_melted = elecdemand.reset_index().melt(id_vars="Datetime", value_vars=['Demand', 'Temperature'],
                                      var_name="variable", value_name="value")

g = sns.FacetGrid(data_melted, row="variable", aspect = 2, height=4, sharey=False)
g.map(sns.lineplot, "Datetime", "value")

# Customize the plot with titles, axis labels, and more
g.set_axis_labels("", "Value")
g.set_titles("{row_name}")
g.set_xlabels("Year: 2014")
plt.suptitle("Half-hourly electricity demand: Victoria, Australia", fontsize=16, y=1.05)

# Rotate x-ticks for readability
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

We can use a scatterplot to look at the relationship between these two time series:

In [None]:
## scatterplot
x = np.array(elecdemand['Temperature'])
y = np.array(elecdemand['Demand'])

plt.scatter(x, y, alpha=0.5)
plt.show()

In [None]:
#from statsmodels.graphics.tsaplots import lag_plot
# Load the CO2 dataset (monthly atmospheric CO2 from Mauna Loa)
data = sm.datasets.co2.load_pandas().data

# Convert the index to datetime and interpolate missing values
data = data.resample('M').mean().interpolate()

# Extract data starting from 1980
co2x = data[data.index >= '1980-01-01']
co2x['date'] = co2x.index
co2x['month'] = pd.DatetimeIndex(co2x['date']).month
co2x

#### Lag plots

Below examples of:
1. no autocorrelation in the time series
2. moderate autocorrelations in the time series
3. strong autocorrelations in the time series

[figures from https://www.geeksforgeeks.org/lag-plots/]

![no_autocorrelation](https://drive.google.com/uc?export=view&id=1f3dI0osQNRTTC2urjGaJEwFzkoZWLwhE)

![weak_autocorrelation](https://drive.google.com/uc?export=view&id=1FRLtzYcJ_qBkTnqTDu7kOgigDEQCzjPs)

![strong_autocorrelation](https://drive.google.com/uc?export=view&id=1Pn1fignZOr--wRgybqc457Oq0ojNfmce)

Let's make a lag plot with the CO2 data we used before:

In [None]:
# Plot lag plots for lags 1 through 6
plt.figure(figsize=(12, 8))
for i in range(1, 7):
    plt.subplot(2, 3, i)
    pd.plotting.lag_plot(co2x['co2'], lag=i)
    plt.title(f'Lag {i}')
plt.tight_layout()
plt.show()

## Autocorrelations

Autocorrelation is used to measure the degree of similarity between a time series and a lagged version of itself over the given range of time intervals.
We can also call autocorrelation as *serial correlation* or *lagged correlation*.
It is mainly used to measure the relationship between the current values and the previous values.

$$
r_k = \frac{\sum_{t=k+1}^T (y_t-\bar{y})(y_{t-k}-\bar{y})}{\sum_{t=1}^T(y_t-\bar{y})^2}
$$

We start with a made-up sequence:

In [None]:
v = np.array([12,34,56,30,23,15,28,36,45,31,21,10,29,40,58,64,78,90])
plt.plot(v)

With lag = 0, the autocorrelation is clearly $1$: $\frac{\sum (y_t-\bar{y}) \cdot (y_t-\bar{y})}{\sum (y_t-\bar{y})^2}$.

With lag = $1$, we calculate the correlation between the two vectors ($n$ is the length of the vector):

-   v[2,n]
-   v[1,n-1]

In [None]:
n = len(v)
v_avg = np.mean(v)
v1 = v[1:n] ##
v2 = v[0:(n-1)]

In [None]:
adj_v = (v - v_avg)
adj_v1 = v1 - v_avg
adj_v2 = v2 - v_avg

In [None]:
r1 = np.inner(adj_v1, adj_v2) / np.inner(adj_v, adj_v)
print(r1)

**Q: can you calculate the autocorrelation with lag = 2?** (let's do this together!)

In [None]:
## LAG 2
n = len(v)
v_avg = np.mean(v)
v1 = v[2:n]
v2 = v[0:(n-2)]

In [None]:
## autocorrelation with lag = 1
r1 = np.inner((v1 - v_avg), (v2 - v_avg))/np.inner((v - v_avg), (v - v_avg))
print(r1)

In [None]:
from statsmodels.tsa.stattools import acf

# Compute autocorrelation without plotting
acf_values = acf(v, fft=False)
print(acf_values)

In [None]:
# Plot autocorrelation
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(v, lags=12)
plt.show()

Now we apply the calculation of autocorrelations to the CO2 and electricity demand time series that we encountered before:

In [None]:
acf(data['co2'], nlags = 10)

In [None]:
plot_acf(data['co2'], lags=26)
plt.show()

In [None]:
elecdemand

In [None]:
acf(elecdemand['Demand'], nlags = 10)

In [None]:
plot_acf(elecdemand['Demand'], lags=800)
plt.show()

In [None]:
acf(elecdemand['Temperature'], nlags = 50)

In [None]:
plot_acf(elecdemand['Temperature'], lags=len(elecdemand)-1)
plt.show()

### White noise time series

Time series that show no autocorrelation are called white noise.

In [None]:
y = np.random.randn(50)
plt.plot(y)

plt.title("White noise series")

plt.show()

In [None]:
acf(y)

In [None]:
y[1:]

In [None]:
plot_acf(y)
plt.show()

For **white noise series**, we expect each autocorrelation to be close to zero.
Of course, they will not be exactly equal to zero as there is random variation.
For a white noise series, we expect $95\%$ of the spikes in the autocorrelation plot to lie within:

$$
\pm \frac{2}{\sqrt{T}}
$$

where $T$ is the length of the time series.

It is common to plot these bounds on the graph (the blue dashed lines above).
If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise.

In this white noise series, $T = 50$ and so the bounds are at $\pm \frac{2}{\sqrt{50}}= \pm 0.28$.
All of the autocorrelation coefficients lie within these limits, confirming that the data are white noise.