<a href="https://colab.research.google.com/github/gitmystuff/DTSC5502/blob/main/Module_10-Auto_Regression/Statsmodels_Time_Series.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statsmodels Time Series

Name

## Getting Started

* Colab - get notebook from gitmystuff DTSC5502 repository
* Save a Copy in Drive
* Remove Copy of
* Submit shared link in Canvas


## Overview

* Autoregression
* Time Series with Statsmodels
* ETS
* Stationary Time Series
* EWMA
* Holt-Winters
* ACF
* PACF
* ARIMA
* SARIMA
* Finance Terms

## Autoregression

The nutshell - Uses observations from previous time steps as input to a regression equation to predict the value at the next time step.

Autoregression is a time series modeling technique that predicts future values based on past values of the same series. It's like saying, "knowing what happened in the past can help us understand what might happen in the future."

Think of it like this: Imagine you're tracking the daily temperature. You notice that if today is hot, tomorrow is also likely to be hot.  Autoregression captures this relationship by using past temperatures to predict future temperatures.

**Here's a more formal definition:**

Autoregression is a statistical model that uses a linear combination of *lagged values* of a time series to predict future values. A lagged value is simply a previous value in the series.

**Key Concepts:**

* **Order of Autoregression (p):**  The number of lagged values used in the model. For example, an autoregressive model of order 2 (AR(2)) uses the two previous values to predict the current value.
* **Coefficients:** Each lagged value is multiplied by a coefficient that determines its influence on the prediction. These coefficients are estimated from the data.
* **Error term:**  The model also includes an error term to account for random fluctuations and unpredictable factors.

**Example:**

An AR(1) model for predicting temperature (T) might look like this:

  T(t) = c + a * T(t-1) + error(t)

where:

* T(t) is the temperature at time t
* T(t-1) is the temperature at the previous time step (t-1)
* c is a constant term
* a is the coefficient for the lagged temperature
* error(t) is the random error at time t

**Applications:**

* **Forecasting:** Predicting future values of time series data, such as stock prices, sales, or weather patterns.
* **Economics:**  Analyzing economic data and understanding relationships between variables over time.
* **Signal processing:**  Filtering noise and extracting signals from data.

**Advantages:**

* **Simple and interpretable:** The model is relatively easy to understand and interpret.
* **Effective for many time series:**  Works well for data with clear autocorrelations (relationships between values at different time lags).

**Limitations:**

* **Assumes linearity:**  The model assumes a linear relationship between past and future values, which may not always be true.
* **Requires stationary data:**  The time series should be stationary (have constant statistical properties over time) for the model to be effective.

Overall, autoregression is a valuable tool for analyzing and forecasting time series data, providing insights into the underlying patterns and relationships within the data.


In [1]:
# # https://fred.stlouisfed.org/series/POPTHM
# import pandas as pd

# pop_url = 'https://raw.githubusercontent.com/gitmystuff/Datasets/main/uspopulation.csv'
# pop = pd.read_csv(pop_url, index_col='DATE', parse_dates=True)
# pop.index.freq = 'MS'
# print(len(pop))

# pop['PopEst'].plot().autoscale(axis='x', tight=True);

In [2]:
# # train and predict
# import matplotlib.pyplot as plt
# from statsmodels.tsa.ar_model import AutoReg

# train = pop.iloc[:84]
# test = pop.iloc[84:] # test len should be same len as forecast

# model = AutoReg(train['PopEst'], lags=[1, 2, 11, 12], seasonal=True, period=12).fit()
# predictions = model.predict(start=len(train), end=len(pop)-1)

# test['PopEst'].plot(legend=True)
# predictions.plot(label='Predictions')
# plt.title(f'AR({model.ar_lags}) Fit with Predictions')
# plt.legend()
# plt.show()


In the code `model = AutoReg(train['PopEst'], lags=[1, 2, 11, 12], seasonal=True, period=12).fit()`, the `lags=[1, 2, 11, 12]` argument specifies the autoregressive lags to be included in the model.

Here's what that means:

* **Autoregressive lags (AR lags):** These are the past values of the time series that are used to predict the current value. In this case, the model will use the values from 1, 2, 11, and 12 periods ago.
* **`lags=[1, 2, 11, 12]`:** This list specifies that the model should include the first, second, eleventh, and twelfth lags in the autoregressive equation.
* **Interpretation:** This suggests that the model is considering both short-term (lags 1 and 2) and seasonal (lags 11 and 12) patterns in the data. Since the `period=12` argument is also provided, it indicates that the data likely has a yearly seasonality (e.g., monthly data with patterns repeating every 12 months).

**In simpler terms:**

Imagine you're trying to predict the population of a city. This model is saying that the population in the current month is likely influenced by:

* The population in the previous month (lag 1)
* The population two months ago (lag 2)
* The population in the same month last year (lag 11)
* The population in the month before the same month last year (lag 12)

By including these specific lags, the model aims to capture both the short-term trends and the yearly seasonal patterns in the population data.


In [3]:
# # forecast
# model = AutoReg(pop['PopEst'], lags=[1, 2, 11, 12], seasonal=True, period=12).fit()
# forecast = model.predict(start=len(pop), end=len(pop)+12)

# pop['PopEst'].plot(legend=True)
# forecast.plot(label='Forecast')
# plt.legend()
# plt.show()

### Time Series

Imagine you're trying to predict the weather. You know that today's temperature is often related to yesterday's temperature. Maybe it's a bit warmer or cooler, but it usually doesn't jump from freezing to scorching heat overnight. This is the basic idea behind autoregression in time series.

**Autoregression (AR)** is a way to predict future values in a time series by looking at past values of the *same* series. It assumes that the current value is influenced by a weighted sum of its own previous values.

Think of it like this:

* **Time series:** A sequence of data points collected over time (e.g., daily temperature, stock prices, monthly sales).
* **Autoregression:**  Using past values within the time series itself to predict future values.
* **Regression:** A statistical method to find relationships between variables. In this case, the relationship is between the current value and its own past values.

**Here's a simple example:**

Let's say you want to predict tomorrow's temperature. An autoregressive model might say:

"Tomorrow's temperature will be 0.8 times today's temperature plus 0.1 times yesterday's temperature plus some random error."

In this case:

* **0.8 and 0.1 are the weights** assigned to the previous two days' temperatures. These weights determine how much influence each past value has on the prediction.
* **The random error** accounts for unpredictable factors that the model can't capture.

**Key concepts in autoregression:**

* **Lag:** The time difference between observations.  A lag of 1 means using the value from the previous time step.
* **Order:** The number of past values used in the prediction.  An AR model of order 2 (AR(2)) uses the values from the two previous time steps.
* **Autocorrelation:** The correlation between a time series and a lagged version of itself. It measures how strongly past values are related to current values.

**Why is autoregression useful?**

* **Forecasting:** Predict future values based on historical trends.
* **Understanding patterns:** Identify recurring patterns and seasonality in data.
* **Control:**  In some cases, AR models can be used to design control systems that react to changes in a time series.

Autoregression is a powerful tool for analyzing and predicting time series data. It's used in various fields like finance, economics, weather forecasting, and engineering.


### LLMs

Think of a large language model (LLM) that learns to predict the next word in a sequence based on the words that came before it. This is where autoregression comes in.

**Autoregression in LLMs**

LLMs use autoregression to generate text by predicting the probability of each word given its preceding words. It's like a chain reaction:

1. **Start with a prompt:** You give the LLM a starting phrase or sentence.
2. **Predict the next word:** The LLM analyzes the prompt and calculates the probability of different words following it. For example, if the prompt is "The cat sat on the", the LLM might predict a high probability for words like "mat", "chair", or "table".
3. **Generate the next word:** The LLM selects the most likely word based on its prediction and adds it to the sequence.
4. **Repeat:** The LLM continues this process, predicting and generating one word at a time, building a coherent and contextually relevant text.

**Example:**

Prompt: "The quick brown fox"

LLM prediction:
* "jumps" - high probability
* "runs" - medium probability
* "sleeps" - low probability

LLM output: "The quick brown fox jumps"

**How LLMs learn autoregression:**

LLMs are trained on massive amounts of text data. During training, they learn to:

* **Identify patterns:**  Recognize common sequences of words and phrases.
* **Estimate probabilities:**  Calculate the likelihood of different words following a given context.
* **Generate coherent text:** Produce text that flows naturally and makes sense.

**Benefits of autoregression in LLMs:**

* **High-quality text generation:**  LLMs can generate remarkably human-like text, including stories, articles, and even code.
* **Contextual understanding:** LLMs can maintain context over long sequences of text, allowing them to generate coherent and relevant responses.
* **Flexibility:** Autoregressive LLMs can be used for various tasks, such as translation, summarization, and question answering.

**Limitations:**

* **Bias:** LLMs can inherit biases from their training data, leading to unfair or discriminatory outputs.
* **Lack of common sense:** LLMs may struggle with tasks that require real-world knowledge or reasoning.
* **Computational cost:**  Training and running large LLMs can be computationally expensive.

Despite these limitations, autoregression is a core technique behind the impressive capabilities of modern LLMs. It enables them to learn from data and generate human-quality text in a wide range of applications.


## Fly Me to the Moon

### ETS

https://www.statsmodels.org/dev/examples/notebooks/generated/ets.html

* Error
* Trend
* Seasonalality
* Exponential Smoothing
* Trend Methods
* ETS Decomposition
* Additive Models: Used when trend is linear and/or seasonal variations are constant
* Multiplicative: Used when trend is non linear (exponential) and/or seasonality varies proportional to the level of the series

### Moving Average

The moving average is a simple but powerful technique used to smooth out fluctuations in time series data and highlight underlying trends. It works by calculating the average of a specified number of consecutive data points, creating a new series of averages.

**Here's how it works:**

1. **Choose a window size:** Decide how many data points you want to include in each average. This is called the "window" or "order" of the moving average. For example, a 3-month moving average would use the average of the current month and the two preceding months.

2. **Calculate the averages:**  Slide the window across the time series, calculating the average of the data points within the window at each step.

3. **Create a new series:** The calculated averages form a new time series that is smoother than the original data.

**Example:**

Let's say you have monthly sales data:

| Month | Sales |
|---|---|
| Jan | 100 |
| Feb | 120 |
| Mar | 110 |
| Apr | 130 |
| May | 140 |

A 3-month moving average would be calculated as follows:

| Month | Sales | 3-Month Moving Average |
|---|---|---|
| Jan | 100 | - |
| Feb | 120 | - |
| Mar | 110 | (100 + 120 + 110) / 3 = 110 |
| Apr | 130 | (120 + 110 + 130) / 3 = 120 |
| May | 140 | (110 + 130 + 140) / 3 = 126.67 |

**Types of Moving Averages:**

* **Simple Moving Average (SMA):**  All data points within the window are given equal weight.
* **Weighted Moving Average (WMA):**  Assigns different weights to data points, typically giving more weight to recent observations.
* **Exponential Moving Average (EMA):**  A type of WMA that gives exponentially decreasing weights to older data points.

**Applications:**

* **Smoothing data:**  Reduces noise and highlights trends.
* **Technical analysis:** Used in finance to identify trends and generate trading signals.
* **Forecasting:** Can be used to make simple forecasts by extrapolating the trend.

**Choosing the window size:**

* **Smaller window:**  Responds more quickly to changes in the data but is less smooth.
* **Larger window:**  Produces a smoother trend but lags behind changes in the data.

The choice of window size depends on the specific application and the desired balance between responsiveness and smoothness.


In [4]:
# import pandas as pd
# import numpy as np

# date_rng = pd.date_range(start='2032-01-01', end='2043-12-01', freq='MS')

# # Generate a base series with an upward trend
# base_passengers = np.linspace(1000, 5000, len(date_rng))  # Start with 1000, steadily increase to 5000

# # Add some random noise to simulate real-world fluctuations
# noise = np.random.normal(0, 200, len(date_rng))
# passengers = base_passengers + noise

# # Ensure no negative or zero values
# passengers = np.maximum(passengers, 1)
# passengers = np.round(passengers).astype(int)  # Round to integers and convert to int

# # Create the DataFrame
# # df = pd.DataFrame({'Passengers': passengers}, index=date_rng)

# df = pd.DataFrame({'Passengers': passengers, 'Month': date_rng})
# df.set_index('Month', inplace=True)

# print(df.info())
# print(df.head())
# print(df.groupby(df.index.year).size())
# starliner = df

In [5]:
# starliner.tail()

In [6]:
# # ets decomposition
# from statsmodels.tsa.seasonal import seasonal_decompose

# seasonal_decompose(starliner['Passengers'], model='multiplicative').plot();

### EWMA

https://towardsdatascience.com/time-series-analysis-with-statsmodels-12309890539a

* Exponential Weighted Moving Average
* Same window for SMA vs EWMA allows weighting more relevant data

EWMA stands for **Exponentially Weighted Moving Average**. It's a way to smooth out data by giving more weight to recent observations and less weight to older observations. This makes it particularly useful for analyzing time series data where the most recent data points are often the most relevant.

Here's a breakdown of how EWMA works:

**1. Calculating the EWMA**

The EWMA is calculated using the following formula:

```
EWMA(t) = α * X(t) + (1 - α) * EWMA(t-1)
```

Where:

* `EWMA(t)` is the EWMA at time t
* `α` is the smoothing factor (a value between 0 and 1)
* `X(t)` is the observation at time t
* `EWMA(t-1)` is the EWMA at time t-1

**2. The Smoothing Factor (α)**

The smoothing factor (`α`) determines how much weight is given to recent observations.

* A higher `α` gives more weight to recent observations, making the EWMA more responsive to changes in the data.
* A lower `α` gives less weight to recent observations, making the EWMA smoother and less sensitive to noise.

**3. How it Works**

The EWMA starts with an initial value (often the first observation in the series). Then, for each subsequent observation, the EWMA is calculated by taking a weighted average of the current observation and the previous EWMA.  This process continues, with each new EWMA being influenced by all previous observations, but with exponentially decreasing weights.

**Key Advantages of EWMA:**

* **Simple to understand and implement.**
* **Requires minimal data storage.**
* **Adapts to changes in the data.**
* **Can be used for forecasting.**

**Common Applications:**

* **Technical Analysis:** Used to smooth price data and identify trends.
* **Volatility Modeling:**  Used to estimate the volatility of financial assets.
* **Quality Control:** Used to monitor processes and detect shifts in performance.
* **Forecasting:** Used to make predictions about future values.


**Analogy:**

Imagine you're trying to estimate the average temperature in a room. You could take the average of all the temperature readings you've ever taken, but that wouldn't be very useful because the older readings are less relevant to the current temperature. Instead, you could use an EWMA, which would give more weight to recent readings and less weight to older readings. This would give you a more accurate estimate of the current temperature.


In [7]:
# # add 6 and 12 month sma and annual ewma
# starliner['6-month-SMA'] = starliner['Passengers'].rolling(window=6).mean()
# starliner['12-month-SMA'] = starliner['Passengers'].rolling(window=12).mean()
# starliner['12-month-EWMA'] = starliner['Passengers'].ewm(span=12,adjust=False).mean()
# starliner.plot();

### Holt-Winters

* EWMA has just one smoothing factor and doesn't account for trend, seasonality, etc.
* HW offers three smoothing factors for level, trend, and season ($\alpha, \beta, \gamma$) and can adjust for divisions per cycle (L)
* Offers Single, Double (Holt and Trend), and Triple (Holt-Winters and Seasonality) Exponential Smoothing

Both Holt-Winters and EWMA are smoothing methods used for time series analysis and forecasting, but they differ in their complexity and ability to handle trends and seasonality:

**EWMA (Exponentially Weighted Moving Average)**

* **Focus:** Smooths out data by giving more weight to recent observations.
* **Handles:**  Suitable for data with no clear trend or seasonality.
* **Parameters:**  One main parameter: the smoothing factor (α).
* **Limitations:** Doesn't explicitly account for trends or seasonality in the data.

**Holt-Winters**

* **Focus:**  A more advanced method that can handle both trend and seasonality.
* **Handles:**  Suitable for data with trends and/or seasonal patterns.
* **Parameters:** Three main parameters:
    *  α (smoothing factor for the level)
    *  β (smoothing factor for the trend)
    *  γ (smoothing factor for the seasonality)
* **Variations:**  
    * **Single Exponential Smoothing:**  Similar to EWMA, handles only the level.
    * **Double Exponential Smoothing:** Handles level and trend.
    * **Triple Exponential Smoothing:**  Handles level, trend, and seasonality.

**Here's an analogy:**

Imagine you're tracking the sales of ice cream over time.

* **EWMA:**  Like averaging daily sales with a focus on recent days. It might show you if sales are generally increasing or decreasing, but it won't capture the summer peak.
* **Holt-Winters:** Like considering not just the average sales, but also how quickly sales are increasing towards the summer (trend) and the repeating pattern of high sales every summer (seasonality).

**Key Differences:**

| Feature | EWMA | Holt-Winters |
|---|---|---|
| Trend | No explicit handling | Explicitly models trend |
| Seasonality | No explicit handling | Explicitly models seasonality |
| Complexity | Simpler | More complex |
| Parameters | One (α) | Three (α, β, γ) |
| Forecasting | Short-term forecasts | Short- to medium-term forecasts |

**When to Use Which:**

* **EWMA:** When dealing with data that doesn't have a clear trend or seasonality and you need a simple smoothing method.
* **Holt-Winters:** When dealing with time series data that exhibits trends and/or seasonality, and you need more accurate forecasts.

In essence, Holt-Winters is an extension of EWMA that incorporates trend and seasonality, making it more suitable for complex time series data.


Source
* https://www.statsmodels.org/dev/tsa.html#exponential-smoothing

In [8]:
# from statsmodels.tsa.holtwinters import SimpleExpSmoothing, ExponentialSmoothing

# span = 12
# alpha = 2/(span+1)

# starliner['12-month-SES'] = SimpleExpSmoothing(starliner['Passengers']).fit(smoothing_level=alpha,
#                                                                         optimized=False).fittedvalues.shift(-1) # shift corrects for optimization
# starliner['12-month-DES'] = ExponentialSmoothing(starliner['Passengers'],
#                                                trend='mul').fit().fittedvalues.shift(-1)
# starliner['12-month-TES'] = ExponentialSmoothing(starliner['Passengers'],
#                                                trend='mul',
#                                                seasonal='mul',
#                                                seasonal_periods=12).fit().fittedvalues

# print(starliner.info())
# print(starliner.columns)

In [9]:
# plots = ['Passengers', '6-month-SMA', '12-month-SMA', '12-month-EWMA', '12-month-DES', '12-month-TES', '12-month-SES']
# starliner[['Passengers', '12-month-SES', '12-month-DES', '12-month-TES']].plot(figsize=(12,5)).autoscale(axis='x', tight=True);

In [10]:
# # zoom in first two years
# starliner[['Passengers', '12-month-SES', '12-month-DES', '12-month-TES']].iloc[:24].plot(figsize=(12,5)).autoscale(axis='x', tight=True);

In [11]:
# starliner.tail()

In [12]:
# # zoom in last two years
# starliner[['Passengers', '12-month-SES', '12-month-DES', '12-month-TES']]['2022-01-01':].plot(figsize=(12,5)).autoscale(axis='x', tight=True);

### Stationary Time Series

In [13]:
# # non-stationary
# starliner['Passengers'].plot();

In [14]:
# # births - can you spot any trends
# import pandas as pd
# import numpy as np

# # Set the start and end dates for the year
# start_date = pd.to_datetime('2023-01-01')
# end_date = pd.to_datetime('2023-12-31')

# # Generate a daily datetime index for the year
# date_range = pd.date_range(start_date, end_date, freq='D')

# # Simulate the number of female births per day
# # We'll use a normal distribution with a mean of 50 and a standard deviation of 10
# # We'll also ensure there are no negative values or 0s
# births_per_day = np.random.normal(loc=50, scale=10, size=len(date_range))
# births_per_day = np.maximum(births_per_day, 1).astype(int)  # Ensure no negative values or 0s

# # Create a DataFrame with the datetime index and simulated birth data
# births = pd.DataFrame({'Births': births_per_day}, index=date_range)

# print(births.info())
# print(births.head())
# births.plot();

### Passenger Data (Non-Stationary)


### Key Characteristics:
* **Data:** Likely monthly **Passenger** counts over several years (labeled 2033 to 2043). *Note: This is the same underlying data from the previous question.*
* **Trend:** It exhibits a very strong, **obvious upward trend**. The passenger numbers are consistently increasing year after year.
* **Seasonality/Cycles:** A clear **seasonal pattern** is visible, with regular peaks and troughs recurring every year.
* **Variance:** The amplitude of the seasonal peaks seems to be **increasing** as the mean value increases (called **multiplicative seasonality**).

### Stationarity Status:
* This series is clearly **Non-Stationary** because:
    1.  Its **mean** is **not constant** (it has an upward **Trend**).
    2.  Its **variance** is **not constant** (the seasonal amplitude increases over time).

---

### Daily Births (Non-Stationary in Mean, Not in Variance)


### Key Characteristics:
* **Data:** Daily number of **Births** over a single year (2023).
* **Trend:** There is **no clear upward or downward trend** over the year. The average number of births seems to hover around a consistent level (e.g., 40-50).
* **Seasonality/Cycles:** There are no strong, repeating seasonal patterns (like monthly peaks/troughs) visible. The fluctuations appear **random** (noise).
* **Variance:** The variation (volatility or amplitude) of the line appears relatively **constant** throughout the year.

### Stationarity Status:
* This series is close to being **stationary in mean**, but the high day-to-day **randomness (noise)** makes it a complex, highly volatile series. For practical modeling, one might assume its **mean and variance are constant** over time.

---

### Summary of Differences

| Feature | Daily Births (Chart 1) | Passenger Data (Chart 2) |
| :--- | :--- | :--- |
| **Trend** | No significant trend (stationary mean) | **Strong upward trend** (non-stationary mean) |
| **Seasonality** | No clear pattern (high noise) | **Strong and clear seasonal pattern** |
| **Volatility** | High, but constant variance | **Increasing variance** (multiplicative seasonality) |
| **Modeling Goal** | De-noising, identifying daily effects | **Modeling/Removing trend & seasonality** |

## Forecasting

### Holt-Winters

In [15]:
# from statsmodels.tsa.holtwinters import ExponentialSmoothing

# train = starliner.iloc[:108].dropna()
# test = starliner.iloc[108:].dropna()
# # print(len(test))

# model = ExponentialSmoothing(train['Passengers'],
#                              trend='mul',
#                              seasonal='mul',
#                              seasonal_periods=12).fit()

# predictions = model.forecast(len(test))

# train['Passengers'].plot(legend=True, label='train', figsize=(12,5))
# test['Passengers'].plot(legend=True, label='test')
# predictions.plot(legend=True, label='predictions');

### Evaluation

* Accuracy, precision, and recall aren't applicable for time series
* MSE: Bigger residuals are emphasized
* RMSE: Expresses MSE in the same units of the data compared to the standard deviation and leans towards the mean\
* Compare RMSE to the average values (no 100% answer)
* MAE (Mean Absolute Error): Forecasts lean to the median
* AIC/BIC: Metrics to compare models, the lower the better at forecasting
* Impossible to evaluate future data

In [16]:
# # evaluation
# import numpy as np
# from sklearn.metrics import mean_squared_error,mean_absolute_error

# print('MAE:', mean_absolute_error(test['Passengers'], predictions))
# print('MSE:', mean_squared_error(test['Passengers'], predictions))
# print('RMSE:', np.sqrt(mean_squared_error(test['Passengers'], predictions)))
# print(test.describe())

### ACF

https://medium.com/@krzysztofdrelczuk/acf-autocorrelation-function-simple-explanation-with-python-example-492484c32711

* Positive Correlation: as x goes up, y goes up
* Negative Correlation: as x goes up,
y goes down
* AutoCorrelation: degree of correlation between successive time intervals (correlation with itself)
* Lag: measured by k periods apart

ACF stands for **Autocorrelation Function**. It's a tool used in time series analysis to measure the correlation between a time series and lagged versions of itself. In simpler terms, it helps you understand how a data point at a particular time is related to data points that came before it.

**Here's a breakdown:**

1. **Lag:**  A lag refers to a previous point in time. For example, a lag of 1 means looking at the data point one time unit before the current one, a lag of 2 means two time units before, and so on.

2. **Correlation:**  Correlation measures the linear relationship between two variables. A positive correlation means they move in the same direction, a negative correlation means they move in opposite directions, and no correlation means there's no linear relationship.

3. **Autocorrelation:**  Autocorrelation specifically measures the correlation between a time series and its lagged versions. It tells you how much a data point at a certain time is influenced by its past values.

**How ACF Works**

The ACF calculates the correlation coefficient for different lag values. It produces a plot (called a correlogram) that shows these correlation coefficients on the y-axis and the lag values on the x-axis.

**Interpreting the ACF Plot**

* **Significant correlations:**  Bars that extend beyond the confidence intervals (usually shown as dashed lines) indicate statistically significant correlations.
* **Positive autocorrelation:**  Positive values suggest that the data points at a particular lag tend to move in the same direction as the current data point.
* **Negative autocorrelation:** Negative values suggest that the data points at a particular lag tend to move in the opposite direction as the current data point.
* **Gradual decline:**  A gradual decline in the ACF suggests a trend in the data.
* **Sharp drops:**  Sudden drops in the ACF might indicate a seasonal pattern.

**Why is ACF Important?**

* **Identifying patterns:**  ACF helps identify trends, seasonality, and other patterns in time series data.
* **Model selection:** It helps choose the appropriate model for your time series data (e.g., ARIMA models).
* **Checking for randomness:**  If the ACF shows no significant correlations, it suggests that the data might be random (white noise).

**Analogy:**

Imagine you're tracking the daily temperature. If the temperature today is high, there's a good chance it will be high tomorrow as well (positive autocorrelation). The ACF would help you quantify this relationship and understand how the temperature on previous days influences the temperature today.

In [17]:
# from statsmodels.graphics.tsaplots import plot_acf

# title = 'Autocorrelation: starliner Passengers'
# lags = 40
# plot_acf(starliner['Passengers'], title=title, lags=lags);

* Lagged x units (shifts) define the correlation with the original time
* This plot indicates non-stationary data, as there are a large number of lags before ACF values drop off
* Gradual decline shows bigger shifts have less correlation
* Shifts that form peaks express an increase in correlation and is related to seasonality
* Describes autocorrelation between two observations over time

In [18]:
# # acf plots default lags = 40
# from statsmodels.graphics.tsaplots import plot_acf

# title = 'Autocorrelation: Daily Female Births'
# lags = 40
# plot_acf(births, title=title, lags=lags);

This ACF plot shows stationary data, with lags on the horizontal axis and correlations on the vertical axis. The first value $y_0$ is always 1. A sharp dropoff indicates AR

### Lag Plot

**Lag Plot**

* **What it is:**  A scatter plot that compares a time series against a lagged version of itself.  
* **How it works:** It plots the values of the time series on the y-axis against the values of the time series at a previous time (lag) on the x-axis.
* **What it shows:**  Helps visualize the relationship between observations and their past values. You can create lag plots for different lags (e.g., lag 1, lag 2, etc.) to see how the relationships change over time.

**Example:**

If you have daily stock prices, a lag plot with a lag of 1 would plot today's price against yesterday's price. A point in the upper-right quadrant would mean both today and yesterday had high prices.

**Relationship to ACF**

* **Visual vs. Numerical:** A lag plot provides a *visual* representation of the relationship between data points at different lags, while the ACF provides a *numerical* measure (the correlation coefficient) of that relationship.
* **Patterns:**  Both can help identify patterns like:
    * **Strong positive correlation:** Points clustered along a rising diagonal in a lag plot often correspond to strong positive bars in the ACF.
    * **Strong negative correlation:** Points clustered along a falling diagonal in a lag plot often correspond to strong negative bars in the ACF.
    * **No correlation:** A random scatter of points in a lag plot suggests no correlation, which would be reflected by small bars in the ACF.

**In Summary**

Lag plots and ACF plots are complementary tools.

* **Start with lag plots:**  They give you a quick visual impression of potential correlations in your time series.
* **Then use ACF:** To quantify those correlations and get a more precise understanding of the relationships at different lags.


Think of it this way:

* **Lag plot:**  A picture of the relationship.
* **ACF:** A number that describes the strength of the relationship in the picture.


In [19]:
# from pandas.plotting import lag_plot

# lag_plot(starliner['Passengers']);

In [20]:
# lag_plot(births['Births']);

### PACF

* Partial AutoCorrelation: provides the partial correlation of a stationary time series with its own lagged values, regresses the values of the time series at all shorter lags
* Helps visualize residuals over time vs real values
* It contrasts with the autocorrelation function, which does not control for other lags
* Describes relationship between observation and its lag
* Statsmodel.tsa.statespace.tools.diff: performs a differencing operation along the zero(th) axis

PACF stands for **Partial Autocorrelation Function**. It's another important tool in time series analysis, and it's closely related to the ACF (Autocorrelation Function). However, there's a key difference:

**PACF vs. ACF**

* **ACF:** Measures the correlation between a time series and its lagged versions *without* considering the influence of intermediate lags.
* **PACF:** Measures the correlation between a time series and its lagged versions *after* removing the effect of intermediate lags.

**In simpler terms:**

Imagine you're looking at the relationship between the temperature today and the temperature three days ago.

* **ACF:**  Would simply measure the correlation between these two temperatures.
* **PACF:**  Would measure the correlation between these two temperatures *but would remove the influence of the temperatures on the two days in between*.

**Why is this "removing the influence" important?**

Because sometimes, the correlation between two points in time might be due to their shared correlation with points in between.  The PACF helps you isolate the *direct* relationship between two points in time, excluding the influence of intermediate points.

**How PACF Works**

The PACF is calculated by fitting a series of autoregressive models (AR models) of increasing order. For each lag, it estimates the coefficient of that lag in the AR model, which represents the partial autocorrelation at that lag.

**Interpreting the PACF Plot**

The PACF plot is similar to the ACF plot:

* **Significant correlations:** Bars that extend beyond the confidence intervals indicate statistically significant partial autocorrelations.
* **Positive/Negative values:**  Indicate the direction of the direct relationship.
* **Sharp drops:**  Sudden drops in the PACF can be helpful in identifying the order of an autoregressive (AR) model.

**Why is PACF Important?**

* **Identifying direct relationships:**  Helps understand the direct relationship between data points at different lags.
* **Model selection:** Particularly useful for determining the order of an autoregressive (AR) model in ARIMA modeling.

**Analogy:**

Imagine you're studying the relationship between a grandfather and his grandson.

* **ACF:** Would measure the overall correlation between them, which might be influenced by the father (the intermediate generation).
* **PACF:** Would measure the correlation between the grandfather and grandson *after removing the influence of the father*, giving you a clearer picture of their direct relationship.

**In essence:**

The PACF provides a more refined view of the relationships within a time series by isolating the direct correlations between points in time. This is crucial for building accurate time series models, especially ARIMA models.


In [21]:
# from statsmodels.graphics.tsaplots import plot_pacf

# title = 'Autocorrelation: Daily Female Births'
# lags = 40
# plot_pacf(births, title=title, lags=lags);

Partial autocorrelation works best with stationary data. To make starliner stationary we use the diff command.

In [22]:
# from statsmodels.tsa.statespace.tools import diff

# starliner['diff1'] = diff(starliner['Passengers'], k_diff=1)
# starliner[['Passengers', 'diff1']].plot();

In [23]:
# # plot pacf with diff k = 1
# title = 'PACF: starliner Passengers Diff k=1'
# lags = 40
# plot_pacf(starliner['diff1'].dropna(), title=title, lags=np.arange(lags));

### Month and Quarter Plot

**`month_plot`**

* **Purpose:** To visualize monthly seasonality in time series data.
* **How it works:** It groups your time series data by month and creates a boxplot for each month. This allows you to see the distribution of values (median, quartiles, outliers) for each month of the year.
* **What it shows:**
    * **Monthly patterns:**  Helps identify if there are consistent differences in the behavior of your time series across different months.
    * **Outliers:**  Can reveal unusual values for specific months.
    * **Seasonality:** If the boxplots show a clear pattern across the months (e.g., higher values in summer months for ice cream sales), it suggests seasonality.

**`quarter_plot`**

* **Purpose:**  Similar to `month_plot`, but it visualizes quarterly seasonality.
* **How it works:** Groups your time series data by quarter and creates a boxplot for each quarter.
* **What it shows:**
    * **Quarterly patterns:**  Helps identify if there are consistent differences in the behavior of your time series across different quarters of the year.
    * **Outliers:** Can reveal unusual values for specific quarters.
    * **Seasonality:**  If the boxplots show a clear pattern across the quarters, it suggests quarterly seasonality.

**Important Notes:**

* **Data requirements:** Your time series data needs to have a datetime index (e.g., a `pandas` `DatetimeIndex`) with a monthly frequency for these plots to work correctly.
* **Aggregation:** If you have daily or weekly data, you might need to aggregate it to monthly or quarterly averages before using these plots.
* **Interpretation:** These plots provide a visual indication of potential seasonality. You can combine them with other tools like ACF and PACF to confirm and further analyze the seasonality.

**In essence:**

`month_plot` and `quarter_plot` are quick and easy ways to visually explore potential monthly or quarterly seasonality in your time series data. They provide valuable insights into how your data behaves across different time periods within a year.


In [24]:
# # monthly
# from statsmodels.graphics.tsaplots import month_plot

# # starliner.index.freq = 'D'
# month_plot(starliner['Passengers']);

In [25]:
# # quarterly
# from statsmodels.graphics.tsaplots import quarter_plot

# starlinerq = starliner.resample(rule='Q').mean()
# quarter_plot(starlinerq['Passengers']);

### ARIMA

* Auto Regressive Integrated Moving Average
* Auto Regressive (p)
* Moving Average (q)
* Integrated (d - degree of differencing)
* Orderings
* Works well with univariate data, not so much with feature rich data (like stocks)

AR, MA, ARMA, and ARIMA

* PACF helps with AR model
* Sharp drop off suggests AR(-k) model
* ACF helps with MA model
* Gradual decline suggests MA model

ARIMA stands for **Autoregressive Integrated Moving Average**. It's a powerful statistical model used for analyzing and forecasting time series data.

Here's a breakdown of what each part means:

* **AR (Autoregressive):**  This part of the model uses past values of the time series to predict future values. It's like saying, "Today's value depends on what happened yesterday (or the day before, etc.)." The 'p' in ARIMA(p,d,q) represents the order of the autoregressive part, meaning how many past values are considered.
* **I (Integrated):** This part deals with making the time series stationary.  Stationarity means that the statistical properties of the series (like mean and variance) don't change over time. This is often achieved by differencing the data (taking the difference between consecutive observations). The 'd' in ARIMA(p,d,q) represents the degree of differencing.
* **MA (Moving Average):** This part of the model uses past forecast errors to predict future values. It's like saying, "If we overpredicted yesterday, we should adjust our prediction for today." The 'q' in ARIMA(p,d,q) represents the order of the moving average part.

**In simpler terms:**

Imagine you're trying to predict the daily temperature.

* **AR:** You might consider the temperatures of the past few days.
* **I:** You might look at the difference in temperature between yesterday and today, rather than the actual temperatures.
* **MA:** You might consider how much you over- or under-predicted the temperature yesterday and adjust your prediction for today accordingly.

**Why is ARIMA useful?**

* **Versatile:** It can handle a wide range of time series patterns, including trends, seasonality, and combinations of these.
* **Forecasting:**  It's a popular method for making short-term forecasts.
* **Well-established:** It has a strong theoretical foundation and has been widely used in various fields.

**How to use ARIMA:**

1. **Identify stationarity:** Check if your time series is stationary (using plots, tests like the Augmented Dickey-Fuller test).
2. **Differencing:** If it's not stationary, apply differencing to make it stationary.
3. **Identify p, d, and q:**  Use ACF and PACF plots to help determine the appropriate values for p, d, and q.
4. **Estimate parameters:**  Fit the ARIMA model to your data to estimate the model parameters.
5. **Evaluate the model:** Check the model's fit and make adjustments if needed.
6. **Forecast:** Use the fitted model to make forecasts.

**ARIMA is a powerful tool, but it does have some limitations:**

* **Linearity:**  It assumes a linear relationship between past and future values.
* **Stationarity:** Requires the time series to be stationary or transformed to become stationary.
* **Model selection:**  Choosing the right values for p, d, and q can be challenging.


In [26]:
# !pip install pmdarima

In [27]:
# from pmdarima import auto_arima

# help(auto_arima)

In [28]:
# auto_arima(births['Births'],
#            error_action='ignore',
#            seasonal=False,
#            stationary=True).summary()

In [29]:
# # train and forecast
# import pandas as pd
# import numpy as np
# from statsmodels.tsa.arima.model import ARIMA

# train = births.iloc[:90]
# test = births.iloc[90:]

# model = ARIMA(train['Births'],order=(1,0,1))
# results = model.fit()
# results.summary()

In [30]:
# # make predictions
# predictions = results.predict(start=len(train), end=len(births)-1)
# test['Births'].plot(legend=True,figsize=(12,5),label='test').autoscale(axis='x',tight=True)
# predictions.plot(legend=True, label='predictions');

This makes sense because our dataset has no trend or seasonal component

### SARIMA(X)

* ARIMA with Seasoning
* Uses PDQ as well as pdq
* X refers to the support of exogenous (independent) variables

**SARIMA (Seasonal ARIMA)**

* **What it is:**  An extension of ARIMA that explicitly includes seasonal components.
* **Why it's needed:** ARIMA struggles with time series data that exhibits strong seasonal patterns (e.g., peaks and troughs that repeat at regular intervals like monthly or yearly).
* **How it works:**  It adds three new sets of parameters (P, D, Q) to the ARIMA model to capture the seasonal autoregressive (SAR), seasonal integrated (I), and seasonal moving average (SMA) components. These components operate at the seasonal level (e.g., with a lag of 12 for monthly data).
* **Parameters:** SARIMA(p, d, q)(P, D, Q)m
    * `p, d, q`:  Non-seasonal ARIMA parameters (order of autoregression, differencing, and moving average).
    * `P, D, Q`: Seasonal ARIMA parameters.
    * `m`: The number of periods in each season (e.g., 12 for monthly data, 4 for quarterly data).
* **Example:** SARIMA(1, 1, 1)(1, 1, 1)12 would model a time series with both non-seasonal and monthly seasonal patterns.

**SARIMAX (Seasonal ARIMA with eXogenous regressors)**

* **What it is:** An extension of SARIMA that allows you to include external variables (exogenous variables) in your model.
* **Why it's needed:** Sometimes, factors outside of the time series itself can influence its behavior. SARIMAX lets you incorporate these factors.
* **How it works:** It adds an 'X' component to SARIMA, which represents the exogenous variables. These variables can be anything that might affect your time series, such as economic indicators, weather patterns, or promotional events.
* **Example:** If you're modeling sales data, you could include advertising spending or competitor prices as exogenous variables in a SARIMAX model.

**In Summary**

* **ARIMA:**  A basic model for time series with non-seasonal patterns.
* **SARIMA:**  Extends ARIMA to handle time series with seasonal patterns.
* **SARIMAX:** Extends SARIMA to include external variables that might influence the time series.

**When to use which:**

* **ARIMA:** When your time series doesn't have a clear seasonal pattern.
* **SARIMA:** When your time series has a seasonal pattern.
* **SARIMAX:** When your time series has a seasonal pattern and you have external variables that might influence it.

**Key takeaway:**

SARIMA and SARIMAX are more complex models than ARIMA, but they can be very powerful for analyzing and forecasting time series data with seasonal patterns and external influences. They provide a more comprehensive approach to understanding the dynamics of your time series.


### Mind your Ps and Qs

**1. Non-Seasonal Parameters (`p`, `d`, `q`)**

These parameters define the non-seasonal part of the ARIMA model, dealing with patterns that don't repeat at regular intervals.

* **`p` (Order of Autoregression):**
    * The number of past values used to predict the current value.
    * Example: `p = 2` means the model uses the values from two time steps ago to predict the current value.
    * **How to find it:** Look for significant lags in the PACF plot.

* **`d` (Degree of Differencing):**
    * The number of times the data needs to be differenced to become stationary.
    * Differencing involves taking the difference between consecutive observations to remove trends and stabilize the mean.
    * Example: `d = 1` means the model uses the first difference of the time series.
    * **How to find it:** Use tests like the Augmented Dickey-Fuller (ADF) test and visual inspection of time series plots.

* **`q` (Order of Moving Average):**
    * The number of past forecast errors used to predict the current value.
    * Example: `q = 1` means the model uses the forecast error from one time step ago to predict the current value.
    * **How to find it:** Look for significant lags in the ACF plot.

**2. Seasonal Parameters (`P`, `D`, `Q`)**

These parameters are specific to SARIMA and SARIMAX models and define the seasonal part of the model, dealing with patterns that repeat at regular intervals.

* **`P` (Seasonal Autoregressive Order):**
    * The number of seasonal lags used in the model.
    * Example: `P = 1` with `m = 12` (monthly data) means the model uses the value from 12 time steps ago (same month last year) to predict the current value.
    * **How to find it:** Look for significant lags at multiples of the seasonal period in the PACF plot.

* **`D` (Seasonal Differencing Order):**
    * The number of times the data needs to be differenced at the seasonal level to become stationary.
    * Example: `D = 1` with `m = 12` means taking the difference between the current value and the value from the same month in the previous year.
    * **How to find it:** Use tests like the ADF test and visual inspection of time series plots, focusing on seasonal patterns.

* **`Q` (Seasonal Moving Average Order):**
    * The number of seasonal moving average terms used in the model.
    * Example: `Q = 1` with `m = 12` means the model uses the forecast error from 12 time steps ago (same month last year) to predict the current value.
    * **How to find it:** Look for significant lags at multiples of the seasonal period in the ACF plot.

**3. `m` (Seasonal Period)**

* This parameter defines the number of periods in each season.
* Example:
    * `m = 12` for monthly data (12 months in a year)
    * `m = 4` for quarterly data (4 quarters in a year)

**Finding the Right Values**

Finding the optimal values for these parameters often involves a combination of:

* **ACF and PACF plots:** To identify potential autocorrelations and partial autocorrelations.
* **Information criteria:**  Like AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) to compare different models.
* **Trial and error:**  Experimenting with different parameter combinations and evaluating the model's performance.

It's important to note that finding the perfect values can be challenging and may require some iteration and fine-tuning.


In [31]:
# auto_arima(starliner['Passengers'],
#                       m=12,
#                       trace=True,
#                       error_action='ignore',
#                       seasonal=True, # default
#                       suppress_warnings=True,
#                       stationary=False).summary()

In [32]:
# stepwise = auto_arima(starliner['Passengers'],
#                       m=12,
#                       trace=True,
#                       error_action='ignore',
#                       seasonal=True, # default
#                       suppress_warnings=True,
#                       stepwise=True)

# stepwise.summary()

In [33]:
# # uses PDQ as well as pdq
# import statsmodels.api as sm

# train = starliner.iloc[:116]
# test = starliner.iloc[116:]

# model = sm.tsa.statespace.SARIMAX(train['Passengers'],
#                                   order=(2,1,1),
#                                   seasonal_order=(0,1,0,12))
# results = model.fit()
# results.summary()

In [34]:
# # make predictions
# predictions = results.predict(start=len(train), end=len(starliner)-1)
# test['Passengers'].plot(legend=True,figsize=(12,5),label='test').autoscale(axis='x',tight=True)
# predictions.plot(legend=True, label='predictions');

## Other Topics

* Vector AutoRegression (Moving Average) VAR(MA)
* Prophet Library (Facebook)

## Tests

* Dickey-Fuller for stationality
* Granger Causality for one time series forecasting another
* Seasonality

In [35]:
# # Stationality Test with Dickey-Fuller
# from statsmodels.tsa.stattools import adfuller

# dftest = adfuller(starliner['Passengers'],autolag='AIC')
# dfout = pd.Series(dftest[0:4],index=['ADF test statistic','p-value','lags used','observations'])

# for key, val in dftest[4].items():
#     dfout[f'critical value ({key})']=val

# print(dfout)

The p-value is 0.99, which shows the lack of evidence to reject the null hypothesis, we fail to reject the null hypothesis. Our dataset is not stationary.

In [36]:
# # Stationality Test with Dickey-Fuller
# from statsmodels.tsa.stattools import adfuller

# dftest = adfuller(births['Births'],autolag='AIC')
# dfout = pd.Series(dftest[0:4],index=['ADF test statistic','p-value','lags used','observations'])

# for key, val in dftest[4].items():
#     dfout[f'critical value ({key})']=val

# print(dfout)

The p-value is 0.000052, which shows strong evidence to reject the null hypothesis. Our dataset is stationary.

In [37]:
# import statsmodels.api as sm
# from statsmodels.tsa.stattools import grangercausalitytests
# import numpy as np

# data = sm.datasets.macrodata.load_pandas()
# data = data.data[['realgdp', 'realcons', 'realinv', 'realgovt']].pct_change().dropna()
# print(sm.datasets.macrodata.NOTE)

In [38]:
# data.plot(figsize=(12,5));

In [39]:
# from statsmodels.tsa.stattools import grangercausalitytests

# grangercausalitytests(data[['realgdp','realgovt']],maxlag=3);

In [40]:
# from statsmodels.tsa.stattools import grangercausalitytests

# grangercausalitytests(data[['realinv','realcons']],maxlag=3);

## Finance Terms

Notes from Jose Portilla's Python for Financial Analysis and Algorithmic Trading

* Adjusted Close
* Moving Average / Rolling Mean
* Bollinger Bands
* Cumulative Return
* Daily Return
* Cumulative Daily Return
* Time Series
* Error - Trends - Seasonality (ETS) Models
* ETS Decomposition
* Exponentially Weighted Moving Averages (EWMA)
* Auto Regressive Integrated Moving Average (ARIMA)
* ACF - Autocorrelation Model
* PACF - Partial Autocorrelation Model
* Volatility
* Portfolio Allocation
* Sharpe Ratio
* Exchange Traded Funds (ETF)
* Mutual Funds
* Hedge Funds
* High Frequency Trading
* Selling Short and Long
* Capital Asset Pricing Model
* Stock Splits and Dividends
* Survivorship Bias
* Efficient Market Hypothesis
* Pairs Trading