In [16]:
import statsmodels.api
import statsmodels.formula.api
import numpy as np
import pandas as pd
import scipy.stats
import matplotlib.pyplot as plt

# Time-series Analysis

## Deterministic vs. Stochastic Processes

1. What is a deterministic process?
2. What is an example of a deterministic process?
3. What is a stochastic process?
4. Are most physical phenomena deterministic or stochastic?

## Time-series patterns

1. **Stationarity**: mean and stdev constant over time
2. **Trend**: changes in mean over time
3. **Seasonality**: Systemic, periodic variation

![Air Passenger Data time-series example](air_log_transform.PNG)

*Figure 1: Air passenger time-series data and log transform of data.*

![Air Passenger Data time-series example](air_log_trend.png) ![Air Passenger Data time-series example](air_log_seasonality.png)

*Figure 2: Components for trend (left) of log transform of air data and seasonality (right) of log transform of air data.*


![Air Passenger Data time-series example](air_log_residual.png)

*Figure 3: Residual component of log transform of air data.*


## Forecasting time-series data

Using the log of the data can help with stabilizing standard deviation. Assuming stationarity with linear decomposition techniques:

$y_t = m_t + s_t + r_t$ where $y_t$ is the value of the time series, $m_t$ is the trend component, $s_t$ is the seasonality component, and $r_t$ is a residual component.

Try using `statsmodels.tsa.seasonal.seasonal_decompose()` with the `https://static-resources.zybooks.com/static/AirPassengers.csv` dataset:

## Error (cost or loss) functions for forecasting

- **Mean squared error (MSE):** $\frac{1}{n} \sum_{i=1}^{n} e_i^2$
- **Mean absolute error (MAE):** $\frac{1}{n} \sum_{i=1}^{n} |e_i|$
- **Mean absolute percentage error (MAPE):** $100\% \cdot \frac{1}{n} \sum_{i=1}^{n} |\frac{e_i}{y_i}|$
- Even more metrics! See the [scikit-learn documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics) for more.

Let's write functions for these using `numpy`:

# Supervised Machine Learning

## Getting started

- **Machine learning:** Generic term for computer algorithms that build models based on sample data
- **Task:** What the statistical model does (classification vs. regression vs. clustering)
- **Features:** Properties, attributes, or predictors of a dataset
- **Supervised learning:** Estimator optimization using known data labels ("correct" data)

We will be using the [`scikit-learn`](https://scikit-learn.org) library, which is shortened to `sklearn` in code. Go ahead and use `mamba` to install `scikit-learn`.

## Choosing the machine learning estimator to use

There's a [convenient chart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) available on scikit-learn's documentation.  There's also a [massive list of all of the types of supervised learning](https://scikit-learn.org/stable/supervised_learning.html) that exists in `sklearn`.

## Trying out some regression on the Diabetes dataset

First, load the Diabetes dataset, using `sklearn.datasets.load_diabetes()` as a DataFrame:

Next, let's just try ordinary least squares (but from `sklearn.linear_model.LinearRegression()`) with it:

Now, we can try another model, such as `Ridge`, `Lasso`, or `ElasticNet`:

Let's try something different -- the Support Vector Machine (SVM), `sklearn.svm.SVR()`:

## What's the catch?

How do we know we are overfitting, underfitting, etc.? 