In [1]:
import statsmodels.api
import statsmodels.formula.api
import numpy as np
import pandas as pd
import scipy.stats
import matplotlib.pyplot as plt
import sklearn

# Time-series Analysis

## Deterministic vs. Stochastic Processes

1. **What is a deterministic process?**: Next values depend on previous or known steps
2. **What is an example of a deterministic process?**:
3. **What is a stochastic process?**: Doesn't depend on previous steps, "random"
4. **Are most physical phenomena deterministic or stochastic?**: Stochastic

## Time-series patterns

1. **Stationarity**: mean and stdev constant over time
2. **Trend**: changes in mean over time
3. **Seasonality**: Systemic, periodic variation

![Air Passenger Data time-series example](air_log_transform.PNG)

*Figure 1: Air passenger time-series data and log transform of data.*

![Air Passenger Data time-series example](air_log_trend.png) ![Air Passenger Data time-series example](air_log_seasonality.png)

*Figure 2: Components for trend (left) of log transform of air data and seasonality (right) of log transform of air data.*


![Air Passenger Data time-series example](air_log_residual.png)

*Figure 3: Residual component of log transform of air data.*


## Forecasting time-series data

Using the log of the data can help with stabilizing standard deviation. Assuming stationarity with linear decomposition techniques:

$y_t = m_t + s_t + r_t$ where $y_t$ is the value of the time series, $m_t$ is the trend component, $s_t$ is the seasonality component, and $r_t$ is a residual component.

Try using `statsmodels.tsa.seasonal.seasonal_decompose()` with the `https://static-resources.zybooks.com/static/AirPassengers.csv` dataset:

## Error (cost or loss) functions for forecasting

- **Mean squared error (MSE):** $\frac{1}{n} \sum_{i=1}^{n} e_i^2$
- **Mean absolute error (MAE):** $\frac{1}{n} \sum_{i=1}^{n} |e_i|$
- **Mean absolute percentage error (MAPE):** $100\% \cdot \frac{1}{n} \sum_{i=1}^{n} |\frac{e_i}{y_i}|$
- Even more metrics! See the [scikit-learn documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics) for more.

Let's write functions for these using `numpy`:

# Supervised Machine Learning

## Getting started

- **Machine learning:** Generic term for computer algorithms that build models based on sample data
- **Task:** What the statistical model does (classification vs. regression vs. clustering)
- **Features:** Properties, attributes, or predictors of a dataset
- **Supervised learning:** Estimator optimization using known data labels ("correct" data)

We will be using the [`scikit-learn`](https://scikit-learn.org) library, which is shortened to `sklearn` in code. Go ahead and use `mamba` to install `scikit-learn`.

## Choosing the machine learning estimator to use

There's a [convenient chart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) available on scikit-learn's documentation.  There's also a [massive list of all of the types of supervised learning](https://scikit-learn.org/stable/supervised_learning.html) that exists in `sklearn`.

## Trying out some regression on the Diabetes dataset

First, load the Diabetes dataset, using `sklearn.datasets.load_diabetes()` as a DataFrame:

In [6]:
import sklearn.datasets

bunch_diabetes = sklearn.datasets.load_diabetes(as_frame=True)
# print(bunch_diabetes['DESCR'])
df_diabetes_data = bunch_diabetes['data']
df_diabetes_target = bunch_diabetes['target']
print(f"Data:\n{df_diabetes_data.head()}")
print(f"Target:\n{df_diabetes_target.head()}")

Data:
        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005671 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  
0 -0.002592  0.019908 -0.017646  
1 -0.039493 -0.068330 -0.092204  
2 -0.002592  0.002864 -0.025930  
3  0.034309  0.022692 -0.009362  
4 -0.002592 -0.031991 -0.046641  
Target:
0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
Name: target, dtype: float64


Next, let's just try ordinary least squares (but from `sklearn.linear_model.LinearRegression()`) with it:

In [14]:
import sklearn.linear_model

regress_diabetes_model = sklearn.linear_model.LinearRegression()
regress_diabetes_model.fit(df_diabetes_data, df_diabetes_target)
print(f"fit score: {regress_diabetes_model.score(df_diabetes_data, df_diabetes_target)}")
print(f"coef names: {regress_diabetes_model.feature_names_in_}")
print(f"coefficients: {regress_diabetes_model.coef_}")
print(f"intercept: {regress_diabetes_model.intercept_}")

fit score: 0.5177494254132934
coef names: ['age' 'sex' 'bmi' 'bp' 's1' 's2' 's3' 's4' 's5' 's6']
coefficients: [ -10.01219782 -239.81908937  519.83978679  324.39042769 -792.18416163
  476.74583782  101.04457032  177.06417623  751.27932109   67.62538639]
intercept: 152.1334841628965


Now, we can try another model, such as `Ridge`, `Lasso`, or `ElasticNet`:

Let's try something different -- the Support Vector Machine (SVM), `sklearn.svm.SVR()`:

## What's the catch?

How do we know we are overfitting, underfitting, etc.? 