In [None]:
import statsmodels.api
import statsmodels.formula.api
import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.datasets
import sklearn.svm
import sklearn.linear_model
import sklearn.model_selection

# Time-series Analysis

## Deterministic vs. Stochastic Processes

1. **What is a deterministic process?**: Next values depend on previous or known steps
2. **What is an example of a deterministic process?**:
3. **What is a stochastic process?**: Doesn't depend on previous steps, "random"
4. **Are most physical phenomena deterministic or stochastic?**: Stochastic

## Time-series patterns

1. **Stationarity**: mean and stdev constant over time
2. **Trend**: changes in mean over time
3. **Seasonality**: Systemic, periodic variation

![Air Passenger Data time-series example](air_log_transform.PNG)

*Figure 1: Air passenger time-series data and log transform of data.*

![Air Passenger Data time-series example](air_log_trend.png) ![Air Passenger Data time-series example](air_log_seasonality.png)

*Figure 2: Components for trend (left) of log transform of air data and seasonality (right) of log transform of air data.*


![Air Passenger Data time-series example](air_log_residual.png)

*Figure 3: Residual component of log transform of air data.*


## Forecasting time-series data

Using the log of the data can help with stabilizing standard deviation. Assuming stationarity with linear decomposition techniques:

$y_t = m_t + s_t + r_t$ where $y_t$ is the value of the time series, $m_t$ is the trend component, $s_t$ is the seasonality component, and $r_t$ is a residual component.

Try using `statsmodels.tsa.seasonal.seasonal_decompose()` with the `https://static-resources.zybooks.com/static/AirPassengers.csv` dataset:

## Error (cost or loss) functions for forecasting

- $e(x) = y_{true}(x) - y_{pred}(x)$
- **Mean squared error (MSE):** $\frac{1}{n} \sum_{i=1}^{n} e_i^2$
- **Mean absolute error (MAE):** $\frac{1}{n} \sum_{i=1}^{n} |e_i|$
- **Mean absolute percentage error (MAPE):** $100\% \cdot \frac{1}{n} \sum_{i=1}^{n} |\frac{e_i}{y_i}|$
- Even more metrics! See the [scikit-learn documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics) for more.

Let's write functions for these using `numpy`:

In [None]:
def calc_mse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Returns the mean squared error given y_true and y_pred.
    """
    return np.sum(np.square(y_true-y_pred)) * 1/(y_true.size)

# Supervised Machine Learning

## Getting started

- **Machine learning:** Generic term for computer algorithms that build models based on sample data
- **Task:** What the statistical model does (classification vs. regression vs. clustering)
- **Features:** Properties, attributes, or predictors of a dataset
- **Supervised learning:** Estimator optimization using known data labels ("correct" data)

We will be using the [`scikit-learn`](https://scikit-learn.org) library, which is shortened to `sklearn` in code. Go ahead and use `mamba` to install `scikit-learn`.

## Choosing the machine learning estimator to use

There's a [convenient chart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) available on scikit-learn's documentation.  There's also a [massive list of all of the types of supervised learning](https://scikit-learn.org/stable/supervised_learning.html) that exists in `sklearn`.

## Trying out some regression on the Diabetes dataset

First, load the Diabetes dataset, using `sklearn.datasets.load_diabetes()` as a DataFrame:

In [None]:
df_diabetes = pd.read_csv('https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt',
                          sep='\t')
df_diabetes.columns = df_diabetes.columns.str.lower()
df_diabetes.head()
df_diabetes_data = df_diabetes.drop('y', axis=1)
df_diabetes_target = df_diabetes['y'].copy()
print(f"Data:\n{df_diabetes_data.head()}")
print(f"Target:\n{df_diabetes_target.head()}")

Sometimes, it's helpful to use a `sns.pairplot` to see how everything behaves:

In [None]:
sns.pairplot(data=df_diabetes, hue='y', kind='scatter', palette='plasma')

Next, let's just try ordinary least squares (but from `sklearn.linear_model.LinearRegression()`) with it:

In [None]:
df_diabetes_data_subbed = df_diabetes_data[['bmi', 'bp', 's4', 's5', 's6']]

regress_diabetes_model = sklearn.linear_model.LinearRegression()
regress_diabetes_model.fit(df_diabetes_data_subbed, df_diabetes_target)
print(f"fit score: {regress_diabetes_model.score(df_diabetes_data_subbed, df_diabetes_target)}")
print(f"coef names: {regress_diabetes_model.feature_names_in_}")
print(f"coefficients: {regress_diabetes_model.coef_}")
print(f"intercept: {regress_diabetes_model.intercept_}")

Now, we can try another model, such as `Ridge`, `Lasso`, or `ElasticNet`:

In [None]:
def test_linear_regressor(model_to_use: object,
                          true_features: pd.DataFrame,
                          true_targets: pd.Series) -> object:
    """
    Test a linear regressor (assuming models in sklearn.linear_model).
    
    :param model_to_use: The model to use (e.g. sklearn.linear_model.Ridge)
    :param true_feature: The dataframe of features from the dataset
    :param true_targets: The series of targets from the dataset
    :returns: the trained model
    """
    model = model_to_use()
    model.fit(true_features, true_targets)
    print(f"Testing model: {model}")
    print(f"fit score: {model.score(true_features, true_targets)}")
    print(f"coef names: {model.feature_names_in_}")
    print(f"coefficients: {model.coef_}")
    print(f"intercept: {model.intercept_}")
    return model

test_linear_regressor(sklearn.linear_model.Ridge, df_diabetes_data_subbed, df_diabetes_target)
test_linear_regressor(sklearn.linear_model.ElasticNet, df_diabetes_data_subbed, df_diabetes_target)

Let's try something different -- the Support Vector Machine (SVM), `sklearn.svm.SVR()`:

In [None]:
svm_diabetes_model = sklearn.svm.SVR(kernel="linear")
svm_diabetes_model.fit(df_diabetes_data_subbed, df_diabetes_target)
print(f"fit score: {svm_diabetes_model.score(df_diabetes_data_subbed, df_diabetes_target)}")
print(f"coef names: {svm_diabetes_model.feature_names_in_}")
print(f"coefficients: {svm_diabetes_model.coef_}")
print(f"intercept: {svm_diabetes_model.intercept_}")
print(f"support vectors: {svm_diabetes_model.support_vectors_}")

## What's the catch?

How do we know we are overfitting, underfitting, etc.? --> Validation and Testing!

In [None]:
svm_diabetes_model = sklearn.svm.SVR(kernel="linear")
scores = sklearn.model_selection.cross_val_score(svm_diabetes_model, 
                                                 df_diabetes_data_subbed, 
                                                 df_diabetes_target, 
                                                 cv=7)
print(f"Cross-validation scores: {scores}")

In [None]:
import sklearn.ensemble

rf_diabetes_model = sklearn.ensemble.RandomForestRegressor(n_estimators=1000)
scores = sklearn.model_selection.cross_val_score(rf_diabetes_model, 
                                                 df_diabetes_data, 
                                                 df_diabetes_target, 
                                                 cv=7,
                                                 verbose=2)
print(f"Cross-validation scores: {scores}")

## Feature Selection

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression

feature_selector = SelectKBest(f_regression, k=6)
feature_selector.fit_transform(df_diabetes_data, df_diabetes_target)
features_info = zip(feature_selector.feature_names_in_, feature_selector.pvalues_, feature_selector.scores_)
print("f_regression (linear assumption) results:")
for feature_name, p_value, score in features_info:  # TODO: sort
    print(f"{feature_name}: {score} with p-value {p_value}")

In [None]:
feature_selector = SelectKBest(mutual_info_regression, k=6)
feature_selector.fit_transform(df_diabetes_data, df_diabetes_target)
features_info = zip(feature_selector.feature_names_in_, feature_selector.scores_)
print("mutual_info_regression (nonlinear assumption) results:")
for feature_name, score in features_info:
    print(f"{feature_name}: {score}")

## Homework 2 Results

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats

grades = np.array(sorted([5.5, 10, 10, 5.5, 5, 7.75, 4.5, 5.5, 5.5, 10, 9.5, 5.5, 8.25, 10, 9.9, 6.5, 9]))
df_grades = pd.Series(grades)
df_grades.describe()
print(f"mode 1: {df_grades[df_grades < 7.75].mean()}")
print(f"mode 2: {df_grades[df_grades >= 7.75].mean()}")
scipy.stats.ttest_ind(df_grades[df_grades < 7.75], df_grades[df_grades >= 7.75])


In [None]:
sns.histplot(df_grades, binrange=(0.0, 10.0))

### Notes on Homework 2

There has a pattern of work sharing, in particular incorrect work, with these two homework assignments.

Do not copy and paste other students' work, this is not allowed -- it is considered cheating.  Academic dishonesty results in an automatic failing grade with the course.  I did not give the automatic fail penalty this time.  If cheating is found, the grade will become an automatic zero in the future.  If you have any questions, please feel free to email me.

You MUST explain all of the justification for writing the Python code which you use to help you analyze datasets. You may talk through problems with your peers, but you must perform your own writing and coding. If you use other sources to help you, please cite them.

If you are having trouble and have questions or would like clarifications on homework problems, please feel free to message me on Teams, Canvas, or email.

## Supervised learning steps from scratch

Dataset: `ml_datasets/power_plant.csv`

References cited:

<p style="font-size: smaller">
    Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615.
</p>
<p style="font-size: smaller">
    Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai).
</p>


#### Dataset summary

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.

#### Dataset attributes:

Features consist of hourly average ambient variables:

- Ambient Temperature (AT) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW

The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization. 

### Step 1: Import libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

import sklearn.pipeline
import sklearn.svm

### Step 2: Import and visualize data

### Step 3: Clean data as necessary

### Step 4: Separate data and targets

### Step 5: Check selection algorithm (linear vs. nonlinear model)

Select best features for use with linear techniques (`f_regression`)

Select best features for use with nonlinear techniques (`mutual_info_regression`)

### Step 6: Build Pipeline (scaler and classifier combined) and select model

In [None]:
import joblib
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.linear_model
import sklearn.model_selection

num_cores = 4

Let's try good ol' ElasticNet linear regression:

Let's try this dataset with a linear SVM:

Now, try it with a nonlinear SVM:

Let's try a neural network!

### Step 7: Train final model on single shuffled train-test split