# Chemistry 313: Machine Learning in Chemistry
Spring 2025, Dinner

In [1]:
import torch
import jax
import numpy as np
import jax.numpy as jnp
import matplotlib.pyplot as plt

In [16]:
jkey = jax.random.PRNGKey(seed=1)
j2key = jax.random.PRNGKey(seed=2)
j3key = jax.random.PRNGKey(seed=3)

## 3/24/25
Types of ML
 - supervised: data = $\{(x_i,y_i)\}_i$, x is features y is label
    - regression ($y \in \mathbb{R}^d$)
    - classifcation ($y \in \mathbb{Z}^d, \{c_0,c_1,\ldots,c_d\}$)
 - unsupervised
    - dimensional reduction (visualization, speed optimization)
    - density estimation
    - clustering
 - RL
    - choosing actions (faces the credit assignment problem which is overdetermined)
    - protein design

test data, train data, validation data (holdout) to prevent overfitting
- in chemistry hard to split because often correlations, "leakage" between train and test
cross validation (kfold)
- make k groups of test-train and cycle through and then average over the groups

## 3/26/25
### Regularization
- modification to the procedure directed at reducing test error without changing training error
- types: ridge (L2) (better usually, hard to interpret, not robust to outliers), lasso (L1) (sparse coeffs, better for interpretability, outlier robust)
Q-regularization with hyperparameter $\lambda$:
$$\tilde{L}(\beta) = \sum^{N_{test}} (\hat{y}_n - y_n)^2 + \frac{\lambda}{2}\sum_i \beta_i ^q$$

### Bias-Variance Decomposition
$$y = f(x) + \varepsilon$$
learn $\hat{y}(\{x_n,y_n\})$

$E[(y-\hat{y})^2] = (f(x)-E[\hat{y}])^2 + E[(E[\hat{y}]-\hat{y})^2] + E[\varepsilon^2]$

ie = average deviation from truth based on model (bias)^2 + variance of the model between data draws + noise

better generalization = lower variance
there is a tradeoff between low variance low bias

### Linear Basis Regression
basis functions nonlinear
$$\hat{y}(x) = \sum w_j \phi_j(x) $$
$$\phi_0(x) = 1$$

assume $y(x) = f(x) + \varepsilon$ where $\varepsilon \sim N(0,\sigma^2)$

$$p(\{y_n\} | w, \sigma^2) = \Pi_n^N \frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{1}{2}(y_n - \sum w_j \phi_j(x))^2)$$
likelihood of seeing that data if the model were correct

MLE: $$w^*, \sigma^{2*} = \arg\max \log p(\{y_n\} | w,\sigma^2)$$
$$= -N/2 \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum (y_n -\sum w_j \phi_j(x))^2$$
$$= -N/2 \log(2\pi \sigma^2) - L/\sigma^2$$

Taking gradients and setting to 0: $$\hat{y} = \Phi w$$
$$w = (\Phi^T \Phi)^{-1}\Phi^{-T}y_n$$

## 4/4

### PCA

$ x \in \mathbb{R}^R \quad z \in \mathbb{R}^K$


$L = \frac{1}{N} \sum_n^N ||x_n - \text{decode}(\text{encode}(x))||^2$

encode, decode are linear

$z = W^T x \quad W^TW = I$ W is orthogonal

$L = \frac{1}{N}\sum_n ||x-Wz||^2$

$x_n = \sum_k z_{nk}w_k$

for $K=1$

$L = \frac{1}{N}\sum_n (x-w_1z_{n1})^T(x_n -w_1 z_{n1})$

$\frac{\partial L}{\partial z_{n1}} = -2w_1 \cdot x_n + 2 z_{n1}$

$\tilde{L} = - \frac{1}{N}\sum_n z_{n1}^2 = - w_1^T (\frac{1}{N}\sum_n x_n x_n^T) w_1 = -w_1^T \Sigma w_1$

Unitarity Constraint:

$\tilde{L}' = -w_1^T \Sigma w_1 + \lambda (w_1^T  w_1 -1)$

$\Sigma w_1 = \lambda w_1$

$\lambda = w_1^T \Sigma w_1 = \tilde{L}$

therefore pick the largest eigenvalue of $\Sigma$

#### PCA maximizes variance
$\text{Var}(z_{n1}) = \frac{1}{N} \sum_n z_{n1}^2 = -\tilde{L}$