# Sequential Variance-Altered MLE (SVAM)

This is an implementation of Sequential Variance-Altered MLE (SVAM) algorithm, a robust learning algorithm for Generalized Linear Model. The algorithm refers to the paper in [arxiv](https://arxiv.org/pdf/2212.05430). This is a robust model recovery algorithm for supervised learning problems like linear regression, gamma regression, logistic regression and unsupervised learning like mean estimation.

# Setup and Instructions

It requires installation of the package scikit-learn, numpy, matplotlib

In [None]:
pip install scikit-learn
pip install numpy matplotlib

The codes with extension .ipynb run in Jupyter Notebook. So install Jupyter notebook

In [None]:
pip install notebook

This repository has four implementations of SVAM on 1) Linear LS regression 2) Logistic regression 3) mean estimation and, 4) gamma regression, respectively. The implementations are coded as Jupyter notebook files with extension .ipynb. The four implementations are in their respective folders. 

# Linear Least Square Regression

It learns linear regression model under constant fraction corruption, i.e. when the labels are generated as $y=w_\ast^Tx+\epsilon$, where noise $\epsilon$ is generated from a zero-mean Gaussian distribution, it works.

1. **Data Preparation**:
Generate n random covariates of d-dimension from normal distribution $x\sim\mathcal{N}(\mu,\sigma)$. For a gold model $w_\ast$ for $i=1,2,\ldots,n$ generate $n$ corresponding labels as $y_i = < w_\ast, x_i>$. For $\alpha$-corruption for some constant $\alpha$, replace $\alpha*n$ many labels with labels generated by an adversarial model $\tilde{w}$.

2. **Run SVAM**:
Run the main algorithm SVAM with initial (inverse) variance parameter $\beta_0$ and increment parameter $\eta$. SVAM computes weights and calls Weighted Linear Regression library module with the computed weights. Perform grid search to find best intialization of the two parameters

3. **Plot error vs time**:
Plot model recovery error $\|w_\ast-\hat{w}_t\|$ in L2 distance over time $t$. Compare the result of SVAM where the variance parameter gradually changes over time vs that of VAM with a fixed variance parameter for a set of fixed values.

# Logistic Regression

For binary classification problems, solves a logistic regression where the labels are $y=\text{sign}(w_\ast^Tx)$ with constant fraction of label corruption

1. **Data Preparation**:
Generate n random covariates of d-dimension from normal distribution $x\sim\mathcal{N}(\mu,\sigma)$. For a gold model $w_\ast$ for $i=1,2,\ldots,n$ generate $n$ corresponding labels as $y_i = (1+sign(< w_\ast, x_i>))/2$. For $\alpha$-corruption for some constant $\alpha$, replace $\alpha*n$ many labels with $\tilde{y}_i=(1+sign(< \tilde{w}, x_i>))/2$ where $\tilde{w}$ is an adversarial model.

2. **Run SVAM**:
Run the main algorithm SVAM with initial (inverse) variance parameter $\beta_0$ and increment parameter $\eta$. SVAM computes weights and calls Weighted Logistic Regression library module with the computed weights. Perform grid search to find best intialization of the two parameters

3. **Plot error vs time**:
Plot model recovery error $\|w_\ast-\hat{w}_t\|$ in L2 distance over time $t$. Compare the result of SVAM where the variance parameter gradually changes over time vs that of VAM with a fixed variance parameter for a set of fixed values. Also compare with the oracle which knows the points of corruption and thus gives the best performance benchmark.

# Mean Estimation

In mean estimation problem $d$-dimensional covariates are generated from a normal distribution with mean $\mu_\ast\in\mathbb{R}^d$. 

1. **Data Preparation**:
   For $\alpha$-corruption, $n\alpha$ many points are replaced by samples from a normal distribution with adversarial mean $\tilde{\mu}$.
   
2. **Run SVAM**:
   Run SVAM with $\beta_0$ and $\eta$, chosen by gridsearch. Compute weights and estimate mean by weighted average.

   
3. **Plot error vs time**:
   Plot mean estimation error $\|\mu_\ast-\hat{\mu}_t\|$ in L2 distance over time. Compare the SVAM result with median estimation and the oracle who knows the points that are corrupted.

# Gamma Regression

By Gamma regression we refer to the GLM corresponding to the Exponential Dispersion Model as a Gamma distribution which has two parameters $\phi$ and $w_\ast$, and the labels are given by $y=(1-\phi)\exp(w_\ast^T x)$. We keep $\phi$ constant and learn $w_\ast$ when $\alpha n$ labels are corrupted.

1. . **Data Preparation**:
   For $\alpha$-corruption, $n\alpha$ many labels are replaced by $\tilde{y_i}=(1-\phi)\exp(\tilde{w}^T x_i)$.
   
2. **Run SVAM**:
   Run SVAM with $\beta_0$ and $\eta$, chosen by gridsearch. Compute weights and take gradient step on weighted likelihood.

   
3. **Plot error vs time**:
   Plot mean estimation error $\|w_\ast-\hat{w}_t\|$ L2-distance over time. Compare the SVAM result with MLE and fixed variance VAM.