In [1]:
%matplotlib inline


# Generalized Linear Model


## Logistic Regressions

Logistic regression is a important model to solve classification problem, which is expressed specifically as:
$$
\begin{aligned}
& P(y=1 \mid x)=\frac{1}{1+\exp \left(-x^T \beta\right)}, \\
& P(y=0 \mid x)=\frac{1}{1+\exp \left(x^T \beta\right)},
\end{aligned}
$$
where $\beta$ is an unknown parameter vector that to be estimated. Since we expect only a few explanatory variables contribute for predicting $y$, we assume $\beta$ is sparse vector with sparsity level $s$.

With $n$ independent data of the explanatory variables $x$ and the response variable $y$, we can estimate $\beta$ by minimizing the negative log-likelihood function under sparsity constraint:
$$
\arg \min _{\beta \in R^p} L(\beta):=-\frac{1}{n} \sum_{i=1}^n\left\{y_i x_i^T \beta-\log \left(1+\exp \left(x_i^T \beta\right)\right)\right\}, \text { s.t. }\|\beta\|_0 \leq s
$$

In [2]:
import jax.numpy as jnp
import numpy as np
from scope import ScopeSolver
import numpy as np
np.random.seed(0)

def data_generator(n, p, s, rho, random_state=None):
    """
    * $\beta^*_i$ ~ N(0, 1), $\forall i \in supp(\beta^*)$
    * $x = (x_1, \cdots, x_p)^T$, $x_{i+1}=\rho x_i+\sqrt{1-\rho^2}z_i$, where $x_1, z_i$ ~ N(0, 1)
    * $y\in\{0,1\}$, $P(y=0)=\frac{1}{1+\exp^{x^T\beta^*+c}}$
    """
    np.random.seed(random_state)
    # beta
    beta = np.zeros(p)
    true_support_set = np.random.choice(p, s, replace=False)
    beta[true_support_set] = np.random.normal(0, 1, s)
    # X
    X = np.empty((n, p))
    X[:, 0] = np.random.normal(0, 1, n)
    for j in range(1, p):
        X[:, j] = rho * X[:, j - 1] + np.sqrt(1-rho**2) * np.random.normal(0, 1, n)
    # y
    xbeta = np.clip(X @ beta, -30, 30)
    p = 1 / (1 + np.exp(-xbeta))
    y = np.random.binomial(1, p)

    return X, y, beta, true_support_set

n, p, s, rho = 100, 10, 3, 0.0
X, y, true_params, true_support_set = data_generator(n, p, s, rho , 0)
# Define function to calculate negative log-likelihood of logistic regression
def logistic_loss(params):
    xbeta = jnp.clip(X @ params, -30, 30)
    return jnp.mean(jnp.log(1 + jnp.exp(xbeta)) - y * xbeta)

solver = ScopeSolver(p, s)
solver.solve(logistic_loss, jit=True)

print("True support set: ", np.sort(true_support_set))
print("Estimated support set: ", np.sort(solver.support_set))
print("True parameters: ", true_params)
print("True loss value: ", logistic_loss(true_params))
print("Estimated parameters: ", solver.params)
print("Estimated loss value: ", logistic_loss(solver.params))


True support set:  [2 4 8]
Estimated support set:  [2 7 8]
True parameters:  [ 0.          0.          0.95008842  0.         -0.10321885  0.
  0.          0.         -0.15135721  0.        ]
True loss value:  0.6396969
Estimated parameters:  [ 0.          0.          0.86291105  0.          0.          0.
  0.         -0.32276162  0.23823929  0.        ]
Estimated loss value:  0.60980606


## Poisson Regression
Poisson Regression involves regression models in which the response variable is in the form of counts.
For example, the count of number of car accidents or number of customers in line at a reception desk.
The response variables is assumed to follow a Poisson distribution.

The general mathematical equation for Poisson regression is

\begin{align}\log(E(y)) = \beta_0 + \beta_1 X_1+\beta_2 X_2+\dots+\beta_p X_p.\end{align}

With $n$ independent data of the explanatory variables $x$ and the response variable $y$, we can estimate $\beta$ by minimizing the negative log-likelihood function under sparsity constraint:
$$
\arg \min _{\beta \in R^p} L(\beta):=-\frac{1}{n} \sum_{i=1}^n\left\{y_i x_i^T \beta-\exp \left(x_i^T \beta\right)-\log  \left(y!\right)\right\}, \text { s.t. }\|\beta\|_0 \leq s .
$$

Here is Python code for solving sparse poisson regression problem:

In [3]:
import numpy as np
from abess.datasets import make_glm_data
import jax.numpy as jnp
from scope import ScopeSolver
np.random.seed(1)

n = 100
p = 10
s = 3
data = make_glm_data(n=n, p=p, k=s, family="poisson")
X = data.x
y = data.y
# Define function to calculate negative log-likelihood of poisson regression
def poisson_loss(params):
    xbeta = jnp.clip(X @ params, -30, 30)
    return jnp.mean(jnp.exp(xbeta) - y * xbeta) #omit \log y! term


solver = ScopeSolver(p, s)
solver.solve(poisson_loss, jit=True)

print("True support set: ", np.nonzero(data.coef_)[0])
print("True parameters: ", data.coef_)
print("True loss value: ", poisson_loss(data.coef_))
print("Estimated support set: ", np.sort(solver.support_set))
print("Estimated parameters: ", solver.params)
print("Estimated loss value: ", poisson_loss(solver.params))

True support set:  [0 5 9]
True parameters:  [4.70030694 0.         0.         0.         0.         8.30570366
 0.         0.         0.         3.78436768]
True loss value:  0.5956122
Estimated support set:  [3 5 9]
Estimated parameters:  [ 0.          0.          0.         -3.99304304  0.          8.65190727
  0.          0.          0.          4.95582619]
Estimated loss value:  0.5782885


## Gamma Regression
Gamma regression can be used when you have positive continuous response variables such as payments for insurance claims,
or the lifetime of a redundant system.
It is well known that the density of Gamma distribution can be represented as a function of
a mean parameter ($\mu$) and a shape parameter ($\alpha$), respectively,
$$
\begin{align}f(y \mid \mu, \alpha)=\frac{1}{y \Gamma(\alpha)}\left(\frac{\alpha y}{\mu}\right)^{\alpha} e^{-\alpha y / \mu} {I}_{(0, \infty)}(y),\end{align}
$$
where $I(\cdot)$ denotes the indicator function. In the Gamma regression model,
response variables are assumed to follow Gamma distributions. Specifically,

\begin{align}y_i \sim Gamma(\mu_i, \alpha),\end{align}


where $1/\mu_i = x_i^T\beta$.

With $n$ independent data of the explanatory variables $x$ and the response variable $y$, we can estimate $\beta$ by minimizing the negative log-likelihood function under sparsity constraint:
$$
\arg \min _{\beta \in R^p} L(\beta):=-\frac{1}{n} \sum_{i=1}^n\left\{-\alpha \left( y_i x_i^T \beta - \log \left(x_i^T \beta\right)\right) + \alpha \log \alpha + \left(\alpha - 1\right) \log y - \log \Gamma \left(\alpha\right) \right\}, \text { s.t. }\|\beta\|_0 \leq s .
$$

Here is Python code for solving sparse gamma regression problem:


In [4]:
np.random.seed(2)

n = 100
p = 10
s = 3
data = make_glm_data(n=n, p=p, k=s, family="gamma")
X = data.x
y = data.y

# Define function to calculate negative log-likelihood of Gamma regression
def gamma_loss(params):
    xbeta = jnp.clip(X @ params, -30, 30)
    return jnp.mean(y * xbeta - jnp.log(xbeta)) 


solver = ScopeSolver(p, s)
solver.solve(gamma_loss, jit=True)

print("True support set: ", np.nonzero(data.coef_)[0])
print("True parameters: ", data.coef_)
print("True loss value: ", gamma_loss(data.coef_))
print("Estimated support set: ", np.sort(solver.support_set))
print("Estimated parameters: ", solver.params)
print("Estimated loss value: ", gamma_loss(solver.params))

True support set:  [2 6 8]
True parameters:  [ 0.          0.         16.84626207  0.          0.          0.
  9.48390875  0.          7.42158219  0.        ]
True loss value:  nan
Estimated support set:  []
Estimated parameters:  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Estimated loss value:  inf
