
Determine the optimal support size
------------

We would like to use logistic regression to show how to determine the optimal support size based on different information criteria.

### Introduction

Logistic regression is an important model to solve classification problem, which is expressed specifically as:
$$
\begin{aligned}
& P(y=1 \mid x)=\frac{1}{1+\exp \left(-x^T \beta\right)}, \\
& P(y=0 \mid x)=\frac{1}{1+\exp \left(x^T \beta\right)},
\end{aligned}
$$
where $\beta$ is an unknown parameter vector that to be estimated. Since we expect only a few explanatory variables contributing to predicting $y$, we assume $\beta$ is sparse vector with sparsity level $s$.

With $n$ independent data of the explanatory variables $x$ and the response variable $y$, we can estimate $\beta$ by minimizing the negative log-likelihood function under sparsity constraint:

<a id='loss'></a>
$$
\arg \min _{\beta \in R^p} L(\beta):=-\frac{1}{n} \sum_{i=1}^n\left\{y_i x_i^T \beta-\log \left(1+\exp \left(x_i^T \beta\right)\right)\right\}, \text { s.t. }\|\beta\|_0 \leq s \tag{1}
$$ 

### Import necessary packages

In [1]:
import jax.numpy as jnp
import numpy as np
from skscope import ScopeSolver
import time

### Set a seed

In [2]:
np.random.seed(2024)

### Generate the data

Firstly, we define a data generator function to provide a way to generate suitable dataset for this task.

The model:

* $\beta^*_i$ ~ U(1, 2), $\forall i \in supp(\beta^*)$
* $x = (x_1, \cdots, x_p)^T$, $x_{i+1}=\rho x_i+\sqrt{1-\rho^2}z_i$, where $x_1, z_i$ ~ N(0, 1)
* $y\in\{0,1\}$, $P(y=0)=\frac{1}{1+\exp^{x^T\beta^*+c}}$

In [3]:
def make_logistic_data(n, p, s, rho, random_state=None):
    np.random.seed(random_state)
    # beta
    beta = np.zeros(p)
    true_support_set = np.random.choice(p, s, replace=False)
    beta[true_support_set] = np.random.uniform(1, 2, s)
    # X
    X = np.empty((n, p))
    X[:, 0] = np.random.normal(0, 1, n)
    for j in range(1, p):
        X[:, j] = rho * X[:, j - 1] + np.sqrt(1-rho**2) * np.random.normal(0, 1, n)
    # y
    xbeta = np.clip(X @ beta, -30, 30)
    p = 1 / (1 + np.exp(-xbeta))
    y = np.random.binomial(1, p)

    return X, y, beta, true_support_set

We then use this function to generate a data set containg 500 observations and set only 5 of the 500 variables to have effect on the expectation of the response.

In [4]:
n, p, s, rho = 500, 500, 5, 0.5
X, y, true_params, true_support_set = make_logistic_data(n, p, s, rho , 0)

print("The predictor variables of the first five samples:",'\n',X[:,:5])
print("The first five noisy observations:", '\n', y[:5])

The predictor variables of the first five samples: 
 [[ 0.69737282  0.37856863  0.55593142 -0.5971624   0.39692106]
 [-1.73742924 -1.6743813   0.18271199 -0.08962145 -0.43781848]
 [ 0.1158557   0.82780721  2.0148995   1.78473305  1.770524  ]
 ...
 [-0.72372136  0.44922885  0.65933275 -0.0693512  -0.91185329]
 [ 1.7275667   1.4409185   1.22541887  0.38692584 -0.6191575 ]
 [ 0.05005725 -0.1412343   0.82986437 -2.14078613 -2.09797562]]
The first five noisy observations: 
 [0 0 1 1 1]


### Define function to calculate negative log-likelihood of logistic regression

Secondly, we define the loss function `logistic_loss` accorting to [1](#loss) that matches the data generating function `make_logistic_data`.

In [5]:
def logistic_loss(params):
    xbeta = jnp.clip(X @ params, -30, 30)
    return jnp.sum(jnp.log(1 + jnp.exp(xbeta)) - y * xbeta)

### Use SIC to decide the optimal support size

There are four types of information criterion can be implemented in `skscope.utilities`:
- Akaike information criterion (AIC)
- Bayesian information criterion (BIC)
- Extend BIC (EBIC)
- Special information criterion (SIC)
  
You can just need one line of code to call any IC, here we use SIC:

In [6]:
from skscope.utilities import SIC

In [7]:
 # Record start time
start_time = time.time()

solver_ic = ScopeSolver(p, sparsity = range(1, 6), sample_size=n, ic_method = SIC)
params_ic = solver_ic.solve(logistic_loss, jit=True)

# Calculate runtime
runtime = time.time() - start_time
print("Runtime:", runtime, "seconds")

An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.


Runtime: 2.9305684566497803 seconds


Now the `solver.params` contains the coefficients of logistic model with no more than 5 variables. That is, those variables with a coefficient 0 is unused in the model:

We can further compare the coefficients estimated by `skscope` and the real coefficients in three-fold:

* The true support set and the estimated support set

* The true nonzero parameters and the estimated nonzero parameters

* The true loss value and the estimated values

In [8]:
print("True support set: ", (true_support_set))
print("Estimated support set: ", (solver_ic.support_set))

True support set:  [ 90 254 283 445 461]
Estimated support set:  [ 90 254 283 445 461]


In [9]:
print("True parameters: ", true_params[true_support_set])
print("Estimated parameters: ", solver_ic.params[solver_ic.support_set])

True parameters:  [1.45985588 1.0446123  1.79979588 1.07695645 1.51883515]
Estimated parameters:  [1.49872541 1.03224578 2.04393697 1.32145707 1.50325417]


In [10]:
print("True loss value: ", logistic_loss(true_params))
print("Estimated loss value: ", logistic_loss(solver_ic.params))

True loss value:  159.59784
Estimated loss value:  157.65933


### Use CV to decide the optimal support size

In [11]:
def logistic_loss_cv(params, data):
    xbeta = jnp.clip(data[0] @ params, -30, 30)
    return jnp.sum(jnp.log(1 + jnp.exp(xbeta)) - data[1] * xbeta)

# Record start time
start_time = time.time()

solver_cv = ScopeSolver(p, sparsity = range(1, 6), sample_size = n, cv = 5,
                        split_method=lambda data, index: (data[0][index, :], data[1][index]))
params_cv = solver_cv.solve(logistic_loss_cv, jit=True, data=(X, y))

# Calculate runtime
runtime = time.time() - start_time
print("Runtime:", runtime, "seconds")

Runtime: 8.449155330657959 seconds


In [12]:
print("True support set: ", (true_support_set))
print("Estimated support set: ", (solver_cv.support_set))

True support set:  [ 90 254 283 445 461]
Estimated support set:  [ 90 254 283 445 461]


In [13]:
print("True parameters: ", true_params[true_support_set])
print("Estimated parameters: ", solver_cv.params[solver_cv.support_set])

True parameters:  [1.45985588 1.0446123  1.79979588 1.07695645 1.51883515]
Estimated parameters:  [1.49872597 1.0322461  2.04393774 1.32145742 1.50325463]


In [14]:
print("True loss value: ", logistic_loss(true_params))
print("Estimated loss value: ", logistic_loss(solver_cv.params))

True loss value:  159.59784
Estimated loss value:  157.65933


Comparing the results of SIC and CV criteria, we find that while maintaining high accuracy in variable selection, SIC exhibits a clear time advantage.

### Compare the results under two different circumstances: using warmstart and not using warmstart.

#### Using warmstart

In [15]:
# Record start time
start_time = time.time()

solver_ws = ScopeSolver(p, s)
params_ws = solver_ws.solve(logistic_loss, jit=True)

# Calculate runtime
runtime = time.time() - start_time
print("Runtime:", runtime, "seconds")

Runtime: 1.8910024166107178 seconds


In [16]:
print("True support set: ", (true_support_set))
print("Estimated support set: ", (solver_ws.support_set))

True support set:  [ 90 254 283 445 461]
Estimated support set:  [ 90 254 283 445 461]


#### Not using warmstart

In [17]:
# Record start time
start_time = time.time()

solver_nws = ScopeSolver(p, s)
solver_nws.warm_start = False
params_nws = solver_nws.solve(logistic_loss, jit=True)

# Calculate runtime
runtime = time.time() - start_time
print("Runtime:", runtime, "seconds")


Runtime: 3.612983465194702 seconds


In [18]:
print("True support set: ", (true_support_set))
print("Estimated support set: ", (solver_nws.support_set))

True support set:  [ 90 254 283 445 461]
Estimated support set:  [ 90 254 283 445 461]


Hint: all solvers default to using warmstart, which can slightly prolong computation time if not utilized