glmnet in Python is always deterministic, regardless of seed #23

paulhendricks · 2017-06-06T02:05:20Z

I was hoping to get some clarification on why glmnet for Python is always deterministic regardless of seed, despite the fact that the documentation states the solver is not deterministic (e.g. https://github.com/civisanalytics/python-glmnet/blob/master/glmnet/linear.py#L77). For example, each of the following different runs return the result, regardless of whether a seed is or is not set:

from glmnet import ElasticNet
import io
import numpy as np
import pandas as pd
import requests
from sklearn.preprocessing import StandardScaler


# Load data
url = 'https://raw.githubusercontent.com/CCS-Lab/easyml/master/Python/datasets/prostate.csv'
s = requests.get(url).content
prostate = pd.read_csv(io.StringIO(s.decode('utf-8')))

# Generate coefficients from data by hand
X, y = prostate.drop('lpsa', axis=1).values, prostate['lpsa'].values
sclr = StandardScaler()
X_preprocessed = sclr.fit_transform(X)

# no random state
coefficients = []
for i in range(10):
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

# seed set at outer level
np.random.seed(43210)
coefficients = []
for i in range(10):
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

# seed set at inner level
coefficients = []
for i in range(10):
    np.random.seed(43210)
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

# seed set at function level
coefficients = []
for i in range(10):
    random_state = np.random.RandomState(i)
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200, random_state=random_state)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

coefficients = []
random_state = np.random.RandomState(43210)
for i in range(10):
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200, random_state=random_state)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

This behavior is in direct contrast with the behavior observed in the R version of the glmnet package:

library(easyml) # devtools::install_github("CCS-Lab/easyml", subdir = "R")
library(glmnet)

data("prostate", package = "easyml")

# Set X, y, and scale X
X <- as.matrix(prostate[, -9])
y <- prostate[, 9]
X_scaled <- scale(X)

# no seed
m <- 10
n <- ncol(X)
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

# Seed set at outer level
set.seed(43210)
m <- 10
n <- ncol(X)
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

# Seed set at inner level
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  set.seed(43210)
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

# Different seed set each loop at inner level
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  set.seed(i)
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

If the https://github.com/civisanalytics/python-glmnet/ version of glmnet is a wrapper around the Fortran code, why are the behavior in R and Python different?

xdavio · 2017-06-06T16:34:42Z

Thank you for asking this question. As far as I know, the splits in the sklearn cross validator are deterministic by default, which is causing the issue. An immediate workaround is randomizing the dataset row order as a preprocessing step.

An alternative might be to set shuffle to True and pass the random state to the cross validator object right below the referenced line, but there may be a better way to handle this (such as exposing the cv object itself as an init parameter to the estimators). I'll follow up.

stephen-hoover · 2018-10-01T14:10:47Z

This was fixed by #24 .

xdavio mentioned this issue Jun 22, 2017

BUG Parameterize CV object with random seed #24

Merged

stephen-hoover closed this as completed Oct 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glmnet in Python is always deterministic, regardless of seed #23

glmnet in Python is always deterministic, regardless of seed #23

paulhendricks commented Jun 6, 2017

xdavio commented Jun 6, 2017

stephen-hoover commented Oct 1, 2018

glmnet in Python is always deterministic, regardless of seed #23

glmnet in Python is always deterministic, regardless of seed #23

Comments

paulhendricks commented Jun 6, 2017

xdavio commented Jun 6, 2017

stephen-hoover commented Oct 1, 2018