Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

glmnet in Python is always deterministic, regardless of seed #23

Closed
paulhendricks opened this issue Jun 6, 2017 · 2 comments
Closed

Comments

@paulhendricks
Copy link

I was hoping to get some clarification on why glmnet for Python is always deterministic regardless of seed, despite the fact that the documentation states the solver is not deterministic (e.g. https://github.com/civisanalytics/python-glmnet/blob/master/glmnet/linear.py#L77). For example, each of the following different runs return the result, regardless of whether a seed is or is not set:

from glmnet import ElasticNet
import io
import numpy as np
import pandas as pd
import requests
from sklearn.preprocessing import StandardScaler


# Load data
url = 'https://raw.githubusercontent.com/CCS-Lab/easyml/master/Python/datasets/prostate.csv'
s = requests.get(url).content
prostate = pd.read_csv(io.StringIO(s.decode('utf-8')))

# Generate coefficients from data by hand
X, y = prostate.drop('lpsa', axis=1).values, prostate['lpsa'].values
sclr = StandardScaler()
X_preprocessed = sclr.fit_transform(X)

# no random state
coefficients = []
for i in range(10):
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

# seed set at outer level
np.random.seed(43210)
coefficients = []
for i in range(10):
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

# seed set at inner level
coefficients = []
for i in range(10):
    np.random.seed(43210)
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

# seed set at function level
coefficients = []
for i in range(10):
    random_state = np.random.RandomState(i)
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200, random_state=random_state)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

coefficients = []
random_state = np.random.RandomState(43210)
for i in range(10):
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200, random_state=random_state)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

This behavior is in direct contrast with the behavior observed in the R version of the glmnet package:

library(easyml) # devtools::install_github("CCS-Lab/easyml", subdir = "R")
library(glmnet)

data("prostate", package = "easyml")

# Set X, y, and scale X
X <- as.matrix(prostate[, -9])
y <- prostate[, 9]
X_scaled <- scale(X)

# no seed
m <- 10
n <- ncol(X)
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

# Seed set at outer level
set.seed(43210)
m <- 10
n <- ncol(X)
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

# Seed set at inner level
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  set.seed(43210)
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

# Different seed set each loop at inner level
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  set.seed(i)
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

If the https://github.com/civisanalytics/python-glmnet/ version of glmnet is a wrapper around the Fortran code, why are the behavior in R and Python different?

@xdavio
Copy link
Contributor

xdavio commented Jun 6, 2017

Thank you for asking this question. As far as I know, the splits in the sklearn cross validator are deterministic by default, which is causing the issue. An immediate workaround is randomizing the dataset row order as a preprocessing step.

An alternative might be to set shuffle to True and pass the random state to the cross validator object right below the referenced line, but there may be a better way to handle this (such as exposing the cv object itself as an init parameter to the estimators). I'll follow up.

@stephen-hoover
Copy link
Contributor

This was fixed by #24 .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants