Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glmnet in Python is always deterministic, regardless of seed #23

Closed
paulhendricks opened this issue Jun 6, 2017 · 2 comments
Closed

Comments

@paulhendricks
Copy link

I was hoping to get some clarification on why glmnet for Python is always deterministic regardless of seed, despite the fact that the documentation states the solver is not deterministic (e.g. https://github.com/civisanalytics/python-glmnet/blob/master/glmnet/linear.py#L77). For example, each of the following different runs return the result, regardless of whether a seed is or is not set:

from glmnet import ElasticNet
import io
import numpy as np
import pandas as pd
import requests
from sklearn.preprocessing import StandardScaler


# Load data
url = 'https://raw.githubusercontent.com/CCS-Lab/easyml/master/Python/datasets/prostate.csv'
s = requests.get(url).content
prostate = pd.read_csv(io.StringIO(s.decode('utf-8')))

# Generate coefficients from data by hand
X, y = prostate.drop('lpsa', axis=1).values, prostate['lpsa'].values
sclr = StandardScaler()
X_preprocessed = sclr.fit_transform(X)

# no random state
coefficients = []
for i in range(10):
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

# seed set at outer level
np.random.seed(43210)
coefficients = []
for i in range(10):
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

# seed set at inner level
coefficients = []
for i in range(10):
    np.random.seed(43210)
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

# seed set at function level
coefficients = []
for i in range(10):
    random_state = np.random.RandomState(i)
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200, random_state=random_state)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

coefficients = []
random_state = np.random.RandomState(43210)
for i in range(10):
    model = ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200, random_state=random_state)
    print(id(model))
    model.fit(X_preprocessed, y)
    coefficients.append(np.asarray(model.coef_))
print(coefficients)

This behavior is in direct contrast with the behavior observed in the R version of the glmnet package:

library(easyml) # devtools::install_github("CCS-Lab/easyml", subdir = "R")
library(glmnet)

data("prostate", package = "easyml")

# Set X, y, and scale X
X <- as.matrix(prostate[, -9])
y <- prostate[, 9]
X_scaled <- scale(X)

# no seed
m <- 10
n <- ncol(X)
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

# Seed set at outer level
set.seed(43210)
m <- 10
n <- ncol(X)
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

# Seed set at inner level
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  set.seed(43210)
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

# Different seed set each loop at inner level
Z <- matrix(NA, nrow = m, ncol = n)
for (i in (1:m)) {
  set.seed(i)
  model_cv <- cv.glmnet(X_scaled, y, standardize = FALSE)
  model <- glmnet(X_scaled, y)
  coefs <- coef(model, s = model_cv$lambda.min)
  Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)

If the https://github.com/civisanalytics/python-glmnet/ version of glmnet is a wrapper around the Fortran code, why are the behavior in R and Python different?

@xdavio
Copy link
Contributor

xdavio commented Jun 6, 2017

Thank you for asking this question. As far as I know, the splits in the sklearn cross validator are deterministic by default, which is causing the issue. An immediate workaround is randomizing the dataset row order as a preprocessing step.

An alternative might be to set shuffle to True and pass the random state to the cross validator object right below the referenced line, but there may be a better way to handle this (such as exposing the cv object itself as an init parameter to the estimators). I'll follow up.

@stephen-hoover
Copy link
Contributor

This was fixed by #24 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants