You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.
I was hoping to get some clarification on why glmnet for Python is always deterministic regardless of seed, despite the fact that the documentation states the solver is not deterministic (e.g. https://github.com/civisanalytics/python-glmnet/blob/master/glmnet/linear.py#L77). For example, each of the following different runs return the result, regardless of whether a seed is or is not set:
fromglmnetimportElasticNetimportioimportnumpyasnpimportpandasaspdimportrequestsfromsklearn.preprocessingimportStandardScaler# Load dataurl='https://raw.githubusercontent.com/CCS-Lab/easyml/master/Python/datasets/prostate.csv's=requests.get(url).contentprostate=pd.read_csv(io.StringIO(s.decode('utf-8')))
# Generate coefficients from data by handX, y=prostate.drop('lpsa', axis=1).values, prostate['lpsa'].valuessclr=StandardScaler()
X_preprocessed=sclr.fit_transform(X)
# no random statecoefficients= []
foriinrange(10):
model=ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
print(id(model))
model.fit(X_preprocessed, y)
coefficients.append(np.asarray(model.coef_))
print(coefficients)
# seed set at outer levelnp.random.seed(43210)
coefficients= []
foriinrange(10):
model=ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
print(id(model))
model.fit(X_preprocessed, y)
coefficients.append(np.asarray(model.coef_))
print(coefficients)
# seed set at inner levelcoefficients= []
foriinrange(10):
np.random.seed(43210)
model=ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200)
print(id(model))
model.fit(X_preprocessed, y)
coefficients.append(np.asarray(model.coef_))
print(coefficients)
# seed set at function levelcoefficients= []
foriinrange(10):
random_state=np.random.RandomState(i)
model=ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200, random_state=random_state)
print(id(model))
model.fit(X_preprocessed, y)
coefficients.append(np.asarray(model.coef_))
print(coefficients)
coefficients= []
random_state=np.random.RandomState(43210)
foriinrange(10):
model=ElasticNet(alpha=1, standardize=False, cut_point=0.0, n_lambda=200, random_state=random_state)
print(id(model))
model.fit(X_preprocessed, y)
coefficients.append(np.asarray(model.coef_))
print(coefficients)
This behavior is in direct contrast with the behavior observed in the R version of the glmnet package:
library(easyml) # devtools::install_github("CCS-Lab/easyml", subdir = "R")
library(glmnet)
data("prostate", package="easyml")
# Set X, y, and scale XX<- as.matrix(prostate[, -9])
y<-prostate[, 9]
X_scaled<- scale(X)
# no seedm<-10n<- ncol(X)
Z<-matrix(NA, nrow=m, ncol=n)
for (iin (1:m)) {
model_cv<- cv.glmnet(X_scaled, y, standardize=FALSE)
model<- glmnet(X_scaled, y)
coefs<- coef(model, s=model_cv$lambda.min)
Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)
# Seed set at outer level
set.seed(43210)
m<-10n<- ncol(X)
Z<-matrix(NA, nrow=m, ncol=n)
for (iin (1:m)) {
model_cv<- cv.glmnet(X_scaled, y, standardize=FALSE)
model<- glmnet(X_scaled, y)
coefs<- coef(model, s=model_cv$lambda.min)
Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)
# Seed set at inner levelZ<-matrix(NA, nrow=m, ncol=n)
for (iin (1:m)) {
set.seed(43210)
model_cv<- cv.glmnet(X_scaled, y, standardize=FALSE)
model<- glmnet(X_scaled, y)
coefs<- coef(model, s=model_cv$lambda.min)
Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)
# Different seed set each loop at inner levelZ<-matrix(NA, nrow=m, ncol=n)
for (iin (1:m)) {
set.seed(i)
model_cv<- cv.glmnet(X_scaled, y, standardize=FALSE)
model<- glmnet(X_scaled, y)
coefs<- coef(model, s=model_cv$lambda.min)
Z[i, ] <- as.numeric(coefs)[-1]
}
print(Z)
Thank you for asking this question. As far as I know, the splits in the sklearn cross validator are deterministic by default, which is causing the issue. An immediate workaround is randomizing the dataset row order as a preprocessing step.
An alternative might be to set shuffle to True and pass the random state to the cross validator object right below the referenced line, but there may be a better way to handle this (such as exposing the cv object itself as an init parameter to the estimators). I'll follow up.
I was hoping to get some clarification on why glmnet for Python is always deterministic regardless of seed, despite the fact that the documentation states the solver is not deterministic (e.g. https://github.com/civisanalytics/python-glmnet/blob/master/glmnet/linear.py#L77). For example, each of the following different runs return the result, regardless of whether a seed is or is not set:
This behavior is in direct contrast with the behavior observed in the R version of the glmnet package:
If the https://github.com/civisanalytics/python-glmnet/ version of glmnet is a wrapper around the Fortran code, why are the behavior in R and Python different?
The text was updated successfully, but these errors were encountered: