Import required libraries.

In [1]:
!pip install group-lasso --quiet

In [72]:
!pip install rpy2 --quiet

In [37]:
# !pip install --user glmnet_py
# !pip install --upgrade scipy --quiet
# !pip uninstall -y scipy
# !pip install scipy==1.13.1

In [85]:
from group_lasso import LogisticGroupLasso
from sklearn.datasets import fetch_openml
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from group_lasso import GroupLasso
from sklearn.metrics import accuracy_score
import rpy2.robjects as ro
import rpy2.robjects.numpy2ri as np2ri

In [75]:
ro.r('install.packages("glmnet")') #install glment which is a package in R

(as ‘lib’ is unspecified)







	‘/tmp/RtmpCANoAt/downloaded_packages’



Initialize notebook 2 with the code from notebook 0.

In [76]:
# Load Dataset
mnist = fetch_openml("mnist_784")

# Filter and Prepare Dataset
keys = list(mnist.keys())
df = pd.concat([mnist[keys[0]],mnist[keys[1]]],axis=1)
df['class'] = df['class'].apply(lambda x: int(x))
bool_musk = df['class'].isin([3,5,8])
df = df.loc[bool_musk].reset_index(drop=True)

# Split data into training and testing
cutoff = int(df.shape[0]*(0.8))
X, y = df.iloc[:,:-1], df['class']
X_train, X_test = X[:cutoff], X[cutoff:]
y_train, y_test = y[:cutoff], y[cutoff:]

# Randomize the split
n = X_train.shape[0]
rand = np.random.permutation(n)
X_train = X_train.iloc[list(rand),:].reset_index(drop = True)
y_train = y_train[rand].reset_index(drop = True)

assert X_train.shape[0] + X_test.shape[0] == df.shape[0], 'ERROR'
assert X_train.shape[1] == X_test.shape[1] == df.shape[1] - 1, 'ERROR' #class column == y
assert y_train.shape[0] + y_test.shape[0] == df.shape[0], 'ERROR'

# Formats
print(f'X_train: {X_train.shape},\nX_test: {X_test.shape},\ny_train: {y_train.shape},\ny_test: {y_test.shape}')

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train: (16223, 784),
X_test: (4056, 784),
y_train: (16223,),
y_test: (4056,)


Apply group-lasso regularized multinomial logistic regression to select
features that separate the three digits.  The package glmnet supports this but is only available in R (with the exception of some wrappers but they were not working).  Therefore we will convert to R and run the glmnet package.

In [90]:
# Define a simple grouping for demonstration (group features by column index)
num_features = X_train_scaled.shape[1]
groups = np.repeat(np.arange(1, num_features + 1), 1)  # one group per feature

# Convert the groups to R
groups_r = ro.IntVector(groups)
ro.r.assign("groups", groups_r)

# Fit the multinomial group lasso model using glmnet
ro.r('''
fit <- glmnet(X_train_scaled, y_train, family="multinomial", alpha=1, group=groups)
''')

In [105]:
# Convert to R format
X_train_r = ro.r['matrix'](np2ri.numpy2rpy(X_train_scaled), nrow=X_train_scaled.shape[0], ncol=X_train_scaled.shape[1])
y_train_r = ro.IntVector(y_train)

# Convert Data to R in order to use glmnet
ro.r.assign("X_train_scaled", X_train_r)
ro.r.assign("y_train", y_train_r)

# Define a simple grouping for demonstration (group features by column index)
num_features = X_train_scaled.shape[1]
groups = np.repeat(np.arange(1, num_features + 1), 1)  # one group per feature

# Convert the groups to R
groups_r = ro.IntVector(groups)
ro.r.assign("groups", groups_r)

0,1,2,3,4,5,6
1,2,3,...,782,783,784


In [132]:
# Fit the multinomial group Lasso model using glmnet
ro.r('''
library(glmnet)

# Fit the model with multinomial family and group Lasso
fit <- glmnet(X_train_scaled, y_train, family = "multinomial", alpha = 1, group = groups)

# Store the model fit
fit

# Plot the regularization path
plot(fit, xvar = "lambda", label = TRUE)
''')

# Extract coefficients and lambda values from the fitted model
fit_result = ro.r('fit')
fit_lambda = ro.r('fit$lambda')  # Lambda values from the model
coefficients = ro.r('fit$beta')  # Coefficients from the model

# Convert coefficients into a NumPy array for plotting
coefficients_list = list(coefficients)
coefficients_np = np.array(coefficients_list)

In [128]:
# Df: Number of non-zero coefficients in the model for each value of lambda.
# %Dev: The percentage of deviance explained by the model at each regularization step.
# Lambda: The value of the regularization parameter (lambda), which controls the strength of the penalization.
print(fit_result)


Call:  glmnet(x = X_train_scaled, y = y_train, family = "multinomial",      alpha = 1, group = groups) 

     Df  %Dev   Lambda
1     0  0.00 0.312300
2     2  3.62 0.284500
3     2  6.63 0.259200
4     3  9.15 0.236200
5     3 11.32 0.215200
6     5 13.94 0.196100
7     8 17.91 0.178700
8    11 21.87 0.162800
9    12 25.42 0.148400
10   15 28.70 0.135200
11   20 32.04 0.123200
12   21 35.32 0.112200
13   25 38.32 0.102300
14   26 41.06 0.093170
15   28 43.55 0.084890
16   32 45.90 0.077350
17   34 48.10 0.070480
18   38 50.10 0.064220
19   45 52.07 0.058510
20   52 54.01 0.053310
21   55 55.85 0.048580
22   61 57.56 0.044260
23   71 59.20 0.040330
24   81 60.78 0.036750
25   83 62.26 0.033480
26   93 63.67 0.030510
27  100 65.02 0.027800
28  105 66.29 0.025330
29  112 67.48 0.023080
30  120 68.63 0.021030
31  129 69.72 0.019160
32  136 70.75 0.017460
33  142 71.70 0.015910
34  145 72.57 0.014490
35  153 73.40 0.013210
36  161 74.19 0.012030
37  166 74.92 0.010960
38  174 75.60 0.0099

See pdf output of regularization paths.

*End of notebook 2*