In [8]:
import numpy as np
import scipy #saving to Matlab with scipy.io.savemat

In [9]:
import sklearn
sklearn.__version__

'0.24.2'

# Leukemia dataset

## From LIBSVM

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#leukemia

Can be loaded directly in MATLAB using "libsvmread" in LIBSVM package. In python it is part of sklearn.

Source: [T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531, 1999.]

Preprocessing: Merge training/testing. Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one. [S.K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.]
- \# of classes: 2
- \# of data: 38 / 34 (testing)
- \# of features: 7129

Files:
 - leu.bz2
 - leu.t.bz2 (testing)


In [9]:
from sklearn.datasets import load_svmlight_file

# download the file here
# http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#leukemia
X, y = load_svmlight_file("datasets/classification/Leukemia/leu.bz2")
#X = X.astype(float)
y[y == -1] = 0

X_test, y_test = load_svmlight_file("datasets/classification/Leukemia/leu.t.bz2")
y_test[y_test == -1] = 0

In [10]:
X.shape # (38, 7129)

(38, 7129)

In [11]:
X_test.shape # (34, 7129)

(34, 7129)

- Saving to MATLAB: using whole data (training and test)

In [12]:
scipy.io.savemat('Leukemia.mat', dict(A_train=X, y_train=y, A_test = X_test, y_test = y_test))

## Loading with sickit-learn (DEPRECATED!)
** Works only for older versions of sklear (0.15)**

See https://scikit-learn.org/0.15/modules/generated/sklearn.datasets.fetch_mldata.html
"Load the ‘leukemia’ dataset from mldata.org, which needs to be transposed to respects the sklearn axes convention:"

In [None]:
from sklearn.datasets import fetch_mldata #DEPRECATED!!
import tempfile
test_data_home = tempfile.mkdtemp() #temp directory/file

In [None]:
leuk = fetch_mldata('leukemia', transpose_data=True,data_home=test_data_home)
leuk.data.shape

See: https://github.com/EugeneNdiaye/Gap_Safe_Rules/blob/master/experiments/bench_logreg.py

In [None]:
dataset = "leukemia"
data = fetch_mldata(dataset)
X = data.data  # [:, ::10]
y = data.target
X = X.astype(float)
y = y.astype(float)
y[y == -1] = 0

# Colon Cancer dataset

## From LIBSVM

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#colon-cancer

Can be loaded directly in MATLAB using "libsvmread" in LIBSVM package.

Source: [U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D.Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Cell Biology, 96:6745–6750, 1999.]

Preprocessing: Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one. [SKS03a]
- \# of classes: 2
- \# of data: 62
- \# of features: 2,000

Files:
   - colon-cancer.bz2

In [3]:
from sklearn.datasets import load_svmlight_file

# download the file here
# http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#leukemia
X, y = load_svmlight_file("datasets/classification/Other-LIBSVM/colon-cancer.bz2")
#X = X.astype(float)
y[y == -1] = 0 #classes are to be represented by 0 and 1

In [4]:
X.shape # (62, 2000)

(62, 2000)

- Saving to MATLAB: using whole data (training and test)

In [20]:
scipy.io.savemat('Colon-Cancer.mat', dict(A=X, y=y))

# RCV1.binary dataset

## From LIBSVM

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#rcv1.binary

Can be loaded directly in MATLAB using "libsvmread" in LIBSVM package. In python it is part of sklearn.

Source: [Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5, 361-397, 2004.]

Preprocessing: positive: CCAT, ECAT; negative: GCAT, MCAT; instances in both positive and negative classes are removed. [DL04b]
- \# of classes: 2
- \# of data: 20,242 / 677,399 (testing)
- \# of features: 47,236    

Files:
   - rcv1_train.binary.bz2
   - rcv1_test.binary.bz2 (testing) Not using!

In [13]:
from sklearn.datasets import load_svmlight_file

# download the file here
# https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#rcv1.binary
#X, y = load_svmlight_file("datasets/classification/Other-LIBSVM/rcv1_train.binary.bz2")
X, y = load_svmlight_file("datasets/classification/Other-LIBSVM/rcv1_test.binary.bz2")
#X = X.astype(float)
y[y == -1] = 0 #classes are to be represented by 0 and 1

In [14]:
X.shape # train: (20242, 47236), test: (677399, 47236)

(677399, 47236)

In [200]:
## 1000 samples (~9000 features)
#Select subset of samples (in X and y)
idx0 = X.getnnz(1)>10 #remove rows with less than 'x' non-zeros
X = X[idx0] 
y = y[idx0]
idx1 = np.random.choice(X.shape[0], 1000, replace=False) #Random subset of 1000 rows
X = X[idx1] 
y = y[idx1]

X = X[:,X.getnnz(0)>0] #remove all-zero columns
X.shape #new shape

(1000, 9679)

In [16]:
#verify rank
myrank = np.linalg.matrix_rank(X.todense()) #rank
myrank

1999

In [7]:
#normalize columns of X (original data is normalized by samples, i.e. by rows)
from sklearn.preprocessing import normalize
X = normalize(X, axis=0)
#scipy.sparse.linalg.norm(X[:,10]) # Verify column norms

In [8]:
U, S, Vt = scipy.sparse.linalg.svds(X)
pinvX = np.dot(Vt.T * (1/S), U.T)
pinvX_1 = np.linalg.norm(pinvX,1) #np.max(np.sum(np.abs(pinvX),axis=0))

In [9]:
pinvX_1

2.739335660839793

- Saving to MATLAB: using whole data (training and test)

In [10]:
scipy.io.savemat('rcv1.mat', dict(A=X, y=y, pinvA_1=pinvX_1))

- Other subsampling sizes (requires reloading the dataset)

In [15]:
## 2000 samples (~15000 features)
#Select subset of samples (in X and y)
idx0 = X.getnnz(1)>50 # remove rows with less than 'x' non-zeros. 
                      # x=10 => 675317 samples | x=50 => 380847 samples | x=100 => 162035 samples | x=200 => 19347 samples |  x=300 => 2151 samples
X = X[idx0]
y = y[idx0]
idx1 = np.random.choice(X.shape[0], 2000, replace=False)
X = X[idx1] #Random subset of 2000 rows
y = y[idx1]

X = X[:,X.getnnz(0)>0] #remove all-zero columns
X.shape #new shape

(2000, 15175)

In [58]:
## 3000 samples (~22000 features) - hard to obtain full-rank with 3000 samples or more.
#Select subset of samples (in X and y)
idx0 = X.getnnz(1)>200 #remove rows with less than 'x' non-zeros
X = X[idx0]
y = y[idx0]
idx1 = np.random.choice(X.shape[0], 3000, replace=False)
X = X[idx1] #Random subset of 3000 rows
y = y[idx1]

X = X[:,X.getnnz(0)>0] #remove all-zero columns
X.shape #new shape

(3000, 22446)

# Other classification datasets

## 20News

See: https://github.com/EugeneNdiaye/Gap_Safe_Rules/blob/master/experiments/bench_logreg.py 

Download link: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary (LIBSVM)

Source: S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005.

Preprocessing: Each instance has unit length.
- \# of classes: 2
- \# of data: 19,996
- \# of features: 1,355,191

**ALTERNATIVELY: can be directly loaded in MATLAB using "libsvmread" in LIBSVM package.**

In [209]:
from sklearn.datasets import load_svmlight_file

# download the file here
# http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary
X, y = load_svmlight_file("datasets/classification/News20/news20.binary")
X = X.astype(float)
y = y.astype(float)
y[y == -1] = 0

In [210]:
X.shape #(19996, 1355191)
#X[:2,:2].data #visualizing some data (float)

(19996, 1355191)

In [221]:
idx0 = X.getnnz(1)>50
X = X[idx0] #remove all-zero rows
y = y[idx0]
#idx1 = np.random.choice(X.shape[0], 1000, replace=False)
#X = X[idx1] #Random subset of 10000 rows
#y = y[idx1]

X = X[:,X.getnnz(0)>20] #remove all-zero columns

X.shape #new shape (20242, 44504

(18376, 43993)

In [None]:
myrank = np.linalg.matrix_rank(X.todense()) #rank
myrank

In [None]:
# Save to MATLAB
scipy.io.savemat('20newsgroups_binary.mat', dict(A=X, y=y))

## Synthetic

See: https://github.com/EugeneNdiaye/Gap_Safe_Rules/blob/master/experiments/bench_logreg.py 

Synthetic data option using  [sklearn.datasets.make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html?highlight=make_classification#sklearn.datasets.make_classification)

In [19]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=50,
                           n_features=3000,
                           n_classes=2,
                           random_state=42)
X = X.astype(float)
X /= np.sqrt(np.sum(X ** 2, axis=0))
mask = np.sum(np.isnan(X), axis=0) == 0
if np.any(mask):
    X = X[:, mask]
y = y.astype(float) #labels {0,1}

In [20]:
X.shape #(50, 3000)

(50, 3000)

## Digits dataset

https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_l1_l2_sparsity.html#sphx-glr-auto-examples-linear-model-plot-logistic-l1-l2-sparsity-py

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

X, y = datasets.load_digits(return_X_y=True)

X = StandardScaler().fit_transform(X)

# classify small against large digits
y = (y > 4).astype(np.int)

## Other

- LIBSVM : Extensive list of binary classification datasets (loadable directly at MATLAB) https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

- MLdata : https://www.mldata.io/ (Example: Smartphone Activity dataset, 6 classes 10299x562). I think sklearn used to fetch Leukemia dataset from this website.

- Scikit Learn : https://scikit-learn.org/stable/datasets/index.html
