# SBRCT Microarray Data
Gene expression arrays are an important new technology in biology. The data for this example form a matrix of 2308 genes (columns) and 63 samples (rows), from a set of microarray experiments. Each expression value is a log-ratio $log(R/G)$. $R$ is the amount of gene-specific RNA in the target sample that hybridizes to a particular (gene-specific) spot on the microarray, and $G$ is the corresponding amount of RNA from a reference sample. The samples arose from small, round blue-cell tumors (SRBCT) found in children, and are classified into four major types: BL (Burkitt lymphoma), EWS (Ewing's sarcoma), NB (neurablastoma), and RMS (rhabdomyosarcoma). There is an additional test data set of 20 observations.
SBRCT gene expression data.

Cancer classes are labelled 1,2,3,4 for c("EWS","RMS","NB","BL")

In [1]:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

%matplotlib inline

# define commonly used colors
GRAY1, GRAY4, PURPLE = '#231F20', '#646369', '#A020F0'
BLUE, ORANGE, BLUE1 = '#57B5E8', '#E69E00', '#174A7E'
# configure plot font family to Arial
plt.rcParams['font.family'] = 'Arial'
plt.rcParams['axes.linewidth'] = 0.5

## Load and Prepare Data

In [2]:
data = np.load('../data/srbct.npy.npz')['data']

# last column contains 'is train' flag
is_train = data[:,-1].astype(int)
data_test = data[is_train == 0, :]
data_train = data[is_train == 1, :]
# pre-last column contains class
y_train = data_train[:, -2].astype(int)
y_test = data_test[:, -2].astype(int)
X_train = data_train[:, :-2]
X_test = data_test[:, :-2]

## Nearest Shrunken Centroids

In [9]:
nearest_centroid_classifier = Pipeline([
    ('scale', StandardScaler()),
    ('ncc', NearestCentroid())]
)
shrink_threshold_grid_search = GridSearchCV(
    nearest_centroid_classifier,
    {'ncc__shrink_threshold': np.linspace(0, 20, 100)},
    cv=StratifiedKFold(8),
    iid=True
).fit(X_train, y_train)
best_model = shrink_threshold_grid_search.best_estimator_
print(np.sum(y_test != best_model.predict(X_test)))


0


In [20]:
A = ~np.isclose(best_model[1].centroids_, 0)

In [27]:
np.argwhere(np.sum(A, axis=0) > 0).shape

(407, 1)