Support Vector Machines 
=======================

In [None]:
from copy import deepcopy
import numpy as np
import pandas as pd
import plotnine
from plotnine import ggplot, aes, geom_line, geom_text, geom_hline, geom_vline
import sklearn as sk

plotnine.theme_set(plotnine.theme_bw())

from ggfuntile import predictionContour
from maclearn_utils_2020 import extractPCs, PcaExtractor
from load_hess import hessTrain, hessTrainY, hessTest, hessTestY, probeAnnot

twoProbeData = hessTrain.T.loc[:, ["205548_s_at", "201976_s_at"]].copy()
twoProbeData.columns =\
        probeAnnot.loc[twoProbeData.columns, "Gene.Symbol"]

Linear models, including logistic regression and DLDA, are very useful
in many contexts but do have some characteristics which can be
limiting. *Support vector machines*, or SVMs
([@cortes1995support; @hastie2009elements]), are a type of
supervised classifcation algorithm which address two particular
limitations:
1.  The parameters fit by classical linear classification algorithms
    are generally sensitive to extremely easy-to-call sampling units
    -   (correctly called sampling units whose feature vectors are
        very far from the decision boundary)
    -   even when a more accurate classifier might result from
        parameters which move these outlier probabilities "in the wrong
        direction"
    -   but not far enough to change the final classification made.
    
2.  Linearity of response is a very strong and often unrealistic
    assumption; many real-world response patterns are highly nonlinear.

The term "support vectors" in the name SVMs refers to the feature
vectors corresponding to samples which are close to being on the wrong
side of the decision boundary; for a nice illustration check out
<https://en.wikipedia.org/wiki/Support_vector_machine#/media/File:SVM_margin.png>.

SVM models are fit by positioning the decision boundary so as to keep
the support vectors as far on the right sides as possible; the
parameters defining the decision boundary thus ultimately depend only
on the sampling units corresponding to the support vectors, thus
mitigating point 1 above.

Point 2 can also be addressed in the SVM framework using a
mathematical technique known as the "kernel trick." The math here is
beyond the scope of these notes, but the core idea is that you first
apply a nonlinear transformation to the data matrix and then apply SVM
in the transformed coordinates, kind of like when we did feature
extraction prior to fitting a knn model. The trick is that certain
special transformations lead to model fitting problems which may be
described in terms of the *un*transformed coordinates in
intuitively interesting and useful ways.

(What makes this so tricky is that while it has been proven that there
do indeed exist particular transformations that lead to the problems
nicely describable as modified versions of the original SVM problem in
untransformed coordinates, the actual transformations
themselves---which are very complicated---aren't actually needed in
doing the computations, just the modified problem description in the
original feature space!)

We'll focus on a specific class of transformations: those which
replace the standard dot products appearing in the mathematical
expressions composing the original SVM problem with so-called "radial
basis function" (RBF) kernels.

While SVM models are somewhat more complex than the simplicity that is
knn, revisiting our two probe set contour plotting strategy using a
range of `gamma` parameter values reveals a striking similarity in
the types of decision boundaries learned by the methods:

In [None]:
import sklearn.svm as svm
svmFitter = svm.SVC(kernel="rbf", C=1, gamma=0.25, probability=True)
twoProbeSvmFitSig0p25 = svmFitter.fit(twoProbeData, hessTrainY)
predictionContour(twoProbeSvmFitSig0p25,
                  twoProbeData, hessTrainY, "gamma = 0.25")

Increasing the `gamma` parameter creates a more local---and in this
case, likely more overfit---SVM model (similar to *decreasing*
the number $k$ of nearest neighbors in a knn model):

In [None]:
svmFitter = svm.SVC(kernel="rbf", gamma=1.25, probability=True)
twoProbeSvmFitSig1p25 = svmFitter.fit(twoProbeData, hessTrainY)
predictionContour(twoProbeSvmFitSig1p25,
                  twoProbeData, hessTrainY, "gamma = 1.25")

And increasing `gamma` still further\...

In [None]:
svmFitter = svm.SVC(kernel="rbf", gamma=6.25, probability=True)
twoProbeSvmFitSig6p25 = svmFitter.fit(twoProbeData, hessTrainY)
predictionContour(twoProbeSvmFitSig6p25,
                  twoProbeData, hessTrainY, "gamma = 6.25")

While SVM models using RBF kernels produce classifiers with somewhat
similar properties to knn, you can see that the decision boundaries
tend to be smoother. Why might this be?

The knn approach doesn't care whether the $(k+1)^{\text{th}}$ nearest
neighbor is just ever so slightly farther away than the
$k^{\text{th}}$, or whether the $k^{\text{th}}$ nearest neighbor is 10
times farther away than the the $(k-1)^{\text{th}}$, but just gives
equal weight to the closest $k$ and zero weight to everything else.

In contrast, the SVM-with-RBF-kernel approach can be seen as making
predictions using on a *weighted* sum of the known
classifications for nearby training data, with the weightings based on
a smooth function of the distance from training feature vector to the
feature vector to be classified.

In [None]:
import sklearn.pipeline as pl
np.random.seed(123)            ## for replicability
pcaSvmPipeline = pl.Pipeline([
    ("featextr", PcaExtractor(m=5)),
    ("classifier", svm.SVC(kernel="rbf", gamma="scale", probability=True))
])
 ## gamma="scale" uses default gamma value (may not be optimal!)
pcaSvmFit = deepcopy(pcaSvmPipeline).fit(hessTrain.T, hessTrainY)
pcaSvmTestPredictionClass = pcaSvmFit.predict(hessTest.T)
pd.crosstab(pcaSvmTestPredictionClass, hessTestY,
            rownames=["prediction"], colnames=["actual"])

We'll assess the performance of this classifier in the training set
as well, but instead of avoiding resubstitution
bias using cross-validation we'll try an alternative resampling
technique known as *bootstrapping*.

Bootstrapping 
=============

Machine learning is generally less concerned with questions about
whether the internal structure of a model is correct, necessary or
interpretable than is classical statistics, but there are still times
when we'd like to be able to characterize the uncertainty or
repeatability associated with an estimated parameter value.

Put another way: if we had another data set generated in the same way
as the one we do have, how similar would the value we estimated for
this or that parameter be to what we get using the actually realized
training data? Do we expect to get basically the same value or
something wildly different?

For linear models, the literature abounds with useful analytical
results on confidence intervals, credible intervals, and the like. But
for other types of modeling strategies, this is rarely the case!

If gathering data were cheap and easy, we could just go ahead and
replicate
-   the experiment which generated the data and then
-   re-fit the model to the newest round of data

many times to empirically estimate the distribution of fit model
parameters.

*Bootstrapping* is a clever approach to *simulate* such
replication using just the one data set we actually have
([@tibshirani1993introduction]). The bootstrapping process
consists of:
1.  Generate a case-resampled data set with feature matrix
    $\mathbf{\underline{X}}^{\text{boot}}$ and outcome vector
    $\mathbf{y}^{\text{boot}}$ by drawing $n$ random integers
    $1 \leq r_i \leq n$ *with replacement* and setting
    $$\begin{aligned}
       x_{ig}^{\text{boot}} &= x_{r_i g} \\
       y_i^{\text{boot}} &= y_{r_i}
      \end{aligned}$$
    Note that the $r_i$ will generally not be unique: $r_i$ and $r_j$
    may be the same sampling unit even when $i \neq j$, so that the same
    sampling unit may be included multiple times in the resampled data
    set!
2.  Fit desired model to resampled feature matrix
    $\mathbf{\underline{X}}^{\text{boot}}$ and outcome vector
    $\mathbf{y}^{\text{boot}}$ to learn parameters
    $\mathbf{y}^{\text{boot}}$
    -   $\boldsymbol{\theta}$ is just way of writing set of all
        parameters needed by model pulled together into one big vector,
        while
    -   the "hat" on top of $\hat{\boldsymbol{\theta}}$ indicates
        that we are talking about an specific data-derived estimate of the
        parameter values $\boldsymbol{\theta}$, and
    -   superscript "boot" on $\hat{\boldsymbol{\theta}}^{\text{boot}}$
        just says the parameters were learned from the bootstrap-resampled
        data as opposed to the original trainind data set.
    
3.  Use fit model with parameters
    $\hat{\boldsymbol{\theta}}^{\text{boot}}$ to estimate parameter or
    statistic $\hat{\Omega}^{\text{boot}}$ of interest.
4.  Repeat steps 1-3 $B$ times, obtaining values
    $\hat{\Omega}_b^{\text{boot}}$ for $b \in \{1,\ldots,B\}$ using fit models
    with parameters $\hat{\boldsymbol{\theta}}_b^\text{boot}$.

Note that because bootstrap resampling generates new simulated data
sets of the same size $n$ as the original data set but in which some
sampling units are repeated, there will necessarily be some sampling
units that get left out in any particular resampled data set: on
average, a fraction $\frac{1}{e} \approx 0.368$ of all sampling units
will be omitted in each bootstrap sample.

Bootstrapping for Performance Estimation 
----------------------------------------

Bootstrapping can also be used as an alternative to cross-validation
for estimation of prediction error $\Omega$.

How should we go about this?
-   We might try to estimate distribution of prediction error
    $\{\hat{\Omega}_b^{\text{full}}\}$
-   making predictions with each bootstrap model $b$ with parameters
    $\hat{\boldsymbol{\theta}}_b^\text{boot}$ applied to full (original)
    training set $\mathbf{\underline{X}}$.

However, since bootstrap training sets were drawn from the same
original feature matrix $\mathbf{\underline{X}}$,
$\{\hat{\Omega}_b^{\text{full}}\}$ will suffer from resubstitution
bias.

Instead we could follow cross-validation methodology:
-   use only fit models with parameters $\hat{\boldsymbol{\theta}}_b$ for
    which
-   sampling unit $i$ not used in the $b^{\text{th}}$ resampled
    training set.

Writing $R_b$ to indicate the set of sampling units included in the
$b^{\text{th}}$ resampled training set:
$$\label{eq:loo-boot}
\hat{\Omega}^{\text{loo-boot}} = \frac{1}{n} \sum\limits_{i} {
    \frac{1}{|\{b \mid i \notin R_b \}|}
    \sum\limits_{\{b \mid i \notin R_b \}} \hat{\Omega}(\hat{\boldsymbol{\theta}}_b, y_i)
}$$
(Aside re: set notation: $\{b \mid i \notin R_b \}$ is set of
bootstrap iterations $b$ for which sampling unit $i$ does not appear
in the set of sampling units $R_b$, while $|{b i R_b }|$
is the number of elements in this set, that is, the number of
bootstrap iterations which omitted sampling unit $i$.)

But while $\{\hat{\Omega}_b^{\text{full}}\}$ are generally overly
optimistic, $\hat{\Omega}^{\text{loo-boot}}$ may be too
*pessimistic*, since each bootstrap case-resampled training set
generally contains only a fraction $1-\frac{1}{e} \approx 0.632$ of
the true training sampling units (albeit with some showing up multiple
times!).

Since repeating training sampling units doesn't generally improve
models---the repeated units aren't really new data!---we are
effectively learning models using only $\approx 63.2%$ of the
available data (albeit randomly upweighting some sampling units
relative to others).
[@efron1997improvements] showed that
$$\label{eq:632-bootstrap}
\hat{\Omega}^{.632} = 0.368 \, \hat{\Omega}^{\text{resub}} + 0.632 \, \hat{\Omega}^{\text{loo-boot}}$$
strikes a good balance between the optimism of $\hat{\Omega}^{\text{resub}}$
and the pessimism of $\hat{\Omega}^{\text{loo-boot}}$ in some situations.

However, in cases where overfitting is more severe,
[@efron1997improvements] recommend
$$\label{eq:632plus-bootstrap}
\hat{\Omega}^{.632+} =
(1-\hat{w}) \, \hat{\Omega}^{\text{resub}} +
\hat{w} \, \hat{\Omega}^{\text{loo-boot}}$$
where $\hat{w} \in [1-\frac{1}{e}, 1]$ depends on the degree of
overfitting.

There is a standard formula for calculating $\hat{w}$ for estimating
prediction error using the .632+ bootstrap which you can look up;
aside from [@efron1997improvements], [@hastie2009elements] has
a nice treatment.

OK, let's get back to a concrete example: we'll use bootstrapping to
assess the performance of a select-10-feature-for-SVM-modeling
pipeline using 25 bootstrap resamples (this is a relatively low number
for illustration purposes only; most sources suggest $100$
resamples with bootstrapping, but that takes a while!):

In [None]:
 ## bootstrap .632+ implemented in mlxtend module:
from mlxtend.evaluate import bootstrap_point632_score
import sklearn.feature_selection as fs 
fsSvmPipeline = pl.Pipeline([
    ("featsel", fs.SelectKBest(fs.f_regression, k=10)),
    ("classifier", svm.SVC(kernel="rbf", gamma="scale", probability=True))
])
np.random.seed(321)
fsSvmBootAccs = bootstrap_point632_score(
    estimator = fsSvmPipeline,
    X = hessTrain.T.values,  ## .values b/c mlxtend likes numpy arrays
    y = hessTrainY.values,   ## .values b/c mlxtend likes numpy arrays
    method = ".632+",
     ## here do only 25 bootstrap resamples for speed;
     ## (usually recommended to do >= 100 in real usage!)    
    n_splits = 25
)
np.mean(fsSvmBootAccs)

Decision Tree Classifiers
=========================

Decision trees are probably understood by considering an example. A
single decision tree can be constructed in Python using the function
`sklearn.tree.DecisionTreeClassifier`. 

The standard process of fitting a decision trees, which is sometimes
referred to as "recursive partitioning", actually performs a
form of embedded feature selection, but to keep these notes
as similar to the R version as possible, we'll connect our by now old-hat
$t$-test feature selector upstream of `tree.DecisionTreeClassifier`:

In [None]:
import sklearn.tree as tree
decTreePipeline = pl.Pipeline([
    ("featsel", fs.SelectKBest(fs.f_regression, k=100)),
    ("classifier", tree.DecisionTreeClassifier(
        min_samples_split = 10,  ## don't split if < 10 sampling units in bin
        max_depth = 3            ## split splits of splits but no more!
    ))
])
fsDecTree = decTreePipeline.fit(hessTrain.T, hessTrainY)
tree.plot_tree(fsDecTree[1])

Each node of the tree shown is associated with a subset of the set of
all sampling units. The topmost (or root) node contains all samples,
which are then split (or partitioned) into those samples for which the
expression level for probe set 212745_s_at was measured to be \<
7.673, which flow down to the left node, and those samples with higher
levels of 212745_s_at expression, which go down the right
branch. The fitting algorithm selected the probe set 212745_s_at and
the level 7.673 for the top split because this was determined to be
the best single split to separate pCR patient samples from RD patient
samples.

The "recursive" part of recursive partitioning is then to repeat
this splitting process within each of those sample subpopulations,
*unless* one of the stopping criteria is met. Stopping criteria
are usually based on the size and "impurity" of the sample
subpopulation: If the node is associated with too small a sample
subpopulation it will not be split, or if the sample subpopulation
within the node is sufficiently pure in either one outcome class or
the other (either close to all pCR or close to all RD), there is no
point in further splitting.

Classification probabilities for any new sample may then be calculated
by starting at the root and following the branches of the tree
indicated the sample's feature values until a terminal, or leaf, node
is reached: the fraction of training set samples in the leaf node with
classification RD is then the predicted probability that patient from
which the new sample is derived will suffer from residual invasive
disease (RD).

In [None]:
fsDecTreePredProbs = fsDecTree.predict_proba(hessTest.T)
fsDecTreePredProbs[0:5, :]

In [None]:
fsDecTreePredClass = fsDecTree.predict(hessTest.T)
pd.crosstab(fsDecTreePredClass, hessTestY,
            rownames=["prediction"], colnames=["actual"])

Single decision trees are simple and intuitive but, despite the
reasonably good results seen just above, have generally not performed
very well in real world classification tasks. The structure of such
trees also tends to be very sensitive to small changes in the training
data; don't be surprised if you get an entirely different tree if a
single sampling unit is added or removed from the training data set!

There is, however, an approach to machine learning based on multiple
decision trees which has become very popular in the last few
decades\...

Bagging: **Bootstrap** **Agg**regat**ing** Models 
=================================================

We could consider using set of $B$ bootstrap case-resample trained
models in place of a single model for making predictions.
Repeat for $b {1,...,B}$:
1.  Generate $\mathbf{\underline{X}}_b$ by drawing $n$
    random integers $R_b=\{r_{b i}\}$ with replacement
    and setting $x_{b ig} = x_{r_{b i} g}$, $y_{b i} = y_{r_{b i}}$.
2.  Fit model using $\mathbf{\underline{X}}_b$ and $\mathbf{y}_b$ to
    obtain fitted parameters $\hat{\boldsymbol{\theta}}_b$.

Bagged predictions for new datum with feature vector $\mathbf{x}$ by
simply averaging together the predictions of each bagged submodel $b$
with parameters $\hat{\boldsymbol{\theta}}_b$ for features $\mathbf{x}$.
From [@breiman1996bagging]:
> For unstable procedures bagging works well ...The evidence,
> both experimental and theoretical, is that bagging can push a
> good but unstable procedure a significant step towards optimality.
> On the other hand, it can slightly degrade the performance of stable
> procedures.

In this context, "stability" is of the fit model parameters
$\boldsymbol{\theta}$ with respect to the training data
$\{\mathbf{x}_i, y_i\}$. Recall that I said in section
[sec:decision-trees](#Decision-Tree-Classifiers) that decision trees suffered
from exactly this sort of instability!

In fact the most well-known application of bagging is indeed the
generation of *random forests* of decision trees
([@breiman1999random]). A random forest is constructed by
repeating, for $b {1,...,B}$:
1.  Generate $\mathbf{\underline{X}}_b$ and $\mathbf{y}_b$ by drawing $n$ random
    integers $R_b=\{1 \leq r_{b i} \leq n\}$ with replacement and setting
    $x_{b ig} = x_{r_{b i} g}$ and $y_{b i} = y_{r_{b i}}$.
2.  Randomly select $m' < m$ of the features and fit a decision tree
    classifier for $\mathbf{y}_b$ using the columns of feature matrix
    $\mathbf{\underline{X}}_b$ corresponding to those features.
    -   $m'$ random features redrawn for each new split.
    -   Commonly $m' \approx \sqrt{m}$.
    
`sklearn.ensemble` includes a class `RandomForestClassifier`
which is quite easy to use:

In [None]:
import sklearn.ensemble as ens
np.random.seed(321)
rf = ens.RandomForestClassifier(
    n_estimators = 100,          ## number of trees
    min_samples_split = 10
).fit(hessTrain.T, hessTrainY)

rfPredClass = rf.predict(hessTest.T)
pd.crosstab(rfPredClass, hessTestY,
            rownames=["prediction"], colnames=["actual"])

So\...here we found that a single decision tree combined with upstream
simple $t$-test feature selection of 100 probe sets outperformed a
random forest of 100 trees. Don't think this is a typical
result---random forests have been found to generate very competitive
ML classifiers in a wide variety of situations, while single decision
trees generally have not. But it does go to show that it can be hard
to generalize about ML algorithm performance, especially on relatively
small data sets like the Hess example here!

Classification Performance Metrics 
==================================

There are many ways to measure performance for classifiers, of which
accuracy is only one. Like accuracy, most are based the discrete
classification label calls. For classifiers which output probability
scores, this means that some threshold probability $\psi$ (often, but
certainly not always, 0.5) must be set.

For binary (two-class) classification, when one class can be
considered "positive" and the other "negative, the cells of the 2x2
contingency table are often labeled as true positive (TP), true
negative (TN), false positive (FP), and false negative (FN), where,
e.g., a false positive is sampling unit which the classifier declares
positive but for which the true value of the outcome is negative.

We could consider the values of such hard-call metrics over range of
threshold values $\psi$. The so-called receiver operating
characteristic (ROC) curve ([@fawcett2006introduction]) does this
for sensitivity and specificity:

In [None]:
 ## pick 20 test samples to score with pcaSvmFit classifier:
np.random.seed(123)
xfew = hessTrain.iloc[:, np.random.permutation(hessTrain.shape[1])[0:20]].T
yis1 = hessTrainY.loc[xfew.index] == 1
 ## do the scoring:
fewPredProbs = pcaSvmFit.predict_proba(xfew)[:, 1]
fewPredProbs = pd.Series(fewPredProbs, index=xfew.index)
 ## set up vector all threshold values at which a call would change:
thresholds = pd.concat([
    pd.Series([1.0], index=["none"]),
    fewPredProbs.sort_values(ascending=False),
    pd.Series([0.0], index=["all"])
])
 ## calculate number true positives at each threshold:
tp = pd.Series(
    [np.sum((fewPredProbs > thresh) & yis1) for thresh in thresholds],
    index = thresholds.index
)
 ## and also number true negatives at each threshold:
tn = pd.Series(
    [np.sum((fewPredProbs <= thresh) & ~yis1) for thresh in thresholds],
    index = thresholds.index
)
 ## scale these by totals to obtain sens, spec at each threshold value:
sensitivity = tp / np.sum(yis1)
specificity = tn / np.sum(~yis1)

Having calculated sensitivity and specificity at every meaningful
threshold value, we can now plot the ROC curve using `ggplot`:

In [None]:
ggdata = pd.DataFrame({
    "sample" : sensitivity.index,
    "actual_class" : yis1.reindex(sensitivity.index).astype(float).values,
    "score" : fewPredProbs.reindex(sensitivity.index).values,
    "sensitivity" : sensitivity,
    "specificity" : specificity
})
ggdata["1-specificity"] = 1 - ggdata["specificity"]
gg = ggplot(ggdata, aes(x="1-specificity", y="sensitivity"))
gg += geom_line(aes(color="score"), size=1, alpha=0.75)
gg += geom_text(mapping = aes(label="sample"),
                data = ggdata.loc[ggdata["actual_class"] == 1, :],
                color = "red")
gg += geom_text(mapping = aes(label="sample"),
                data = ggdata.loc[ggdata["actual_class"] == 0, :],
                angle = -90,
                color = "black")
gg += geom_hline(mapping = aes(yintercept="sensitivity"),
                 data = ggdata.loc[ggdata["actual_class"] == 1, :],
                 alpha = 0.35,
                 size = 0.25)
gg += geom_vline(mapping = aes(xintercept="1-specificity"),
                 data = ggdata.loc[ggdata["actual_class"] == 0, :],
                 alpha = 0.35,
                 size = 0.25)
gg += plotnine.scale_color_gradientn(
    colors = ["orangered", "goldenrod", "seagreen", "dodgerblue", "#606060"]
)
gg += plotnine.theme_classic()
print(gg)

You can see in this plot that there are 15 RD (or
"positive") and 5 pCR ("negative") samples in the
subampled test data `xfew`: 3 of the
5 negative samples---M153, M206, M111---have
lower prediction scores than any of the 15 positive samples.
Thus, there are 3 times 15 =
45 light gray vertices below the
ROC curve in the 3 columns on the right of
the plot.

Adding to this the 26 light gray verices below the ROC curve along the
vertical lines labeled by samples M125 and M309, corresponding to the 26
positive samples with scores higher than that of the negative samples
M125 and M309, we obtain 71 total ways of pairing one of the positive
samples with one of the negative samples for which the positive sample
has a higher score than the negative.

This corresponds to a fraction of 71 out of the 75, or 0.9467, vertices in the
plot which lie below the curve. This shows that the area under the curve (AUC)
for the ROC curve is 0.9467,
which must also be the likelihood that if we randomly pick one positive sample
and one negative sample from these 20 the positive sample will have a higher
score than the negative.

The ROC AUC score is one the most popular metrics for assessing
classifier performance. Beyond being threshold-independent---since it
aggregates over all possible thresholds by considering the full ROC
curve---it has the property that an uninformative classifier will have
an AUC of 0.5 even when the two classes are unbalanced (more of one
than the other), as they are in the Hess data (almost 3x as many RD as
pCR).

This is not the case with accuracy: if you just assign all sampling
units the same classification score (ignoring all feature values) and
then set the classification threshold so that they are all called the
more common class, the accuracy will be the > 0.5 fraction assigned to
that class (almost 0.75 in the case of the Hess set!).

We don't usually want to do all of the work we did above to assess the
AUC score for a classifier; here are two easier ways to do it:

In [None]:
 ## calculate ROC-AUC using sklearn.metrics.roc_auc_score
import sklearn.metrics as met
met.roc_auc_score(yis1, fewPredProbs)

In [None]:
 ## or can calculate from scipy.stats.mannwhitneyu statistic:
import scipy.stats as stats
 ## (this nonparametric test is based on same underlying information):
mwu = stats.mannwhitneyu(fewPredProbs[yis1],
                         fewPredProbs[~yis1],
                         alternative = "less")
# MannwhitneyuResult(statistic=70.0, pvalue=0.9980146250417405)
mwu[0] / (np.sum(yis1) * np.sum(~yis1))

(In other words, the ROC AUC score is essentially a more interpretable
rescaling of the Wilcoxon-Mann-Whitney test (also known as the
Mann-Whitney U test) statistic. This makes sense in light of the
intepretation of AUC as the chance that a randomly chosen positive
case has a higher classification score than does a randomly chosen
negative case, since the Wilcoxon-Mann-Whitney test is based on this
same idea.)

Of course, we can get better estimate of the AUC using the
*whole* test set instead of just `xfew`:

In [None]:
pcaSvmTestPredProbs = pcaSvmFit.predict_proba(hessTest.T)
met.roc_auc_score(hessTestY, pcaSvmTestPredProbs[:, 1])

So, a bit worse---but this still shows thus that even though
`pcaSvmFit` only managed to correctly call 2 of the 13 test pCR
samples as negative:

In [None]:
pcaSvmTestPredictionClass = pcaSvmFit.predict(hessTest.T)
pd.crosstab(pcaSvmTestPredictionClass, hessTestY,
            rownames=["prediction"], colnames=["actual"])

the scores of the negative (pCR) samples still tend to be lower than
the scores of the positive (RD) samples, even if they are above the
default threshold $\psi=0.5$.

Wrap-up: Comparing Models by AUC 
--------------------------------

Let's go back and try a quick head-to-head comparison of five of the
different classification models we've covered. First let's make sure
we have all of the necessary libraries loaded:

In [None]:
import sklearn.neighbors as nbr    ## KNeighborsClassifier
import sklearn.linear_model as lm  ## LogisticRegression
import sklearn.svm as svm          ## SVC
import sklearn.naive_bayes as nb   ## GaussianNB
import sklearn.ensemble as ens     ## RandomForestClassifier

We're going to use a dictionary comprehension to loop through the
five different classification
strategies. This method wants to be supplied a `dict` to work with,
and will output a `dict` of result objects with the same keys:

In [None]:
downstreamFitters = {
    "knn" : nbr.KNeighborsClassifier(n_neighbors=9),
    "l2logistic" : lm.LogisticRegression(penalty="l2", max_iter=1000),
    "nb" : nb.GaussianNB(),
    "svm" : svm.SVC(kernel="rbf", C=1, gamma="scale", probability=True),
    "randomForest" : ens.RandomForestClassifier(n_estimators=500,
                                                min_samples_split=10)
}

Now we're ready to hook up a common upstream feature selection
strategy (guess which one we'll use!) to each of the
`downstreamFitters`, fit the resulting pipeline, make predictions on
the test set, and calculate the resulting AUC scores:

In [None]:
np.random.seed(123)
 ## use dictionary comprehension to loop through all downstreamFitters:
fitModels = {name : pl.Pipeline([
                 ("featsel", fs.SelectKBest(fs.f_regression, k=30)),
                 ("classifier", deepcopy(downstreamFitter))
             ]).fit(hessTrain.T, hessTrainY)
             for name, downstreamFitter in downstreamFitters.items()}
 ## now loop through fitModels to predict probabilities:
fitPredProbs = {name : fitModel.predict_proba(hessTest.T)[:, 1]
                for name, fitModel in fitModels.items()}
 ## and finally loop through fitPredProbs to get AUC values:
fsAucs = {name : met.roc_auc_score(hessTestY, predProbs)
          for name, predProbs in fitPredProbs.items()}
fsAucs

Interesting to note that the simplest strategy, knn, ends up winning
according to this comparison! Lots of caveats here: the results might
look very different with different methods of feature selection or
extraction, different numbers of features retained, different settings
of the various modeling parameters (number of nearest neighbors, SVM
cost or sigma parameters, number of trees in random forests, etc.), so
I wouldn't advise reading too much into this beyond this: sometimes
simplicity works.