# SPLEX TME 10 Feature Selection Model Selection

The goal of the TME is to learn various techniques of feature selection.
Data (both data sets are provided)
• Molecular classification of leukemia data set of Golub et al. 1999 contains gene expressions of 72 patients and 3562 genes.
• Breast cancer data set

In [28]:
#You will need to load at least the following packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import ElasticNet
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn import linear_model

## Analysis
Repeat the same analyses for the two data sets.

In [39]:
#To read the data:
# For the Golub et al. 1999 data
X = pd.read_csv('Golub_X',sep=" ") # Observations
y = pd.read_csv('Golub_y',sep=" ") # Classes

(71, 3562)

In [41]:
#For the Breast cancer data
X = pd.read_csv("Breast.txt",sep=" ")
y = X.values[:,30] # Classes
X = X.values[:,0:29] # Observations

We will use the sklearn Python library only.

## 1.  A simple heuristic approach 

It is to delete features whose variance is less then a threshold. Try it (with two different arbitrary thresholds) but do not expect this method to return an optimal
performance (although it can be quite efficient on some data sets).
http://scikit-learn.org/stable/modules/feature_selection.html

In [31]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

array([[ 1.82821197, -0.35332152,  1.68447255, ..., -0.14661996,
         1.08612862, -0.24367526],
       [ 1.5784992 ,  0.45578591,  1.56512598, ...,  0.85422232,
         1.95328166,  1.15124203],
       [-0.76823332,  0.25350905, -0.59216612, ...,  1.98783917,
         2.17387323,  6.04072615],
       ...,
       [ 0.70166686,  2.04377549,  0.67208442, ...,  0.32647934,
         0.41370467, -1.10357792],
       [ 1.83672491,  2.33440316,  1.98078127, ...,  3.1947936 ,
         2.28797231,  1.9173959 ],
       [-1.80681144,  1.22071793, -1.81279344, ..., -1.30468267,
        -1.7435287 , -0.04809589]])

## 2.  Univariate feature selection with statistical tests
In order to get rid of features which are not statistically significant with respect to the vector of class. Try the SelectFdr
function that computes p-values for an estimated false discovery rate.
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFdr.

In [38]:
#chi2 onky takes positive values so we cannot use above defined breast cancer data with this stats function. 
#Let's try ANOVA F-value


from sklearn.feature_selection import SelectFdr, f_classif
X_new = SelectFdr(f_classif, alpha=0.01).fit_transform(X, y)
X_new.shape

# we remove 4 features

(569, 25)

    Load and return the breast cancer wisconsin dataset (classification).

    The breast cancer dataset is a classic and very easy binary classification
    dataset.

    =================   ==============
    Classes                          2
    Samples per class    212(M),357(B)
    Samples total                  569
    Dimensionality                  30
    Features            real, positive
    =================   ==============

    Parameters
    ----------
    return_X_y : boolean, default=False
        If True, returns ``(data, target)`` instead of a Bunch object.
        See below for more information about the `data` and `target` object.

        .. versionadded:: 0.18

    Returns
    -------
    data : Bunch
        Dictionary-like object, the interesting attributes are:
        'data', the data to learn, 'target', the classification labels,
        'target_names', the meaning of the labels, 'feature_names', the
        meaning of the features, and 'DESCR', the
        full description of the dataset.

    (data, target) : tuple if ``return_X_y`` is True

        .. versionadded:: 0.18

    The copy of UCI ML Breast Cancer Wisconsin (Diagnostic) dataset is
    downloaded from:
    https://goo.gl/U2Uwz2

    Examples
    --------
    Let's say you are interested in the samples 10, 50, and 85, and want to
    know their class name.

    >>> from sklearn.datasets import load_breast_cancer
    >>> data = load_breast_cancer()
    >>> data.target[[10, 50, 85]]
    array([0, 1, 0])
    >>> list(data.target_names)
    ['malignant', 'benign']

In [33]:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectFdr, chi2
X, y = load_breast_cancer(return_X_y=True)
X.shape

X_new = SelectFdr(chi2, alpha=0.01).fit_transform(X, y)
X_new.shape

(569, 16)

## 3.L1-based feature selection
it is designed to find an optimal solution.  The sparsity parameter is important (since it controls the number of non-zero parameters:  it too many parameters
are kept, no really feature selection; if too few parameters are chosen, it is possible that the accuracy is very poor).

### (a)  Logistic regression penalized by the L1 penalty term 

In [None]:
linear_model.Lasso(alpha=alpha)

### (b)  A support vector machine penalized by the L1 penalty term

In [None]:
LinearSVC(C=C, penalty="l1", dual=False)

### (c)  Explore the Elastic Net which is a compromise between the L1 and L2 penalty terms.

In [None]:
ElasticNet(alpha=alpha, l1_ratio=0.7)

## 4. Features to keep
How many features do you keep using these different methods?  It is quite normal that each method selects a different number of features.

## 5.  Method leading to the best perf
What method leads to the best performance (on the given data sets) ?

### References
• The original Lasso paper: Tibshirani,  R. (1996).  Regression shrinkage and selection via the lasso.  J. Royal.  Statist.
Soc B., Vol.  58, No.  1, pages 267–288http://statweb.stanford.edu/~tibs/lasso/lasso.pdf

•T. Hastie, R. Tibshirani, and M. Wainwright.Statistical Learning with Sparsity. The Lassoand Generalizations.(a good book)
https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS_corrected_1.4.16.pdf
