# Dimensionality Reduction via <br>Feature Selection



<div class="alert alert-block alert-info"><font color="#000000">
    
Reducing the dimensionality of the feature space not only allows learning algorithms to run much faster, but may also improve the predictive performance of a model, especially when our dataset contains a large number of features that contain noises.

There are two main categories of dimensionality reduction techniques: <b>feature selection</b> and <b>feature extraction</b>.  

- In feature selection, we select a subset of the original features that retain most of the information needed for a given machine learning task;

- In feature extraction, we derive information from the original features and construct new features.

We focus on feature selection in this notebook.
</font></div>

<div class="alert alert-block alert-success"><font color="#000000">
<b><font color="#008000">Concept Review: Data Transformers</font></b>

Features dimensionality reduction is a <b>data transformation process</b>: It transforms or compresses the original features into fewer features while retaining most of the information.

As with the two transformers (label encoder and standard scaler) we learned before, any transformer involves 'fit' and 'transform' steps:
<ol>
    <li><b>Fitting</b> step finds the parameters of a transformer.</li>
    <li><b>Transforming</b> step applies the parameterized transformer to transform the data.  In dimensionality reduction, this transformation step returns a fewer number of features than pre-transformation.</li>
</ol>
    
See a list of data transformers at https://scikit-learn.org/stable/data_transforms.html
</font></div>

In [None]:
# Run the following command to widen this notebook
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

<div class="alert alert-block alert-info"><font color="#000000">
Let's first run a block of codes that we are already familiar with.
</font></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.set_printoptions(suppress=True, linewidth=200)

winedf = pd.read_csv("wine.csv")

# Define Predictors and Target Variable
X = winedf.iloc[:, 1:]       # iloc selects data by index
y = winedf.loc[:, 'Class label']

# Splitting Data into Training Set and Test Set
from sklearn.model_selection import train_test_split
testsize = 0.3
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=testsize, stratify=y, random_state=0)

# Standardizing data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std= sc.transform(X_train)
X_test_std = sc.transform(X_test)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=100, solver='liblinear', multi_class='ovr')
lr.fit(X_train_std, y_train)

print("Training accuracy: ", lr.score(X_train_std, y_train))
print("Test accuracy:     ", lr.score(X_test_std, y_test))

## <font color="#0000E0"> Recursive Feature Elimination (RFE) </font>

__Reference__:  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

<div class="alert alert-block alert-info"><font color="#000000">
The idea is simple: Starting from an initial set of features, we eliminate the least important features, and repeat the procedure until the desired number of features is reached.

<b>RFE</b> is readily available from the <b>feature selection</b> module in <b>scikit-learn</b> library.
</font></div>

In [None]:
lr = LogisticRegression(C=100, solver='liblinear', multi_class='ovr')

# Construct the feature selector
from sklearn.feature_selection import RFE
rfe = RFE(estimator=lr, n_features_to_select=2)

# Select features
rfe.fit(X_train_std, y_train)
print('Feature ranking: Selected features are ranked 1: ', rfe.ranking_)
print(np.vstack((rfe.ranking_, winedf.columns[1:])).T)
X_train_rd= rfe.transform(X_train_std)
X_test_rd = rfe.transform(X_test_std)

# Verify the selected features are indeed from desired columns
#dif = X_train_rd - X_train_std[:,[6,9]]
#print(np.mean(dif), np.std(dif))

# Train the model using only the selected features
lr.fit(X_train_rd, y_train)
print("Training accuracy: ", lr.score(X_train_rd, y_train))
print("Test accuracy:     ", lr.score(X_test_rd, y_test))

<div class="alert alert-block alert-success"><font color="#000000">
Original features must be normalized or standardized.  Otherwise, we can make any feature important or unimportant, as shown in the following codes.
</font></div>

In [None]:
# Make a copy of the standardized features
X_alt = np.copy(X_train_std)

# Scale several features
X_alt[:,4] = X_alt[:,4] * 100000
X_alt[:,5] = X_alt[:,5] * 100000
X_alt[:,6] = X_alt[:,6] / 100000

# Now see how features rank
rfe.fit(X_alt, y_train)
print(np.vstack((rfe.ranking_, winedf.columns[1:])).T)

## <font color="#0000E0"> Recursive Feature Elimination with Cross Validation (RFECV) </font>

__Reference__:  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html

<div class="alert alert-block alert-info"><font color="#000000">
In RFE, we specify <b>n_features_to_select</b>, but what is the best number of features to select?     
To answer this question, you can consider a series of RFE models with different <b>n_features_to_select</b>. We can evaluate each model using cross validation and tuning the hyperparameter <b>n_features_to_select</b> to achieve best validation accuracy.  This entire process is already coded for you: <b>RFECV</b> from scikit-learn does exactly <b>R</b>ecursive <b>F</b>eature <b>E</b>limination and <b>C</b>ross-<b>V</b>alidated selection of the best number of features.   
</font></div>

In [None]:
lr = LogisticRegression(C=100, solver='liblinear', multi_class='ovr')

# Construct the feature selector
from sklearn.feature_selection import RFECV
rfecv = RFECV(estimator=lr, cv=10)

# Select features
rfecv.fit(X_train_std, y_train)
print('Best number of features to select:', rfecv.n_features_,
      '\nFeature ranking: Selected features are ranked 1:\n',
      np.vstack((rfecv.ranking_, winedf.columns[1:])).T)
X_train_rd= rfecv.transform(X_train_std)
X_test_rd = rfecv.transform(X_test_std)

# Train the model using only the selected features
lr.fit(X_train_rd, y_train)
print("Training accuracy: ", lr.score(X_train_rd, y_train))
print("Test accuracy:     ", lr.score(X_test_rd, y_test))

In [None]:
rfecv.cv_results_['mean_test_score']  # new this year

In [None]:
plt.figure(figsize=(12,6))
plt.xlabel("Number of features selected")
plt.plot(range(1, 14), rfecv.cv_results_['mean_test_score'], label='validation accuracy')
plt.grid()
plt.legend(loc='upper center')
plt.show()

## <font color="#0000E0"> Least Absolute Shrinkage and Selection Operator (LASSO)</font>

<div class="alert alert-block alert-success"><font color="#000000">
<b><font color="#008000">Concept Review </font></b>

In LASSO regularization, we penalize weights by $\frac{1}{C}\sum_{j=1}^m |w_j|$, where the sum of absolute values of weights is known as the L1 norm of the weights.  In the figure below, any point $(w_1, w_2)$ on the edge of the diamond shape has the same penalty.  The corners of the diamond are more likely to minimize the regularized loss function.
</font></div>
<img src="LASSO.png" width="500">

__Reference__:    
https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection-using-selectfrommodel

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html


<div class="alert alert-block alert-info"><font color="#000000">
The idea of feature selection by LASSO is simple: We select those features with non-zero weights estimated by a model with LASSO regularization.  Scikit-learn package has automated this process via <b>SelectFromModel</b>.
</font></div>

In [None]:
# Logistic model with LASSO regularization
lr = LogisticRegression(C=0.05, solver='liblinear', multi_class='ovr', 
                        penalty='l1')

# Construct the feature selector 
from sklearn.feature_selection import SelectFromModel
selector = SelectFromModel(estimator=lr)

# Select features
selector.fit(X_train_std, y_train)
print('Selected features: ', selector.get_support(),
      '\n', np.sum(selector.get_support()),
      'features are selected:', 
      winedf.columns[1:].to_numpy()[selector.get_support()])
#with np.printoptions(precision=2, linewidth=200):
#    print(np.vstack((selector.estimator_.coef_, selector.get_support())))
X_train_rd= selector.transform(X_train_std)
X_test_rd = selector.transform(X_test_std)

# Train the model using the selected features
# Different regularization scheme can be used
lr = LogisticRegression(C=100, solver='liblinear', multi_class='ovr')
lr.fit(X_train_rd, y_train)
print("Training accuracy: ", lr.score(X_train_rd, y_train))
print("Test accuracy:     ", lr.score(X_test_rd, y_test))

## <font color="#0000E0"> Summary: Feature Selection </font>

<div class="alert alert-block alert-success"><font color="#000000">
    
1. Preprocess the data and construct a machine learning model (same as what we did in previous classes)

2. Construct a feature selector: RFE, RFECV, LASSO, etc.

3. Select features by using 'fit' and 'transform' functions:

<ul>
<li>Fit function finds the parameters of the transformer, e.g., the feature ranking (rfe.ranking_, rfecv.ranking_), the number of features selected (rfecv.n_features_), etc.  This step should use training set only.
</li>
<li>Transform function collects the selected features and return them as an array.  This step applies to both training and test set.  The returned arrays have the same number of rows (observations) but fewer columns (features).
</li>
</ul>

4. Train the model using the selected features, evaluate model performance, and tune hyperparameters.

</font></div>