# Scikit Learn Tutorial #9 - Dimensionality Reduction 2

<table align="left"><td>
  <a target="_blank"  href="https://colab.research.google.com/github/TannerGilbert/Tutorials/blob/master/Scikit-Learn-Tutorial/9.%20Dimensionality%20Reduction%202.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab
  </a>
</td><td>
  <a target="_blank"  href="https://github.com/TannerGilbert/Tutorials/blob/master/Scikit-Learn-Tutorial/9.%20Dimensionality%20Reduction%202.ipynb">
    <img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
</td></table>

![Scikit Learn Logo](http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)

## Quick recap of the last tutorial

In the last tutorial we learned that Dimensionality Reduction is the process of reducing the number of features or variables. We also learned that there are two groups of Dimensionality Reduction.
<ul>
    <li>Feature Selection</li>
    <li>Feature Extraction</li>
</ul>
We focused on Feature Extraction in the <a href="https://youtu.be/IvvPPKDRJqE">last tutorial</a>. We used PCA to reduce the dimensionality of our dataset. So if you want to learn about Feature Extraction you should check out the last video. In this tutorial we will focus on Feature Selection.

## What is Feature Selection?

<b>Feature Selection</b> is what its name sounds like. It tries to find a subset of the original features. There are three strategies to select features. There is the filter, wrapper and embedded strategy. Feature Selection has a few advantages and disadvantages over feature extraction. The advantages of feature selection include its simplicity and its interpretability of the features. One of its disadvantages is that you lose all information from the features you dropped which possibly results in worse accuracy. 

## Feature Selection in Scikit Learn

Scikit Learn offers a lot of options for Feature Selection like removing features with low <a href="https://en.wikipedia.org/wiki/Variance">variance</a> using <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold">Scikit Learns VarianceThreshold</a> or by selecting the best features based on univariate statistical tests using <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest">SelectKBest</a> or <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile">SelectPercentile</a>. Another option we get is selecting features for a specific algorithm using <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel">SelectFromModel</a>.

### Loading in Dataset

In [1]:
import pandas as pd

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'label'])
X = iris.drop(['label'], axis=1)
y = iris['label']
X.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


### Removing features with low variance (VarianceThreshold)

In [2]:
from sklearn.feature_selection import VarianceThreshold

sel = VarianceThreshold(threshold=.5)

X_vt = sel.fit_transform(X)
X_vt[:5] # Sepal Width got removed

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2],
       [4.6, 1.5, 0.2],
       [5. , 1.4, 0.2]])

### Univariate feature selection (SelectKBest, SelectPercentile)

In [3]:
from sklearn.feature_selection import SelectKBest, SelectPercentile, chi2

X_sb = SelectKBest(chi2, k=2).fit_transform(X, y) # chi2: used score k: number of features to select
print(X_sb.shape)
X_sb[:5]

(150, 2)


array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

In [4]:
X_sp = SelectPercentile(chi2, percentile=50).fit_transform(X, y)
print(X_sp.shape)
X_sp[:5]

(150, 2)


array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

In [5]:
X_sp = SelectPercentile(chi2, percentile=75).fit_transform(X, y)
print(X_sp.shape)
X_sp[:5]

(150, 3)


array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2],
       [4.6, 1.5, 0.2],
       [5. , 1.4, 0.2]])

### Feature selection using SelectFromModel

In [6]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

clf = LinearSVC(C=0.01, penalty="l1", dual=False) # We chose C smaller then default so fewer features where selected
clf.fit(X, y)
model = SelectFromModel(clf, prefit=True)
X_sfm = model.transform(X)
print(X_sfm.shape)
X_sfm[:5]

(150, 3)


array([[5.1, 3.5, 1.4],
       [4.9, 3. , 1.4],
       [4.7, 3.2, 1.3],
       [4.6, 3.1, 1.5],
       [5. , 3.6, 1.4]])

## Resources

<ul>
    <li><a href="http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection">Feature Selection (Scikit Learn Documentation)</a></li>
    <li><a href="https://en.wikipedia.org/wiki/Feature_selection">Feature Selection (Wikipedia)</a></li>
</ul>

## Conclusion

That was a quick overview of Feature Selection and how to implement it in Scikit Learn. 
I hope you liked this tutorial if you did consider subscribing on my <a href="https://www.youtube.com/channel/UCBOKpYBjPe2kD8FSvGRhJwA">Youtube Channel</a> or following me on Social Media. If you have any question feel free to contact me.