# Intro to Data Science @ SzISz Part III.
## Dimensionality Reduction

### Table of contents
- <a href="#What-is-Dimensionality-Reduction?">Theory</a>
- <a href="#Feature-Selection">Feature Selection</a>
- <a href="#Matrix-Decomposition">Matrix Decomposition</a>
- <a href="#Other-Techniques">Other Techniques</a>

## What is Dimensionality Reduction?
Dimensionality reduction _"is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction."_

_"__Feature selection__ approaches try to find a subset of the original variables. ... In some cases, data analysis such as regression or classification can be done in the reduced space more accurately than in the original space."_

_"__Feature extraction__ transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist."_ from:<a href="https://en.wikipedia.org/wiki/Dimensionality_reduction">Wikipedia</a>


## Why is it important?
With hundreds of features in the datasets, there will always be some which does not contribute to the overall precision of the predictive model. These features could be redundant, overlapping or linear combination of each other or simply irrelevant to the prediction. To improve training and transformation/prediction time, it is crucial to reduce the number of features to a moderate amount.

The <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a> also requires one to deal with the dimensionality concerns.


## Tools
- <a href="http://scikit-learn.org/stable/modules/feature_selection.html">Feature Selection</a>
- <a href="http://scikit-learn.org/stable/modules/decomposition.html#decompositions">Matrix decomposition</a>
- <a href="http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing">Hashing</a>
- etc.

In [None]:
%matplotlib inline
import numpy as np
import scipy.sparse as sp
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_iris

In [None]:
iris = load_iris()
X, y = iris.data, iris.target

## Feature Selection

In [None]:
from sklearn.svm import SVR, LinearSVC
from sklearn.feature_selection import VarianceThreshold, RFE, SelectFromModel

__VarianceThreshold__:

__RFE__:

__Select from model__:

## Matrix Decomposition

In [None]:
from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA, TruncatedSVD

In [None]:
data2 = fetch_20newsgroups_vectorized(subset='test', remove=('headers', 'footers', 'quotes'))
X2, y2 = data2.data, data2.target

__PCA__:

__SVD__:

## Other Techniques

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer