# Intro to Data Science
## Part IV. - Dimensionality Reduction

### Table of contents

- ##### Dimensionality reduction:
    - <a href="#What-is-Dimensionality-Reduction?">Theory</a>
    - <a href="#1.-Feature-Selection">Feature Selection</a>
    - <a href="#2.-Matrix-Decomposition">Matrix Decomposition</a>
    - <a href="#3.-Nonlinear-Dimensionality-Reduction">Nonlinear Dimensionality Reduction</a>
    - <a href="#2.-Trade-offs-&-When-to-Use-What">Trade-offs & When to Use What</a>

- ##### SVM
    - <a href="#SVM-=-Support-Vector-Machines">Theory</a>
    - <a href="#Example">Example</a>

- ##### Feature Union
    - <a href="#Feature-Unions">Feature Union</a>
    - <a href="#Create-custom-transformers">Custom transformers</a>
    - <a href="#Exercise:-Prediction-on-last-week's-dataset">Exercise</a>
    
---

## What is Dimensionality Reduction?

Dimensionality reduction _"is the process of reducing the number of random variables under consideration and can be divided into feature selection and feature extraction."_

_"__Feature selection__ approaches try to find a subset of the original variables. ... In some cases, data analysis such as regression or classification can be done in the reduced space more accurately than in the original space."_

_"__Feature extraction__ transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist."_ from: <a href="https://en.wikipedia.org/wiki/Dimensionality_reduction">Wiki</a>


## Why it is important?

With datasets containing hundreds (or even thousands) of features, some will inevitably be redundant, overlapping, or simply irrelevant to the prediction task. These unnecessary features can:
- Slow down training and prediction
- Increase the risk of overfitting
- Make models harder to interpret

Reducing the number of features to a manageable amount improves both model performance and efficiency.

### <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">The curse of dimensionality</a>

<img src="pics/curse-dimensions.png" align="left"> 
<br style="clear:left;"/>
(<a href="http://tm.durusau.net/wp-content/uploads/2016/06/curse-dimensions-460.png">source</a>)

This is one of our greatest enemies—right up there with overfitting and underfitting. As the number of dimensions increases, the number of possible states or input vectors grows **exponentially**. Even in the simplest case of binary variables, a dataset with 50 dimensions already has $2^{50} > 10^{15}$ number of possible inputs. That's **over a quadrillion** possible input combinations! 😱 To achieve the same level of effectiveness, **we need exponentially more training data** — which is often impractical.

## Tools
- <a href="http://scikit-learn.org/stable/modules/feature_selection.html">Feature Selection</a>
- <a href="http://scikit-learn.org/stable/modules/decomposition.html#decompositions">Matrix decomposition</a>
- <a href="http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing">Feature Hashing</a>
- etc. (e.g., Autoencoders, t-SNE, UMAP, etc.)

In [None]:
import numpy as np
import scipy.sparse as sp
import pandas as pd

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d, Axes3D
from matplotlib.colors import ListedColormap
import seaborn as sns

from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

np.random.seed(42)

In [None]:
def plotme(X, y):
    with sns.color_palette('muted', n_colors=3) as mycolors:
        plt.scatter(*X.T, c=y, cmap=ListedColormap(mycolors), edgecolors='k')

def plot_results_with_hyperplane(clf, clf_name, df, ax):
    x_min, x_max = df.x.min() - .5, df.x.max() + .5
    y_min, y_max = df.y.min() - .5, df.y.max() + .5

    # step between points. i.e. [0, 0.02, 0.04, ...]
    step = .02
    # to plot the boundary, we're going to create a matrix of every possible point
    # then label each point using our classifier
    xx, yy = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    # this gets our predictions back into a matrix
    Z = Z.reshape(xx.shape)
    
    # plot the boundaries
    ax.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired, shading='auto')
    ax.scatter(xs, ys, c=['r' if c else 'b' for c in cs], edgecolors='k')
    ax.set_title(clf_name)

In [None]:
iris = load_iris()
X, y = iris.data, iris.target

## 1. Feature Selection

### Simple (Variance Threshold) Based Selection:

"[_`VarianceThreshold`_](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) _is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples."_ from: <a href="http://scikit-learn.org/stable/modules/feature_selection.html#removing-features-with-low-variance">sklearn docs</a>

In [None]:
from sklearn.feature_selection import VarianceThreshold

thres = VarianceThreshold(.6)
X_t = thres.fit_transform(X)
X_t.shape, list(zip(iris.feature_names, thres.variances_))

In [None]:
plotme(X_t, y)

### Recursive Feature Elimination (<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn-feature-selection-rfe">`RFE`</a>):

_"Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), __recursive feature elimination (RFE)__ is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached."_ from: <a href="http://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination">sklearn docs</a>

In [None]:
from sklearn.feature_selection import RFE

rfe = RFE(LinearRegression(), n_features_to_select=2)
X_t = rfe.fit_transform(X, y)
X_t.shape, list(zip(iris.feature_names, rfe.ranking_))

In [None]:
plotme(X_t, y)

Thought experiment: Consider the __<a href="http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html">digits</a>__ dataset and try to describe the results found __<a href="http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_digits.html#recursive-feature-elimination">here</a>__.

**Tip**: RFE is particularly useful when working with high-dimensional datasets, but it can be computationally expensive. Consider using [`RFECV`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html) (Recursive Feature Elimination with Cross-Validation) to automatically determine the optimal number of features! 


### Select based on models:

This method is highly versatile since it relies on an external model to determine feature importance. Features are selected based on the model's coefficients or importance scores. If a feature's importance is below a predefined threshold, it is considered unimportant and removed.  
In sklearn, the <a href="http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection-using-selectfrommodel">`SelectFromModel`</a>transformer requires an estimator that has either a `coef_` (for linear models) or `feature_importances_` (for tree-based models) attribute.

In [None]:
from sklearn.feature_selection import SelectFromModel

sel = SelectFromModel(LogisticRegression(C=.1, solver='lbfgs', multi_class='auto'))
X_t = sel.fit_transform(X, y)
X_t.shape, list(zip(iris.feature_names, sel.get_support()))

In [None]:
plotme(X_t, y)

**Tip**: This method works well with tree-based models (like Random Forests and Gradient Boosting) and linear models with L1 regularization (like Lasso). 

## 2. Matrix Decomposition

### Principal Component Analysis ([`PCA`](http://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca))

<div style="display: flex; align-items: center; gap: 20px;">
    <img src="pics/transforming_axes.gif" width="500">
    <span style="display: block; width: 100%;">
        <h4>What is PCA?</h4>
        Principal Component Analysis (PCA) is a <b>linear dimensionality reduction technique</b> that transforms data into a lower-dimensional space while <b>maximizing variance</b>. Instead of using the original features, PCA finds new axes (principal components) that capture the most variation in the data.  
        </br></br>
        Mathematically, PCA:
        <ul>
            <li>Computes the <b>covariance matrix</b> of the data.</li>
            <li>Finds the <b>eigenvectors</b> and <b>eigenvalues</b> of this matrix.</li>
            <li>Uses the eigenvectors corresponding to the largest eigenvalues as new feature axes.</li>
        </ul>
        Simply put, PCA <b>rotates</b> the data to find the best angles that preserve its structure with the least number of dimensions.
    </span>
</div>

</br>

<div style="display: flex; align-items: center; gap: 20px;">
    <img src="pics/finding_pca.gif" width="500">
    <span style="display: block; width: 100%;">
        <h4>Why is PCA useful?</h4>
        <ul>
            <li><b>Reduces dimensionality</b> while retaining essential information.</li>
            <li><b>Removes redundancy</b> by capturing correlated features in fewer dimensions.</li>
            <li><b>Speeds up models</b> by working with fewer input variables.</li>
            <li><b>Helps visualization</b> of high-dimensional data in 2D or 3D.</li>
        </ul>
        In practice, we don't always compute exact eigenvectors manually. Instead, PCA is implemented using <b>eigendecomposition** or iterative methods like <b>Singular Value Decomposition (SVD)</b>.
        </br></br>
        💡 <b>Fun Fact:</b> The first principal component is the direction of maximum variance, the second is orthogonal to it, capturing the next highest variance, and so on.
    </span>
</div>

**Animations above are from [this great StackExchange answer](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues).**

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_t = pca.fit_transform(X)
X_t.shape, pca.explained_variance_ratio_

In [None]:
plotme(X_t, y)

A notebook with the results shown in 3d space can be downloaded from <a href="http://scikit-learn.org/stable/_downloads/plot_pca_iris.ipynb">here</a>.


### Singular Value Decomposition (<a href="http://scikit-learn.org/stable/modules/decomposition.html#truncated-singular-value-decomposition-and-latent-semantic-analysis">`SVD`</a>):

<img src="pics/vector_decomposition.gif" height=400 align="left" style="margin-right: 20px">

Singular Value Decomposition (SVD) is a well-known matrix factorization method widely used in various fields, including statistics and signal processing. Essentially, it is a technique to decompose a matrix into its fundamental components, revealing its underlying structure.

It is analogous to decomposing a 2D vector into its $x$ and $y$ components. Each component describes the original vector by storing information about:  

**a)** The **direction of the components**, given an orthogonal basis (e.g., $u_x$ points to $(0, 1)$ and $u_y$ points to $(1, 0)$).  
**b)** The **length of the projections**, indicating how much each basis contributes (e.g., $s_x=0.5$ and $s_y=1.2$).  
**c)** The **description of the orthogonal basis** the projections are based on (e.g., $v_x$ points to $(0, 1)$, and $v_y$ to $(1, 0)$, so the original axes remain unchanged).  

<br style="clear:left;"/>

The animation is sourced from <a href="https://towardsdatascience.com/svd-8c2f72e264f">this detailed article on SVD</a>.

Instead of decomposing just a single vector (of dimension $n$), we can extend this to a set of $m$ vectors by forming an $m \times n$ matrix $M$. The decomposition of $M$ results in three matrices:

$$
M = U \Sigma V^T
$$

where:

- $U$ is an $m \times m$ orthogonal matrix storing the left singular vectors (directions of new basis vectors in the original space).  
- $\Sigma$ is an $m \times n$ diagonal matrix storing singular values, which indicate the importance of each component.  
- $V$ is an $n \times n$ orthogonal matrix describing the projection axes.  

The diagonal elements of $\Sigma$ contain the singular values of $M$, ordered in descending magnitude. In feature extraction, we can approximate $M$ by keeping only the top $k$ singular values and their corresponding vectors:

$$
X \approx X_k = U_k \Sigma_k V_k^\top
$$

This allows us to reduce dimensionality while retaining the most significant information in the dataset.

In [None]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=2)
X_t = svd.fit_transform(X)
X_t.shape, svd.explained_variance_ratio_

In [None]:
plotme(X_t, y)

## 3. Nonlinear Dimensionality Reduction

When working with high-dimensional data, linear methods like PCA or SVD often fall short in capturing complex nonlinear relationships. This is where **t-SNE** (t-Distributed Stochastic Neighbor Embedding) and **UMAP** (Uniform Manifold Approximation and Projection) come into play. These methods are particularly useful for **visualizing high-dimensional datasets in 2D or 3D** while preserving important structure.

### t-SNE (t-Distributed Stochastic Neighbor Embedding)

<img src="./pics/tsne.gif" width=500 align="center" style="margin-bottom: 20px">

[Source: Google Research](https://research.google/blog/realtime-tsne-visualizations-with-tensorflowjs/)

### t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a **dimensionality reduction technique** designed to help visualize **high-dimensional data** in 2D or 3D. Unlike PCA, which looks for the directions of maximum variance, t-SNE focuses on **preserving relationships between nearby points**, making it useful for **revealing clusters** in complex datasets.

The key idea behind t-SNE is to measure how similar two points are in the high-dimensional space and then try to keep those similarities when mapping the data to a lower dimension. To do this, t-SNE builds two probability distributions:  

1. **In the original high-dimensional space:** The similarity between two points $\vec{x}_i$ and $\vec{x}_j$ is measured using a Gaussian function (bell curve). The probability that $\vec{x}_j$ is a "neighbor" of $\vec{x}_i$ depends on how close they are:

   $$ p_{j|i} = \frac{\exp\left(-\frac{||\vec{x}_i - \vec{x}_j||^2}{2\sigma_i^2}\right)}{\sum_{k \neq i} \exp\left(-\frac{||\vec{x}_i - \vec{x}_k||^2}{2\sigma_i^2}\right)} $$

   Here, $\sigma_i$ is chosen automatically to balance local vs. global structure (controlled by a parameter called **perplexity**).

2. **In the lower-dimensional space:** t-SNE tries to place the points $\vec{y}_i$ and $\vec{y}_j$ so that their similarities match the high-dimensional ones. However, instead of a Gaussian, it uses a **Student’s t-distribution**, which has "fatter tails" and allows distant points to stay far apart:

   $$ q_{ij} = \frac{(1 + ||\vec{y}_i - \vec{y}_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||\vec{y}_k - \vec{y}_l||^2)^{-1}} $$

   This prevents the **"crowding problem,"** where too many points would get squeezed together.

t-SNE then **adjusts the low-dimensional positions** by minimizing the **difference (KL divergence)** between the high- and low-dimensional probabilities:

   $$ C = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} $$

This optimization process is done through **gradient descent**, moving points around iteratively until their relative similarities match as closely as possible.

One thing to note is that **t-SNE focuses on preserving local structure**, meaning nearby points will stay close, but the global shape of the data might not be preserved. This makes it great for spotting clusters, but distances between clusters may not always be meaningful.

Here’s an example of how t-SNE transforms data into a lower dimension:

#### Pros
✅ Great for **clustering visualization**  
✅ Captures **local** structure well  
✅ Works well for datasets with **clear groups**  

#### Cons
❌ Computationally expensive (doesn't scale well to large datasets)  
❌ Different runs can produce **different visualizations**  
❌ Distorts **global** relationships  


### UMAP (Uniform Manifold Approximation and Projection)

<img src="./pics/umap.gif" width=500 align="center" style="margin-bottom: 20px">

[Source: UMAP Documentation](https://umap-learn.readthedocs.io/en/latest/aligned_umap_basic_usage.html)

UMAP is a **dimensionality reduction technique** that is often used as an alternative to **t-SNE**, but it is generally **faster, better at preserving global structure, and scales to larger datasets**. Like t-SNE, it aims to keep similar points close together when mapping high-dimensional data into 2D or 3D, making it great for **visualizing clusters**.

#### Step 1: Constructing the High-Dimensional Graph  
UMAP starts by creating a **graph-based representation** of the data. Instead of just measuring distances, UMAP **assumes that the data lies on a manifold (a curved space) embedded in a high-dimensional space** and tries to learn its structure.  

1. **Computing local neighborhoods:**  
   - Each point $\vec{x}_i$ gets a local neighborhood defined by a distance-based function.  
   - The number of neighbors is controlled by a parameter called **n_neighbors**, which affects how local/global the embedding is.

2. **Building a weighted graph:**  
   - For each point, UMAP assigns probabilities to nearby points based on an exponential function:

     $$ p_{j|i} = \exp\left(\frac{-d(\vec{x}_i, \vec{x}_j) - \rho_i}{\sigma_i}\right) $$  

     where:
     - $d(\vec{x}_i, \vec{x}_j)$ is the distance between two points.
     - $\rho_i$ is a local adjustment term to ensure a minimum number of connections.
     - $\sigma_i$ controls how fast probability decreases with distance.

3. **Symmetric Graph Construction:**  
   - The final edge weight between two points is computed as:  

     $$ p_{ij} = p_{j|i} + p_{i|j} - p_{j|i} p_{i|j} $$  

   - This graph represents how "connected" each point is to others in the high-dimensional space.

#### Step 2: Mapping to a Lower-Dimensional Space  
Now, UMAP **tries to construct a similar graph in lower dimensions (2D or 3D)** while preserving the relationships from the high-dimensional space.

1. **Defining the Low-Dimensional Probabilities:**  
   - Instead of using a Gaussian (like t-SNE), UMAP uses a **fuzzy topological structure** where edges in the new space have a similar probability function.  
   - The optimization goal is to **find a low-dimensional layout where the graph structure is best preserved**.

2. **Minimizing the Difference Between Graphs (Optimization Step):**  
   - UMAP minimizes the **cross-entropy** between the high- and low-dimensional graphs:

     $$ C = \sum_{(i,j) \in \text{edges}} p_{ij} \log q_{ij} + (1 - p_{ij}) \log (1 - q_{ij}) $$  

   - This ensures that nearby points stay together and far points stay apart.

#### How is UMAP Different from t-SNE?  
- **Faster:** UMAP uses an **approximate nearest neighbor search** (t-SNE does not), making it much faster for large datasets.  
- **Better global structure:** Unlike t-SNE, which mainly preserves local relationships, UMAP can maintain **both local and some global structures**.  
- **More interpretable distances:** The space UMAP creates is often more meaningful in terms of distances between clusters.

UMAP is widely used for **data exploration, clustering, and visualization**, especially in **high-dimensional datasets like images, text, and biological data (e.g., single-cell RNA sequencing).** 🚀

#### Pros
✅ Faster and **scales better** than t-SNE  
✅ Preserves **more of the global structure**  
✅ Works well for datasets with complex manifolds  
✅ Can be used for **general-purpose dimensionality reduction** (not just visualization)  

#### Cons
❌ May not always capture clusters as well as t-SNE  
❌ Some parameters (like `n_neighbors`) can significantly impact the output  


### When to Use t-SNE vs. UMAP?
| Feature | t-SNE | UMAP |
|---------|------|------|
| Focus  | Local Structure | Local + Global Structure |
| Speed  | Slower | Faster |
| Interpretability | Harder to control | More interpretable |
| Large Datasets | Struggles | Handles well |
| Reproducibility | Less consistent | More stable |

**TL;DR**: Use **t-SNE** for **cluster visualization**, and **UMAP** for **general-purpose dimensionality reduction**!

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X)

plotme(X_tsne, y)

UMAP requires an installation with:
```bash
pip install umap-learn
```

In [None]:
import umap

umap_model = umap.UMAP(n_components=2, random_state=42)
X_umap = umap_model.fit_transform(X)

plotme(X_umap, y)

## 4. Trade-offs & When to Use What

Dimensionality reduction techniques come with different strengths and trade-offs. Choosing the right method depends on the dataset and the specific goal of the analysis. Below is a quick guide to help decide when to use each technique.

### **Quick Summary Table**
| **Method**        | **Best When...** | **Key Trade-offs** |
|------------------|-----------------|--------------------|
| **PCA**  | You need a **linear reduction** that captures the most variance. | Assumes linear relationships, may not work well for highly nonlinear data. |
| **SVD**  | You're working with **sparse or high-dimensional data**, such as text (NLP). | Similar to PCA but better suited for matrix factorization tasks. |
| **t-SNE / UMAP**  | You need a **visualization or clustering** rather than strict feature reduction. | t-SNE is computationally expensive, UMAP is faster but assumes a manifold structure. Neither method is great for actual feature reduction. |
| **Feature Selection**  | Interpretability is important (e.g., **medical or business applications**). | May remove informative but redundant features, does not create new combined features. |

### **How to Choose the Right Method**
- If your goal is to **reduce dimensions while preserving variance** → **Use PCA**.
- If you’re working with **sparse, high-dimensional data (e.g., text, NLP)** → **Use SVD**.
- If you want to **visualize high-dimensional data** and discover clusters → **Use t-SNE or UMAP**.
- If **interpretability** is critical and you need to keep meaningful features → **Use Feature Selection**.

There is no single "best" dimensionality reduction method—each has strengths and weaknesses. The best approach depends on the dataset and the specific task at hand. 🚀  

---

## Model of the Week:
## Support Vector Machines (<a href="http://scikit-learn.org/stable/modules/svm.html#support-vector-machines">`SVM`</a>)

Behold, the first truly **black-box** classifier! (Fun fact: SVMs can also be used for regression.)  
Don't worry about the strange name—everything will make sense soon.  

### Why are SVMs awesome?  
Because they are **both linear and nonlinear classifiers** at the same time! Sounds confusing? Here’s the trick: **SVMs perform linear classification, but only after transforming the data into a higher-dimensional space.**  

Let's first consider a simple **linear** case in 2D. There can be many ways to separate two classes with a straight line (a.k.a. a **hyperplane** in higher dimensions), as shown in the figure below:

<img src="pics/svm_separating_hyperplanes.png">  
<small>Image source: <a href="https://en.wikipedia.org/wiki/Support-vector_machine">Wikipedia</a></small>

Clearly, not all separating hyperplanes are equally good.  
- $H_3$ is useless—it doesn’t even separate the classes.  
- Both $H_1$ and $H_2$ do separate them, but… **$H_2$ feels like a better separator**. Why?  

Look at the training points closest to $H_1$—they are dangerously close to flipping sides. The best decision boundary is the one that **maximizes the margin**, meaning it keeps the classes as far apart as possible while still separating them.  

### Support Vectors  
The **support vectors** are the data points closest to the decision boundary. These are the most critical points in the dataset—if we remove or move them, the boundary will shift.  
**SVMs aim to maximize the margin around the decision boundary**, ensuring the best possible separation.  

### Example  
Let's see it in action! (Plotting function is adapted from <a href="http://blog.yhat.com/posts/why-support-vector-machine.html">this blog post</a>.)

In [None]:
from sklearn.datasets import make_circles

from sklearn import linear_model
from sklearn import tree
from sklearn import svm

We generate 500 points, and classify them according to an imaginary circle:

In [None]:
xs = np.random.rand(500) * 5
ys = np.random.rand(500) * 5
cs = np.int0((xs - 3) ** 2 + (ys - 2) ** 2 > 3)

df = pd.DataFrame(data={'x': xs, 'y': ys, 'c': cs})
train_cols = ['x', 'y']

In [None]:
clfs = {
    "SVM": svm.SVC(gamma='auto'),
    "Logistic" : linear_model.LogisticRegression(solver='lbfgs', multi_class='auto'),
    "Tree": tree.DecisionTreeClassifier()
}

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15,5))

for i, (clf_name, clf) in enumerate(clfs.items()):
    clf.fit(df[train_cols], df.c)
    plot_results_with_hyperplane(clf, clf_name, df, ax[i])

#### How the heck is this linear?

It is linear in the *transformed space*. If we introduce a third dimension, which we get like this:

In [None]:
zs = (xs - 3) ** 2 + (ys - 2) ** 2

Then our data points will look like this in the 3D space:

In [None]:
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter3D(xs, ys, zs, c=cs, cmap=plt.cm.Paired)
ax.view_init(10, 30)

Now we can see that the data points can be separated by a plane in this 3D space. Then projecting the intersection of the plane and the function $(x_1-3)^2 + (x_2-2)^2$ back to 2D, we get the classification boundary

In [None]:
fig = plt.figure()
ax = fig.gca()
ax.scatter(xs, ys, c=cs, cmap=plt.cm.Paired, edgecolors='k')
ax.add_patch(plt.Circle((3,2), radius=np.sqrt(3), fill=False, linewidth=.7))
fig.set_figwidth(4)
fig.set_figheight(4)

There are videos that make this waaay more clear: <a href="https://www.youtube.com/watch?v=9NrALgHFwTo">vid1</a>, <a href="https://www.youtube.com/watch?v=3liCbRZPrZA">vid2</a>

$\renewcommand{\vec}[1]{\mathbf{#1}}$
### The Kernel Trick  

When the the data isn’t linearly separable, we might need to **transform** the data into a higher-dimensional space where it *is* separable.  

Sounds expensive, right? Computing these extra dimensions can get **very** computationally heavy. Luckily, we don’t actually need to compute the transformed vectors—**we only need their dot products!**  

That’s where the **kernel trick** comes in. Instead of explicitly transforming the data, we use a **kernel function** that tells us what the dot product of two transformed vectors *would have been*:

$$ K(\vec{x},\vec{y}) = \phi(\vec{x}) \cdot \phi(\vec{y}) $$

This lets us work in higher-dimensional spaces **without ever computing the transformation explicitly**!  

#### Example: A Polynomial Kernel  

Let’s say we use a simple **polynomial kernel function** in two dimensions:

$$ K(\vec{x},\vec{y}) = (1 + \vec{x} \cdot \vec{y})^2 $$

where $\vec{x} = (x_1, x_2)$ and $\vec{y} = (y_1, y_2)$.  

At first glance, it’s not obvious what transformation $\phi$ this corresponds to. Let’s expand the equation:

$$K(\vec{x},\vec{y}) = (1+\vec{x} \cdot \vec{y})^2 = (1 + x_1y_1 + x_2y_2)^2$$  
$$ = 1 + x_1^2y_1^2 + x_2^2y_2^2 + 2x_1y_1 + 2x_2y_2 + 2x_1x_2y_1y_2$$  

If we look at this carefully, we can rewrite it as a dot product in a **6-dimensional space**:

$$\vec{x'} = \phi(\vec{x}) = (1, x_1^2, x_2^2, \sqrt{2}x_1, \sqrt{2}x_2, \sqrt{2}x_1x_2)$$  

and  

$$\vec{y'} = \phi(\vec{y}) = (1, y_1^2, y_2^2, \sqrt{2}y_1, \sqrt{2}y_2, \sqrt{2}y_1y_2)$$  

So, the **implicit transformation** that our kernel function is applying is:

$$\phi(\vec{x}) = (1, x_1^2, x_2^2, \sqrt{2}x_1, \sqrt{2}x_2, \sqrt{2}x_1x_2)$$  

#### Why Does This Matter?  

The **key insight** is that we never actually compute $\phi(\vec{x})$! Instead, we just compute $K(\vec{x},\vec{y})$, which gives us the same result as if we had transformed the data. This **saves enormous amounts of computation** while still allowing us to work in higher-dimensional spaces.  

Kernel functions might sound mysterious at first, but this example shows that they aren’t just black magic! They’re simply a clever way to work in higher dimensions **without ever explicitly going there**.  

Check out some <a href="http://scikit-learn.org/stable/modules/svm.html#svm-kernels">common kernel functions</a> that people use.

---

## Feature Unions

<a href="http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces">`FeatureUnions`</a> are "parallel pipes". Every transformator in the union is applied to the input data, and the results are concatenated. It is very useful if we want to create new features from appling different transformers on the same data.

Not only the transformators steps can be set, but also weight can be associated with them.

In [None]:
from sklearn.pipeline import FeatureUnion

In [None]:
feat = FeatureUnion(transformer_list=[
    ('thres', VarianceThreshold(.7)),
    ('svd', TruncatedSVD(n_components=2))
])

FeatureUnion can be a step in a pipeline:

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
pipe = Pipeline([
    ('norm', StandardScaler()),
    ('feat', feat),
    ('knn', KNeighborsClassifier())
])

In [None]:
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

Pipes can be part of unions:

In [None]:
union = FeatureUnion([
    ('normsvd', Pipeline([('norm', StandardScaler()),
                          ('svd', TruncatedSVD(n_components=2))])),
    ('pca', PCA('mle'))
])

And put this into a pipeline:

In [None]:
pipe = Pipeline([
    ('feat', union),
    ('knn', KNeighborsClassifier())
])

In [None]:
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

# PIPECEPTION!

---
 
### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#custom-transformers">Create custom transformers</a>

Sometimes we just couldn't find what we are looking for in sklearn's massive library. In this case we can write our own transformers.  
It's pretty easy:

- Import the baseclasses

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

- Subclass our transformer

In [None]:
class Multiplier(BaseEstimator, TransformerMixin):
    
    def __init__(self, multitude):
        self.multitude = multitude

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X * self.multitude

- We are good to go!

In [None]:
np.arange(1, 21).reshape(4, 5)

In [None]:
multi = Multiplier(5)
multi.transform(np.arange(1, 21).reshape(4, 5))

In [None]:
multi.transform(X_test)[:10]

---

## Exercise: Prediction on last week's dataset

- Use last week's dataset
- Transform the nominal features
- Transform the numerical features
- Use the custom transformer from the cheat sheet
- Create a feature union from the nominal and the numerical feature pipes
- Create a pipe with the feature union and a model of your liking
- Predict!

## Exercise: [Car Wash](https://www.youtube.com/watch?v=eB0aROCl530)

Build a pipeline to predict the car type using the 2004 cars dataset (`./data/04cars.csv`).