# Dimensionality Reduction

Sometimes training data has thousands of features. It makes training difficult and slow. To combat this you can reduce the number of features turning an intractable problem into a tractable one. Dimensionality reduction also helps in visualization.
<br>Techniques of Dimensionality Reduction:
<ul>
    <li>PCA</li>
    <li>Kernel PCA</li>
    <li>LLE</li>
</ul>

High dimensional datasets run the risk of the training instances being far away from each other. One easy solution would be to get more training data. However, the size of an optimal dataset would grow exponentially as you add more features. 

# Approaches for DR <br>
## Projection
Many times training instances are not spread out uniformly across all dimensions. Some features are constant while others are correlated. These training instances actually lie within a much lower-dimensional supspace. Projection is where you have a high dimension dataset projected onto something of lower dimension. Think of 3D data being spread out across a plane, so instead of having an x, y, z component you project it onto a 2D plane. 
<br>
When a dataset twists and turns in 3D space, then projection isn't the best approach

<br>
## Manifold Learning
A manifold is a shape that can be bent and twisted in a higher dimensional space. Manifold learning works by modeling the manifold on which the training instances lie. 


# PCA Principal Component Analysis
First identify the hyperplane that lies closest to the data, and then project data on to it. It's important to choose the right hyperplane. The idea with PCA is to preseve the variance between the training instances. Or the minimize the mean squared distance between the original dataset and its projection onto that axis.

<br>
PCA finds the axis that accounts for the largest amount of variance in the training set, and a second axis which is orthogonal to the first one, that accounts for the largest amount of remaining variance.

To find the principal component axes, use the SVD function that returns the dot product of three matricies. The principal component is orthogonal axes for all dimensions. Once you have identified it you can reduce the dimensions by projecting it onto the hyperplane defined by the first d principal components.

## Explained Variance Ratio
The EVR indicated the proportion of the dataset's variance that lies along the axis of each principal component.

## How to choose the right number of dimensions?
For visualization: 2 to 3. Usually the best amount is the dimension that preserves at least 95% of the variance.
you can also plot the cumulative sum of all the explained variance ratios based on the dimension, and find the inflection point. 

In [9]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import plotly
import plotly.graph_objs as go
 
%matplotlib inline

In [6]:
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [8]:
corr_matrix = df.corr()
corr_matrix

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
mean radius,1.0,0.323782,0.997855,0.987357,0.170581,0.506124,0.676764,0.822529,0.147741,-0.311631,...,0.969539,0.297008,0.965137,0.941082,0.119616,0.413463,0.526911,0.744214,0.163953,0.007066
mean texture,0.323782,1.0,0.329533,0.321086,-0.023389,0.236702,0.302418,0.293464,0.071401,-0.076437,...,0.352573,0.912045,0.35804,0.343546,0.077503,0.27783,0.301025,0.295316,0.105008,0.119205
mean perimeter,0.997855,0.329533,1.0,0.986507,0.207278,0.556936,0.716136,0.850977,0.183027,-0.261477,...,0.969476,0.303038,0.970387,0.94155,0.150549,0.455774,0.563879,0.771241,0.189115,0.051019
mean area,0.987357,0.321086,0.986507,1.0,0.177028,0.498502,0.685983,0.823269,0.151293,-0.28311,...,0.962746,0.287489,0.95912,0.959213,0.123523,0.39041,0.512606,0.722017,0.14357,0.003738
mean smoothness,0.170581,-0.023389,0.207278,0.177028,1.0,0.659123,0.521984,0.553695,0.557775,0.584792,...,0.21312,0.036072,0.238853,0.206718,0.805324,0.472468,0.434926,0.503053,0.394309,0.499316
mean compactness,0.506124,0.236702,0.556936,0.498502,0.659123,1.0,0.883121,0.831135,0.602641,0.565369,...,0.535315,0.248133,0.59021,0.509604,0.565541,0.865809,0.816275,0.815573,0.510223,0.687382
mean concavity,0.676764,0.302418,0.716136,0.685983,0.521984,0.883121,1.0,0.921391,0.500667,0.336783,...,0.688236,0.299879,0.729565,0.675987,0.448822,0.754968,0.884103,0.861323,0.409464,0.51493
mean concave points,0.822529,0.293464,0.850977,0.823269,0.553695,0.831135,0.921391,1.0,0.462497,0.166917,...,0.830318,0.292752,0.855923,0.80963,0.452753,0.667454,0.752399,0.910155,0.375744,0.368661
mean symmetry,0.147741,0.071401,0.183027,0.151293,0.557775,0.602641,0.500667,0.462497,1.0,0.479921,...,0.185728,0.090651,0.219169,0.177193,0.426675,0.4732,0.433721,0.430297,0.699826,0.438413
mean fractal dimension,-0.311631,-0.076437,-0.261477,-0.28311,0.584792,0.565369,0.336783,0.166917,0.479921,1.0,...,-0.253691,-0.051269,-0.205151,-0.231854,0.504942,0.458798,0.346234,0.175325,0.334019,0.767297


In [63]:
pca = PCA(n_components=3)
pca.fit(preprocessed_data)
decomposed_data = pca.transform(preprocessed_data)
print(pca.explained_variance_ratio_, sum(pca.explained_variance_ratio_))

[0.44272026 0.18971182 0.09393163] 0.7263637090894923


In [64]:
plotly.offline.init_notebook_mode(connected=True)

In [65]:
data = go.Heatmap(z=pca.components_, 
                  x=bunch.feature_names, 
                  y=['PC 1', 'PC 2', 'PC 3'], 
                  colorscale='Viridis')
 
# Plot heatmap.
plotly.offline.iplot([data], filename='heatmap')
#This plot shows how each feature correlates with each principal component

In [66]:
# Add malignant column.
decomposed_df = pd.DataFrame(decomposed_data, columns=['x', 'y', 'z'])
decomposed_df['malignant'] = 1 - bunch.target
 
# Create individual data sets.
malignant = decomposed_df[decomposed_df.malignant == 1]
benign = decomposed_df[decomposed_df.malignant == 0]

In [67]:
# Create line style.
line_style = dict(color='rgba(0, 0, 0, 0.14)',width=0.5)
 
# Create scatters.
malignant_scatter = go.Scatter3d(
    x=malignant['x'],
    y=malignant['y'],
    z=malignant['z'],
    mode='markers',
    marker=dict(
        color='rgb(181, 20, 37)',
        size=12,
        opacity=0.8,
        line=line_style
    ),
    name='Malignant'
)
benign_scatter = go.Scatter3d(
    x=benign['x'],
    y=benign['y'],
    z=benign['z'],
    mode='markers',
    marker=dict(
        color='rgb(5, 99, 226)',
        size=12,
        opacity=0.8,
        line=line_style
    ),
    name='Benign'
)
 
# Create data array. Ensure malignant scatter is rendered above (can we merge layers somehow?).
data = [benign_scatter, malignant_scatter]
 
# Create layout.
layout = go.Layout(showlegend=True, margin=dict(l=0,r=0,b=0,t=0))
 
# Render (offline).
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig, filename='3d-scatter')

In [3]:
# Get data.
bunch = load_breast_cancer()
df = pd.DataFrame(bunch.data, columns=bunch.feature_names)
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [18]:
dataset = load_breast_cancer()
features = dataset['data']
labels = dataset['target']
train, test, train_labels, test_labels = train_test_split(features,
                                                          labels,
                                                          test_size=0.15)

In [19]:
newClassifier = KNeighborsClassifier()
newClassifier.fit(train, train_labels)
print(newClassifier.score(test, test_labels))

0.9069767441860465


In [7]:
modifiedDataset = load_breast_cancer()
X_raw = modifiedDataset['data']
y = modifiedDataset['target']
scaler = StandardScaler()
X = scaler.fit_transform(X_raw)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=42)

In [10]:
pca = PCA(n_components=25)
X_train_pca = pca.fit_transform(X_train)
nearClassifier = KNeighborsClassifier()
X_test_pca = pca.transform(X_test)
nearClassifier.fit(X_train_pca, y_train)
print(nearClassifier.score(X_test_pca, y_test))


0.9473684210526315


# Incremental PCA
Regular PCA required the training set to be in memory at once. Incremental PCA allows you to split the training data in mini-batches.

# Randomized PCA 
Is a stochastic algorithm that approximates the first d principal components. It's complexisty is much less than regular PCA.

# Kernel PCA
Kernel PCA uses the Kernel trick (a mathematical technique that implicitly maps training instaces into a high dimensional space, allowing for nonlinear classification and regression. It allows for linear decision boundaries in high dimensional space which translate to nonlinear boundries in original space. <br>
Kernel trick can be applied to PCA to allow for nonlinear projections for dimensionality reduction.
It is good at preserving clusters of instances after projection. 
# LLE
Locally Linear Embedding is a manifold technique. LLE works by measuring how each training instance linearly relates to its closest neighbors and then looks for a low-dimensional representation of the training set where these local relationships are best preserved. This algorithm is good at unrolling twisted manifolds, when there isn't noise.

In [32]:
from sklearn.decomposition import KernelPCA
rbf_pca = KernelPCA(n_components = 25, kernel="rbf", gamma=0.001)
X_reduced = rbf_pca.fit_transform(X_train)
secondaryClassifier = KNeighborsClassifier()
X_reduced_test = rbf_pca.transform(X_test)
secondaryClassifier.fit(X_reduced, y_train)
print(secondaryClassifier.score(X_reduced_test, y_test))

0.9473684210526315


In [33]:
from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components = 25, n_neighbors=20)
X_reduced_lle = lle.fit_transform(X_train)
thirdClassifier = KNeighborsClassifier()
X_reduced_lle_test = lle.transform(X_test)
thirdClassifier.fit(X_reduced_lle, y_train)
print(thirdClassifier.score(X_reduced_lle_test, y_test))

0.9210526315789473
