
<br>
=========================================================<br>
Hashing feature transformation using Totally Random Trees<br>
=========================================================<br>
RandomTreesEmbedding provides a way to map data to a<br>
very high-dimensional, sparse representation, which might<br>
be beneficial for classification.<br>
The mapping is completely unsupervised and very efficient.<br>
This example visualizes the partitions given by several<br>
trees and shows how the transformation can also be used for<br>
non-linear dimensionality reduction or non-linear classification.<br>
Points that are neighboring often share the same leaf of a tree and therefore<br>
share large parts of their hashed representation. This allows to<br>
separate two concentric circles simply based on the principal components<br>
of the transformed data with truncated SVD.<br>
In high-dimensional spaces, linear classifiers often achieve<br>
excellent accuracy. For sparse binary data, BernoulliNB<br>
is particularly well-suited. The bottom row compares the<br>
decision boundary obtained by BernoulliNB in the transformed<br>
space with an ExtraTreesClassifier forests learned on the<br>
original data.<br>


In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn.datasets import make_circles
from sklearn.ensemble import RandomTreesEmbedding, ExtraTreesClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.naive_bayes import BernoulliNB

make a synthetic dataset

In [None]:
X, y = make_circles(factor=0.5, random_state=0, noise=0.05)

use RandomTreesEmbedding to transform data

In [None]:
hasher = RandomTreesEmbedding(n_estimators=10, random_state=0, max_depth=3)
X_transformed = hasher.fit_transform(X)

Visualize result after dimensionality reduction using truncated SVD

In [None]:
svd = TruncatedSVD(n_components=2)
X_reduced = svd.fit_transform(X_transformed)

Learn a Naive Bayes classifier on the transformed data

In [None]:
nb = BernoulliNB()
nb.fit(X_transformed, y)

Learn an ExtraTreesClassifier for comparison

In [None]:
trees = ExtraTreesClassifier(max_depth=3, n_estimators=10, random_state=0)
trees.fit(X, y)

scatter plot of original and reduced data

In [None]:
fig = plt.figure(figsize=(9, 8))

In [None]:
ax = plt.subplot(221)
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolor='k')
ax.set_title("Original Data (2d)")
ax.set_xticks(())
ax.set_yticks(())

In [None]:
ax = plt.subplot(222)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, s=50, edgecolor='k')
ax.set_title("Truncated SVD reduction (2d) of transformed data (%dd)" %
             X_transformed.shape[1])
ax.set_xticks(())
ax.set_yticks(())

Plot the decision in original space. For that, we will assign a color<br>
to each point in the mesh [x_min, x_max]x[y_min, y_max].

In [None]:
h = .01
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

transform grid using RandomTreesEmbedding

In [None]:
transformed_grid = hasher.transform(np.c_[xx.ravel(), yy.ravel()])
y_grid_pred = nb.predict_proba(transformed_grid)[:, 1]

In [None]:
ax = plt.subplot(223)
ax.set_title("Naive Bayes on Transformed data")
ax.pcolormesh(xx, yy, y_grid_pred.reshape(xx.shape))
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolor='k')
ax.set_ylim(-1.4, 1.4)
ax.set_xlim(-1.4, 1.4)
ax.set_xticks(())
ax.set_yticks(())

transform grid using ExtraTreesClassifier

In [None]:
y_grid_pred = trees.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

In [None]:
ax = plt.subplot(224)
ax.set_title("ExtraTrees predictions")
ax.pcolormesh(xx, yy, y_grid_pred.reshape(xx.shape))
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolor='k')
ax.set_ylim(-1.4, 1.4)
ax.set_xlim(-1.4, 1.4)
ax.set_xticks(())
ax.set_yticks(())

In [None]:
plt.tight_layout()
plt.show()