Add explanation for why t-SNE is not a good feature preprocessor for models #78

rhiever · 2016-10-30T17:34:43Z

Notebook 22 makes a really important point that t-SNE is only for visualization, yet doesn't explicitly explain why that is the case. We should add a brief explanation for why that is.

amueller · 2016-10-31T17:39:06Z

Well unsupervised learning will always throw away discriminative information....

rhiever · 2016-10-31T20:43:07Z

I think the point of the exercise was to show that that effect was particular to t-SNE. e.g. if you apply PCA or Isomap you can oftentimes improve---or at least not negatively affect---your model accuracy.

amueller · 2016-11-02T13:56:04Z

really? I would imagine that PCA down to two dimensions will heavily impact accuracy.

rasbt · 2016-11-02T17:29:54Z

really? I would imagine that PCA down to two dimensions will heavily impact accuracy.

I'd say it really depends on a lot of factors (the model, the dataset, the explained-variance ratio, ...). I can imagine that for small datasets and models that tend to overfit (e.g., k-NN with small k or so), it could be really helpful. Or in more general terms, I think it may be a useful thing to do to improve performance ('curse of dimensionality') as an alternative to feature selection and/or if you can't regularize your model.

amueller · 2016-11-02T18:28:52Z

Yeah it depends on many things, that's true. It is certainly a form of regularization. But reducing digits to 2d is probably too much, no matter what method. The point of the exercise was more "don't use manifold learning for supervised tasks". PCA might be helpful in certain situations.

rhiever · 2016-11-02T21:38:53Z

Just to explore your suspicions, @amueller:

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=1)

clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
print('KNeighborsClassifier accuracy: {}'.format(clf.score(X_test, y_test)))

pca = PCA(n_components=2)
digits_pca_train = pca.fit_transform(X_train)
digits_pca_test = pca.transform(X_test)

clf = KNeighborsClassifier()
clf.fit(digits_pca_train, y_train)
print('KNeighborsClassifier accuracy with PCA: {}'.format(clf.score(digits_pca_test, y_test)))

tsne = TSNE(random_state=42)
digits_tsne_train = tsne.fit_transform(X_train)
digits_tsne_test = tsne.fit_transform(X_test)

clf = KNeighborsClassifier()
clf.fit(digits_tsne_train, y_train)
print('KNeighborsClassifier accuracy with t-SNE: {}'.format(clf.score(digits_tsne_test, y_test)))

KNeighborsClassifier accuracy: 0.9933333333333333
KNeighborsClassifier accuracy with PCA: 0.6266666666666667
KNeighborsClassifier accuracy with t-SNE: 0.0022222222222222222

t-SNE is an order of magnitude worse.

rasbt · 2016-11-02T22:27:07Z

Oh, wow, the min. expected accuracy would be 10%; that's really, really bad then! But I see that you have an error here:

pca.fit_transform(X_test)
digits_tsne_test = tsne.fit_transform(X_test)

It should be

pca.transform(X_test) and digits_tsne_test = tsne.transform(X_test)

rhiever · 2016-11-03T14:17:41Z

Ah you're right @rasbt. You can't use the transform method one TSNE, and I just copy and pasted from the TSNE code. :-)

I updated the code and results above with that fix. PCA actually performs MUCH better than t-SNE now. So why is t-SNE so bad for classification?

rasbt · 2016-11-03T15:17:59Z

Oh yeah, good point ... mentioning one problem and introducing another :P

amueller · 2016-11-08T16:33:25Z

What's your code now? Going down to two dimensions I would expect t-SNE to perform better than PCA, but I'd expect both to be worse than not doing anything.

rasbt · 2016-11-09T00:23:46Z

What's your code now? Going down to two dimensions I would expect t-SNE to perform better than PCA, but I'd expect both to be worse than not doing anything.

Hm, yeah, I'd also naturally expect t-SNE to perform better on this particular dataset, however, I think the comparison in the code above is not entirely fair. You can't do a fit_transform separately on training and test since the embedding depends on the order of the samples, right? I.e., the "position" of the "clusters" is arbitrary, isn't it? I think you would at least need to use sth. like

adjusted_rand_score(y_predict, y_test)
print('KNeighborsClassifier accuracy with t-SNE: {}'.format(adjusted_rand_score(y_predict, y_test)))

for t-SNE if you fit_transform train and test data separately.

rhiever · 2017-05-04T19:26:09Z

@rasbt is right on this one. The reason t-SNE doesn't work here is because the t-SNE is fitting on the training data then the testing data, thus causing the clusters to fall in different areas.

fingoldo · 2018-07-14T16:18:15Z

Is there absolutely no way to add pure .transform method to TSNE, like Isomap already has? In 2D separation of TSNE is much much better from MNIST dataset, a pity it can't be used as a regular transformer..

amueller · 2018-07-14T16:29:09Z

there is a way to implement this, I think, but it's not implemented in sklearn right now. Not sure if there's a pr. @fingoldo you might also be interested in UMAP: https://github.com/lmcinnes/umap

fingoldo · 2018-07-14T22:10:56Z

Thank you so much Andreas for this great suggestion!
Features added by UMAP proved to be useful indeed :-) Quick & dirty assessment:


from sklearn.datasets import load_digits
digits=load_digits()

def EstimateClassifier(model,transformer=None):
    startTime = datetime.now()
    if transformer:
        transformer.fit(x_train)
        
        dp=transformer.transform(x_train)   
        x_train_new=np.concatenate((x_train,dp),axis=1)
        
        dp=transformer.transform(x_test)   
        x_test_new=np.concatenate((x_test,dp),axis=1)        
    else:
        x_train_new,x_test_new=x_train,x_test
    model.fit(x_train_new,y_train)
    timeElapsed = datetime.now() - startTime
    print("Test Accuracy: %s" % (accuracy_score(y_test,model.predict(x_test_new))))
	
EstimateClassifier(GaussianNB())
Test Accuracy: 0.833333333333
time: 7.51 ms

EstimateClassifier(GaussianNB(),PCA(n_components=2))
Test Accuracy: 0.855555555556
time: 17.5 ms

EstimateClassifier(GaussianNB(),Isomap(n_components=2))
Test Accuracy: 0.893333333333
time: 1.5 s

EstimateClassifier(GaussianNB(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.917777777778
time: 2.67 s

EstimateClassifier(RandomForestClassifier())
Test Accuracy: 0.951111111111
time: 38.5 ms

EstimateClassifier(RandomForestClassifier(),PCA(n_components=2))
Test Accuracy: 0.935555555556
time: 49.5 ms

EstimateClassifier(RandomForestClassifier(),Isomap(n_components=2))
Test Accuracy: 0.971111111111
time: 1.53 s

EstimateClassifier(RandomForestClassifier(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.973333333333
time: 2.68 s

amueller · 2018-07-14T23:25:29Z

Set n_estimators to 100 in the random forest and it will be better, and probably better without umap. Sent from phone. Please excuse spelling and brevity.

…

On Sat, Jul 14, 2018, 17:10 fingoldo ***@***.***> wrote: Thank you so much Andreas for this great suggestion! Features added by UMAP proved to be useful indeed :-) Quick & dirty assessment: from sklearn.datasets import load_digits digits=load_digits() def EstimateClassifier(model,transformer=None): startTime = datetime.now() if transformer: transformer.fit(x_train) dp=transformer.transform(x_train) x_train_new=np.concatenate((x_train,dp),axis=1) dp=transformer.transform(x_test) x_test_new=np.concatenate((x_test,dp),axis=1) else: x_train_new,x_test_new=x_train,x_test model.fit(x_train_new,y_train) timeElapsed = datetime.now() - startTime print("Test Accuracy: %s" % (accuracy_score(y_test,model.predict(x_test_new)))) EstimateClassifier(GaussianNB()) Test Accuracy: 0.833333333333 time: 7.51 ms EstimateClassifier(GaussianNB(),PCA(n_components=2)) Test Accuracy: 0.855555555556 time: 17.5 ms EstimateClassifier(GaussianNB(),Isomap(n_components=2)) Test Accuracy: 0.893333333333 time: 1.5 s EstimateClassifier(GaussianNB(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation')) Test Accuracy: 0.917777777778 time: 2.67 s EstimateClassifier(RandomForestClassifier()) Test Accuracy: 0.951111111111 time: 38.5 ms EstimateClassifier(RandomForestClassifier(),PCA(n_components=2)) Test Accuracy: 0.935555555556 time: 49.5 ms EstimateClassifier(RandomForestClassifier(),Isomap(n_components=2)) Test Accuracy: 0.971111111111 time: 1.53 s EstimateClassifier(RandomForestClassifier(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation')) Test Accuracy: 0.973333333333 time: 2.68 s — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#78 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbcFgJRvjxp7f1C_d9vkcS69AtUPc3nks5uGmxwgaJpZM4KkcBW> .

rasbt · 2018-07-14T23:36:48Z

Haven't read up on UMAP yet -- heard from an attendant that the recent talk at PyData Ann Arbor was really good: https://www.youtube.com/watch?v=YPJQydzTLwQ&t=521s -- but I think it's more meant as a technique for visualizing training examples (or clusters thereof) in low dim rather than using it as a feature extraction technique (if you are not using generalized linear models maybe) in the similar vein as T-SNE? So in that case it would be interesting to add an eval of the random forest on the raw features like Andreas suggested.

Set n_estimators to 100 in the random forest and it will be better, and
probably better without umap.

In practice, it could come in handy for huge datasets though, as it is already much faster than T-SNE

fingoldo · 2018-07-15T00:16:45Z

Here we go, guys.

It seems to still be helpful, but now I think i should have used cross_val_score from the beginning, as Isomap's result seems to be out of a picture a bit and affected by the split...


EstimateClassifier(RandomForestClassifier(n_estimators=100))
Test Accuracy: 0.977777777778
time: 346 ms

EstimateClassifier(RandomForestClassifier(n_estimators=100),PCA(n_components=2))
Test Accuracy: 0.98
time: 372 ms

EstimateClassifier(RandomForestClassifier(n_estimators=100),Isomap(n_components=2))
Test Accuracy: 0.977777777778
time: 1.93 s

EstimateClassifier(RandomForestClassifier(n_estimators=100),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.986666666667
time: 2.98 s

fingoldo · 2018-07-15T01:40:05Z

Added cross-validation to get a more definitive answer.

from sklearn.datasets import load_digits
digits=load_digits()

def EstimateClassifier(model,transformer=None):
    startTime = time.time()
    if transformer:
        pipe = Pipeline([('VarianceThreshold',VarianceThreshold()),('union', FeatureUnion([('AsIs',SelectKBest(k='all')),('transformer', transformer)])), ('classifier', model)])
    else:
        pipe = Pipeline([('classifier', model)])    
    accuracies=cross_val_score(pipe,digits.data,digits.target,cv=10)
    timeElapsed = time.time() - startTime
    print("Model: %s, Transformer: %s, avg.accuracy: %0.3f +- %0.3f, time=%0.3fs" % (type(model).__name__,type(transformer).__name__,np.mean(accuracies),np.std(accuracies),timeElapsed))

for model in (GaussianNB(),RandomForestClassifier(n_estimators=100)):
    for transformer in (None,PCA(n_components=2),Isomap(n_components=2),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation')):
        EstimateClassifier(model,transformer)

Model: GaussianNB, Transformer: NoneType, avg.accuracy: 0.810 +- 0.057, time=0.065s
Model: GaussianNB, Transformer: PCA, avg.accuracy: 0.843 +- 0.051, time=0.198s
Model: GaussianNB, Transformer: Isomap, avg.accuracy: 0.883 +- 0.046, time=16.367s
Model: GaussianNB, Transformer: UMAP, avg.accuracy: 0.921 +- 0.028, time=51.394s
Model: RandomForestClassifier, Transformer: NoneType, avg.accuracy: 0.953 +- 0.020, time=3.750s
Model: RandomForestClassifier, Transformer: PCA, avg.accuracy: 0.948 +- 0.023, time=4.002s
Model: RandomForestClassifier, Transformer: Isomap, avg.accuracy: 0.964 +- 0.017, time=19.914s
Model: RandomForestClassifier, Transformer: UMAP, avg.accuracy: 0.969 +- 0.017, time=55.271s

Do you think benefit of adding new features will hold if we add proper hyperparameters tuning, or it will be neglected?

rhiever · 2018-07-16T13:27:56Z

I think it's more meant as a technique for visualizing training examples (or clusters thereof) in low dim rather than using it as a feature extraction technique

Not so -- UMAP can be used as a visualization mapping technique similar to t-SNE, but also works fine as a feature construction technique (as shown by @fingoldo). I was going to link the SciPy talk, but it seems you already found it. 👍

@fingoldo, I think your initial explorations show that UMAP can potentially be useful as a feature construction technique. It will have to be evaluated further on more benchmarks, perhaps on PMLB.

fingoldo · 2018-07-17T01:44:29Z

@rhiever Randy will UMAP be included into tpot pipeline? :-)

rhiever · 2018-07-18T16:00:55Z

It's possible!

rhiever changed the title ~~Add explanation for why t-SNE is not a good feature preprocess for models~~ Add explanation for why t-SNE is not a good feature preprocessor for models Oct 30, 2016

rhiever closed this as completed May 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explanation for why t-SNE is not a good feature preprocessor for models #78

Add explanation for why t-SNE is not a good feature preprocessor for models #78

rhiever commented Oct 30, 2016 •

edited

Loading

amueller commented Oct 31, 2016

rhiever commented Oct 31, 2016

amueller commented Nov 2, 2016

rasbt commented Nov 2, 2016

amueller commented Nov 2, 2016

rhiever commented Nov 2, 2016 •

edited

Loading

rasbt commented Nov 2, 2016

rhiever commented Nov 3, 2016 •

edited

Loading

rasbt commented Nov 3, 2016

amueller commented Nov 8, 2016

rasbt commented Nov 9, 2016

rhiever commented May 4, 2017

fingoldo commented Jul 14, 2018

amueller commented Jul 14, 2018

fingoldo commented Jul 14, 2018

amueller commented Jul 14, 2018 via email

rasbt commented Jul 14, 2018

fingoldo commented Jul 15, 2018

fingoldo commented Jul 15, 2018

rhiever commented Jul 16, 2018 •

edited

Loading

fingoldo commented Jul 17, 2018

rhiever commented Jul 18, 2018

Add explanation for why t-SNE is not a good feature preprocessor for models #78

Add explanation for why t-SNE is not a good feature preprocessor for models #78

Comments

rhiever commented Oct 30, 2016 • edited Loading

amueller commented Oct 31, 2016

rhiever commented Oct 31, 2016

amueller commented Nov 2, 2016

rasbt commented Nov 2, 2016

amueller commented Nov 2, 2016

rhiever commented Nov 2, 2016 • edited Loading

rasbt commented Nov 2, 2016

rhiever commented Nov 3, 2016 • edited Loading

rasbt commented Nov 3, 2016

amueller commented Nov 8, 2016

rasbt commented Nov 9, 2016

rhiever commented May 4, 2017

fingoldo commented Jul 14, 2018

amueller commented Jul 14, 2018

fingoldo commented Jul 14, 2018

amueller commented Jul 14, 2018 via email

rasbt commented Jul 14, 2018

fingoldo commented Jul 15, 2018

fingoldo commented Jul 15, 2018

rhiever commented Jul 16, 2018 • edited Loading

fingoldo commented Jul 17, 2018

rhiever commented Jul 18, 2018

rhiever commented Oct 30, 2016 •

edited

Loading

rhiever commented Nov 2, 2016 •

edited

Loading

rhiever commented Nov 3, 2016 •

edited

Loading

rhiever commented Jul 16, 2018 •

edited

Loading