Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explanation for why t-SNE is not a good feature preprocessor for models #78

Closed
rhiever opened this issue Oct 30, 2016 · 22 comments
Closed

Comments

@rhiever
Copy link
Contributor

rhiever commented Oct 30, 2016

Notebook 22 makes a really important point that t-SNE is only for visualization, yet doesn't explicitly explain why that is the case. We should add a brief explanation for why that is.

@rhiever rhiever changed the title Add explanation for why t-SNE is not a good feature preprocess for models Add explanation for why t-SNE is not a good feature preprocessor for models Oct 30, 2016
@amueller
Copy link
Owner

Well unsupervised learning will always throw away discriminative information....

@rhiever
Copy link
Contributor Author

rhiever commented Oct 31, 2016

I think the point of the exercise was to show that that effect was particular to t-SNE. e.g. if you apply PCA or Isomap you can oftentimes improve---or at least not negatively affect---your model accuracy.

@amueller
Copy link
Owner

amueller commented Nov 2, 2016

really? I would imagine that PCA down to two dimensions will heavily impact accuracy.

@rasbt
Copy link
Collaborator

rasbt commented Nov 2, 2016

really? I would imagine that PCA down to two dimensions will heavily impact accuracy.

I'd say it really depends on a lot of factors (the model, the dataset, the explained-variance ratio, ...). I can imagine that for small datasets and models that tend to overfit (e.g., k-NN with small k or so), it could be really helpful. Or in more general terms, I think it may be a useful thing to do to improve performance ('curse of dimensionality') as an alternative to feature selection and/or if you can't regularize your model.

@amueller
Copy link
Owner

amueller commented Nov 2, 2016

Yeah it depends on many things, that's true. It is certainly a form of regularization. But reducing digits to 2d is probably too much, no matter what method. The point of the exercise was more "don't use manifold learning for supervised tasks". PCA might be helpful in certain situations.

@rhiever
Copy link
Contributor Author

rhiever commented Nov 2, 2016

Just to explore your suspicions, @amueller:

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=1)

clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
print('KNeighborsClassifier accuracy: {}'.format(clf.score(X_test, y_test)))

pca = PCA(n_components=2)
digits_pca_train = pca.fit_transform(X_train)
digits_pca_test = pca.transform(X_test)

clf = KNeighborsClassifier()
clf.fit(digits_pca_train, y_train)
print('KNeighborsClassifier accuracy with PCA: {}'.format(clf.score(digits_pca_test, y_test)))

tsne = TSNE(random_state=42)
digits_tsne_train = tsne.fit_transform(X_train)
digits_tsne_test = tsne.fit_transform(X_test)

clf = KNeighborsClassifier()
clf.fit(digits_tsne_train, y_train)
print('KNeighborsClassifier accuracy with t-SNE: {}'.format(clf.score(digits_tsne_test, y_test)))
KNeighborsClassifier accuracy: 0.9933333333333333
KNeighborsClassifier accuracy with PCA: 0.6266666666666667
KNeighborsClassifier accuracy with t-SNE: 0.0022222222222222222

t-SNE is an order of magnitude worse.

@rasbt
Copy link
Collaborator

rasbt commented Nov 2, 2016

Oh, wow, the min. expected accuracy would be 10%; that's really, really bad then! But I see that you have an error here:

pca.fit_transform(X_test)
digits_tsne_test = tsne.fit_transform(X_test)

It should be

pca.transform(X_test) and digits_tsne_test = tsne.transform(X_test)

@rhiever
Copy link
Contributor Author

rhiever commented Nov 3, 2016

Ah you're right @rasbt. You can't use the transform method one TSNE, and I just copy and pasted from the TSNE code. :-)

I updated the code and results above with that fix. PCA actually performs MUCH better than t-SNE now. So why is t-SNE so bad for classification?

@rasbt
Copy link
Collaborator

rasbt commented Nov 3, 2016

Oh yeah, good point ... mentioning one problem and introducing another :P

@amueller
Copy link
Owner

amueller commented Nov 8, 2016

What's your code now? Going down to two dimensions I would expect t-SNE to perform better than PCA, but I'd expect both to be worse than not doing anything.

@rasbt
Copy link
Collaborator

rasbt commented Nov 9, 2016

What's your code now? Going down to two dimensions I would expect t-SNE to perform better than PCA, but I'd expect both to be worse than not doing anything.

Hm, yeah, I'd also naturally expect t-SNE to perform better on this particular dataset, however, I think the comparison in the code above is not entirely fair. You can't do a fit_transform separately on training and test since the embedding depends on the order of the samples, right? I.e., the "position" of the "clusters" is arbitrary, isn't it? I think you would at least need to use sth. like

adjusted_rand_score(y_predict, y_test)
print('KNeighborsClassifier accuracy with t-SNE: {}'.format(adjusted_rand_score(y_predict, y_test)))

for t-SNE if you fit_transform train and test data separately.

@rhiever
Copy link
Contributor Author

rhiever commented May 4, 2017

@rasbt is right on this one. The reason t-SNE doesn't work here is because the t-SNE is fitting on the training data then the testing data, thus causing the clusters to fall in different areas.

@rhiever rhiever closed this as completed May 4, 2017
@fingoldo
Copy link

Is there absolutely no way to add pure .transform method to TSNE, like Isomap already has? In 2D separation of TSNE is much much better from MNIST dataset, a pity it can't be used as a regular transformer..

@amueller
Copy link
Owner

there is a way to implement this, I think, but it's not implemented in sklearn right now. Not sure if there's a pr. @fingoldo you might also be interested in UMAP: https://github.com/lmcinnes/umap

@fingoldo
Copy link

Thank you so much Andreas for this great suggestion!
Features added by UMAP proved to be useful indeed :-) Quick & dirty assessment:


from sklearn.datasets import load_digits
digits=load_digits()

def EstimateClassifier(model,transformer=None):
    startTime = datetime.now()
    if transformer:
        transformer.fit(x_train)
        
        dp=transformer.transform(x_train)   
        x_train_new=np.concatenate((x_train,dp),axis=1)
        
        dp=transformer.transform(x_test)   
        x_test_new=np.concatenate((x_test,dp),axis=1)        
    else:
        x_train_new,x_test_new=x_train,x_test
    model.fit(x_train_new,y_train)
    timeElapsed = datetime.now() - startTime
    print("Test Accuracy: %s" % (accuracy_score(y_test,model.predict(x_test_new))))
	
EstimateClassifier(GaussianNB())
Test Accuracy: 0.833333333333
time: 7.51 ms

EstimateClassifier(GaussianNB(),PCA(n_components=2))
Test Accuracy: 0.855555555556
time: 17.5 ms

EstimateClassifier(GaussianNB(),Isomap(n_components=2))
Test Accuracy: 0.893333333333
time: 1.5 s

EstimateClassifier(GaussianNB(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.917777777778
time: 2.67 s

EstimateClassifier(RandomForestClassifier())
Test Accuracy: 0.951111111111
time: 38.5 ms

EstimateClassifier(RandomForestClassifier(),PCA(n_components=2))
Test Accuracy: 0.935555555556
time: 49.5 ms

EstimateClassifier(RandomForestClassifier(),Isomap(n_components=2))
Test Accuracy: 0.971111111111
time: 1.53 s

EstimateClassifier(RandomForestClassifier(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.973333333333
time: 2.68 s

@amueller
Copy link
Owner

amueller commented Jul 14, 2018 via email

@rasbt
Copy link
Collaborator

rasbt commented Jul 14, 2018

Haven't read up on UMAP yet -- heard from an attendant that the recent talk at PyData Ann Arbor was really good: https://www.youtube.com/watch?v=YPJQydzTLwQ&t=521s -- but I think it's more meant as a technique for visualizing training examples (or clusters thereof) in low dim rather than using it as a feature extraction technique (if you are not using generalized linear models maybe) in the similar vein as T-SNE? So in that case it would be interesting to add an eval of the random forest on the raw features like Andreas suggested.

Set n_estimators to 100 in the random forest and it will be better, and
probably better without umap.

In practice, it could come in handy for huge datasets though, as it is already much faster than T-SNE

screen shot 2018-07-14 at 6 35 56 pm

@fingoldo
Copy link

Here we go, guys.

It seems to still be helpful, but now I think i should have used cross_val_score from the beginning, as Isomap's result seems to be out of a picture a bit and affected by the split...


EstimateClassifier(RandomForestClassifier(n_estimators=100))
Test Accuracy: 0.977777777778
time: 346 ms

EstimateClassifier(RandomForestClassifier(n_estimators=100),PCA(n_components=2))
Test Accuracy: 0.98
time: 372 ms

EstimateClassifier(RandomForestClassifier(n_estimators=100),Isomap(n_components=2))
Test Accuracy: 0.977777777778
time: 1.93 s

EstimateClassifier(RandomForestClassifier(n_estimators=100),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.986666666667
time: 2.98 s

@fingoldo
Copy link

Added cross-validation to get a more definitive answer.

from sklearn.datasets import load_digits
digits=load_digits()

def EstimateClassifier(model,transformer=None):
    startTime = time.time()
    if transformer:
        pipe = Pipeline([('VarianceThreshold',VarianceThreshold()),('union', FeatureUnion([('AsIs',SelectKBest(k='all')),('transformer', transformer)])), ('classifier', model)])
    else:
        pipe = Pipeline([('classifier', model)])    
    accuracies=cross_val_score(pipe,digits.data,digits.target,cv=10)
    timeElapsed = time.time() - startTime
    print("Model: %s, Transformer: %s, avg.accuracy: %0.3f +- %0.3f, time=%0.3fs" % (type(model).__name__,type(transformer).__name__,np.mean(accuracies),np.std(accuracies),timeElapsed))

for model in (GaussianNB(),RandomForestClassifier(n_estimators=100)):
    for transformer in (None,PCA(n_components=2),Isomap(n_components=2),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation')):
        EstimateClassifier(model,transformer)

Model: GaussianNB, Transformer: NoneType, avg.accuracy: 0.810 +- 0.057, time=0.065s
Model: GaussianNB, Transformer: PCA, avg.accuracy: 0.843 +- 0.051, time=0.198s
Model: GaussianNB, Transformer: Isomap, avg.accuracy: 0.883 +- 0.046, time=16.367s
Model: GaussianNB, Transformer: UMAP, avg.accuracy: 0.921 +- 0.028, time=51.394s
Model: RandomForestClassifier, Transformer: NoneType, avg.accuracy: 0.953 +- 0.020, time=3.750s
Model: RandomForestClassifier, Transformer: PCA, avg.accuracy: 0.948 +- 0.023, time=4.002s
Model: RandomForestClassifier, Transformer: Isomap, avg.accuracy: 0.964 +- 0.017, time=19.914s
Model: RandomForestClassifier, Transformer: UMAP, avg.accuracy: 0.969 +- 0.017, time=55.271s

Do you think benefit of adding new features will hold if we add proper hyperparameters tuning, or it will be neglected?

@rhiever
Copy link
Contributor Author

rhiever commented Jul 16, 2018

I think it's more meant as a technique for visualizing training examples (or clusters thereof) in low dim rather than using it as a feature extraction technique

Not so -- UMAP can be used as a visualization mapping technique similar to t-SNE, but also works fine as a feature construction technique (as shown by @fingoldo). I was going to link the SciPy talk, but it seems you already found it. 👍

@fingoldo, I think your initial explorations show that UMAP can potentially be useful as a feature construction technique. It will have to be evaluated further on more benchmarks, perhaps on PMLB.

@fingoldo
Copy link

@rhiever Randy will UMAP be included into tpot pipeline? :-)

@rhiever
Copy link
Contributor Author

rhiever commented Jul 18, 2018

It's possible!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants