-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add explanation for why t-SNE is not a good feature preprocessor for models #78
Comments
Well unsupervised learning will always throw away discriminative information.... |
I think the point of the exercise was to show that that effect was particular to t-SNE. e.g. if you apply PCA or Isomap you can oftentimes improve---or at least not negatively affect---your model accuracy. |
really? I would imagine that PCA down to two dimensions will heavily impact accuracy. |
I'd say it really depends on a lot of factors (the model, the dataset, the explained-variance ratio, ...). I can imagine that for small datasets and models that tend to overfit (e.g., k-NN with small k or so), it could be really helpful. Or in more general terms, I think it may be a useful thing to do to improve performance ('curse of dimensionality') as an alternative to feature selection and/or if you can't regularize your model. |
Yeah it depends on many things, that's true. It is certainly a form of regularization. But reducing digits to 2d is probably too much, no matter what method. The point of the exercise was more "don't use manifold learning for supervised tasks". PCA might be helpful in certain situations. |
Just to explore your suspicions, @amueller:
t-SNE is an order of magnitude worse. |
Oh, wow, the min. expected accuracy would be 10%; that's really, really bad then! But I see that you have an error here:
It should be
|
Ah you're right @rasbt. You can't use the I updated the code and results above with that fix. PCA actually performs MUCH better than t-SNE now. So why is t-SNE so bad for classification? |
Oh yeah, good point ... mentioning one problem and introducing another :P |
What's your code now? Going down to two dimensions I would expect t-SNE to perform better than PCA, but I'd expect both to be worse than not doing anything. |
Hm, yeah, I'd also naturally expect t-SNE to perform better on this particular dataset, however, I think the comparison in the code above is not entirely fair. You can't do a
for t-SNE if you |
@rasbt is right on this one. The reason t-SNE doesn't work here is because the t-SNE is fitting on the training data then the testing data, thus causing the clusters to fall in different areas. |
Is there absolutely no way to add pure .transform method to TSNE, like Isomap already has? In 2D separation of TSNE is much much better from MNIST dataset, a pity it can't be used as a regular transformer.. |
there is a way to implement this, I think, but it's not implemented in sklearn right now. Not sure if there's a pr. @fingoldo you might also be interested in UMAP: https://github.com/lmcinnes/umap |
Thank you so much Andreas for this great suggestion!
|
Set n_estimators to 100 in the random forest and it will be better, and
probably better without umap.
Sent from phone. Please excuse spelling and brevity.
…On Sat, Jul 14, 2018, 17:10 fingoldo ***@***.***> wrote:
Thank you so much Andreas for this great suggestion!
Features added by UMAP proved to be useful indeed :-) Quick & dirty
assessment:
from sklearn.datasets import load_digits
digits=load_digits()
def EstimateClassifier(model,transformer=None):
startTime = datetime.now()
if transformer:
transformer.fit(x_train)
dp=transformer.transform(x_train)
x_train_new=np.concatenate((x_train,dp),axis=1)
dp=transformer.transform(x_test)
x_test_new=np.concatenate((x_test,dp),axis=1)
else:
x_train_new,x_test_new=x_train,x_test
model.fit(x_train_new,y_train)
timeElapsed = datetime.now() - startTime
print("Test Accuracy: %s" % (accuracy_score(y_test,model.predict(x_test_new))))
EstimateClassifier(GaussianNB())
Test Accuracy: 0.833333333333
time: 7.51 ms
EstimateClassifier(GaussianNB(),PCA(n_components=2))
Test Accuracy: 0.855555555556
time: 17.5 ms
EstimateClassifier(GaussianNB(),Isomap(n_components=2))
Test Accuracy: 0.893333333333
time: 1.5 s
EstimateClassifier(GaussianNB(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.917777777778
time: 2.67 s
EstimateClassifier(RandomForestClassifier())
Test Accuracy: 0.951111111111
time: 38.5 ms
EstimateClassifier(RandomForestClassifier(),PCA(n_components=2))
Test Accuracy: 0.935555555556
time: 49.5 ms
EstimateClassifier(RandomForestClassifier(),Isomap(n_components=2))
Test Accuracy: 0.971111111111
time: 1.53 s
EstimateClassifier(RandomForestClassifier(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.973333333333
time: 2.68 s
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#78 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbcFgJRvjxp7f1C_d9vkcS69AtUPc3nks5uGmxwgaJpZM4KkcBW>
.
|
Haven't read up on UMAP yet -- heard from an attendant that the recent talk at PyData Ann Arbor was really good: https://www.youtube.com/watch?v=YPJQydzTLwQ&t=521s -- but I think it's more meant as a technique for visualizing training examples (or clusters thereof) in low dim rather than using it as a feature extraction technique (if you are not using generalized linear models maybe) in the similar vein as T-SNE? So in that case it would be interesting to add an eval of the random forest on the raw features like Andreas suggested.
In practice, it could come in handy for huge datasets though, as it is already much faster than T-SNE |
Here we go, guys. It seems to still be helpful, but now I think i should have used cross_val_score from the beginning, as Isomap's result seems to be out of a picture a bit and affected by the split...
|
Added cross-validation to get a more definitive answer.
Do you think benefit of adding new features will hold if we add proper hyperparameters tuning, or it will be neglected? |
Not so -- UMAP can be used as a visualization mapping technique similar to t-SNE, but also works fine as a feature construction technique (as shown by @fingoldo). I was going to link the SciPy talk, but it seems you already found it. 👍 @fingoldo, I think your initial explorations show that UMAP can potentially be useful as a feature construction technique. It will have to be evaluated further on more benchmarks, perhaps on PMLB. |
@rhiever Randy will UMAP be included into tpot pipeline? :-) |
It's possible! |
Notebook 22 makes a really important point that t-SNE is only for visualization, yet doesn't explicitly explain why that is the case. We should add a brief explanation for why that is.
The text was updated successfully, but these errors were encountered: