-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating dimensionality reduction into the pipeline #43
Comments
Thanks for posting this issue @htcai. To recap for those who weren't at the meetup last night: we have an expression matrix with 7,306 samples (rows) × 20,530 genes (columns). We want to reduce the dimensionality of the genes, using a technique such as PCA. However, we were running into memory issues when using the algorithms in Tagging @gheimberg who has experience with applying these methods to gene expression datasets. @gheimberg and others, is the best solution to reduce the memory issue to perform feature selection before applying feature extraction? |
Which class has been tried? RandomizedPCA should cost less memory than PCA |
@yl565, I don't remember anyone trying So I guess we should compare the performance of classifier pipelines which use:
|
Also tagging @nabeelsarwar. |
Great idea! Let people know which one you choose below. So we're all on the same page, make sure you're using the latest data retrieved by It may also be nice to print out max memory usage at the end of the script (not sure if this will work on all OSes): # Get peak memory usage in kilobytes
# https://docs.python.org/3/library/resource.html#resource.getrusage
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss |
I will work on factor analysis. |
I would like to try Linear Discriminant Analysis (LDA) after feature selection. I will look for other commands that can report max memory usage if the one above does not work. Also, maybe we should select a uniform number of features. For instance, we select 5000 and then reduce the dimensionality to 2000 or 500. |
I suggest 2000 to keep at least 10%. This can be fine tuned by algorithms that report on the information contained within each component. But I think we should err on the side of more features. |
I tried PCA, seems to be running on my computer just fine pipeline = make_pipeline(
PCA(n_components=500),
StandardScaler(), # Feature scaling
clf_grid) |
Peak memory is about 9GB on Ubuntu |
@yl565 is it important to standardize before performing PCA? Also |
Most of these algorithms will also do whitening and building the covariance matrix for you, or so I thought. |
I believe we only tried Factor Analysis and LDA. I ran out of memory using Factor inside the pipeline with the randomized solution. This was without selecting some features before hand. I will try to get some updates on the memory usage before the weekend. |
@yl565 really cool analysis. Can you link to the source code? If you just need a quick place to upload a file, you can check out GitHub Gists. So here is my interpretation of your plot. It looks like loading the data peaks at ~4.5 GB of memory and stabilizes around 4 GB -- hence 32 bit systems run into a memory error.
According to the sklearn docs:
Depending on the extent of "almost exactly match", I think a good option is to use PCA/IncrementalPCA if we expect there to be a memory issue. However, it's also worth noting that the peak memory usage of 9 GB can be handled by many systems. Therefore, I still think it makes sense to try algorithms without an out-of-core ( |
@yl565 - Are you working with PCA or IncrementalPCA ? I started on PCA but if you are working on it, I can try IncrementalPCA. |
@dhimmel the source code: https://gist.github.com/yl565/caf34bce62cb0fb4fa0c1a26a298e1d6 @mans2singh I'm not currently working on either. They should produce same (or very close) results. You could try using PCA first and if run out of memory try IncrementalPCA instead |
It will benefit all of us if the operations of dimensionality reduction can be integrated into the pipeline.
Moreover, it seems necessary to place dimensionality reduction after preliminary feature selection (keeping 5000?); otherwise, our computers are likely to run out of memory.
The text was updated successfully, but these errors were encountered: