Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating dimensionality reduction into the pipeline #43

Closed
htcai opened this issue Sep 7, 2016 · 18 comments
Closed

Integrating dimensionality reduction into the pipeline #43

htcai opened this issue Sep 7, 2016 · 18 comments

Comments

@htcai
Copy link
Member

htcai commented Sep 7, 2016

It will benefit all of us if the operations of dimensionality reduction can be integrated into the pipeline.

Moreover, it seems necessary to place dimensionality reduction after preliminary feature selection (keeping 5000?); otherwise, our computers are likely to run out of memory.

@dhimmel
Copy link
Member

dhimmel commented Sep 7, 2016

Thanks for posting this issue @htcai.

To recap for those who weren't at the meetup last night: we have an expression matrix with 7,306 samples (rows) × 20,530 genes (columns). We want to reduce the dimensionality of the genes, using a technique such as PCA. However, we were running into memory issues when using the algorithms in sklearn.

Tagging @gheimberg who has experience with applying these methods to gene expression datasets. @gheimberg and others, is the best solution to reduce the memory issue to perform feature selection before applying feature extraction?

@yl565
Copy link
Contributor

yl565 commented Sep 7, 2016

Which class has been tried? RandomizedPCA should cost less memory than PCA

@dhimmel
Copy link
Member

dhimmel commented Sep 7, 2016

@yl565, I don't remember anyone trying sklearn.decomposition.RandomizedPCA, which looks like it's designed to solve this problem. For reference, sklearn cites the following two studies: Halko et al 2011 and Martinsson et al 2011.

So I guess we should compare the performance of classifier pipelines which use:

  1. an approximate decomposition
  2. feature selection followed by an exact decomposition

@dhimmel
Copy link
Member

dhimmel commented Sep 7, 2016

Also tagging @nabeelsarwar.

@htcai
Copy link
Member Author

htcai commented Sep 7, 2016

@dhimmel @yl565 Thank you for the references and discussions! Should everyone claim an algorithm of dimensionality reduction (including a choice between 1 vs. 2)?

@dhimmel
Copy link
Member

dhimmel commented Sep 7, 2016

Should everyone claim an algorithm of dimensionality reduction (including a choice between 1 vs. 2)?

Great idea! Let people know which one you choose below. So we're all on the same page, make sure you're using the latest data retrieved by 1.download.ipynb. I recommend starting with algorithms/SGDClassifier-master.ipynb.

It may also be nice to print out max memory usage at the end of the script (not sure if this will work on all OSes):

# Get peak memory usage in kilobytes
# https://docs.python.org/3/library/resource.html#resource.getrusage
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

@beelze-b
Copy link
Contributor

beelze-b commented Sep 7, 2016

I will work on factor analysis.

@htcai
Copy link
Member Author

htcai commented Sep 7, 2016

I would like to try Linear Discriminant Analysis (LDA) after feature selection. I will look for other commands that can report max memory usage if the one above does not work.

Also, maybe we should select a uniform number of features. For instance, we select 5000 and then reduce the dimensionality to 2000 or 500.

@beelze-b
Copy link
Contributor

beelze-b commented Sep 7, 2016

I suggest 2000 to keep at least 10%. This can be fine tuned by algorithms that report on the information contained within each component. But I think we should err on the side of more features.

@yl565
Copy link
Contributor

yl565 commented Sep 7, 2016

I tried PCA, seems to be running on my computer just fine

pipeline = make_pipeline(
    PCA(n_components=500), 
    StandardScaler(),  # Feature scaling
    clf_grid)

@yl565
Copy link
Contributor

yl565 commented Sep 7, 2016

Peak memory is about 9GB on Ubuntu

@dhimmel
Copy link
Member

dhimmel commented Sep 7, 2016

@yl565 is it important to standardize before performing PCA?

Also IncrementalPCA may also circumvent memory issues.

@beelze-b
Copy link
Contributor

beelze-b commented Sep 7, 2016

Most of these algorithms will also do whitening and building the covariance matrix for you, or so I thought.

@yl565
Copy link
Contributor

yl565 commented Sep 7, 2016

From PCA source code seems it is demeaned but not standardized. Standardizing may help improve classification performance.

The following figures shows the memory cost of the three algorithms, either RandomizedPCA or IncrementalPCA (I used n_batch=1000) should work fine. All three produces classification test AUROC = 0.93. For reduce memory cost maximumly, IncrementalPCA is better at the cost of longer computational time.
figure_1-3

@beelze-b
Copy link
Contributor

beelze-b commented Sep 7, 2016

I believe we only tried Factor Analysis and LDA. I ran out of memory using Factor inside the pipeline with the randomized solution. This was without selecting some features before hand. I will try to get some updates on the memory usage before the weekend.

@dhimmel
Copy link
Member

dhimmel commented Sep 8, 2016

@yl565 really cool analysis. Can you link to the source code? If you just need a quick place to upload a file, you can check out GitHub Gists.

So here is my interpretation of your plot. It looks like loading the data peaks at ~4.5 GB of memory and stabilizes around 4 GB -- hence 32 bit systems run into a memory error. PCA appears to require an additional ~5 GB of memory. RandomizedPCA requires ~2 GB. IncrementalPCA required ~1.8 GB.

PCA and RandomizedPCA took about the same runtime, while IncrementalPCA took ~30% longer.

According to the sklearn docs:

The IncrementalPCA object uses a different form of processing and allows for partial computations which almost exactly match the results of PCA while processing the data in a minibatch fashion.

Depending on the extent of "almost exactly match", I think a good option is to use PCA/IncrementalPCA if we expect there to be a memory issue. However, it's also worth noting that the peak memory usage of 9 GB can be handled by many systems. Therefore, I still think it makes sense to try algorithms without an out-of-core (partial_fit) implementation.

@mans2singh
Copy link
Contributor

@yl565 - Are you working with PCA or IncrementalPCA ? I started on PCA but if you are working on it, I can try IncrementalPCA.

@yl565
Copy link
Contributor

yl565 commented Sep 14, 2016

@dhimmel the source code: https://gist.github.com/yl565/caf34bce62cb0fb4fa0c1a26a298e1d6
Use memory_profiler to run the code from command line:
mprof run test_PCA_peak_memory.py
mprof plot

@mans2singh I'm not currently working on either. They should produce same (or very close) results. You could try using PCA first and if run out of memory try IncrementalPCA instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants