Integrating dimensionality reduction into the pipeline #43

htcai · 2016-09-07T00:14:44Z

It will benefit all of us if the operations of dimensionality reduction can be integrated into the pipeline.

Moreover, it seems necessary to place dimensionality reduction after preliminary feature selection (keeping 5000?); otherwise, our computers are likely to run out of memory.

dhimmel · 2016-09-07T14:32:41Z

Thanks for posting this issue @htcai.

To recap for those who weren't at the meetup last night: we have an expression matrix with 7,306 samples (rows) × 20,530 genes (columns). We want to reduce the dimensionality of the genes, using a technique such as PCA. However, we were running into memory issues when using the algorithms in sklearn.

Tagging @gheimberg who has experience with applying these methods to gene expression datasets. @gheimberg and others, is the best solution to reduce the memory issue to perform feature selection before applying feature extraction?

yl565 · 2016-09-07T14:47:39Z

Which class has been tried? RandomizedPCA should cost less memory than PCA

dhimmel · 2016-09-07T15:08:59Z

@yl565, I don't remember anyone trying sklearn.decomposition.RandomizedPCA, which looks like it's designed to solve this problem. For reference, sklearn cites the following two studies: Halko et al 2011 and Martinsson et al 2011.

So I guess we should compare the performance of classifier pipelines which use:

an approximate decomposition
feature selection followed by an exact decomposition

dhimmel · 2016-09-07T15:12:27Z

Also tagging @nabeelsarwar.

htcai · 2016-09-07T17:00:11Z

@dhimmel @yl565 Thank you for the references and discussions! Should everyone claim an algorithm of dimensionality reduction (including a choice between 1 vs. 2)?

dhimmel · 2016-09-07T17:16:09Z

Should everyone claim an algorithm of dimensionality reduction (including a choice between 1 vs. 2)?

Great idea! Let people know which one you choose below. So we're all on the same page, make sure you're using the latest data retrieved by 1.download.ipynb. I recommend starting with algorithms/SGDClassifier-master.ipynb.

It may also be nice to print out max memory usage at the end of the script (not sure if this will work on all OSes):

# Get peak memory usage in kilobytes
# https://docs.python.org/3/library/resource.html#resource.getrusage
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

beelze-b · 2016-09-07T17:17:43Z

I will work on factor analysis.

htcai · 2016-09-07T17:24:09Z

I would like to try Linear Discriminant Analysis (LDA) after feature selection. I will look for other commands that can report max memory usage if the one above does not work.

Also, maybe we should select a uniform number of features. For instance, we select 5000 and then reduce the dimensionality to 2000 or 500.

beelze-b · 2016-09-07T17:26:03Z

I suggest 2000 to keep at least 10%. This can be fine tuned by algorithms that report on the information contained within each component. But I think we should err on the side of more features.

yl565 · 2016-09-07T17:26:43Z

I tried PCA, seems to be running on my computer just fine

pipeline = make_pipeline(
    PCA(n_components=500), 
    StandardScaler(),  # Feature scaling
    clf_grid)

yl565 · 2016-09-07T17:28:37Z

Peak memory is about 9GB on Ubuntu

dhimmel · 2016-09-07T17:39:34Z

@yl565 is it important to standardize before performing PCA?

Also IncrementalPCA may also circumvent memory issues.

beelze-b · 2016-09-07T17:41:30Z

Most of these algorithms will also do whitening and building the covariance matrix for you, or so I thought.

yl565 · 2016-09-07T18:51:24Z

From PCA source code seems it is demeaned but not standardized. Standardizing may help improve classification performance.

The following figures shows the memory cost of the three algorithms, either RandomizedPCA or IncrementalPCA (I used n_batch=1000) should work fine. All three produces classification test AUROC = 0.93. For reduce memory cost maximumly, IncrementalPCA is better at the cost of longer computational time.

beelze-b · 2016-09-07T19:07:51Z

I believe we only tried Factor Analysis and LDA. I ran out of memory using Factor inside the pipeline with the randomized solution. This was without selecting some features before hand. I will try to get some updates on the memory usage before the weekend.

dhimmel · 2016-09-08T14:50:21Z

@yl565 really cool analysis. Can you link to the source code? If you just need a quick place to upload a file, you can check out GitHub Gists.

So here is my interpretation of your plot. It looks like loading the data peaks at ~4.5 GB of memory and stabilizes around 4 GB -- hence 32 bit systems run into a memory error. PCA appears to require an additional ~5 GB of memory. RandomizedPCA requires ~2 GB. IncrementalPCA required ~1.8 GB.

PCA and RandomizedPCA took about the same runtime, while IncrementalPCA took ~30% longer.

According to the sklearn docs:

The IncrementalPCA object uses a different form of processing and allows for partial computations which almost exactly match the results of PCA while processing the data in a minibatch fashion.

Depending on the extent of "almost exactly match", I think a good option is to use PCA/IncrementalPCA if we expect there to be a memory issue. However, it's also worth noting that the peak memory usage of 9 GB can be handled by many systems. Therefore, I still think it makes sense to try algorithms without an out-of-core (partial_fit) implementation.

mans2singh · 2016-09-14T19:30:21Z

@yl565 - Are you working with PCA or IncrementalPCA ? I started on PCA but if you are working on it, I can try IncrementalPCA.

yl565 · 2016-09-14T23:36:13Z

@dhimmel the source code: https://gist.github.com/yl565/caf34bce62cb0fb4fa0c1a26a298e1d6
Use memory_profiler to run the code from command line:
mprof run test_PCA_peak_memory.py
mprof plot

@mans2singh I'm not currently working on either. They should produce same (or very close) results. You could try using PCA first and if run out of memory try IncrementalPCA instead

dhimmel mentioned this issue Sep 12, 2016

Decisions required to reach a minimum viable product #44

Open

dhimmel mentioned this issue Jan 16, 2017

Memory issue #70

Closed

rdvelazquez mentioned this issue Jul 11, 2017

Selecting the number of components returned by PCA #106

Closed

dhimmel mentioned this issue Oct 9, 2017

Machine Learning Punch List for Launch #110

Closed

5 tasks

rdvelazquez closed this as completed Oct 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating dimensionality reduction into the pipeline #43

Integrating dimensionality reduction into the pipeline #43

htcai commented Sep 7, 2016

dhimmel commented Sep 7, 2016

yl565 commented Sep 7, 2016

dhimmel commented Sep 7, 2016

dhimmel commented Sep 7, 2016

htcai commented Sep 7, 2016

dhimmel commented Sep 7, 2016 •

edited

Loading

beelze-b commented Sep 7, 2016 •

edited by dhimmel

Loading

htcai commented Sep 7, 2016

beelze-b commented Sep 7, 2016 •

edited by dhimmel

Loading

yl565 commented Sep 7, 2016

yl565 commented Sep 7, 2016

dhimmel commented Sep 7, 2016

beelze-b commented Sep 7, 2016 •

edited by dhimmel

Loading

yl565 commented Sep 7, 2016 •

edited

Loading

beelze-b commented Sep 7, 2016 •

edited

Loading

dhimmel commented Sep 8, 2016

mans2singh commented Sep 14, 2016

yl565 commented Sep 14, 2016

Integrating dimensionality reduction into the pipeline #43

Integrating dimensionality reduction into the pipeline #43

Comments

htcai commented Sep 7, 2016

dhimmel commented Sep 7, 2016

yl565 commented Sep 7, 2016

dhimmel commented Sep 7, 2016

dhimmel commented Sep 7, 2016

htcai commented Sep 7, 2016

dhimmel commented Sep 7, 2016 • edited Loading

beelze-b commented Sep 7, 2016 • edited by dhimmel Loading

htcai commented Sep 7, 2016

beelze-b commented Sep 7, 2016 • edited by dhimmel Loading

yl565 commented Sep 7, 2016

yl565 commented Sep 7, 2016

dhimmel commented Sep 7, 2016

beelze-b commented Sep 7, 2016 • edited by dhimmel Loading

yl565 commented Sep 7, 2016 • edited Loading

beelze-b commented Sep 7, 2016 • edited Loading

dhimmel commented Sep 8, 2016

mans2singh commented Sep 14, 2016

yl565 commented Sep 14, 2016

dhimmel commented Sep 7, 2016 •

edited

Loading

beelze-b commented Sep 7, 2016 •

edited by dhimmel

Loading

beelze-b commented Sep 7, 2016 •

edited by dhimmel

Loading

beelze-b commented Sep 7, 2016 •

edited by dhimmel

Loading

yl565 commented Sep 7, 2016 •

edited

Loading

beelze-b commented Sep 7, 2016 •

edited

Loading