Data Processing Requirements #17

awm33 · 2016-07-16T00:35:56Z

I'd like to get an idea of what, at least initially, the data processing requirements are for cognoma.

Here's some questions I think we'll need to answer. Let's discuss here.

How "big" is the data? Number of rows, columns, and bytes?

Can a classifier search run on a single multicore machine?

It looks like scikit-learn has some multithreading ability http://scikit-learn.org/stable/developers/performance.html but this may not be applicable to every machine learning algorithm we want to use.
- A single cloud machine can pack quite a punch, even the lower tier machines can have 8 cores and 30 GBs of ram. You can keep going up from there, but it gets more expensive.

Will we need distributed computing for each classifier search?

If we can't compute the classifier on a single machine within a reasonable amount of time (30-120 min?), then we may need to distribute a single classifier search across multiple machines.
Can the algorithms we want to use be distributed? Easily?
If so, how do we distribute? Simple parent/child jobs? Spark? Hadoop?

Where will the cancer data live? How big is it? (goes into the first question)

Flat files on S3/GCS? If we don't need to query, or select subsets of data, this may be ok. Index can be in primary app db.
A postgres database? Usually ok for 1TB or less
A data warehouse? Redshift and BigQuery are available. BigQuery supports nested records as well.

Where will the results live? How big is an individual result set?

Same options as above

dhimmel · 2016-07-16T20:20:08Z

How "big" is the data? Number of rows, columns, and bytes?

I posted the preliminary datasets created by dhimmel/cancer-data@ffe66ab. Both datasets are sample × gene matrices, with one containing mutation status (for constructing y) and the other containing expression (for x). The datasets contain 7,706 samples (rows). The expression data contains 20,501 genes (columns) and the mutation data contains 30,236 genes (columns).

Where will the cancer data live?

Currently, the datasets are compressed TSVs (flat text files). On my machine, reading the TSV into a pandas.DataFrame takes 78 seconds for the 174 MB expression dataset and 58.1 seconds for the 2 MB mutation dataset. One thing to consider is that we'll often only be interested in a subset of the columns and rows. Therefore, we're wasting time reading rows/columns that are going to be discarded. This will be especially true for the mutation dataset.

So what would be ideal is a storage format for matrix-like files that supports indexed reading of specific rows and columns in Python. Ideally for convenience, the data could be stored and read directly from a file -- no need to have a separate application running to serve it. @awm33 do you have any ideas on what technologies may fit here? I've seen a project use pytables for a similar application. Does anyone have experience with pytables? numpy.memmap also looks promising.

awm33 · 2016-07-16T21:30:27Z

@dhimmel So you have a wide tables, but few rows.

This is a preliminary set? Will it be expanded beyond the initial 7,706 samples? If it stays relatively small, say under 100k samples, I think files could work.

I assume that any one classifier search would only select 2-25ish genes? Or at least that's what I've heard for the frontend. Then you're really only selecting a very small subset of those matrices.

Both pytables and memmap look like they may fit. Here is the IO options for both http://pandas.pydata.org/pandas-docs/stable/io.html http://docs.scipy.org/doc/numpy/reference/routines.io.html

Are you guys using a mix of pandas and numpy structures? Something to be cognizant of is the performance and multithreading abilities of each structure.

One option would be to prepare the matrices in pytable and or memmap format, upload them to S3/GCS, and load them to the workers ephemeral storage as a cache. This load could happen during instance deploy, application deploy, or maybe a script on the machine as well.

Here's another approach. Uncompressed, these matrices are about 1.2GB, which is small enough to keep cached in a server's memory (assuming 15+GB ram). Maybe you could store an indexed copy of the matrices in-memory? You could keep a copy of the matrices in a pandas DataFrame, select the columns (genes) you want and copy them into a new DataFrame. It looks like DataFrames support indexed querying http://pandas.pydata.org/pandas-docs/version/0.15.0/indexing.html, but someone more familar with pandas may know a better structure. As long as operations againt that structure are thread safe. This also assumes we would recycle processes across jobs.

dhimmel · 2016-07-16T22:31:12Z

This is a preliminary set? Will it be expanded beyond the initial 7,706 samples?

I don't expect the sample number to grow substantially. @gwaygenomics, is TCGA still collecting data and expanding?

I assume that any one classifier search would only select 2-25ish genes?

The classifier will train itself on many genes (often the full set of 20,501). Depending on the algorithm, the resulting model may only actually use a small number of genes. The machine learning portion of the project will need to take large subsets of the expression matrix. As far as the results viewer, I'm not sure whether the raw data will be incorporated.

One option would be to prepare the matrices in pytable and or memmap format, upload them to S3/GCS, and load them to the workers ephemeral storage as a cache.

I like this idea. Having the file stored locally provides lot's of versatility for code running on the instance.

Uncompressed, these matrices are about 1.2GB, which is small enough to keep cached in a server's memory

Not a huge fan of this as it creates a dependency on having everything operate through a single python process? Threading in python, although fun, can be cumbersome and problematic due to the GIL. The subsetting of pandas dataframes is easy -- it's the communicating the copied subset to independent Python instances that will be hard.

Are you guys using a mix of pandas and numpy structures?

Unclear at this point. pandas is more user friendly and uses numpy in the backend. numpy is less friendly but may have performance benefits.

gwaybio · 2016-07-16T22:44:11Z

This is a preliminary set? Will it be expanded beyond the initial 7,706 samples?

I don't expect the sample number to grow substantially. @gwaygenomics, is TCGA still collecting data and expanding?

TCGA has completed sample collection but more data will be made public eventually. The updated data will include more complete mutation calls, more in depth clinical information, and a complete gene expression matrix. I think it should end up at around 15,000 total samples.

We're working on the cutting edge of cancer genomic data right now with these sets, but cognoma should be flexible enough to handle data updates.

awm33 · 2016-07-16T23:56:12Z

Ok, assuming it's a somewhat linear trend, the dataset will end up being ~3GB.

@dhimmel Yeah, I've never done multithreading in python and the two people I've discussed it with were like "if you need multithreading, don't use python". I think I'll miss how easy Clojure makes threading/concurrency.

But you could still use the caching approach if a process would only be executing one task, but it would be kept around to pick up another task, that I'm assuming would need to use the same matrices. Cycling processes vs long running processes.

I'm not sure what will end up being used. I think a long running worker daemon is ideal, if memory leaks are not an issue in this code. So one or more processes more per instance, each independently communicating with the task queue, running one job at a time.

Generating new processes or child processes would clear out the memory of the previous task, but adds the overhead of managing the processes.

awm33 · 2016-07-17T01:11:45Z

If we're using python 3, https://docs.python.org/3/library/multiprocessing.html looks promising "effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads"

dhimmel · 2016-07-17T01:27:42Z

For applications that are job/task based, concurrent.futures is my favorite method for parallelism in Python. You can use the same API but choose between a backend of multiprocessing or threading.

Multiprocessing escapes the Global Interpreter Lock, but has to pickle (serialize) all data being sent to the subprocess. This will cause a big overhead. If we can find a format that allows for quick/indexed reading from disk that's preferable I think. If not, we should consider the in-memory approach -- I think threading would probably be the way to go, but we should test it out. Lot's of the machine learning functions execute non-python code and are thus likely to release the GIL.

awm33 · 2016-07-17T01:47:59Z

concurrent.futures looks neat. concurrent.futures and/or multiprocessing could be a way for a daemon process to manage per task child processes.

Yeah, doesn't look like there is an easy way to share large amounts of data cross thread/process. In-memory cache may be a premature optimization anyway.

Numpy uses libblas and liblapack. If pandas can tap into those through numpy, it should have good multicore performance.

It looks like experimenting with the indexed file formats is the way to go, maybe trying a few formats.

dhimmel mentioned this issue Jul 25, 2016

Persistent storage of matrices that enables quick indexed lookup cognoma/cancer-data#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Processing Requirements #17

Data Processing Requirements #17

awm33 commented Jul 16, 2016

dhimmel commented Jul 16, 2016

awm33 commented Jul 16, 2016

dhimmel commented Jul 16, 2016

gwaybio commented Jul 16, 2016

awm33 commented Jul 16, 2016

awm33 commented Jul 17, 2016

dhimmel commented Jul 17, 2016

awm33 commented Jul 17, 2016

Data Processing Requirements #17

Data Processing Requirements #17

Comments

awm33 commented Jul 16, 2016

dhimmel commented Jul 16, 2016

awm33 commented Jul 16, 2016

dhimmel commented Jul 16, 2016

gwaybio commented Jul 16, 2016

awm33 commented Jul 16, 2016

awm33 commented Jul 17, 2016

dhimmel commented Jul 17, 2016

awm33 commented Jul 17, 2016