New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Sets #2043

Closed
ssabnis opened this Issue Sep 6, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@ssabnis
Copy link

ssabnis commented Sep 6, 2018

Hello,

I am trying to use ADAM in the infrastructure that I have built, I am looking a for a large data set that I can use to process using ADAM. Are there any open data available. ?

Any info much appreciated.

Thanks

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Sep 6, 2018

Hello @ssabnis! You'll have to be a bit more specific, human variation? cancer? metagenomics?

@ssabnis

This comment has been minimized.

Copy link

ssabnis commented Sep 6, 2018

Hi @heuermh,

I appreciate your quick response, thank you.

I am looking for any large data set to do performance testing of the ADAM with the infrastructure that I have built for the analytics. This setup is going to be used by biotech companies. Can you recommend the data sets that I can use. I am thinking of cancer data.

Thanks

@jondeaton

This comment has been minimized.

Copy link

jondeaton commented Sep 7, 2018

Hello @ssabnis

You might consider using one of the standard human genome references made available from the Genome In a Bottle project. I have been using NA12878 for some of my own performance benchmarking of ADAM. You can obtain the whole genome sequencing data from

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/10XGenomics

95GB BAM file "NA12878_phased_possorted_bam.bam" contains the sequencing records. Although this isn't "cancer data" it is a whole genome sequencing run of the human genome.

@ssabnis

This comment has been minimized.

Copy link

ssabnis commented Sep 7, 2018

Thanks a lot @jondeaton , This is great help. I will start with this. If you know of any larger file size like 1 TB will help.

@ssabnis

This comment has been minimized.

Copy link

ssabnis commented Sep 7, 2018

@jondeaton do I need uncompress the .gz file to run the ADAM?

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Sep 10, 2018

do I need uncompress the .gz file to run the ADAM?

No, but unless the .gz file is Blocked Gzip format (BGZF) reading performance will not scale with number of executors, as regular Gzip format is not splittable.

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Sep 10, 2018

You might also want to take a look at the datasets referenced in
https://github.com/bcbio/bcbio_validations

Many different applications are discussed there.

@heuermh heuermh added this to the 0.24.1 milestone Sep 20, 2018

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Sep 20, 2018

Closing as resolved.

@heuermh heuermh closed this Sep 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment