Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1891] Reimplement FASTA sequence and slice converters for performance #2175

Merged
merged 1 commit into from
Jun 24, 2019

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented Jun 17, 2019

Fixes #1891, fixes #2174

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/3022/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > git rev-parse origin/pr/2175/merge^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains c5a37a3 # timeout=10Checking out Revision c5a37a3 (origin/pr/2175/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f c5a37a382d918d3605352dc60077a3325ebdb729First time build. Skipping changelog.Triggering ADAM-prb ? 2.7.5,2.12,2.4.3,ubuntuTriggering ADAM-prb ? 2.7.5,2.11,2.4.3,ubuntuADAM-prb ? 2.7.5,2.12,2.4.3,ubuntu completed with result FAILUREADAM-prb ? 2.7.5,2.11,2.4.3,ubuntu completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/3023/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > git rev-parse origin/pr/2175/merge^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains 644d3e2 # timeout=10Checking out Revision 644d3e2 (origin/pr/2175/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 644d3e28d4da54230b406db7f2288869426358c1First time build. Skipping changelog.Triggering ADAM-prb ? 2.7.5,2.12,2.4.3,ubuntuTriggering ADAM-prb ? 2.7.5,2.11,2.4.3,ubuntuADAM-prb ? 2.7.5,2.12,2.4.3,ubuntu completed with result FAILUREADAM-prb ? 2.7.5,2.11,2.4.3,ubuntu completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls
Copy link

coveralls commented Jun 18, 2019

Coverage Status

Coverage remained the same at ?% when pulling ae81b5b on heuermh:slice-perf into 05d474a on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/3024/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/3025/
Test PASSed.

@heuermh heuermh changed the title adding alt implementation of FastaSequenceConverter [ADAM-1891] Reimplement FASTA sequence and slice converters for performance Jun 18, 2019
@heuermh heuermh added this to the 0.28.0 milestone Jun 18, 2019
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/3026/
Test PASSed.

@heuermh heuermh marked this pull request as ready for review June 18, 2019 19:42
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/3027/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Jun 21, 2019

I ran performance tests for this branch on AWS EMR against Spark 2.4.0, using scripts prepare-bowhead-whale.sh and transform-bowhead-whale-emr-yarn.sh.

>>> transforming proteins and coding sequences

transformSequences -alphabet PROTEIN \
  bowhead_whale_proteins.fasta.gz bowhead_whale_proteins.sequences.adam
real	0m34.581s
user	0m14.553s
sys	0m2.138s

transformSequences -alphabet DNA \
  bowhead_whale_coding_sequences.fasta.gz bowhead_whale_coding_sequences.sequences.adam
real	0m41.645s
user	0m14.547s
sys	0m2.336s


>>> transforming scaffolds to sequences and slices

transformSequences -alphabet DNA \
  bowhead_whale_scaffolds.fasta.gz bowhead_whale_scaffolds.sequences.adam
real	14m44.948s
user	0m16.720s
sys	0m2.431s

transformSequences -alphabet DNA -create_reference \
  bowhead_whale_scaffolds.fasta.gz bowhead_whale_scaffolds.ref.sequences.adam
real	15m15.814s
user	0m15.503s
sys	0m2.509s

transformSlices -maximum_length 10000 \
  bowhead_whale_scaffolds.fasta.gz bowhead_whale_scaffolds.10k.slices.adam
real	15m10.885s
user	0m15.569s
sys	0m2.568s

transformSlices -maximum_length 10000 -create_reference \
  bowhead_whale_scaffolds.fasta.gz bowhead_whale_scaffolds.ref.10k.slices.adam
real	16m38.960s
user	0m16.104s
sys	0m2.388s

transformSlices -maximum_length 100000 \
  bowhead_whale_scaffolds.fasta.gz bowhead_whale_scaffolds.100k.slices.adam
real	15m2.828s
user	0m16.460s
sys	0m2.395s

transformSlices -maximum_length 100000 -create_reference \
  bowhead_whale_scaffolds.fasta.gz bowhead_whale_scaffolds.ref.100k.slices.adam
real	16m37.634s
user	0m16.082s
sys	0m2.353s


>>> transforming Trinity sequences to sequences

transformSequences -alphabet DNA \
  Bickham_Trinity.fasta.gz Bickham_Trinity.sequences.adam
real	3m33.708s
user	0m14.141s
sys	0m2.359s

transformSequences -alphabet DNA -create_reference \
  Bickham_Trinity.fasta.gz Bickham_Trinity.ref.sequences.adam
real	3m50.797s
user	0m13.788s
sys	0m2.468s

transformSequences -alphabet DNA \
  Bo_bowhead_MusKid_TrinityFasta.fasta.gz Bo_bowhead_MusKid_TrinityFasta.sequences.adam
real	6m29.937s
user	0m14.690s
sys	0m2.314s

transformSequences -alphabet DNA -create_reference \
  Bo_bowhead_MusKid_TrinityFasta.fasta.gz Bo_bowhead_MusKid_TrinityFasta.ref.sequences.adam
real	7m11.082s
user	0m14.749s
sys	0m2.269s

@heuermh heuermh merged commit aa33b06 into bigdatagenomics:master Jun 24, 2019
@heuermh heuermh deleted the slice-perf branch June 24, 2019 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants