Reduce memory consumption in prelim_map #393

donkirkby · 2017-05-15T23:42:01Z

We're trying to allocate enough memory for the pipeline steps when we run them under Slurm, and some of them failed prelim_map because it used more than 3GB on a large sample. I ran prelim_map on a small sample and tracked the memory usage reported by top. It reported 100MB used by bowtie2, and 350MB used by the python process. The bowtie2 memory was stable, but the python memory steadily climbed.

See what is being held in memory, and decide if it can be avoided.
Look at other steps and create separate issues if needed.
Look at steps in the Mixed HCV pipeline.

The text was updated successfully, but these errors were encountered:

donkirkby · 2017-05-16T17:09:25Z

The largest sample I found that stayed under 6GB of memory was D62195-HCV_S5 from the 26 Feb 2016 batch. Its steps had the following MaxRSS values:

trim_fastqs - 21M
prelim_map - 4.1G
remap - 700M
sam2aln - 4.8G
aln2counts - 5.8G
sam_g2p - 11M
coverage_plots - completed in less than 30s, so no memory reported

Another interesting sample is 67182A-HIV_S5 from 20 Sep 2016. It is a smaller sample, but it used more than 6GB of memory on the v7.7 pipeline.

donkirkby · 2017-05-16T17:42:02Z

prelim_map is loading all of the reads into memory before it writes them to the prelim_map.csv file so that it can group the reads by reference name. Figure out if we really need to group the reads.

I'm pushing this back to the near future milestone, because Richard H. asked for #394 to get done before this.

donkirkby · 2017-05-17T16:34:55Z

The largest sample in the 26 Feb 2016 run had a compressed FASTQ file of 800MB. As a workaround, we configured Kive to request 20GB for every driver script, and that worked. The most memory used was just over 10GB for the 800MB sample's aln2counts step. The longest elapsed time was 5h50m for the same sample's remap step. However, it used less than 1GB of memory.

donkirkby · 2017-07-06T23:34:35Z

Here are a range of sample sizes to experiment with from the 13-Jun-2017.M04401 run:

88160A_HCV (214.1MB)
88160AMIDI_MidHCV (47.4MB)
69268A-V3-1_V3LOOP (15.3MB)
69258A-V3_V3LOOP (5.7MB)

The sizes are the sum of the two compressed FASTQ files. Presumably, the V3LOOP samples will behave differently from the HCV samples, because V3LOOP reads get mapped with the pairwise alignment script in fastq_g2p.py.

aln2counts.py seems to use the most memory, so we'll start with that. Here's the memory that each sample uses:

88160A_HCV (2.7GB, 846s)
88160AMIDI_MidHCV (437MB, 52s)
69268A-V3-1_V3LOOP (67MB, 9s)
69258A-V3_V3LOOP (50MB, 4s)

Memory consuming steps in aln2counts.py:

counting nuc variants (removed)
writing coordinate insertions
~~reading consensus insertions from previous step (waiting on changes to that step)~~ probably not important

Also helps prepare for issue #393.

Add a utility for sorting SAM files. Add third-party license information to README.

donkirkby · 2017-07-11T21:22:35Z

Reduced memory usage in aln2counts.py by writing work in progress to file cache.

Here's the memory that each sample uses, after the changes:

88160A_HCV (87MB, 912s)
88160AMIDI_MidHCV (91MB, 60s)
69268A-V3-1_V3LOOP (66MB, 9s)
69258A-V3_V3LOOP (51MB, 4s)

Unsurprisingly, writing to disk is slower by up to a minute, but the memory usage now stays below 100MB.

donkirkby · 2017-07-11T23:22:49Z

Next to tackle: sam2aln.py. Here's the memory that each sample currently uses:

88160A_HCV (2.2GB, 549s)
88160AMIDI_MidHCV (373MB, 84s)
69268A-V3-1_V3LOOP (20MB, 4s)
69258A-V3_V3LOOP (14MB, 2s)

donkirkby · 2017-07-12T18:46:02Z

Reduced memory usage in sam2aln.py by writing work in progress to file cache and not sorting by count. It would probably be even faster and easier to report individual reads instead of grouping them. The few examples I looked at only saved about 5% disk space by grouping duplicates.

Anyway, here's the memory that each sample now uses, along with the slightly slower times.

88160A_HCV (39MB, 651s)
88160AMIDI_MidHCV (59MB, 99s)
69268A-V3-1_V3LOOP (22MB, 4s)
69258A-V3_V3LOOP (17MB, 2s)

donkirkby · 2017-07-12T19:47:36Z

Next to tackle: prelim_map.py. Here's the memory that each sample currently uses:

88160A_HCV (990MB, 86s)
88160AMIDI_MidHCV (204MB, 31s)
69268A-V3-1_V3LOOP (104MB, 2s)
69258A-V3_V3LOOP (96MB, 2s)

donkirkby · 2017-07-13T18:55:57Z

Reduced memory usage in prelim_map.py by letting remap.py handle its input unsorted. Here's the memory that each sample now uses, and it's actually faster:

88160A_HCV (110MB, 72s)
88160AMIDI_MidHCV (108MB, 3s)
69268A-V3-1_V3LOOP (102MB, 3s)
69258A-V3_V3LOOP (96MB, 3s)

Also fix a bunch of warnings in remap.py.

donkirkby · 2017-07-13T21:45:27Z

Next to tackle: remap.py. Here's the memory that each sample currently uses:

88160A_HCV (700MB, 35min)
88160AMIDI_MidHCV (182MB, 204s)
69268A-V3-1_V3LOOP (153MB, 5s)
69258A-V3_V3LOOP (113MB, 3s)

donkirkby · 2017-07-18T21:57:44Z

Made some small improvements to remap.py, but not significant. Leaving it for now, since all the bigger memory users have been fixed.

Part of issue #393. Slight performance improvement to merge_pairs.

Part of #393. Fix some problems with QAI upload, such as removing HLA variants files. Start runs sorted by sample number, reversed.

donkirkby · 2017-07-25T17:17:31Z

Mixed HCV pipeline looks challenging. A lot of memory gets used by bowtie2 in the first step, but the actual failures when I ran some large samples were from sam2aln using more than 6G. Seems like I might be able to improve some of the steps, and then configure the rest with a higher memory limit.

donkirkby · 2017-07-28T00:17:56Z

Here are the memory levels used by each step in the Mixed HCV pipeline on the 88160AMIDI_MidHCV sample:

random-primer-hcv 3.5GB (283s)
sam2aln 343MB (33s)
aln2aafreq 12MB (14s)
merge_by_ref_gene 18MB (3s)

Here are the memory levels used by the bigger 88160A_HCV sample:

random-primer-hcv 3.5GB (35min)
sam2aln 2.1GB (216s)
aln2aafreq 13MB (73s)
merge_by_ref_gene 21MB (12s)

It looks like random-primer-hcv is using 3.5GB for any size sample, probably because it's using the human genome as one of its references. I checked in the Slurm accounting on Bulbasaur, and all of the compute node runs also used 3.5GB.

We might look at removing the human genome in the future, but for this issue, I'm just going to work on the sam2aln step.

donkirkby · 2017-07-28T20:19:39Z

After removing the sorting and changing the grouping, sam2aln now uses 43MB (255s) for the large sample.

donkirkby added the enhancement label May 15, 2017

donkirkby added this to the 7.7 - HIV references milestone May 16, 2017

donkirkby self-assigned this May 16, 2017

donkirkby added the in progress label May 16, 2017

donkirkby modified the milestones: near future, 7.7 - HIV references May 16, 2017

donkirkby removed the in progress label May 16, 2017

donkirkby added the in progress label Jul 6, 2017

donkirkby added a commit that referenced this issue Jul 11, 2017

Remove nuc_variants output, because it isn't used.

1dde5ce

Also helps prepare for issue #393.

donkirkby added a commit that referenced this issue Jul 11, 2017

Write working data to disk in aln2counts for issue #393.

5e5c20d

Add a utility for sorting SAM files. Add third-party license information to README.

donkirkby added a commit that referenced this issue Jul 12, 2017

Fix some warnings in sam2aln, preparing for #393.

f317903

donkirkby added a commit that referenced this issue Jul 12, 2017

Stop sorting sam2aln to reduce memory usage for #393.

9005045

donkirkby added a commit that referenced this issue Jul 13, 2017

Stop sorting prelim.csv for #393.

5437be5

Also fix a bunch of warnings in remap.py.

donkirkby added a commit that referenced this issue Jul 18, 2017

Only use counters at covered positions when building remap consensus.

76b3d17

Part of issue #393. Slight performance improvement to merge_pairs.

donkirkby added a commit that referenced this issue Jul 24, 2017

Use previous consensus when building next one in remap.

d9ccf04

Part of #393. Fix some problems with QAI upload, such as removing HLA variants files. Start runs sorted by sample number, reversed.

donkirkby added a commit that referenced this issue Jul 27, 2017

Replace deletions with previous consensus, part of #393.

ed6966c

donkirkby added a commit that referenced this issue Jul 27, 2017

Fix bug introduced by #393. Calculate distance from original seeds.

fe848ed

donkirkby closed this as completed Jul 28, 2017

donkirkby removed the in progress label Jul 28, 2017

donkirkby modified the milestones: 7.8 Mutation Prevalance, near future Jul 31, 2017

donkirkby mentioned this issue Aug 1, 2017

Reduce memory consumption in fastq_g2p #409

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory consumption in prelim_map #393

Reduce memory consumption in prelim_map #393

donkirkby commented May 15, 2017 •

edited

donkirkby commented May 16, 2017 •

edited

donkirkby commented May 16, 2017

donkirkby commented May 17, 2017

donkirkby commented Jul 6, 2017 •

edited

donkirkby commented Jul 11, 2017

donkirkby commented Jul 11, 2017

donkirkby commented Jul 12, 2017

donkirkby commented Jul 12, 2017

donkirkby commented Jul 13, 2017

donkirkby commented Jul 13, 2017

donkirkby commented Jul 18, 2017

donkirkby commented Jul 25, 2017

donkirkby commented Jul 28, 2017 •

edited

donkirkby commented Jul 28, 2017

Reduce memory consumption in prelim_map #393

Reduce memory consumption in prelim_map #393

Comments

donkirkby commented May 15, 2017 • edited

donkirkby commented May 16, 2017 • edited

donkirkby commented May 16, 2017

donkirkby commented May 17, 2017

donkirkby commented Jul 6, 2017 • edited

donkirkby commented Jul 11, 2017

donkirkby commented Jul 11, 2017

donkirkby commented Jul 12, 2017

donkirkby commented Jul 12, 2017

donkirkby commented Jul 13, 2017

donkirkby commented Jul 13, 2017

donkirkby commented Jul 18, 2017

donkirkby commented Jul 25, 2017

donkirkby commented Jul 28, 2017 • edited

donkirkby commented Jul 28, 2017

donkirkby commented May 15, 2017 •

edited

donkirkby commented May 16, 2017 •

edited

donkirkby commented Jul 6, 2017 •

edited

donkirkby commented Jul 28, 2017 •

edited