Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory consumption in prelim_map #393

Closed
3 tasks done
donkirkby opened this issue May 15, 2017 · 14 comments
Closed
3 tasks done

Reduce memory consumption in prelim_map #393

donkirkby opened this issue May 15, 2017 · 14 comments
Assignees

Comments

@donkirkby
Copy link
Member

donkirkby commented May 15, 2017

We're trying to allocate enough memory for the pipeline steps when we run them under Slurm, and some of them failed prelim_map because it used more than 3GB on a large sample. I ran prelim_map on a small sample and tracked the memory usage reported by top. It reported 100MB used by bowtie2, and 350MB used by the python process. The bowtie2 memory was stable, but the python memory steadily climbed.

  • See what is being held in memory, and decide if it can be avoided.
  • Look at other steps and create separate issues if needed.
  • Look at steps in the Mixed HCV pipeline.
@donkirkby donkirkby added this to the 7.7 - HIV references milestone May 16, 2017
@donkirkby donkirkby self-assigned this May 16, 2017
@donkirkby
Copy link
Member Author

donkirkby commented May 16, 2017

The largest sample I found that stayed under 6GB of memory was D62195-HCV_S5 from the 26 Feb 2016 batch. Its steps had the following MaxRSS values:

  • trim_fastqs - 21M
  • prelim_map - 4.1G
  • remap - 700M
  • sam2aln - 4.8G
  • aln2counts - 5.8G
  • sam_g2p - 11M
  • coverage_plots - completed in less than 30s, so no memory reported

Another interesting sample is 67182A-HIV_S5 from 20 Sep 2016. It is a smaller sample, but it used more than 6GB of memory on the v7.7 pipeline.

@donkirkby
Copy link
Member Author

prelim_map is loading all of the reads into memory before it writes them to the prelim_map.csv file so that it can group the reads by reference name. Figure out if we really need to group the reads.

I'm pushing this back to the near future milestone, because Richard H. asked for #394 to get done before this.

@donkirkby
Copy link
Member Author

The largest sample in the 26 Feb 2016 run had a compressed FASTQ file of 800MB. As a workaround, we configured Kive to request 20GB for every driver script, and that worked. The most memory used was just over 10GB for the 800MB sample's aln2counts step. The longest elapsed time was 5h50m for the same sample's remap step. However, it used less than 1GB of memory.

@donkirkby
Copy link
Member Author

donkirkby commented Jul 6, 2017

Here are a range of sample sizes to experiment with from the 13-Jun-2017.M04401 run:

  • 88160A_HCV (214.1MB)
  • 88160AMIDI_MidHCV (47.4MB)
  • 69268A-V3-1_V3LOOP (15.3MB)
  • 69258A-V3_V3LOOP (5.7MB)

The sizes are the sum of the two compressed FASTQ files. Presumably, the V3LOOP samples will behave differently from the HCV samples, because V3LOOP reads get mapped with the pairwise alignment script in fastq_g2p.py.

aln2counts.py seems to use the most memory, so we'll start with that. Here's the memory that each sample uses:

  • 88160A_HCV (2.7GB, 846s)
  • 88160AMIDI_MidHCV (437MB, 52s)
  • 69268A-V3-1_V3LOOP (67MB, 9s)
  • 69258A-V3_V3LOOP (50MB, 4s)

Memory consuming steps in aln2counts.py:

  • counting nuc variants (removed)
  • writing coordinate insertions
  • reading consensus insertions from previous step (waiting on changes to that step) probably not important

donkirkby added a commit that referenced this issue Jul 11, 2017
donkirkby added a commit that referenced this issue Jul 11, 2017
Add a utility for sorting SAM files.
Add third-party license information to README.
@donkirkby
Copy link
Member Author

Reduced memory usage in aln2counts.py by writing work in progress to file cache.

Here's the memory that each sample uses, after the changes:

  • 88160A_HCV (87MB, 912s)
  • 88160AMIDI_MidHCV (91MB, 60s)
  • 69268A-V3-1_V3LOOP (66MB, 9s)
  • 69258A-V3_V3LOOP (51MB, 4s)

Unsurprisingly, writing to disk is slower by up to a minute, but the memory usage now stays below 100MB.

@donkirkby
Copy link
Member Author

Next to tackle: sam2aln.py. Here's the memory that each sample currently uses:

  • 88160A_HCV (2.2GB, 549s)
  • 88160AMIDI_MidHCV (373MB, 84s)
  • 69268A-V3-1_V3LOOP (20MB, 4s)
  • 69258A-V3_V3LOOP (14MB, 2s)

@donkirkby
Copy link
Member Author

Reduced memory usage in sam2aln.py by writing work in progress to file cache and not sorting by count. It would probably be even faster and easier to report individual reads instead of grouping them. The few examples I looked at only saved about 5% disk space by grouping duplicates.

Anyway, here's the memory that each sample now uses, along with the slightly slower times.

  • 88160A_HCV (39MB, 651s)
  • 88160AMIDI_MidHCV (59MB, 99s)
  • 69268A-V3-1_V3LOOP (22MB, 4s)
  • 69258A-V3_V3LOOP (17MB, 2s)

@donkirkby
Copy link
Member Author

Next to tackle: prelim_map.py. Here's the memory that each sample currently uses:

  • 88160A_HCV (990MB, 86s)
  • 88160AMIDI_MidHCV (204MB, 31s)
  • 69268A-V3-1_V3LOOP (104MB, 2s)
  • 69258A-V3_V3LOOP (96MB, 2s)

@donkirkby
Copy link
Member Author

Reduced memory usage in prelim_map.py by letting remap.py handle its input unsorted. Here's the memory that each sample now uses, and it's actually faster:

  • 88160A_HCV (110MB, 72s)
  • 88160AMIDI_MidHCV (108MB, 3s)
  • 69268A-V3-1_V3LOOP (102MB, 3s)
  • 69258A-V3_V3LOOP (96MB, 3s)

donkirkby added a commit that referenced this issue Jul 13, 2017
Also fix a bunch of warnings in remap.py.
@donkirkby
Copy link
Member Author

Next to tackle: remap.py. Here's the memory that each sample currently uses:

  • 88160A_HCV (700MB, 35min)
  • 88160AMIDI_MidHCV (182MB, 204s)
  • 69268A-V3-1_V3LOOP (153MB, 5s)
  • 69258A-V3_V3LOOP (113MB, 3s)

@donkirkby
Copy link
Member Author

Made some small improvements to remap.py, but not significant. Leaving it for now, since all the bigger memory users have been fixed.

donkirkby added a commit that referenced this issue Jul 18, 2017
Part of issue #393.
Slight performance improvement to merge_pairs.
donkirkby added a commit that referenced this issue Jul 24, 2017
Part of #393.

Fix some problems with QAI upload, such as removing HLA variants files.
Start runs sorted by sample number, reversed.
@donkirkby
Copy link
Member Author

Mixed HCV pipeline looks challenging. A lot of memory gets used by bowtie2 in the first step, but the actual failures when I ran some large samples were from sam2aln using more than 6G. Seems like I might be able to improve some of the steps, and then configure the rest with a higher memory limit.

@donkirkby
Copy link
Member Author

donkirkby commented Jul 28, 2017

Here are the memory levels used by each step in the Mixed HCV pipeline on the 88160AMIDI_MidHCV sample:

  • random-primer-hcv 3.5GB (283s)
  • sam2aln 343MB (33s)
  • aln2aafreq 12MB (14s)
  • merge_by_ref_gene 18MB (3s)

Here are the memory levels used by the bigger 88160A_HCV sample:

  • random-primer-hcv 3.5GB (35min)
  • sam2aln 2.1GB (216s)
  • aln2aafreq 13MB (73s)
  • merge_by_ref_gene 21MB (12s)

It looks like random-primer-hcv is using 3.5GB for any size sample, probably because it's using the human genome as one of its references. I checked in the Slurm accounting on Bulbasaur, and all of the compute node runs also used 3.5GB.

We might look at removing the human genome in the future, but for this issue, I'm just going to work on the sam2aln step.

@donkirkby
Copy link
Member Author

After removing the sorting and changing the grouping, sam2aln now uses 43MB (255s) for the large sample.

@donkirkby donkirkby modified the milestones: 7.8 Mutation Prevalance, near future Jul 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant