Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible improvement on chunk_barcoded_bam.py #73

Closed
ruochiz opened this issue Jun 25, 2023 · 2 comments
Closed

Possible improvement on chunk_barcoded_bam.py #73

ruochiz opened this issue Jun 25, 2023 · 2 comments

Comments

@ruochiz
Copy link
Contributor

ruochiz commented Jun 25, 2023

Thank you for creating this useful toolkit. When running the software on a really large combined libraries (~200k cells to consider), I found the bottleneck becomes the chunk_barcoded_bam.py part, and I found possible solutions to improve it.

  1. transform cell barcode list from list to set bc = set([x.strip() for x in content]) which improves the speed of checking existence of barcodes a lot (~800 records /s -> ~100k records / s)
  2. Use pysam read.get_tag, instead of the iteration way
def getBarcode(read, tag_get):
  '''
  Parse out the barcode per-read
  '''
  # for tg in read.tags:
  # 	if(tag_get == tg[0]):
  # 		return(tg[1])
  # return("AA")
  try:
    read.get_tag(barcodeTag, tag_get)
  except:
    return ("AA")

This improves the speed from ~100k records/s -> 130k records/s

@caleblareau
Copy link
Owner

caleblareau commented Jun 25, 2023 via email

@caleblareau
Copy link
Owner

now implemented in v0.6.8. Thank you very much @ruochiz for the contribution. You should be able to pip install the latest version of the software now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants