Skip to content

biodatageeks/pysequila

Repository files navigation

version PyPI downloads status Python-3.8 license coverage GitHub contributors GitHub commit activity

pysequila

pysequila is a Python entrypoint to SeQuiLa, an ANSI-SQL compliant solution for efficient sequencing reads processing and genomic intervals querying built on top of Apache Spark. Range joins, depth of coverage and pileup computations are bread and butter for NGS analysis but the high volume of data make them execute very slowly or even failing to compute.

Requirements

  • Python 3.7, 3.8, 3.9

Features

  • custom data sources for bioinformatics file formats (BAM, CRAM, VCF)
  • depth of coverage calculations
  • pileup calculations
  • reads filtering
  • efficient range joins
  • other utility functions
  • support for both SQL and Dataframe/Dataset API

Setup

$ python -m pip install --user pysequila
or
(venv)$ python -m pip install pysequila

Usage

$ python
>>> from pysequila import SequilaSession
>>> ss = SequilaSession \
  .builder \
  .config("spark.jars.packages", "org.biodatageeks:sequila_2.12:1.1.0") \
  .config("spark.driver.memory", "2g") \
  .getOrCreate()
>>> ss.sql(
      f"""
      CREATE TABLE IF NOT EXISTS reads
      USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource
      OPTIONS(path "/features/data/NA12878.multichrom.md.bam")
      """
>>> ss.sql ("SELECT * FROM  coverage('reads', 'NA12878','/features/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta")
>>> # or using DataFrame/DataSet API
>>> ss.coverage("/features/data/NA12878.multichrom.md.bam", "/features/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta")