### Cassiopeia-Preprocess

This notebook serves as a tutorial for the `Cassiopeia-Preprocess` module, which processes sequencing data into "Character Matrices" ready to pass into a phylogeny-inference algorithm. The pipeline consists of 7 main steps:

0. Run `cellranger count` on the raw Fastq files to obtain BAMs relating read names to sequences.
1. "Collapse" sequences, indexed by UMIs, thereby counting reads.
2. "Resolve" UMI sequences, choosing the most likely sequencing read to represent each UMI in a cell.
3. "Align" sequences to the reference target site using a the Smith-Waterman local alignment algorithm.
4. "Call Alleles" with respect to the reference target site and the alignment of a sequence, thereby reporting the set of mutations that a target site sequence contains.
5. "Error Correct UMIs" whose mutation data is identical and whose UMI barcode sequences are similar enough.
6. "Filter" UMIs that have conflicting allele information, too few reads, or do not meet other quality control criteria.
7. "Call lineages", or split up cells into clonal populations, based on their shared set of integration barcodes (intBCs).

The final output of this pipeline is an "AlleleTable" which stores the mutation data and clonal population identity for each cell. This data structure can then be broken up into character matrices for phylogenetic inference.


## Pipeline API
All of the key modules of `Cassiopeia-Preprocess` can be invoked by a call from `cassiopeia.pp`. Assuming the user would like to begin at the beginning of the pipeline, we'll start with the `collapse` stage. You can find all documentation on our [main site](https://cassiopeia-lineage.readthedocs.io/en/testdeployment/).

An alternative to running the pipeline interactively is to take advantage of the command line tool `cassiopeia-preprocess`, which takes in a configuration file (for example in Cassiopeia/data/preprocess.cfg) and runs the pipeline end-to-end.

The function assumes that the user has already run `cellranger count` to obtain a BAM for the sequencing library.

In [5]:
import pandas as pd

import cassiopeia

In [6]:
input_bam = "/data/yosef2/users/mattjones/projects/scLineages/Cassiopeia/test_process_pipeline/test_possorted_genome_bam.subsampled.bam"
output_dir = "/data/yosef2/users/mattjones/projects/scLineages/Cassiopeia/test_process_pipeline/"
target_site_reference = "/data/yosef2/users/mattjones/projects/scLineages/Cassiopeia/data/PCT48.ref.fasta"

cassiopeia.pp.setup(output_dir)

In [7]:
umi_table = cassiopeia.pp.collapse_umis(input_bam, output_dir)

KeyboardInterrupt: 

In [4]:
umi_table = cassiopeia.pp.resolve_umi_sequence(umi_table, output_dir)

HBox(children=(FloatProgress(value=0.0, description='Resolving UMI sequences', max=1718439.0, style=ProgressSt…




In [5]:
umi_table = cassiopeia.pp.align_sequences(umi_table, ref_filepath = target_site_reference)

HBox(children=(FloatProgress(value=0.0, description='Aligning sequences to reference', max=1073347.0, style=Pr…




In [6]:
umi_table = cassiopeia.pp.call_alleles(umi_table, ref_filepath = target_site_reference)

HBox(children=(FloatProgress(value=0.0, description='Parsing CIGAR strings into indels', max=1073347.0, style=…




In [None]:
umi_table = cassiopeia.pp.error_correct_umis(umi_table, _id = "test")

HBox(children=(FloatProgress(value=0.0, description='Error-correcting UMIs', max=122462.0, style=ProgressStyle…

In [10]:
umi_table = cassiopeia.pp.filter_molecule_table(umi_table, output_dir, plot=True)

HBox(children=(FloatProgress(value=0.0, description='Error Correcting intBCs', max=8466.0, style=ProgressStyle…




HBox(children=(FloatProgress(value=0.0, description='Mapping alleles to intBCs', max=82115.0, style=ProgressSt…




In [11]:
allele_table = cassiopeia.pp.call_lineage_groups(umi_table, output_dir)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  PIV_LG["lineageGrp"] = iteration + 1
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


In [12]:
allele_table.head(5)

Unnamed: 0,cellBC,intBC,allele,r1,r2,r3,lineageGrp,UMI,readCount,Sample
0,AAACCTGAGGCTAGAC-1,CCCCGTGCCTTCCT,,,,,4,15,186.0,AAACCTGAGGCTAGAC-1
1,AAACCTGAGGCTAGAC-1,CGACAATGTAGTTG,CTTTG[104:29D]TACGGGATAT[167:54D]CGGAGGATAT[16...,CTTTG[104:29D]TACGG,GATAT[167:54D]CGGAG,GATAT[167:54D]CGGAG,4,26,326.0,AAACCTGAGGCTAGAC-1
2,AAACCTGAGGCTAGAC-1,CGCGCGTCCGGGTC,CGCCG[111:1I]AAAAAACATAA[161:18D]CGTGAATTCG[No...,CGCCG[111:1I]AAAAAA,CATAA[161:18D]CGTGA,ATTCG[None]CGGAG,4,13,104.0,AAACCTGAGGCTAGAC-1
3,AAACCTGAGGCTAGAC-1,GACTTTAATGTACA,CCGAA[113:54D]CTCTGCCGAA[113:54D]CTCTGTAATT[21...,CCGAA[113:54D]CTCTG,CCGAA[113:54D]CTCTG,TAATT[219:2D]CGGAG,4,18,187.0,AAACCTGAGGCTAGAC-1
4,AAACCTGAGGCTAGAC-1,GATGGACATTGGGG,CCGAA[113:50D]ATATCCCGAA[113:50D]ATATCATTCG[No...,CCGAA[113:50D]ATATC,CCGAA[113:50D]ATATC,ATTCG[None]CGGAG,4,19,178.0,AAACCTGAGGCTAGAC-1
