## ddqc Tutotial

In [1]:
import pegasus as pg

### Reading the data
For this tutotial we are going to be using count matrix dataset on Human Bone Marrow provided by pegasus.
You can download it at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip. <br>

We will load the data using the standart `read_input` pegasus function.

In [2]:
data = pg.read_input("/Users/michaelalperovich/Downloads/MantonBM_nonmix_subset.zarr.zip")
data

2021-03-26 20:41:47,825 - pegasusio.readwrite - INFO - zarr file '/Users/michaelalperovich/Downloads/MantonBM_nonmix_subset.zarr.zip' is loaded.
2021-03-26 20:41:47,826 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.39s.


MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 48219 x 36601
    Genome: GRCh38; Modality: rna
    It contains 1 matrix: 'X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes', 'Channel'
    var: 'featureid'
    obsm: 
    varm: 
    uns: 'genome', 'modality'

### Filtering step
ddqc does adaptive quality control by clustering cells and picking a separate threshold for each cluster. It is described down below:
- Initial Qualtity Control (QC) is performed, when obvious low-quality cells are removed. By default those are cells with n_genes < 100 and percent_mito > 80.
- The cells are clustered with the clustering resolution 1.3 (default)
- Then thesholds are picked for each cluster. By default the following metrics are considered:
    - **Number of counts/UMIs**: keep cells that have n_counts greater than median - 2 Median Absolute Deviations (MAD)
    - **Number of genes**: keep cells that have n_genes greater than median - 2 MADs
    - **Percent of mitochondrial transctipts**: keep cells that have percent_mito less than median + 2 MADs
- In order to prevent the removal of healthy cells in clusters with high median n_genes and low percent_mito there are additional bounds for those thresholds:
    - Cluster-level threshold for n_genes can't be greater than 200 (default). If it is greater, it will be set to 200.
    - Cluster-level threshold for percent_mito can't be lower than 10 (default). If it is lower, it will be set to 200.

To perform ddqc on a dataset, first, we need to generate QC metrics. We can do it using the `ddqc_metrics` method with the default settings. `return_df_qc=True` is used to get a dataframe with the information about on what metric and what threshold the cell was removed for each removed cell.

In [3]:
df_qc = pg.ddqc_metrics(data, return_df_qc=True)
data

  ribo_genes = data.var_names.map(lambda x: re.match(ribo_prefix, x, flags=re.IGNORECASE) is not None).values.nonzero()[0] # get all genes that match the pattern
  ribo_genes = data.var_names.map(lambda x: re.match(ribo_prefix, x, flags=re.IGNORECASE) is not None).values.nonzero()[0] # get all genes that match the pattern


2021-03-26 20:41:51,059 - pegasusio.qc_utils - INFO - After filtration, 48210 out of 48219 cell barcodes are kept in UnimodalData object GRCh38-rna.
2021-03-26 20:41:51,059 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.33s.
2021-03-26 20:41:51,812 - pegasus.tools.preprocessing - INFO - After filtration, 25910/36601 genes are kept. Among 25910 genes, 16967 genes are robust.
2021-03-26 20:41:51,813 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.75s.
2021-03-26 20:41:52,595 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.77s.
2021-03-26 20:41:52,718 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.12s.
2021-03-26 20:41:52,761 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2021-03-26 20:41:52,762 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.17s.
2021-03-26 20:42:05,628 -

If you want to customize the filtering you can use the following parameters:
- `res`: float - clustering resolution (default: 1.3)
- `method`: string - statistic on which the threshold would be calculated (default: "mad", available options "mad", "outlier")
- `threshold`: float - parameter for the selected method (default: 2)
- `basic_n_genes`: int - parameter for the initial QC n_genes filtering (default: 100)
- `basic_percent_mito`: float - parameter for the initial QC percent_mito filtering (default: 80)
- `mito_prefix`: string - gene prefix used to calculate percent_mito in a cell (default: "MT-")
- `ribo_prefix`: string - gene regular expression used to calculate percent_ribo in a cell (default: "^RP[SL][[:digit:]]|^RPLP[[:digit:]]|^RPSA")
- `do_counts`: bool - whether to consider n_counts for ddqc (default: True)
- `do_genes`: bool - whether to consider n_genes for ddqc (default: True)
- `do_mito`: bool - whether to consider percent_mito for ddqc (default: True)
- `do_ribo`: bool - whether to consider percent_ribo for ddqc (default: False)
- `n_genes_lower_bound`: int - bound for lower n_genes cluster-level threshold (default: 200)
- `basic_percent_mito`: float - bound for upper percent_mito cluster-level threshold (default: 10)
- `random_state`: int - random seed for clustering results reproducibility (default: 29)
- `return_df_qc`: bool - whether to return a dataframe with the information about on what metric and what threshold the cell was removed for each removed cell. (default: False)

If you requested to return df_qc the function will return a pandas dataframe containing the following info for each cell:
- `cluster_labels`: cluster from initial clustering performed by ddqc
- `metric_lower_co` and `metric_upper_co`: lower and upper cuttofs for each metric on which ddqc was performed. If ddqc was not performed for upper or lower end of this metric this field will be `None`
This information is useful if you want to understand based on which metric the cell was filtered out.

In [6]:
df_qc

Unnamed: 0_level_0,cluster_labels,n_counts_lower_co,n_counts_upper_co,n_genes_lower_co,n_genes_upper_co,percent_mito_lower_co,percent_mito_upper_co
barcodekey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
MantonBM1_HiSeq_1-AAACCTGAGCAGGTCA,6,1359.2792,,200,,,4.729336
MantonBM1_HiSeq_1-AAACCTGCACACTGCG,3,-464.3688,,111.2732,,,8.954169
MantonBM1_HiSeq_1-AAACCTGCACCGGAAA,1,731.1666,,200,,,4.225153
MantonBM1_HiSeq_1-AAACCTGCATAGACTC,4,729.4072,,200,,,6.374291
MantonBM1_HiSeq_1-AAACCTGCATCGATGT,13,-335.4844,,77.3428,,,5.007331
...,...,...,...,...,...,...,...
MantonBM8_HiSeq_1-TTTGTCAGTCCGACGT,1,731.1666,,200,,,4.225153
MantonBM8_HiSeq_1-TTTGTCAGTCGCGGTT,3,-464.3688,,111.2732,,,8.954169
MantonBM8_HiSeq_1-TTTGTCAGTGGTGTAG,3,-464.3688,,111.2732,,,8.954169
MantonBM8_HiSeq_1-TTTGTCAGTTTACTCT,21,-1336.0876,,-67.1496,,,4.206936


Then you can generate different QC plots as described in pegasus tutorial. When you are done and want to proceed to downstream analysis, run the following command:

In [7]:
pg.filter_data(data)

2021-03-26 20:50:56,927 - pegasusio.qc_utils - INFO - After filtration, 43384 out of 48210 cell barcodes are kept in UnimodalData object GRCh38-rna.
2021-03-26 20:50:56,928 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.57s.
