vignettes/articles/cmd_qc.Rmd

---
title: "Generation of comprehensive quality control metrics with SCTK"
output: html_document
---

```{r setup, include=FALSE}
require(Biobase)

knitr::opts_chunk$set(warning = FALSE)
pkgVersion <- package.version("singleCellTK")
```

# Introduction

This pipeline will import data from single-cell preprocessing algorithms (e.g. [CellRanger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger), [HCA Optimus](https://data.humancellatlas.org/pipelines/optimus-workflow), [Alevin](https://salmon.readthedocs.io/en/latest/alevin.html)), generate various quality control metrics (e.g. general metrics, doublet scores, contamination estimates) using multiple tools, and output results in standard data containers (e.g. [SingleCellExperiment](https://rdrr.io/bioc/SingleCellExperiment/man/SingleCellExperiment.html), [Seurat object](https://satijalab.org/seurat/index.html), [AnnData](https://github.com/theislab/anndata)).

For data generated with microfluidic devices, the first major step after UMI counting is to detect cell barcodes that represent droplets containing a true cell and exclude empty droplets that only contain ambient RNA.

-   We use the terms **"Droplet" matrix** to denote a count matrix that still contains empty droplets;
-   **"Cell" matrix** to denote a count matrix of cells where empty droplets have been excluded but no other filtering has been performed;
-   And **"FilteredCell" matrix** to indicate a count matrix where poor quality cells have also been excluded.

The Droplet and Cell matrices have also been called "raw" and "filtered" matrices, respectively, by tools such as CellRanger. However, using the term "filtered" can be ambiguous as other forms of cell filtering can be applied beyond empty droplets (e.g. excluding poor-quality cells based on low number of UMIs). Both the original droplet matrix and the filtered cell matrix can be QC'ed in this pipeline. However, QC of the droplet matrix is specific for single cell data generated from microfluidic devices (e.g. 10X).

To run the pipeline, users can [install the singleCellTK (SCTK) package](installation.html) along with Python dependencies that may be potentially used (mainly [Scrublet](https://github.com/swolock/scrublet) and [AnnData](https://github.com/theislab/anndata)). Alternatively, users can run the Docker version of the pipeline which is described in detail further down in this page.

# Running the pipeline

## Running SCTK-QC with SCTK local installation

To run the pipeline script, users will need to download the `SCTK_runQC.R` script file [here](https://github.com/compbiomed/singleCellTK/blob/master/exec/SCTK_runQC.R). Also, the tutorial of installing singleCellTK package can found in [Installation](installation.html) page.

A simple example to run this pipeline on the 'Cell' matrix generated by Cellranger V3 is shown below:

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path \
-P CellRangerV3 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-g /Path_to_gmt/name_of_gmt_file.gmt \
-d Cell \
-n 2 \
-T MulticoreParam 
```

This pipeline enables different ways to import CellrangerV2/CellrangerV3 data for flexibility. Also, the pipeline is compatible with datasets generated by other algorithms. Please refer to the section [*Importing data from different preprocessing tools*](#importing-data-from-different-preprocessing-tools) for more details.

User can quantify expression of mitochondrial genes by passing a GMT files containing mitochondrial genes (with `-g/--gmt` argument). User can also quantify expression of mitochondrial genes for human or mouse dataset by setting `-M/--detectMitoLevel` argument. Please refer to the section [*Gene sets*](#gene-sets) for more details.

Besides, the pipeline contains various parameters to control the process of quality control. Please refer to the section [*Parameters*](#parameters) for more details.

## Running SCTK-QC with Docker

### Installing docker

If you have not used docker before, you can follow the instruction to install and set up docker in [Windows](https://docs.docker.com/desktop/windows/install/), [Mac](https://docs.docker.com/desktop/mac/install/) or [Linux](https://runnable.com/docker/install-docker-on-linux).

### Running SCTK-QC pipeline using docker image

The Docker image can be obtained by running:

```{r, echo=FALSE, eval=TRUE, include=TRUE, comment=NA}
code <- paste0('docker pull campbio/sctk_qc:', pkgVersion)
cat(code)
```

The usage of each argument is the same as running command line analysis. Here is an example code to perform QC on CellRangerV3 data with SCTK docker:

```{r, echo=FALSE, eval=TRUE, include=TRUE, comment=NA}
code <- paste("docker run --rm -v /path/to/data:/SCTK_docker \\",
  paste0('  -it campbio/sctk_qc:', pkgVersion, " \\"),
  "  -b /SCTK_docker/cellranger_folder \\
  -P CellRangerV3 \\
  -s SampleName \\
  -o /SCTK_docker/Output_Directory \\
  -g /SCTK_docker/name_of_gmt_file.gmt \\
  -S TRUE \\
  -F SCE,AnnData,FlatFile,Seurat", sep='\n'
)
cat(code)
```

The docker image will not access files in your host file system by default. To get access to the files on your machine, you can properly set up a mount volume. Noted that the transcriptome data and GMT file needed to be accessible to the container through mounted volume. In the above example, mount volume is enabled for accessing input and output directory using argument `-v`. The transcriptome and GMT files stored in the path `/path/to/data` of your machine file system is now available in `/SCTK_docker` folder inside the docker. To learn more about mounted volumes, please check out [this post](https://docs.docker.com/storage/volumes/).

Please refer to the section [*Parameters*](#parameters) for more details about parameters.

### Running SCTK-QC pipeline docker image with singularity

Users who have not used [Singularity](https://sylabs.io/guides/2.6/user-guide/introduction.html) before can install it following the instruction [here](https://sylabs.io/guides/3.1/user-guide/installation.html). The Singularity image for SCTK-QC can be easily built using Docker Hub as a source:

```{r, echo=FALSE, eval=TRUE, include=TRUE, comment=NA}
code <- paste0('singularity pull docker://campbio/sctk_qc:', pkgVersion)
cat(code)
```

The usage of singleCellTK Singularity image is very similar to that of Docker. In Singularity 3.0+, the mount volume is [automatically overlaid](https://sylabs.io/guides/3.1/user-guide/bind_paths_and_mounts.html).

It's recommended to re-set the home directory when you run singularity. Singularity will mount `\$HOME` path on your file system by default, which might contain your personal R/Python library folder. If we don't re-set the home to mount, singularity will try to use R/Python libraries which are not built within the singularity image and cause some conflicts. You can point to some "sanitized home", which is different from `\$HOME` path on your machine, using argument `-H`/`--home` [(see more information)](https://sylabs.io/guides/3.1/user-guide/bind_paths_and_mounts.html). Or you can disable the `\$HOME` binding by setting the argument `--no-home`. Besides, you can use argument `--bind`/`-B` to specify your own mount volume, which is the path that contains the dataset and will be used to store the output of QC pipeline. The example is shown as below:

```{r, echo=FALSE, eval=TRUE, include=TRUE, comment=NA}
code <- paste("singularity run --home=/PathToSanitizedHome \\",
  paste0('  --bind /PathToData:/data sctk_qc_', pkgVersion, ".sif \\"),
  "  -b /SCTK_docker/cellranger_folder \\
  -P CellRangerV3 \\
  -s SampleName \\
  -o /SCTK_docker/Output_Directory \\
  -g /SCTK_docker/name_of_gmt_file.gmt \\
  -S TRUE \\
  -F SCE,AnnData,FlatFile,Seurat", sep='\n'
)
cat(code)
```

### Important note about docker image

One important note about this docker image: please run the docker image on a machine / node which has a **CPU** with the following architecture: **broadwell, haswell, skylake, cascadelake or the latest architecture**. This can avoid having the "illegal operation" issue from Scrublet package, because this Python package are compiled by SIMD instructions that are compatible with these CPU architectures. Please specify the CPU architecture, at the script header after `#$ -l cpu_arch=`, as one of the following: `broadwell`, `haswell`, `skylake`, `cascadelake` or latest architecture. One of the example is shown below:

```{r, echo=FALSE, eval=TRUE, include=TRUE, comment=NA}
code <- paste("#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe omp 16
#$ -l cpu_arch=broadwell|haswell|skylake|cascadelake

singularity run --home=/PathToSanitizedHome \ ### this also works for 'docker run'",
  paste0('  --bind /PathToData:/data sctk_qc_', pkgVersion, ".sif \\"),
  "  -b /SCTK_docker/cellranger_folder \\
  -P CellRangerV3 \\
  -s SampleName \\
  -o /SCTK_docker/Output_Directory \\
  -g /SCTK_docker/name_of_gmt_file.gmt \\
  -S TRUE \\
  -F SCE,AnnData,FlatFile,Seurat", sep='\n'
)
cat(code)
```

# Parameters {#parameters}

## Table of Parameters

The pipeline contains various parameters to control the process of quality control. The function of each parameter is shown below:

### Required arguments

The required arguments are as follows:

| Parameter                             | Description                                                                                                                                                                                                                                                                                                                       |
|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `-b`, `--basePath` (**required**)     | Base path for the output from the preprocessing algorithm.                                                                                                                                                                                                                                                                        |
| `-P`, `--preproc` (**required**)      | Algorithm used for preprocessing. One of `CellRangerV2`, `CellRangerV3`, `BUStools`, `STARSolo`, `SEQC`, `Optimus`, `DropEst`, `SceRDS`, `Seurat`, `CountMatrix` and `AnnData`.                                                                                                                                                   |
| `-s`, `--sample` (**required**)       | Name of the sample. This will be prepended to the cell barcodes.                                                                                                                                                                                                                                                                  |
| `-o`, `--directory` (**required**)    | Output directory. A new subdirectory will be created with the name "sample". R, Python, and FlatFile directories will be created under the "sample" directory containing the data containers with QC metrics. Default `.`. More information about output directory structure is explained in [*Outputs*](#outputs) section below. |
| `-F`, `--outputFormat` (**required**) | The output format of this QC pipeline. Currently, it supports `SCE`, `Seurat`, `FlatFile`, `AnnData` and `HTAN` (manifest files that meets HTAN requirement).                                                                                                                                                                     |
| `-S`, `--splitSample` (**required**)  | Save a `SingleCellExperiment` object for each sample. Default is `TRUE`. If `FALSE`, the data of all samples will be combined into one `SingleCellExperiment` object and this object will be output.                                                                                                                              |

### Batch processing

When running the pipeline with more than one sample to process, care should be exercised with the -S flag, and whether or not one wishes to combine the samples.

In addition, the following parameters will need multiple inputs, separated by a **comma** **without any spacing**:

| Parameter          | Description                                                                                                                                                      |
|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `-b`, `--basePath` | Base path for the output from the preprocessing algorithm; one is required for each sample, even if they are in the same location.                               |
| `-P`, `--preproc`  | In multiple object mode, each entry should have a preprocessing algorithm associated with it; the pipeline supports serial processing of different sample types. |
| `-s`, `--sample`   | Names of the samples.                                                                                                                                            |

An example execution is shown below:

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path1,/base/path2 \
-P Preprocessing_Algorithm_1,Preprocess_Algorithm_2 \
-s SampleName1,SampleName2 \
-o Output_Directory \
-S FALSE \
-F SCE,AnnData,FlatFile \
-d Droplet \
-D TRUE \
```

### Optional arguments

The optional arguments are as follows. Their usage depend on type of data and user-defined behaviour.

| Parameter                                       | Description                                                                                                                                                                                                                                                                                                                                                                      |
|-------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `-g`, `--gmt`                                   | GMT file containing gene sets for quality control.                                                                                                                                                                                                                                                                                                                               |
| `-t`, `--delim` (required when -g is specified) | Delimiter used in GMT file. Default `"\t"`.                                                                                                                                                                                                                                                                                                                                      |
| `-G`, `--genome`                                | The name of genome reference. This is only required for CellRangerV2 data.                                                                                                                                                                                                                                                                                                       |
| `-y`, `--yaml`                                  | YAML file used to specify parameters of QC functions called by SCTK-QC pipeline. Please check [*Specify parameters using yaml file*](#specify-parameters-using-yaml-file) section for details.                                                                                                                                                                                   |
| `-c`, `--cellData`                              | The full path of the RDS, H5AD or Matrix file of the cell matrix. This would be use only when `--preproc` is `SceRDS`, `Seurat`, `AnnData`, or `CountMatrix`.                                                                                                                                                                                                                    |
| `-r`, `--rawData`                               | The full path of the RDS file or Matrix file of the droplet matrix. This would be provided only when `--preproc` is `SceRDS`, `Seurat`, `AnnData`, or `CountMatrix`.                                                                                                                                                                                                             |
| `-C`, `--cellPath`                              | The directory containing `matrix.mtx.gz`, `features.tsv.gz` and `barcodes.tsv.gz` files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the `filtered_feature_bc_matrix` directory). This argument only works when `--preproc` is `CellRangerV2` or `CellRangerV3`. Default is `NULL`. If `base_path` is `NULL`, `cellPath` or `rawPath` should be specified. |
| `-R`, `--rawPath`                               | The directory containing `matrix.mtx.gz`, `features.tsv.gz` and `barcodes.tsv.gz` files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the `raw_feature_bc_matrix` directory). This argument only works when `--preproc` is `CellRangerV2` or `CellRangerV3`. Default is `NULL`. If `base_path` is `NULL`, `cellPath` or `rawPath` should be specified.      |
| `-d`, `--dataType`                              | Type of data as input. Default is `Both`, which means taking both droplet and cell matrix as input. If set as `Droplet`, it will only processes droplet data. If set as `Cell`, it will only processes cell data.                                                                                                                                                                |
| `-D`, `--detectCells`                           | Detect cells from droplet matrix. Default is `FALSE`. This argument is only evaluated when `-d` is `Droplet`. If set as `TRUE`, cells will be detected and cell matrix will be subset from the droplet matrix. Also, QC will be performed on the detected cell matrix.                                                                                                           |
| `-m`, `--cellDetectMethod`                      | Methods to detect cells from droplet matrix. Default is `EmptyDrops`. This argument is only evaluated when `-D` is `TRUE`. Other options could be `Knee` or `Inflection`. More information is provided in the [*Droplet QC* documentation](cnsl_dropletqc.html).                                                                                                                 |
| `-n`, `--numCores`                              | Number of cores used to run the pipeline. By default is `1`. Parallel computing is enabled if `-n` is greater than 1.                                                                                                                                                                                                                                                            |
| `-T`, `--parallelType`                          | Type of parallel computing used for parallel computing. Parallel computing used in this pipeline depends on [BiocParallel](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html) package. Default is `MulticoreParam`. It can be `MulticoreParam` or `SnowParam`. This argument will be evaluated only when `--numCores` is greater than 1.                     |
| `-i`, `--studyDesign`                           | The TXT file containing the description of the study design. Default is `NULL`. This would be shown at the beginning of the HTML report of cell and droplet QC.                                                                                                                                                                                                                  |
| `-L`, `--subTitle`                              | The subtitle used in the cell and droplet QC HTML report. Default is `None`. The subtitle can contain information of the sample, like sample name, etc. If `-S` is set as `TRUE`, the length of subsitle should be the same as the number of samples. If `-S` is set as `FALSE`, the length of subtitle should be one or NULL.                                                   |
| `-M`, `--detectMitoLevel`                       | Detect mitochondrial gene expression level. If `TRUE`, the pipeline will examine mitochondrial gene expression level automatically without the need of importing user defined GMT file. Default is `TRUE`.                                                                                                                                                                       |
| `-E`, `--mitoType`                              | Type of mitochondrial gene-set to be loaded when `--detectMitoLevel` is set to `TRUE`. Possible choices are: `human-ensembl`, `human-symbol`, `human-entrez`, `human-ensemblTranscriptID`, `mouse-ensembl`, `mouse-symbol`, `mouse-entrez` and `mouse-ensemblTranscriptID`.                                                                                                      |

## Specify parameters using YAML file {#specify-parameters-using-yaml-file}

Users can specify parameters for QC algorithms in this pipeline with a YAML file (supplied with `-y`/`--yamlFile` argument). The current supported QC algorithms including doublet detection (`bcds`, `cxds`, `cxds_bcds_hybrid`, `doubletFinder`, `doubletCells` and `scrublet`), decontamination (`decontX`), emptyDrop detection (`emptyDrops`) and barcodeRankDrops (`barcodeRanks`). A summary of each function is shown below:

![QC yaml parameters](qc_yamlParameters.png)

An example of QC parameters YAML file is shown below:

```         
---
Params:   ### should not be omitted
  bcds:
    ntop: 600

  cxds:
    ntop: 600

  cxds_bcds_hybrid:
    nTop: 600

  decontX:
    maxIter: 600

  emptyDrops:
    lower: 50
    niters: 5000
    testAmbient: True

  barcodeRanks:
    lower: 50
```

The format of YAML file can be found [here](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html). The parameters should be consistent with the parameters of each QC function in SCTK. Parameters that are not defined in this YAML file will use the default value. Please refer [reference](../reference/index.html#quality-control-preprocessing) for detailed information about arguments of each QC function.

## Parallel computing

SCTK-QC pipeline enables parallel computing to speed up the analysis. Parallel computing is enabled by setting `-n`/`--numCores` greater than 1. The `-n`/`--numCores` is used to set the number of cores used for the pipeline.

The backend of parallel computing is supported by [BiocParallel](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html) package. Therefore, users can select different types of parallel evaluation by setting `-T`/`--parallelType` argument. Default is `MulticoreParam`. Currently, `MulticoreParam` and `SnowParam` are supported for `-T` argument. However, `MulticoreParam` is not supported by Windows system. Windows user can choose `SnowParam` as the backend of parallel computing.

# Description of input data

## QC on combinations of droplet and cell matrix

`-d` argument is used to specify which type of count matrix is used in the pipeline. Default is `Both`, which means the pipeline will run quality control on both droplet and cell count data.

Users can also choose to run SCTK-QC pipeline on only the droplet count or cell count matrix, instead of running on both. In this case, the pipeline will only take the single input and perform QC on it. An example is shown below:

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-d Droplet \
-D TRUE \
-m EmptyDrops
```

If `-d` argument is set as `Droplet`, the QC pipeline will only take droplet count matrix as input and perform quality control. You can choose whether to detect cells from the droplet matrix by setting `-D` as `TRUE`. If yes, cell count matrix will be detected and the pipeline will also perform quality control on this matrix and output the result. You could further define the method used to detect cells from droplet matrix by setting `-m` argument. `-m` could be one of `EmptyDrops`, `Knee` or `Inflection`. `EmptyDrops` will keep cells that pass the `runEmptyDrops()` function test. `Knee` and `Inflection` will keep cells that pass the knee or inflection point returned from `runBarcodeRankDrops()` function.

If `-d` argument is set as `Cell`, the QC pipeline will only take cell count matrix as input and perform quality control. A figure showing the analysis steps and outputs of different inputs is shown below:

![QC single input](qc_singleInput.png)\

## Importing data from different preprocessing tools {#importing-data-from-different-preprocessing-tools}

### Import data from Cellranger

This pipeline enables different ways to import CellrangerV2/CellrangerV3 data for flexibility.

1.  If the cellranger data set is saved in the default [cellranger output directory](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/overview), you can load the data by running following code:

For CellRangerV3:

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path \
-P CellRangerV3 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat
```

For CellRangerV2, the reference used by cellranger needs to be specified by `-G`/`--genome`:

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path \
-P CellRangerV2 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-G GenomeName \
-F SCE,AnnData,FlatFile,Seurat
```

`-b` specifies the base path and usually it's the output folder of 10x `cellranger-count`. `-s` specifies the sample name, which has to be the same as the name of the sample folder under the base folder. The folder layout would look like the following:

```         
├── BasePath
└── SampleName 
    ├── outs
    |   ├── filtered_feature_bc_matrix
    |   |   ├── barcodes.tsv.gz
    |   |   ├── features.tsv.gz
    |   |   └── matrix.mtx.gz
    |   ├── raw_feature_bc_matrix
    |   |   ├── barcodes.tsv.gz
    |   |   ├── features.tsv.gz
    |   |   └── matrix.mtx.gz
        ...
```

2.  If the `cellranger-count` output have been moved out of the default [cellranger output directory](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/overview), you can specified the path to droplet and cell count matrix using arguments `-R` and `-C`:

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-P CellRangerV2 \
-C /path/to/cell/matrix \
-R /path/to/droplet/matrix \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat
```

In this case, you **must** skip `-b` arguments and you can also skip `-G` argument for CellRangerV2 data.

### Import data from RDS, H5AD or matrix stored in a text file

1.  If your data in stored as a `SingleCellExperiment` object in RDS file, singleCellTK also supports this type of input. To run quality control with RDS file as input, run the following code:

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-P SceRDS \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-r /path/to/rds/file/droplet.RDS \
-c /path/to/rds/file/cell.RDS
```

2.  If your input is stored as a `Seurat` object in RDS file, you may use the following code:

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-P Seurat \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-r /path/to/rds/file/droplet.RDS \
-c /path/to/rds/file/cell.RDS
```

**Warning:** Both `SingleCellExperiment` and `Seurat` objects use the same RDS format. Be sure which one your data is formatted as or the pipeline may not successfully run.

3.  If your input as stored as an `AnnData` object in H5AD file,

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-P AnnData \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-r /path/to/rds/file/droplet.RDS \
-c /path/to/rds/file/cell.RDS
```

4.  If your input is stored in TXT file as a matrix, which has barcodes as column names and genes as row names, run the following code to start the quality control pipeline:

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-P CountMatrix \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-r /path/to/matrix/file/droplet.txt \
-c path/to/matrix/file/cell.txt
```

### Methods to run pipeline on data set generated by other algorithms

SCTK-QC pipeline allows importing data from the following pre-processing tools or objects:

-   [CellRanger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)
-   [Optimus](https://data.humancellatlas.org/pipelines/optimus-workflow)
-   [DropEst](https://github.com/hms-dbmi/dropEst)
-   [BUStools](https://github.com/BUStools/bustools)
-   [Seqc](https://github.com/ambrosejcarr/seqc)
-   [STARSolo](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md)
-   [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) object saved in RDS file
-   [AnnData](https://github.com/theislab/anndata) object saved in HDF5 file

If your data is preprocessed by other algorithms, you might want to make sure the '-b' argument matches the path storing the data and the '-P' argument matches the right preprocessing tools. Basically, the templated is shown below:

```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat
```

The following table describes how SCTK expects the inputs to be structured and passed for each import function. In all cases, SCTK retains the standard output directory structure from upstream tools. All the import functions return the imported counts matrix as an `assay` in a `SingleCellExperiment` object, with associated information in respective `colData`, `rowData`, `reducedDims`, and `metadata` fields.

![QC data format](qc_inputShell.png)\

The table above also shows the R console functions for the QC algorithms. Detailed information about function parameters and defaults are available in the [*Reference*](../reference/index.html) section.

## Gene sets {#gene-sets}

Quantifying the level of gene sets can be useful quality control. For example, the percentage of counts from mitochondrial genes can be an indicator of cell stress or death.

User can quantify the expression of mitochondrial genes for human or mouse dataset by setting `-M`/`--detectMitoLevel` as `TRUE`. User needs to specify the correct `--mitoType` argument for the dataset. The SCTK-QC pipeline has built-in mitochondrial gene sets for human and mouse genes. It supports four different type of gene id: `gene symbol`, `entrez ID`, `ensembl ID` and `ensembl transcript ID`. Therefore, there are eight options for `--mitoType` arguments: `human-ensembl`, `human-symbol`, `human-entrez`, `human-ensemblTranscriptID`, `mouse-ensembl`, `mouse-symbol`, `mouse-entrez` and `mouse-ensemblTranscriptID`.

To quantify expression of other gene set, users can pass a [GMT](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29) file (with `-g`/`--gmt` argument) to the pipeline with one row for each gene set. The first column should be the name of the gene set (e.g. `mito`).

The second column for each gene set in the GMT file (i.e. the description) should contain the location of where to look for the matching IDs in the data. If set to `rownames`, then the gene set IDs will be matched with the row IDs of the data matrix. If a character string or an integer index is supplied, then gene set IDs will be matched to the IDs in that column of feature table. Gene sets with mitochondrial genes can be found [here](https://github.com/compbiomed/singleCellTK/tree/master/exec).

# Outputs {#outputs}

The output directory is created under the path specified by `-o`/`--directory` argument. Each sample is stored in the subdirectory (named by `-s`/`--sample` argument) within this output direcotry. Within each sample directory, each output format will be separated into subdirectories. The output file hierarchy is shown below:

```         
(root; output directory)
├── level3Meta.csv
├── level4Meta.csv
├── sample1_dropletQC.html
├── sample1_cellQC.html
└── sample1 
    ├──sample1_cellQC_summary.csv
    ├── R
    |   ├── sample1_Droplets.rds
    |   └── sample1_Cells.rds
    ├── Python
    |   ├── Droplets
    |   |   └── sample1.h5ad
    |   └── Cells
    |       └── sample1.h5ad
    ├── FlatFile
    |   ├── Droplets
    |   |   ├── assays
    |   |   |   └── sample1_counts.mtx.gz
    |   |   ├── metadata
    |   |   |   └── sample1_metadata.rds
    |   |   ├── sample1_colData.txt.gz
    |   |   └── sample1_rowData.txt.gz 
    |   └── Cells
    |       ├── assays
    |       |   └── sample1_counts.mtx.gz
    |       ├── metadata
    |       |   └── sample1_metadata.rds
    |       ├── reducedDims
    |       |   ├──sample1_decontX_UMAP.txt.gz
    |       |   ├──sample1_scrublet_TSNE.txt.gz
    |       |   └──sample1_scrublet_UMAP.txt.gz
    |       ├── sample1_colData.txt.gz
    |       └── sample1_rowData.txt.gz 
    └── sample1_QCparameters.yaml
```

# Documentation of available tools

### Empty droplet detection:

-   [emptyDrops](https://rdrr.io/github/MarioniLab/DropletUtils/man/emptyDrops.html) from the package [DropletUtils](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html)
-   [barcodeRanks](https://rdrr.io/github/MarioniLab/DropletUtils/man/barcodeRanks.html) from the package [DropletUtils](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html)

### Doublet Detection

-   [scDblFinder](https://rdrr.io/bioc/scDblFinder/man/scDblFinder.html) from the package [scDblFinder](https://bioconductor.org/packages/release/bioc/html/scDblFinder.html)
-   [cxds](https://rdrr.io/bioc/scds/man/cxds.html), [bcds](https://rdrr.io/bioc/scds/man/bcds.html), and [cxds_bcds_hybrid](https://rdrr.io/bioc/scds/man/cxds_bcds_hybrid.html) from the package [scds](http://bioconductor.org/packages/release/bioc/html/scds.html)
-   [doubletFinder](https://rdrr.io/github/chris-mcginnis-ucsf/DoubletFinder/man/doubletFinder.html) from the package [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder)
-   [Scrublet](https://bioconda.github.io/recipes/scrublet/README.html) from the package [scrublet](https://github.com/allonkleinlab/scrublet)

### Ambient RNA detection

-   [decontX](https://rdrr.io/bioc/celda/man/decontX.html) from the package [celda](https://bioconductor.org/packages/release/bioc/html/celda.html)
-   [SoupX](https://github.com/constantAmateur/SoupX) from package [SoupX](https://github.com/constantAmateur/SoupX)