Skip to content

scSPARKL is an Apache spark based pipeline for performing variety of preprocessing and downstream analysis of scRNA-seq data.

License

Notifications You must be signed in to change notification settings

asif7adil/scSPARKL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scSPARKL

An Apache Spark based computational framework for the downstream analysis of scRNA-seq data.

Description

scSPARKL is a simple framework for conducting variety of analysis on scRNA-seq data. It works on the Apache Spark environment which can either be standalone or in distributed mode (depending up on dataload).

Prerequisites

Current implementation has been tested on Microsoft Windows 11 Operating System. Other Important prerequisites include:
Java (latest)
Apache Spark 3.0 or latest
Python 3.9 or latest
Jupyter Notebook (Optional)

Installation

To install the Apache Spark on Windows use any of the following links:
Spark Standalone on Windows
Apache Spark on Windows

Spark Memory allocations

Apache spark is distributed in-memory analytics engine. It is highly recommended to efficiently determine the number of cores and tune the memory alocations to the executors and drivers. To effectively utilize the power of Apache Spark follow the tutorial below for memory configuration and assigning executor cores:

Cofigure Executors and Drivers in Spark

Implementation

Download the source code and run either the Jupyter Notebook or the scSPARKL script file.

Please note: currently the pipeline uses .csv and .parquet as input files for the analysis. We are continuosly working to bring in more formats for usage.

The following are the main tasks performed by the pipeline:

Data Melting

Input data is first cleaned and melted to tall format.

Generate Cell, Gene Quality Summaries and filtering the unwanted cells and genes.

Data Melting is followed by generating the variety of quality summaries for genes and cells, output of which is saved in an analyses folder automatically generated. The quality summaries are then passed as arguments to the data_filter() to filter out the unwanted genes and cells. Defualt paremeters can be changed by directly manipulating the filter package. Additionally new columns can be added for other operations for filtering.

Normalization

Normalization is currently based on two types:

  • Global or Simple normalization; which is similar to CPM normalization.
  • Quantile Normalization Normalization takes tall formated data and returns one wide formatted and a tall formated normalized data.

Gene Selection

We have two methods for selecting HVG/gene selection:

  • Median Absolute Deviation (MAD). Default Threshold k > 3.
  • Coefficient of Variance Squared. Returning Top n genes.

Dimension Reduction using PCA

The dimension reduction is performed using PCA. The PCA implementation is exclusively spark based.

Read paper for further details


Click here for Preprint

About

scSPARKL is an Apache spark based pipeline for performing variety of preprocessing and downstream analysis of scRNA-seq data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published