scSPARKL

An Apache Spark based computational framework for the downstream analysis of scRNA-seq data.

Description

scSPARKL is a simple framework for conducting variety of analysis on scRNA-seq data. It works on the Apache Spark environment which can either be standalone or in distributed mode (depending up on dataload).

Prerequisites

Current implementation has been tested on Microsoft Windows 11 Operating System. Other Important prerequisites include:
Java (latest)
Apache Spark 3.0 or latest
Python 3.9 or latest
Jupyter Notebook (Optional)

Installation

To install the Apache Spark on Windows use any of the following links:
Spark Standalone on Windows
Apache Spark on Windows

Spark Memory allocations

Apache spark is distributed in-memory analytics engine. It is highly recommended to efficiently determine the number of cores and tune the memory alocations to the executors and drivers. To effectively utilize the power of Apache Spark follow the tutorial below for memory configuration and assigning executor cores:

Cofigure Executors and Drivers in Spark

Implementation

Download the source code and run either the Jupyter Notebook or the scSPARKL script file.

Please note: currently the pipeline uses `.csv` and `.parquet` as input files for the analysis. We are continuosly working to bring in more formats for usage.

The following are the main tasks performed by the pipeline:

Data Melting

Input data is first cleaned and melted to tall format.

Generate Cell, Gene Quality Summaries and filtering the unwanted cells and genes.

Data Melting is followed by generating the variety of quality summaries for genes and cells, output of which is saved in an analyses folder automatically generated. The quality summaries are then passed as arguments to the data_filter() to filter out the unwanted genes and cells. Defualt paremeters can be changed by directly manipulating the filter package. Additionally new columns can be added for other operations for filtering.

Normalization

Normalization is currently based on two types:

Global or Simple normalization; which is similar to CPM normalization.
Quantile Normalization Normalization takes tall formated data and returns one wide formatted and a tall formated normalized data.

Gene Selection

We have two methods for selecting HVG/gene selection:

Median Absolute Deviation (MAD). Default Threshold k > 3.
Coefficient of Variance Squared. Returning Top n genes.

Dimension Reduction using PCA

The dimension reduction is performed using PCA. The PCA implementation is exclusively spark based.

Read paper for further details

Click here for Preprint

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
data		data
implementations		implementations
notebooks		notebooks
src		src
LICENSE		LICENSE
README.md		README.md
helper_functions.py		helper_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

implementations

implementations

notebooks

notebooks

src

src

LICENSE

LICENSE

README.md

README.md

helper_functions.py

helper_functions.py

Repository files navigation

scSPARKL

An Apache Spark based computational framework for the downstream analysis of scRNA-seq data.

Description

Prerequisites

Installation

Spark Memory allocations

Implementation

Please note: currently the pipeline uses `.csv` and `.parquet` as input files for the analysis. We are continuosly working to bring in more formats for usage.

Data Melting

Generate Cell, Gene Quality Summaries and filtering the unwanted cells and genes.

Normalization

Gene Selection

Dimension Reduction using PCA

Read paper for further details

About

Releases

Packages

Languages

License

asif7adil/scSPARKL

Folders and files

Latest commit

History

Repository files navigation

scSPARKL

An Apache Spark based computational framework for the downstream analysis of scRNA-seq data.

Description

Prerequisites

Installation

Spark Memory allocations

Implementation

Please note: currently the pipeline uses .csv and .parquet as input files for the analysis. We are continuosly working to bring in more formats for usage.

Data Melting

Generate Cell, Gene Quality Summaries and filtering the unwanted cells and genes.

Normalization

Gene Selection

Dimension Reduction using PCA

Read paper for further details

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Please note: currently the pipeline uses `.csv` and `.parquet` as input files for the analysis. We are continuosly working to bring in more formats for usage.