scSPARKL is a simple framework for conducting variety of analysis on scRNA-seq data. It works on the Apache Spark environment which can either be standalone or in distributed mode (depending up on dataload).
Current implementation has been tested on Microsoft Windows 11 Operating System.
Other Important prerequisites include:
Java (latest)
Apache Spark 3.0 or latest
Python 3.9 or latest
Jupyter Notebook (Optional)
To install the Apache Spark on Windows use any of the following links:
Spark Standalone on Windows
Apache Spark on Windows
Apache spark is distributed in-memory analytics engine. It is highly recommended to efficiently determine the number of cores and tune the memory alocations to the executors and drivers. To effectively utilize the power of Apache Spark follow the tutorial below for memory configuration and assigning executor cores:
Cofigure Executors and Drivers in Spark
Download the source code and run either the Jupyter Notebook or the scSPARKL script file.
Please note: currently the pipeline uses .csv
and .parquet
as input files for the analysis. We are continuosly working to bring in more formats for usage.
The following are the main tasks performed by the pipeline:
Input data is first cleaned and melted to tall format.
Data Melting is followed by generating the variety of quality summaries for genes and cells, output of which is saved in an analyses folder automatically generated.
The quality summaries are then passed as arguments to the data_filter()
to filter out the unwanted genes and cells. Defualt paremeters can be changed by directly manipulating the filter package.
Additionally new columns can be added for other operations for filtering.
Normalization is currently based on two types:
- Global or Simple normalization; which is similar to CPM normalization.
- Quantile Normalization Normalization takes tall formated data and returns one wide formatted and a tall formated normalized data.
We have two methods for selecting HVG/gene selection:
- Median Absolute Deviation (MAD). Default Threshold
k
> 3. - Coefficient of Variance Squared. Returning Top
n
genes.
The dimension reduction is performed using PCA. The PCA implementation is exclusively spark based.