**Compare the outputs of Cell Ranger ARC and STARsolo for OSD/GLDS-352 samples.**


**Cell Ranger ARC command:**
- Used Cellranger Arc 2.0.2 for data processing

```bash
cellranger-arc count --jobmode=local \
    --localcores=32 \
    --localmem=115 \
    --id=${SAMPLE} \
    --reference=/cellranger/refdata-cellranger-arc-mm10-2020-A-2.0.0 \
    --libraries=${SAMPLE}_info.csv
```



**STARsolo command:** 
- Used STAR 2.7.10a for data processing

```bash
STAR --runThreadN 18 \
	--genomeDir $genomeDir \
	--soloType CB_UMI_Simple \
	--clipAdapterType CellRanger4 \
	--outFilterScoreMin 30 \
	--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \
	--soloUMIfiltering MultiGeneUMI_CR \
	--soloUMIdedup 1MM_CR \
	--soloUMIlen 12 \
	--soloCellFilter EmptyDrops_CR $expectedCells 0.99 10 45000 90000 500 0.01 20000 0.01 10000 \
	--soloMultiMappers EM \
	--outSAMattributes NH HI nM AS CR UR GX GN sS sQ sM \
	--outSAMtype BAM Unsorted \
	--soloFeatures Gene GeneFull SJ Velocyto \
	--readFilesCommand zcat \
	--soloCBwhitelist $whitelist \
	--outFileNamePrefix $outDir/${sample}/${sample}_ \
	--readFilesIn $fastqDir/${sample}_R2_raw.fastq.gz $fastqDir/${sample}_R1_raw.fastq.gz
```
        
Time=47 minutes

**All plots are shown CellRanger first, STARsolo second**

### Set count data paths:

In [None]:
sample='CF2'

cr_counts='./OSD-352_GLDS-352_outputs/' + sample + '/GL_CRA-filtered'

ss_counts='./OSD-352_GLDS-352_outputs/' + sample + '/GL_SS-filtered'

### Import libraries

In [None]:
import scanpy as sc

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
sc.settings.verbosity = 0             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')

### Import data

#### Extract Cell Ranger ARC (cr) data

In [None]:
cr = sc.read_10x_mtx(
    cr_counts,  
    var_names='gene_symbols',                
    cache=False)    

#### Extract STARsolo data

In [None]:
ss = sc.read_10x_mtx(
    ss_counts,  
    var_names='gene_symbols',                
    cache=False) 

In [None]:
cr

In [None]:
ss

-------

*Takeaway:* Different number of genes, different number of cells.

-------

#### Plot highest expressed genes

In [None]:
sc.pl.highest_expr_genes(cr, n_top=20, )

In [None]:
sc.pl.highest_expr_genes(ss, n_top=20, )

----

*Takeaway:* Highly variable genes are different

----

#### Dimensionality Reduction

In [None]:
sc.tl.pca(cr, svd_solver='arpack')
sc.tl.pca(ss, svd_solver='arpack')

In [None]:
sc.pl.pca_variance_ratio(cr, log=True)

In [None]:
sc.pl.pca_variance_ratio(ss, log=True)

----

*Takeaway:* Principal Component structure within both datasets appears different.

----

In [None]:
sc.pp.neighbors(cr, n_neighbors=10, n_pcs=40)
sc.pp.neighbors(ss, n_neighbors=10, n_pcs=40)

sc.tl.leiden(cr)
sc.tl.leiden(ss)

sc.tl.umap(cr)
sc.tl.umap(ss)

In [None]:
sc.pl.umap(cr, color='leiden')

In [None]:
sc.pl.umap(ss, color='leiden')

----

*Takeaway:* The number of clusters and the clustering are different.

----


#### Subset CellRanger to only the barcodes kept in STARsolo.

In [None]:
keep = []
for i in ss.obs.index:
    keep.append(i+'-1')
len(keep)

In [None]:
# subset CR
cr_sub = cr[keep]
cr_sub

In [None]:
# convert SS to have -1 at the end of everything so the correlation will work
ss.obs.index = keep

In [None]:
sc.tl.pca(cr_sub, svd_solver='arpack')
sc.pp.neighbors(cr_sub, n_neighbors=10, n_pcs=40)
sc.tl.leiden(cr_sub)
sc.tl.umap(cr_sub)

In [None]:
sc.pl.umap(cr_sub, color='leiden')

In [None]:
sc.pl.umap(ss, color='leiden')

----

*Takeaway:* Subsetting CellRanger to the barcodes kept in STARsolo does not improve the clustering.

----

#### Calculate pairwise correlation between the two datasets (cell-wise).

In [None]:
# calculate pairwise correlation between the cells
corr = cr_sub.to_df().corrwith(other=ss.to_df(), axis=1, method='spearman')

In [None]:
corr.max()

In [None]:
corr.min()

In [None]:
plt.hist(corr)

----

*Takeaway:* Most cells do not correlate between the datasets. 

----