## To-do:
- Implement violin plot
- Potentially vectorize selection and projection of cells

## Updates
- Can now pass a sparse matrix with a significant speedup for certain calculations

## Installation:
Currently:
conda env create -n py5coding -f http://py5.ixora.io/install/py5_environment.yml
conda install scipy
python setup.py install ## within embedview directory 

In [1]:
import pandas as pd
import numpy as np
from scipy.io import mmread
import time
import scipy.sparse as sparse

In [2]:
%load_ext py5
%gui osx
%load_ext autoreload
%autoreload 2

In [7]:
print('DOWNLOADING AND EXTRACTING EXAMPLE DATA')
! mkdir -p ../data
! wget https://storage.googleapis.com/sabeti-public/dkotliar/scnavigator/pbmc68k_Tcell/data/Tcell50K_expression_log2TP10K_20210409.barcodes.tsv -O ../data/Tcell50K_expression_log2TP10K_20210409.barcodes.tsv
! wget https://storage.googleapis.com/sabeti-public/dkotliar/scnavigator/pbmc68k_Tcell/data/Tcell50K_expression_log2TP10K_20210409.genes.tsv -O ../data/Tcell50K_expression_log2TP10K_20210409.genes.tsv
! wget https://storage.googleapis.com/sabeti-public/dkotliar/scnavigator/pbmc68k_Tcell/data/Tcell50K_expression_log2TP10K_20210409.umap.tsv -O ../data/Tcell50K_expression_log2TP10K_20210409.umap.tsv
! wget https://storage.googleapis.com/sabeti-public/dkotliar/scnavigator/pbmc68k_Tcell/data/Tcell50K_expression_log2TP10K_20210409.mtx -O ../data/Tcell50K_expression_log2TP10K_20210409.mtx

DOWNLOADING AND EXTRACTING EXAMPLE DATA
--2021-04-09 20:32:16--  https://storage.googleapis.com/sabeti-public/dkotliar/scnavigator/pbmc68k_Tcell/data/Tcell50K_expression_log2TP10K_20210409.umap.tsv
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.7.16, 172.217.10.48, 172.217.10.80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.7.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1948398 (1.9M) [text/tab-separated-values]
Saving to: ‘../data/Tcell50K_expression_log2TP10K_20210409.umap.tsv’


2021-04-09 20:32:16 (9.74 MB/s) - ‘../data/Tcell50K_expression_log2TP10K_20210409.umap.tsv’ saved [1948398/1948398]



In [3]:
print("LOADING UMAP DATA...")

_umap = pd.read_csv('../data/Tcell50K_expression_log2TP10K_20210409.umap.tsv', sep='\t', index_col=0)
_umap.head()

LOADING UMAP DATA...


Unnamed: 0,UMAP_1,UMAP_2
AAACATACACCCAA-1,3.975246,10.370767
AAACATACCCCTCA-1,9.388674,1.431675
AAACATACCGGAGA-1,12.206055,11.943375
AAACATACTCTTCA-1,15.312049,-2.373958
AAACATACTGGATC-1,10.571509,-6.149192


In [4]:
print("LOADING GENE EXPRESSION DATA...")

_expr_sparse = mmread('../data/Tcell50K_expression_log2TP10K_20210409.mtx')
_expr_sparse = sparse.csc_matrix(_expr_sparse)
_expr_sparse

LOADING GENE EXPRESSION DATA...


<52899x12563 sparse matrix of type '<class 'numpy.float64'>'
	with 28209469 stored elements in Compressed Sparse Column format>

In [5]:
genes = list(pd.read_csv('../data/Tcell50K_expression_log2TP10K_20210409.genes.tsv', sep='\t', header=None)[0])
barcodes = list(pd.read_csv('../data/Tcell50K_expression_log2TP10K_20210409.barcodes.tsv', sep='\t', header=None)[0])

In [22]:
#from umap_explorer import UMAPexplorer
import sys
sys.path.append('../umap_explorer/')
from umap_explorer import UMAPexplorer

test = UMAPexplorer(_umap, _expr_sparse, gene_names=genes,
                    cell_names=barcodes)
test.explore_data()

0.44756007194519043 seconds to select and project cells
Selected 366 cells
Calculating correlations...
1.2087547779083252 seconds to calculate correlations. Sparsity:  True
0.9712028503417969 seconds to select and project cells
Selected 5378 cells
Calculating correlations...
0.3305070400238037 seconds to calculate correlations. Sparsity:  True
Selected mode 2
0.40462327003479004 seconds to select and project cells
Selected 369 cells
Calculating differential expression...
0.025264978408813477 seconds to calculate genesums. Sparsity:  True
0.8906641006469727 seconds to calculate squared genesums. Sparsity:  True
2.331149101257324 seconds to calculate differential expression. Sparsity:  True
0.9409699440002441 seconds to select and project cells
Selected 5482 cells
Calculating differential expression...
0.7621569633483887 seconds to calculate differential expression. Sparsity:  True
2.619197130203247 seconds to select and project cells
Selected 23363 cells
Calculating differential express

In [23]:
test.significant_genes.head()

Unnamed: 0,R,P
S100A4,-205.189633,0.0
IL32,-170.047202,0.0
CCL5,-168.797463,0.0
NKG7,-139.371245,0.0
B2M,-138.914805,0.0


In [25]:
_expr = pd.DataFrame(_expr_sparse.todense(), columns=genes, index=barcodes)

In [26]:
_expr.iloc[:5, :5]

Unnamed: 0,LINC00115,FAM41C,NOC2L,KLHL17,PLEKHN1
AAACATACACCCAA-1,0.0,0.0,0.0,0.0,0.0
AAACATACCCCTCA-1,0.0,0.0,0.0,0.0,0.0
AAACATACCGGAGA-1,0.0,0.0,0.0,0.0,0.0
AAACATACTCTTCA-1,0.0,0.0,0.0,0.0,0.0
AAACATACTGGATC-1,0.0,0.0,0.0,0.0,0.0


In [27]:
_expr.shape

(52899, 12563)

In [28]:
#from umap_explorer import UMAPexplorer
import sys
sys.path.append('../umap_explorer/')
from umap_explorer import UMAPexplorer

test_dense = UMAPexplorer(_umap, _expr)
test_dense.explore_data()

0.39404726028442383 seconds to select and project cells
Selected 344 cells
Calculating correlations...


  rs = np.dot(DP, DO) / np.sqrt(np.sum(DO ** 2, 0) * np.sum(DP ** 2))


9.00390338897705 seconds to calculate correlations. Sparsity:  False
0.9730532169342041 seconds to select and project cells
Selected 5412 cells
Calculating correlations...
16.529144048690796 seconds to calculate correlations. Sparsity:  False
Selected mode 2
0.6109471321105957 seconds to select and project cells
Selected 375 cells
Calculating differential expression...
8.504546165466309 seconds to calculate genesums. Sparsity:  False
20.941229104995728 seconds to calculate squared genesums. Sparsity:  False
45.02797794342041 seconds to calculate differential expression. Sparsity:  False
0.9342422485351562 seconds to select and project cells
Selected 5562 cells
Calculating differential expression...


  remainder_stds = np.sqrt((self.gene_sqsum - selected_stds - (remainder_N*remainder_means**2)) / (remainder_N -1))


29.00832200050354 seconds to calculate differential expression. Sparsity:  False
2.678100347518921 seconds to select and project cells
Selected 23513 cells
Calculating differential expression...
60.34158396720886 seconds to calculate differential expression. Sparsity:  False
EXPORTING DATA...
BYE


In [6]:
#from umap_explorer import UMAPexplorer
import sys
sys.path.append('../umap_explorer/')
from umap_explorer import UMAPexplorer

test = UMAPexplorer(_umap, _expr_sparse, gene_names=genes,
                    cell_names=barcodes)
test.explore_data()

In [None]:
if (self.pearsonsThreshold <= abs(rs[i])) and (and ps[i] <= self.pvalueThreshold):