This example shows how to:  
1. Load a counts matrix (10X Chromium data from human peripheral blood cells)
2. Run the default Scrublet pipeline 
3. Check that doublet predictions make sense

In [1]:
%matplotlib inline
import scrublet as scr
import scipy.io
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd 
import glob

In [None]:
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Arial'
plt.rc('font', size=14)
plt.rcParams['pdf.fonttype'] = 42

#### Download 8k PBMC data set from 10X Genomics
Download raw data from this link:
http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz


Or use wget:

In [None]:
#!wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz

Uncompress:

In [None]:
#!tar xfz pbmc8k_filtered_gene_bc_matrices.tar.gz

#### Load counts matrix and gene list
Load the raw counts matrix as a scipy sparse matrix with cells as rows and genes as columns.

In [None]:
#input_dir = 'Scrublet_Files/'
counts_matrix = scipy.io.mmread('CZI_Fourth_Run_RNA_ADT_UMAP_paired_res_1.2_with_HTO_doubs.raw.counts.for.Scrublet.mtx').T.tocsc()
#genes = np.array(scr.load_genes(input_dir + 'genes.tsv', delimiter='\t', column=1))

print('Counts matrix shape: {} rows, {} columns'.format(counts_matrix.shape[0], counts_matrix.shape[1]))
#print('Number of genes in gene list: {}'.format(len(genes)))
# number of rows corresponds to number of cells, input matrix of genes in rows, cells in columns

#### Initialize Scrublet object
The relevant parameters are:
- *expected_doublet_rate*: the expected fraction of transcriptomes that are doublets, typically 0.05-0.1. Results are not particularly sensitive to this parameter. For this example, the expected doublet rate comes from the Chromium User Guide: https://support.10xgenomics.com/permalink/3vzDu3zQjY0o2AqkkkI4CC
- *sim_doublet_ratio*: the number of doublets to simulate, relative to the number of observed transcriptomes. This should be high enough that all doublet states are well-represented by simulated doublets. Setting it too high is computationally expensive. The default value is 2, though values as low as 0.5 give very similar results for the datasets that have been tested.
- *n_neighbors*: Number of neighbors used to construct the KNN classifier of observed transcriptomes and simulated doublets. The default value of `round(0.5*sqrt(n_cells))` generally works well.


In [None]:
scrub = scr.Scrublet(counts_matrix, expected_doublet_rate=0.1) #d_rate was 0.1

#### Run the default pipeline, which includes:
1. Doublet simulation
2. Normalization, gene filtering, rescaling, PCA
3. Doublet score calculation 
4. Doublet score threshold detection and doublet calling


In [None]:
doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2, 
                                                          min_cells=3, 
                                                          min_gene_variability_pctl=85, 
                                                          n_prin_comps=30)

#### Plot doublet score histograms  for observed transcriptomes and simulated doublets
The simulated doublet histogram is typically bimodal. The left mode corresponds to "embedded" doublets generated by two cells with similar gene expression. The right mode corresponds to "neotypic" doublets, which are generated by cells with distinct gene expression (e.g., different cell types) and are expected to introduce more artifacts in downstream analyses. Scrublet can only detect neotypic doublets.  
  
To call doublets vs. singlets, we must set a threshold doublet score, ideally at the minimum between the two modes of the simulated doublet histogram. `scrub_doublets()` attempts to identify this point automatically and has done a good job in this example. However, if automatic threshold detection doesn't work well, you can adjust the threshold with the `call_doublets()` function. For example:
```python
scrub.call_doublets(threshold=0.25)
```

In [None]:
#scrub.plot_histogram()
scrub.call_doublets(threshold=0.35)


#### Get 2-D embedding to visualize the results

In [None]:
print('Running UMAP...')
scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))

# # Uncomment to run tSNE - slow
# print('Running tSNE...')
# scrub.set_embedding('tSNE', scr.get_tsne(scrub.manifold_obs_, angle=0.9))

# # Uncomment to run force layout - slow
# print('Running ForceAtlas2...')
# scrub.set_embedding('FA', scr.get_force_layout(scrub.manifold_obs_, n_neighbors=5. n_iter=1000))
    
print('Done.')

#### Plot doublet predictions on 2-D embedding
Predicted doublets should co-localize in distinct states.

In [None]:
scrub.plot_embedding('UMAP', order_points=True);

# scrub.plot_embedding('tSNE', order_points=True);
# scrub.plot_embedding('FA', order_points=True);

In [None]:
pd.DataFrame(doublet_scores).to_csv("doublet_scores_1.0.csv")

In [None]:
aa=list(["bebe","deve"])

In [None]:
aa[0]

In [2]:

home_dir="/projects/ucar-lab/danaco/bncmrk-dblts"

os.chdir(home_dir+"/Python")

from scr_pipe import *

pr_names=list(["CZI.PBMC","PBMC.8.HTO","Four.Cell.12.HTO"])

for i in pr_names:
    os.chdir(home_dir+"/Scrublet/input/"+i)
    fl=glob.glob("*raw.counts.for.Scrublet.mtx")
    for j in fl:
        scr_pipe(j)

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/home/danaco/.conda/envs/scSplit/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-1fd502045ead>", line 14, in <module>
    scr_pipe(j)
  File "/projects/ucar-lab/danaco/bncmrk-dblts/Python/scr_pipe.py", line 10, in scr_pipe
    counts_matrix = scipy.io.mmread(file_name).T.tocsc()
  File "/home/danaco/.conda/envs/scSplit/lib/python3.6/site-packages/scipy/io/mmio.py", line 76, in mmread
    return MMFile().read(source)
  File "/home/danaco/.conda/envs/scSplit/lib/python3.6/site-packages/scipy/io/mmio.py", line 415, in read
    return self._parse_body(stream)
  File "/home/danaco/.conda/envs/scSplit/lib/python3.6/site-packages/scipy/io/mmio.py", line 592, in _parse_body
    I[entry_number], J[entry_number] = map(int, l[:2])
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent

KeyboardInterrupt: 

In [4]:
home_dir="/".join(os.getcwd().split("/")[0:len(os.getcwd().split("/"))-3])
pr_name=os.getcwd().split("/")[-1]

out_dir=home_dir+"/Scrublet/output/"+pr_name+"/"

out_dir+"Scr.doublet.scores."+0.1+"exp_rate.csv"

TypeError: must be str, not float

In [8]:
out_dir+0.1

TypeError: must be str, not float

In [10]:
str(0.1)

'0.1'

In [None]:
anan(x=3)