# BESCAPE - tutorial on deconvolution of bulk RNA using single-cell annotations

BESCAPE (BESCA Proportion Estimator) is a deconvolution module. It utilises single-cell annotations coming from the BESCA workflow to build a Gene Expression Profile (GEP). This GEP is used as a basis vector to deconvolute bulk RNA samples i.e. predict cell type proportions within a sample.

BESCAPE has a useful implementation, whereby the user can specify their own GEP, as well as choose any of the supported deconvolution methods. Thus, it effectively allows decoupling of the deconvolution algorithm from its underlying GEP (basis vector).

This tutorial presents the workflow for deconvolution, as well as the link to BESCA single-cell annotations.

We assume that either Docker or Singularity services have already been installed.

Install bescape using `pip install bescape`.
Please note: If you had previously installed bescape, please upgrade to the latest version:
`pip install --upgrade bescape`

# Initialising the predictor object

Initiate the decovnolution predictor object. Requires either a Docker, or a Singularity image to run. Both methods are shown below.

## 1. Docker
To initiate the Bescape deconvolution object, we to set the service to 'docker' and docker_image='bedapub/bescape:version'. It will first look for local docker images, and if not available, will pull the bescape image from DockerHub. This also means that one can locally build a customised Docker image from the BESCAPE source and set use it in the Bescape object.

Please note: If you had previously installed the docker_image, please make sure it is the latest version using:

`docker pull bedapub/bescape:latest`

In [1]:
import os
from bescape import Bescape

# docker
# may take some time if the docker image is being built for the first time
deconv = Bescape(service='docker', docker_image='bedapub/bescape:latest')

Docker client instantiated
Local Docker image found:  bedapub/bescape:latest


#### Troubleshooting Docker permission error
If running a permission error to run the docker image, please follow the steps in https://askubuntu.com/questions/477551/how-can-i-use-docker-without-sudo to run docker without sudo

Namely,
Add the docker group if it doesn't already exist:

`sudo groupadd docker`

Add the connected user "$USER" to the docker group. Change the user name to match your preferred user if you do not want to use your current user:

`sudo gpasswd -a $USER docker`
    
Either do a newgrp docker or log out/in to activate the changes to groups.


## 2. Singularity
When using Singularity, the user specifies the absolute path for the Singularity container file. 

If the path is not given, Bescape will attempt to pull the lastest docker image from Dockerhub and build a new copy of a Singularity container file. In this case, the `docker_image` parameter specifies which image is pulled from the DockerHub to be converted to a Singularity container.

In [None]:
import os
from bescape import Bescape

# singularity
deconv = Bescape(service='singularity', docker_image='bedapub/bescape:latest', path_singularity=None)

# Performing Deconvolution
Once the Bescape object has been initialised, the methods are the same for both `docker` and `singularity`.

## Input file structure
The correct example file input structure is shown here: https://github.com/bedapub/bescape/tree/master/docs/datasets/bescape

The user needs to provide:
1. Absolute path to the input FOLDER containing the [input.csv](https://github.com/bedapub/bescape/blob/master/docs/datasets/bescape/input/input.csv) file and the [bulk.csv](https://github.com/bedapub/bescape/blob/master/docs/datasets/bescape/input/ds1_ensg.csv) file (rows= bulk gene expression, columns=samples)
2. Absolute path to the gep FOLDER containing the GEP file to be used as a basis vector for deconvolution




## Using a single-cell annotation AnnData object as a basis vector
- should contain single-cell annotations of multiple samples from which the deconvolution method generates its own GEP
- currently supported packages:
    1. MuSiC
    2. SCDC
- __The packages above are written in R. Thus, we need to convert the AnnData objects to R ExpressionSet objects. This has been semi-automated in the following notebook: [Converting AnnData to Eset](https://bedapub.github.io/besca/tutorials/adata_to_eset.html)__
- implemented in the __Bescape.deconvolute_sc( )__ method

### 1. Set input file structure and download example files
The correct example file input structure is shown here: https://github.com/bedapub/bescape/tree/master/docs/datasets/bescape

The user needs to provide:
1. Absolute path to the input FOLDER containing the [input.csv](https://github.com/bedapub/bescape/blob/master/docs/tutorial_data/input/input.csv) file and the [bulk.csv](https://github.com/bedapub/bescape/blob/master/docs/tutorial_data/input/simulated_blk_segerstolpe_hugo.csv) file (rows= bulk gene expression, columns=samples)
2. Absolute path to the gep FOLDER containing the GEP file to be used as a basis vector for deconvolution
3. Absolute path to the output FOLDER, into which the deconvolution results should be written out

The following cell handles folder creation and example file download for this tutorial

In [2]:
import urllib.request

# Important to specify ABSOLUTE directory paths
wd = os.getcwd()
dir_annot = wd + '/tutorial_data/gep'
dir_input = wd + '/tutorial_data/input'
dir_output = wd + '/tutorial_data/output'
dirlist = [dir_annot, dir_input, dir_output]
for directory in dirlist:
    if not os.path.exists(directory):
        os.makedirs(directory)
    

uri_input = 'https://raw.githubusercontent.com/bedapub/bescape/master/docs/tutorial_data/input/input.csv'
urllib.request.urlretrieve(uri_input, dir_input + '/input.csv')

uri_sample = 'https://raw.githubusercontent.com/bedapub/bescape/master/docs/tutorial_data/input/simulated_blk_segerstolpe_hugo.csv'
urllib.request.urlretrieve(uri_sample, dir_input + '/simulated_blk_segerstolpe_hugo.csv')

uri_baron = 'https://raw.githubusercontent.com/bedapub/bescape/master/docs/tutorial_data/gep/baron_raw_exp_eset.RDS'
uri_seger = 'https://raw.githubusercontent.com/bedapub/bescape/master/docs/tutorial_data/gep/segerstolpe_raw_exp_eset.RDS'
urllib.request.urlretrieve(uri_baron, dir_annot + '/baron_raw_exp_eset.RDS')
urllib.request.urlretrieve(uri_seger, dir_annot + '/segerstolpe_raw_exp_eset.RDS')


('/home/tkamth/bescape/docs/tutorial_data/gep/segerstolpe_raw_exp_eset.RDS',
 <http.client.HTTPMessage at 0x7fb2e60ca910>)

### 1. MuSiC
`dir_annot` should contain only one annotated ExpressionSet. If more are available, the first one in alphabetical order is picked

In [None]:
# deconvolute using MuSiC - sc based basis vector
deconv.deconvolute_sc(dir_annot= dir_annot, 
                      dir_input= dir_input,
                      dir_output= dir_output, 
                      method='music')

### 2. SCDC

Using SCDC requires the following parameters:
* `dir_annot` can contain one or more sc-annotation ExpressionSets. If more that one is available, SCDC reads all of them and performs [ENSEMBLE deconvolution](https://rdrr.io/github/meichendong/SCDC/man/SCDC_ENSEMBLE.html)
* `celltypesel` - cell types of interest to estimate; has to be an intersecting set of celltypes contained in the supplied basis vectors in `dir_annot`


In [3]:
wd = os.getcwd()
dir_annot = wd + '/tutorial_data/gep'
dir_input = wd + '/tutorial_data/input'
dir_output = wd + '/tutorial_data/output'

dir_annot

'/home/tkamth/bescape/docs/tutorial_data/gep'

In [4]:
deconv.deconvolute_sc(dir_annot=dir_annot, 
                      dir_input=dir_input,
                      dir_output=dir_output, 
                      method='scdc', 
                      celltype_sel=['fibroblast', 'PP cell', 'pancreatic D cell', 
                                    'pancreatic A cell', 'pancreatic ductal cell', 
                                    'type B pancreatic cell', 'pancreatic acinar cell', 
                                    'blood vessel endothelial cell'])

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.2     ✔ purrr   0.3.4
✔ tibble  3.0.3     ✔ dplyr   1.0.0
✔ tidyr   1.1.0     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:dplyr’:

combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, 