[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chanwkimlab/MarcoPolo/blob/main/notebooks/tutorial.ipynb)

# Setup

**Start the colab kernel with GPU**: Runtime -> Change runtime type -> GPU

## Install dependencies

In [None]:
!pip install marcopolo-pytorch --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting marcopolo-pytorch
  Downloading marcopolo_pytorch-1.0.9-py3-none-any.whl (614 kB)
[K     |████████████████████████████████| 614 kB 4.8 MB/s 
[?25hCollecting anndata>=0.7.4
  Downloading anndata-0.8.0-py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 2.8 MB/s 
Collecting scipy>=1.6.1
  Downloading scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
[K     |████████████████████████████████| 38.1 MB 2.5 MB/s 
Collecting matplotlib>=3.3.0
  Downloading matplotlib-3.5.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
[K     |████████████████████████████████| 11.2 MB 52.3 MB/s 
[?25hCollecting scanpy>=1.9.0
  Downloading scanpy-1.9.1-py3-none-any.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 52.2 MB/s 
[?25hCollecting einops>=0.3
  Downloading einops-0.4.1-py3-none-any.whl (28 kB)
Collecting fo

In [None]:
!pip install matplotlib==3.1.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting matplotlib==3.1.3
  Downloading matplotlib-3.1.3-cp37-cp37m-manylinux1_x86_64.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 4.3 MB/s 
Installing collected packages: matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.5.2
    Uninstalling matplotlib-3.5.2:
      Successfully uninstalled matplotlib-3.5.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scanpy 1.9.1 requires matplotlib>=3.4, but you have matplotlib 3.1.3 which is incompatible.
marcopolo-pytorch 1.0.9 requires matplotlib>=3.3.0, but you have matplotlib 3.1.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully installed matplotlib-3.1.3


# Run MarcoPolo

## Import packages

In [None]:
# Import packages
import pickle

import numpy as np
import pandas as pd
import torch
import anndata as ad
import scanpy as sc
import matplotlib.pyplot as plt

import MarcoPolo

assert torch.cuda.is_available(), "Make sure that you started the colab kernel with GPU: Runtime -> Change runtime type -> GPU"

## Read scRNA-seq data

You can use **example data** or **your own data**.

It should be in a AnnData format. `.X` should contain a raw count matrix of shape (# cells, # genes). You can explore example datasets below

### example data
We have prepared two example data: the human embryogenic stem cell (hESC) dataset of Koh et al. and the liver dataset of MacParland et al.    

In [None]:
!wget https://raw.githubusercontent.com/chanwkimlab/MarcoPolo/main/notebooks/example/hESC.h5ad
!wget https://raw.githubusercontent.com/chanwkimlab/MarcoPolo/main/notebooks/example/HumanLiver.h5ad
    
anndata_path = "hESC.h5ad"

# Read anndata. `anndata_path` should be in a `h5ad` format.
adata = ad.read(anndata_path)

# For fast debugging, only test first 1,000 genes.
adata = adata[:, :1000]

--2022-06-17 17:24:04--  https://raw.githubusercontent.com/chanwkimlab/MarcoPolo/main/notebooks/example/hESC.h5ad
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20842419 (20M) [application/octet-stream]
Saving to: ‘hESC.h5ad’


2022-06-17 17:24:05 (149 MB/s) - ‘hESC.h5ad’ saved [20842419/20842419]

--2022-06-17 17:24:05--  https://raw.githubusercontent.com/chanwkimlab/MarcoPolo/main/notebooks/example/HumanLiver.h5ad
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15403217 (15M) [application/octet-stream]
Saving to: ‘HumanLiver.h5

### your own data
You can upload your own AnnData single cell file to this session. If you intend to use the example data, please run the following cell and upload your data.

In [None]:
from google.colab import files
uploaded = files.upload()

for file_name in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(name=file_name, length=len(uploaded[file_name])))
    
anndata_path = file_name

# Read anndata. `anndata_path` should be in a `h5ad` format.
adata = ad.read(anndata_path)

## (1) Run regression

In [None]:
# (1) Run regression
# Calculate size factor
if "size_factor" not in adata.obs.columns:
    norm_factor = sc.pp.normalize_total(adata, exclude_highly_expressed=True, max_fraction= 0.2, inplace=False)["norm_factor"]
    adata.obs["size_factor"] = norm_factor/norm_factor.mean()
    print("size factor was calculated")
regression_result = MarcoPolo.run_regression(adata=adata, size_factor_key="size_factor",
                         num_threads=2, device="cuda:0")
# If you use a local machine, you can set `num_threads` to higher than 1 (maybe upto 4), which will speed up the regression a lot. For some reason, num_threads>1 does not work on colab.

with open(f"{anndata_path}.regression_result.pickle", "wb") as f:
    pickle.dump(regression_result, f)

<INFO> Currently, you are using 2 threads for regression. If you encounter any memory issues, try to set `num_threads` to 1.
The numbers of clusters to test: [1, 2]
Y: (446, 1000) X: (446, 1) s: (446,)


## (2) Find markers

In [None]:
# (2) Find markers
markers_result = MarcoPolo.find_markers(adata=adata, regression_result=regression_result)
with open(f"{anndata_path}.markers_result.pickle", "wb") as f:
    pickle.dump(markers_result, f)

Assign cells to on-cells and off-cells...
Calculating voting score...
Calculating proximity score...
Calculating bimodality score...
Calculating MarcoPolo score...


## (3) Generate report

In [None]:
# Obtain tSNE coordinates if it does not exist in the adata.
if "X_tsne" not in adata.obsm.keys():
    sc.tl.tsne(adata=adata)

In [None]:
# (3) Generate report
MarcoPolo.generate_report(adata=adata, size_factor_key="size_factor", 
                          regression_result=regression_result, 
                          gene_scores=markers_result, 
                          output_dir="./",  
                          low_dim_key="X_tsne",
                          cell_color_key="cell_type",
                          gene_info_path="https://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz",
                          top_num_html=1000,
                          top_num_image=1000)

Assign cells to on-cells and off-cells...
Annotating genes with the gene info: https://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz


100%|██████████| 1000/1000 [00:20<00:00, 48.02it/s, Num. of unmatched genes=47]


47 not matched genes: ENSG00000198804, ENSG00000210082, ENSG00000198712, ENSG00000198938, ENSG00000198727, ENSG00000211459, ENSG00000198899, ENSG00000198886, ENSG00000198763, ENSG00000198888, ENSG00000198786, ENSG00000212907, ENSG00000225840, ENSG00000274474, ENSG00000183311, ENSG00000223367, ENSG00000226225, ENSG00000235650, ENSG00000096150, ENSG00000273673, ...
Generating table files...


findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.


Generating image files...
Drawing figures
size factor corrected


  0%|          | 0/1000 [00:00<?, ?it/s]findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
100%|██████████| 1000/1000 [12:56<00:00,  1.29it/s]


# Download report

## compress the report folder 

In [None]:
!tar -zcf report.tar.gz report

## trigger download

In [None]:
from google.colab import files

files.download('report.tar.gz')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>