[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chanwkimlab/MarcoPolo/blob/main/notebooks/tutorial.ipynb)

# Setup

**Start the colab kernel with GPU**: Runtime -> Change runtime type -> GPU

## Install dependencies

In [None]:
!pip install marcopolo-pytorch --upgrade

# Run MarcoPolo

## Import packages

In [None]:
# Import packages
import pickle

import numpy as np
import pandas as pd
import torch
import anndata as ad
import scanpy as sc
import matplotlib.pyplot as plt

import MarcoPolo

assert torch.cuda.is_available(), "Make sure that you started the colab kernel with GPU: Runtime -> Change runtime type -> GPU"

## Read scRNA-seq data

You can use **example data** or **your own data**.

It should be in a AnnData format. `.X` should contain a raw count matrix of shape (# cells, # genes). You can explore example datasets below

### example data
We have prepared two example data: the human embryogenic stem cell (hESC) dataset of Koh et al. and the liver dataset of MacParland et al.    

In [None]:
!wget https://raw.githubusercontent.com/chanwkimlab/MarcoPolo/main/notebooks/example/hESC.h5ad
!wget https://raw.githubusercontent.com/chanwkimlab/MarcoPolo/main/notebooks/example/HumanLiver.h5ad
    
anndata_path = "HumanLiver.h5ad"

# Read anndata. `anndata_path` should be in a `h5ad` format.
adata = ad.read(anndata_path)

# For fast debugging, only test first 1,000 genes.
adata = adata[:, :1000]

### your own data
You can upload your own AnnData single cell file to this session. If you intend to use the example data, please run the following cell and upload your data.

In [None]:
from google.colab import files
uploaded = files.upload()

for file_name in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(name=file_name, length=len(uploaded[file_name])))
    
anndata_path = file_name

# Read anndata. `anndata_path` should be in a `h5ad` format.
adata = ad.read(anndata_path)

## (1) Run regression

In [None]:
# (1) Run regression
# Calculate size factor
if "size_factor" not in adata.obs.columns:
    norm_factor = sc.pp.normalize_total(adata, exclude_highly_expressed=True, max_fraction= 0.2, inplace=False)["norm_factor"]
    adata.obs["size_factor"] = norm_factor/norm_factor.mean()
    print("size factor was calculated")
regression_result = MarcoPolo.run_regression(adata=adata, size_factor_key="size_factor",
                         num_threads=1, device="cuda:0")
# If you use a local machine, you can set `num_threads` to higher than 1 (maybe upto 4), which will speed up the regression a lot. For some reason, num_threads>1 does not work on colab.

with open(f"{anndata_path}.regression_result.pickle", "wb") as f:
    pickle.dump(regression_result, f)

## (2) Find markers

In [None]:
# (2) Find markers
markers_result = MarcoPolo.find_markers(adata=adata, regression_result=regression_result)
with open(f"{anndata_path}.markers_result.pickle", "wb") as f:
    pickle.dump(markers_result, f)

## (3) Generate report

In [None]:
# Obtain tSNE coordinates if it does not exist in the adata.
if "X_tsne" not in adata.obsm.keys():
    sc.tl.tsne(adata=adata)

In [None]:
# (3) Generate report
MarcoPolo.generate_report(adata=adata, size_factor_key="size_factor", 
                          regression_result=regression_result, 
                          gene_scores=markers_result, 
                          output_dir="./",  
                          low_dim_key="X_tsne",
                          cell_color_key="cell_type",
                          gene_info_path="https://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz")

# Download report

## compress the report folder 

In [None]:
!tar -zcvf report.tar.gz report

## trigger download

In [None]:
from google.colab import files

files.download('report.tar.gz')