# <center><title>Single-Cell Analysis & Exploration Notebook</title></center>

## <center>Preface</center>  
<p style="font-size:20px;">This Notebook provides: 
    <li style="font-size:16px;">an interactive way to process, analyze, and view single-cell RNA sequenced data  
    <li style="font-size:16px;">suggested steps by growing standard practices for single-cell analysis  
    <li style="font-size:16px;">guided control over how each step is implemented  
</p>  

<p>  
    <em>Currently based in a Python environment.  
    </em>  
</p>  
<p>  
    <em>Contact earezza@ohri.ca for any assistance required.  
    </em>  
</p> 

<div class="alert alert-block alert-info">  
    <p>  
        <em>  
            <center>Note 1:</center>  
            <p>It is assumed that the raw sequencing data has already been preprocessed to produce a count matrix. It is from this count matrix data that we will begin our analysis.  
            </p>  
        </em>  
    </p>  
</div>  


<div class="alert alert-block alert-warning">  
    <p>  
        <em>  
            <center>Note 2:</center>  
            <p>Despite following the framework below, it is up to <strong>you</strong> to determine which steps or values to change or ignore based on your knowledge of experimental details in obtaining data and your overall analysis goals.  
            </p>  
        </em>  
    </p>  
</div>  


### <div class="alert alert-block alert-danger"><center>DO NOT MAKE CHANGES TO THIS NOTEBOOK, USE ONLY AS A REFERENCE TEMPLATE!</center><br><center>ON THE JUPYTER HOME PAGE CLICK THE CHECKBOX BESIDE THIS NOTEBOOK, THEN CLICK DUPLICATE (change name if desired) AND OPEN THE NEW NOTEBOOK TO WORK IN</center></div>  

#### Reference Docs  
> <a href="https://docs.jupyter.org/en/latest/"><strong>Jupyter Notebooks</strong></a>   
> <a href="https://scanpy.readthedocs.io/en/stable/index.html"><strong>Scanpy</strong></a>  
> <a href="https://github.com/dpeerlab/Palantir/"><strong>Palantir</strong></a>  
> <a href="https://scvelo.readthedocs.io/getting_started/#alignment"><strong>scVelo</strong></a> (_Currently not supported_)  

##### Additional Resources  
> <a href="https://www.10xgenomics.com/resources/analysis-guides">_10X Genomics Analysis Guides_</a>  
> <a href="https://www.10xgenomics.com/resources/analysis-guides/web-resources-for-cell-type-annotation">_10X Genomics Web Resources for Cell Type Annotation_</a>  
> <a href="https://www.singlecellcourse.org/">_University of Cambridge scRNAseq Course_</a>  
> <a href="https://satijalab.org/seurat/">_Seurat_</a>  
> <a href="https://bioconductor.org/books/3.14/OSCA.workflows/">_Examples in R_</a>  
> <a href="https://blog.bioturing.com/2022/06/13/single-cell-rna-seq-trajectory-analysis-review/">_Single Cell Trajectory Analysis_</a>  

## Single-Cell Analysis Workflow  
<ol>
    <li>Data Loading & Preprocessing</li>
    <li>Data Cleaning & Quality Control</li>
    <li>Normalization & Batch Correction</li>
    <li>Dimensionality Reduction & Clustering</li>
    <li>Differential Gene Expression & Cell Type Annotation</li>
    <li>Trajectory Inference</li>
</ol>

#### Double-click this cell for Anaconda environment requirements  
<style>
<div.input {display:none; 
name: scENV
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - alabaster=0.7.12=py_0
  - anndata=0.8.0=py39hcbf5309_0
  - appdirs=1.4.4=pyhd3eb1b0_0
  - argon2-cffi=21.3.0=pyhd3eb1b0_0
  - argon2-cffi-bindings=21.2.0=py39h2bbff1b_0
  - arrow=1.2.2=pyhd3eb1b0_0
  - astroid=2.6.6=py39haa95532_0
  - asttokens=2.0.5=pyhd3eb1b0_0
  - atomicwrites=1.4.0=py_0
  - attrs=21.4.0=pyhd3eb1b0_0
  - autopep8=1.6.0=pyhd3eb1b0_0
  - babel=2.10.3=pyhd8ed1ab_0
  - backcall=0.2.0=pyhd3eb1b0_0
  - bcrypt=3.2.0=py39h196d8e1_0
  - beautifulsoup4=4.11.1=py39haa95532_0
  - binaryornot=0.4.4=pyhd3eb1b0_1
  - black=19.10b0=py_0
  - bleach=4.1.0=pyhd3eb1b0_0
  - bokeh=2.4.2=py39haa95532_1
  - brotli=1.0.9=h8ffe710_7
  - brotli-bin=1.0.9=h8ffe710_7
  - brotlipy=0.7.0=py39hb82d6ee_1004
  - ca-certificates=2022.6.15=h5b45459_0
  - cached-property=1.5.2=hd8ed1ab_1
  - cached_property=1.5.2=pyha770c72_1
  - certifi=2022.6.15=py39haa95532_0
  - cffi=1.15.0=py39h0878f49_0
  - chardet=4.0.0=py39haa95532_1003
  - charset-normalizer=2.0.12=pyhd8ed1ab_0
  - click=8.0.4=py39haa95532_0
  - cloudpickle=2.0.0=pyhd3eb1b0_0
  - colorama=0.4.5=pyhd8ed1ab_0
  - cookiecutter=1.7.3=pyhd3eb1b0_0
  - cryptography=37.0.2=py39h7bc7c5c_0
  - cycler=0.11.0=pyhd8ed1ab_0
  - debugpy=1.5.1=py39hd77b12b_0
  - decorator=5.1.1=pyhd3eb1b0_0
  - defusedxml=0.7.1=pyhd3eb1b0_0
  - diff-match-patch=20200713=pyhd3eb1b0_0
  - docutils=0.18.1=py39hcbf5309_1
  - entrypoints=0.4=py39haa95532_0
  - executing=0.8.3=pyhd3eb1b0_0
  - fa2=0.3.5=py39hb82d6ee_1
  - flake8=3.9.2=pyhd3eb1b0_0
  - fonttools=4.33.3=py39hb82d6ee_0
  - freetype=2.10.4=h546665d_1
  - future=0.18.2=py39hcbf5309_5
  - glpk=4.65=h8ffe710_1004
  - h5py=3.7.0=nompi_py39hd4deaf1_100
  - hdf5=1.12.1=nompi_h2a0e4a3_104
  - icu=58.2=ha925a31_3
  - idna=3.3=pyhd8ed1ab_0
  - igraph=0.9.9=h89236e6_0
  - imagesize=1.3.0=pyhd8ed1ab_0
  - importlib-metadata=4.11.4=py39hcbf5309_0
  - importlib_metadata=4.11.4=hd8ed1ab_0
  - inflection=0.5.1=py39haa95532_0
  - intel-openmp=2022.1.0=h57928b3_3787
  - intervaltree=3.1.0=pyhd3eb1b0_0
  - ipykernel=6.9.1=py39haa95532_0
  - ipython=8.3.0=py39haa95532_0
  - ipython_genutils=0.2.0=pyhd3eb1b0_1
  - ipywidgets=7.6.5=pyhd3eb1b0_1
  - isort=5.9.3=pyhd3eb1b0_0
  - jedi=0.18.1=py39haa95532_1
  - jinja2=3.1.2=pyhd8ed1ab_1
  - jinja2-time=0.2.0=pyhd3eb1b0_3
  - joblib=1.1.0=pyhd8ed1ab_0
  - jpeg=9e=h8ffe710_1
  - jsonschema=4.4.0=py39haa95532_0
  - jupyter=1.0.0=py39haa95532_7
  - jupyter_client=6.1.12=pyhd3eb1b0_0
  - jupyter_console=6.4.0=pyhd3eb1b0_0
  - jupyter_core=4.10.0=py39haa95532_0
  - jupyterlab_pygments=0.1.2=py_0
  - jupyterlab_widgets=1.0.0=pyhd3eb1b0_1
  - keyring=23.4.0=py39haa95532_0
  - kiwisolver=1.4.3=py39h2e07f2f_0
  - krb5=1.19.3=h1176d77_0
  - lazy-object-proxy=1.6.0=py39h2bbff1b_0
  - lcms2=2.12=h2a16943_0
  - leidenalg=0.8.10=py39h5cc5705_0
  - lerc=3.0=h0e60522_0
  - libblas=3.9.0=15_win64_mkl
  - libbrotlicommon=1.0.9=h8ffe710_7
  - libbrotlidec=1.0.9=h8ffe710_7
  - libbrotlienc=1.0.9=h8ffe710_7
  - libcblas=3.9.0=15_win64_mkl
  - libcurl=7.83.1=h789b8ee_0
  - libdeflate=1.12=h8ffe710_0
  - libiconv=1.16=he774522_0
  - liblapack=3.9.0=15_win64_mkl
  - liblapacke=3.9.0=15_win64_mkl
  - libpng=1.6.37=h1d00b33_2
  - libspatialindex=1.9.3=h6c2663c_0
  - libssh2=1.10.0=h680486a_2
  - libtiff=4.4.0=h2ed3b44_1
  - libwebp=1.2.2=h57928b3_0
  - libwebp-base=1.2.2=h8ffe710_1
  - libxcb=1.13=hcd874cb_1004
  - libxml2=2.9.14=hf5bbc77_0
  - libzlib=1.2.12=h8ffe710_1
  - llvmlite=0.38.1=py39ha0cd8c8_0
  - loompy=2.0.16=py_0
  - louvain=0.7.1=py39h5cc5705_2
  - lz4-c=1.9.3=h8ffe710_1
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - markupsafe=2.1.1=py39hb82d6ee_1
  - matplotlib-base=3.5.2=py39h581301d_0
  - matplotlib-inline=0.1.2=pyhd3eb1b0_2
  - mccabe=0.6.1=py39haa95532_1
  - mistune=0.8.4=py39h2bbff1b_1000
  - mkl=2022.1.0=h6a75c08_874
  - mpir=3.0.0=he025d50_1002
  - msys2-conda-epoch=20160418=1
  - multicore-tsne=0.1_d4ff4aab=py39hefe7e4c_2
  - munkres=1.1.4=pyh9f0ad1d_0
  - mypy_extensions=0.4.3=py39haa95532_1
  - natsort=8.1.0=pyhd8ed1ab_0
  - nbclient=0.5.13=py39haa95532_0
  - nbconvert=6.4.4=py39haa95532_0
  - nbformat=5.3.0=py39haa95532_0
  - nest-asyncio=1.5.5=py39haa95532_0
  - networkx=2.8.4=pyhd8ed1ab_0
  - notebook=6.4.11=py39haa95532_0
  - numba=0.55.2=py39hb8cd55e_0
  - numpy=1.22.3=py39h0948cea_2
  - numpydoc=1.2=pyhd3eb1b0_0
  - openjpeg=2.4.0=hb211442_1
  - openssl=1.1.1q=h8ffe710_0
  - packaging=21.3=pyhd8ed1ab_0
  - pandas=1.4.3=py39h2e25243_0
  - pandocfilters=1.5.0=pyhd3eb1b0_0
  - paramiko=2.8.1=pyhd3eb1b0_0
  - parso=0.8.3=pyhd3eb1b0_0
  - pathspec=0.7.0=py_0
  - patsy=0.5.2=pyhd8ed1ab_0
  - pexpect=4.8.0=pyhd3eb1b0_3
  - pickleshare=0.7.5=pyhd3eb1b0_1003
  - pillow=9.1.1=py39ha53f419_1
  - pip=21.2.4=py39haa95532_0
  - pluggy=1.0.0=py39haa95532_1
  - poyo=0.5.0=pyhd3eb1b0_0
  - progressbar2=4.0.0=pyhd8ed1ab_0
  - prometheus_client=0.13.1=pyhd3eb1b0_0
  - prompt-toolkit=3.0.20=pyhd3eb1b0_0
  - prompt_toolkit=3.0.20=hd3eb1b0_0
  - pthread-stubs=0.4=hcd874cb_1001
  - ptyprocess=0.7.0=pyhd3eb1b0_2
  - pure_eval=0.2.2=pyhd3eb1b0_0
  - pycodestyle=2.7.0=pyhd3eb1b0_0
  - pycparser=2.21=pyhd8ed1ab_0
  - pydocstyle=6.1.1=pyhd3eb1b0_0
  - pyflakes=2.3.1=pyhd3eb1b0_0
  - pygam=0.8.0=py_0
  - pygments=2.12.0=pyhd8ed1ab_0
  - pylint=2.9.6=py39haa95532_1
  - pyls-spyder=0.4.0=pyhd3eb1b0_0
  - pynacl=1.4.0=py39hbd8134f_1
  - pynndescent=0.5.7=pyh6c4a22f_0
  - pyopenssl=22.0.0=pyhd8ed1ab_0
  - pyparsing=3.0.9=pyhd8ed1ab_0
  - pyqt=5.9.2=py39hd77b12b_6
  - pyrsistent=0.18.0=py39h196d8e1_0
  - pysocks=1.7.1=py39hcbf5309_5
  - python=3.9.12=h6244533_0
  - python-dateutil=2.8.2=pyhd8ed1ab_0
  - python-fastjsonschema=2.15.1=pyhd3eb1b0_0
  - python-igraph=0.9.11=py39h4a3397e_0
  - python-lsp-black=1.0.0=pyhd3eb1b0_0
  - python-lsp-jsonrpc=1.0.0=pyhd3eb1b0_0
  - python-lsp-server=1.2.4=pyhd3eb1b0_0
  - python-slugify=5.0.2=pyhd3eb1b0_0
  - python-utils=3.3.3=pyhd8ed1ab_0
  - python_abi=3.9=2_cp39
  - pytz=2022.1=pyhd8ed1ab_0
  - pywin32=302=py39h2bbff1b_2
  - pywin32-ctypes=0.2.0=py39haa95532_1000
  - pywinpty=2.0.2=py39h5da7b33_0
  - pyyaml=6.0=py39h2bbff1b_1
  - pyzmq=22.3.0=py39hd77b12b_2
  - qdarkstyle=3.0.2=pyhd3eb1b0_0
  - qstylizer=0.1.10=pyhd3eb1b0_0
  - qt=5.9.7=vc14h73c81de_0
  - qtawesome=1.0.3=pyhd3eb1b0_0
  - qtconsole=5.3.0=pyhd3eb1b0_0
  - qtpy=2.0.1=pyhd3eb1b0_0
  - regex=2022.3.15=py39h2bbff1b_0
  - requests=2.28.0=pyhd8ed1ab_1
  - rope=0.22.0=pyhd3eb1b0_0
  - rtree=0.9.7=py39h2eaa2aa_1
  - scanpy=1.9.1=pyhd8ed1ab_0
  - scikit-learn=1.1.1=py39he931e04_0
  - scipy=1.8.1=py39hc0c34ad_0
  - scvelo=0.2.4=pyhdfd78af_0
  - seaborn=0.11.2=hd8ed1ab_0
  - seaborn-base=0.11.2=pyhd8ed1ab_0
  - send2trash=1.8.0=pyhd3eb1b0_1
  - session-info=1.0.0=pyhd8ed1ab_0
  - setuptools=61.2.0=py39haa95532_0
  - sip=4.19.13=py39hd77b12b_0
  - six=1.16.0=pyh6c4a22f_0
  - snowballstemmer=2.2.0=pyhd8ed1ab_0
  - sortedcontainers=2.4.0=pyhd3eb1b0_0
  - soupsieve=2.3.1=pyhd3eb1b0_0
  - sphinx=5.0.2=pyh6c4a22f_0
  - sphinxcontrib-applehelp=1.0.2=py_0
  - sphinxcontrib-devhelp=1.0.2=py_0
  - sphinxcontrib-htmlhelp=2.0.0=pyhd8ed1ab_0
  - sphinxcontrib-jsmath=1.0.1=py_0
  - sphinxcontrib-qthelp=1.0.3=py_0
  - sphinxcontrib-serializinghtml=1.1.5=pyhd8ed1ab_2
  - spyder=5.1.5=py39haa95532_1
  - spyder-kernels=2.1.3=py39haa95532_0
  - sqlite=3.38.5=h2bbff1b_0
  - stack_data=0.2.0=pyhd3eb1b0_0
  - statsmodels=0.13.2=py39h5d4886f_0
  - stdlib-list=0.7.0=py_2
  - suitesparse=5.4.0=h5d0cbe0_1
  - tbb=2021.5.0=h2d74725_1
  - terminado=0.13.1=py39haa95532_0
  - testpath=0.6.0=py39haa95532_0
  - text-unidecode=1.3=pyhd3eb1b0_0
  - textdistance=4.2.1=pyhd3eb1b0_0
  - texttable=1.6.4=pyhd8ed1ab_0
  - threadpoolctl=3.1.0=pyh8a188c0_0
  - three-merge=0.1.1=pyhd3eb1b0_0
  - tinycss=0.4=pyhd3eb1b0_1002
  - tk=8.6.12=h8ffe710_0
  - toml=0.10.2=pyhd3eb1b0_0
  - tornado=6.1=py39h2bbff1b_0
  - tqdm=4.64.0=pyhd8ed1ab_0
  - traitlets=5.1.1=pyhd3eb1b0_0
  - typed-ast=1.4.3=py39h2bbff1b_1
  - typing=3.10.0.0=py39haa95532_0
  - typing-extensions=4.1.1=hd3eb1b0_0
  - typing_extensions=4.1.1=pyh06a4308_0
  - ujson=5.1.0=py39hd77b12b_0
  - umap-learn=0.5.3=py39hcbf5309_0
  - unicodedata2=14.0.0=py39hb82d6ee_1
  - unidecode=1.2.0=pyhd3eb1b0_0
  - urllib3=1.26.9=pyhd8ed1ab_0
  - vc=14.2=h21ff451_1
  - vs2015_runtime=14.27.29016=h5e58377_2
  - watchdog=2.1.6=py39haa95532_0
  - wcwidth=0.2.5=pyhd3eb1b0_0
  - webencodings=0.5.1=py39haa95532_1
  - wheel=0.37.1=pyhd3eb1b0_0
  - widgetsnbextension=3.5.2=py39haa95532_0
  - win_inet_pton=1.1.0=py39hcbf5309_4
  - wincertstore=0.2=py39haa95532_2
  - winpty=0.4.3=4
  - wrapt=1.12.1=py39h196d8e1_1
  - xorg-libxau=1.0.9=hcd874cb_0
  - xorg-libxdmcp=1.1.3=hcd874cb_0
  - xz=5.2.5=h62dcd97_1
  - yaml=0.2.5=he774522_0
  - yapf=0.31.0=pyhd3eb1b0_0
  - zipp=3.8.0=pyhd8ed1ab_0
  - zlib=1.2.12=h8ffe710_1
  - zstd=1.5.2=h6255e5f_1
  - pip:
    - cmake==3.22.5
    - cython==0.29.30
    - fcsparser==0.2.4
    - numexpr==2.8.3
    - palantir==1.0.1
    - phenograph==1.5.7
    - psutil==5.9.1
    - pytz-deprecation-shim==0.1.0.post0
    - tables==3.7.0
    - tzdata==2022.1
    - tzlocal==4.2
           }></div>
    </style>
           

#### Double-click this cell for Virtualenv environment requirements (remove +computecanada if on local machine)  
<style>
<div.input {display:none; 
anndata==0.7.5+computecanada  
arff==0.9+computecanada  
argon2_cffi==21.3.0+computecanada  
argon2_cffi_bindings==21.2.0+computecanada  
async_generator==1.10+computecanada  
attrs==21.4.0+computecanada  
backcall==0.2.0+computecanada  
backports.shutil_get_terminal_size==1.0.0+computecanada  
backports.zoneinfo==0.2.1  
backports_abc==0.5+computecanada  
bcrypt==3.2.0+computecanada  
bitstring==3.1.9+computecanada  
bleach==4.1.0+computecanada  
bokeh==2.4.2+computecanada  
certifi==2021.10.8+computecanada  
cffi==1.15.0+computecanada  
chardet==4.0.0+computecanada  
charset_normalizer==2.0.11+computecanada  
click==8.1.3+computecanada  
cmake==3.21.3+computecanada  
cryptography==36.0.1+computecanada  
cycler==0.11.0+computecanada  
Cython==0.29.27+computecanada  
deap==1.0.1+computecanada  
debugpy==1.5.1+computecanada  
decorator==5.1.1+computecanada  
defusedxml==0.7.1+computecanada  
dnspython==2.2.0+computecanada  
ecdsa==0.17.0+computecanada  
entrypoints==0.4+computecanada  
fa2==0.3.5+computecanada  
fcsparser==0.2.4  
fonttools==4.29.1+computecanada  
forceatlas2-python==1.1+computecanada  
funcsigs==1.0.2+computecanada  
future==0.18.2+computecanada  
h5py==3.6.0+computecanada  
idna==3.3+computecanada  
igraph==0.9.11  
importlib_metadata==4.10.1+computecanada  
importlib_resources==5.4.0+computecanada  
iniconfig==1.1.1+computecanada  
ipykernel==6.0.3+computecanada  
ipython==7.31.1+computecanada  
ipython_genutils==0.2.0+computecanada  
ipywidgets==7.6.5+computecanada  
jedi==0.18.1+computecanada  
Jinja2==3.1.2+computecanada  
joblib==1.1.0+computecanada  
jsonschema==4.4.0+computecanada  
jupyter==1.0.0+computecanada  
jupyter-client==6.1.12+computecanada  
jupyter-console==6.4.0+computecanada  
jupyter_core==4.9.1+computecanada  
jupyterlab_pygments==0.1.2+computecanada  
jupyterlab_widgets==1.0.2+computecanada  
kiwisolver==1.3.2+computecanada  
leidenalg==0.8.10  
llvmlite==0.38.1+computecanada  
lockfile==0.12.2+computecanada  
loompy==3.0.6+computecanada  
louvain==0.7.1  
MarkupSafe==2.0.1+computecanada  
matplotlib==3.5.1+computecanada  
matplotlib_inline==0.1.3+computecanada  
mistune==0.8.4+computecanada  
mock==4.0.3+computecanada  
mpmath==1.2.1+computecanada  
natsort==8.1.0+computecanada  
nbclient==0.5.10+computecanada  
nbconvert==6.4.2+computecanada  
nbformat==5.1.3+computecanada  
nest_asyncio==1.5.4+computecanada  
netaddr==0.8.0+computecanada  
netifaces==0.11.0+computecanada  
networkx==2.8.4+computecanada  
nose==1.3.7+computecanada  
notebook==6.4.8+computecanada  
numba==0.55.2+computecanada  
numexpr==2.8.1+computecanada  
numpy==1.22.2+computecanada  
numpy-groupies==0.9.10+computecanada  
packaging==21.3+computecanada  
palantir==1.0.1  
pandas==1.4.1+computecanada  
pandocfilters==1.5.0+computecanada  
paramiko==2.9.2+computecanada  
parso==0.8.3+computecanada  
path==16.3.0+computecanada  
path.py==12.5.0+computecanada  
pathlib2==2.3.6+computecanada  
patsy==0.5.2+computecanada  
paycheck==1.0.2+computecanada  
pbr==5.8.1+computecanada  
pexpect==4.8.0+computecanada  
PhenoGraph==1.5.7  
pickleshare==0.7.5+computecanada  
Pillow==9.0.1+computecanada  
pluggy==1.0.0+computecanada  
progressbar2==4.0.0  
prometheus_client==0.13.1+computecanada  
prompt_toolkit==3.0.26+computecanada  
psutil==5.9.0+computecanada  
ptyprocess==0.7.0+computecanada  
py==1.11.0+computecanada  
pycparser==2.21+computecanada  
pygam==0.8.0  
Pygments==2.11.2+computecanada  
PyNaCl==1.5.0+computecanada  
pynndescent==0.5.2+computecanada  
pyparsing==3.0.9+computecanada  
pyrsistent==0.18.1+computecanada  
pysam==0.19.1+computecanada  
pytest==7.1.2+computecanada  
python-dateutil==2.8.2+computecanada  
python-utils==3.3.3  
pytz==2022.1+computecanada  
pytz-deprecation-shim==0.1.0.post0  
PyYAML==6.0+computecanada  
pyzmq==22.3.0+computecanada  
qtconsole==5.1.1+computecanada  
QtPy==2.1.0+computecanada  
requests==2.27.1+computecanada  
rpy2==3.1.0+computecanada  
scanpy==1.8.2+computecanada  
scikit_learn==1.0.2+computecanada  
scipy==1.8.0+computecanada  
scvelo==0.2.2+computecanada  
seaborn==0.11.2+computecanada  
Send2Trash==1.8.0+computecanada  
simplegeneric==0.8.1+computecanada  
sinfo==0.3.1+computecanada  
singledispatch==3.7.0+computecanada  
six==1.16.0+computecanada  
statsmodels==0.13.2+computecanada  
stdlib-list==0.7.0+computecanada  
sympy==1.9+computecanada  
tables==3.7.0+computecanada  
terminado==0.13.1+computecanada  
testpath==0.5.0+computecanada  
texttable==1.6.4+computecanada  
threadpoolctl==3.1.0+computecanada  
tomli==2.0.1+computecanada  
tornado==6.1+computecanada  
tqdm==4.64.0+computecanada  
traitlets==5.0.5+computecanada  
typing_extensions==4.2.0+computecanada  
tzdata==2022.1  
tzlocal==4.2  
umap-learn==0.5.1+computecanada  
urllib3==1.26.8+computecanada  
velocyto==0.17.17  
wcwidth==0.2.5+computecanada  
webencodings==0.5.1+computecanada  
widgetsnbextension==3.5.2+computecanada  
zipp==3.7.0+computecanada  
}>
    </div>
</style>

___  
# <center>BEGIN</center>  
___  

<p style="color:black;font-size:16px">Notebook Color Legend:  
    <ul>
        <li style="color:black">BLACK - Descriptions, hints, etc...</li>
        <li style="color:green">GREEN - The following step changes/adds/removes variables in the current data object</li>  
        <li style="color:darkviolet">VIOLET - The following step is for visualization/inspection and does not change the current data object</li>  
    </ul>
</p>

## Load libraries/packages/modules

In [None]:
import itertools
import scanpy as sc
import seaborn as sns
import pandas as pd
import numpy as np
import anndata as ad
import palantir
#import scvelo as scv # Currently not supported
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

Enable notebook visualizations

In [None]:
# for quick static visualization
%matplotlib inline
# for adjustable visualizations
#%matplotlib notebook

### Define some commonly used variables

In [None]:
# Taken from R package Seurat cc.genes
cell_cycle_genes = {'S': ["MCM5", "PCNA", "TYMS", "FEN1", "MCM2", "MCM4", "RRM1", "UNG", "GINS2", "MCM6",
                          "CDCA7", "DTL", "PRIM1", "UHRF1", "MLF1IP", "HELLS", "RFC2", "RPA2", "NASP", "RAD51AP1",
                          "GMNN", "WDR76", "SLBP", "CCNE2", "UBR7", "POLD3", "MSH2", "ATAD2", "RAD51", "RRM2",
                          "CDC45", "CDC6" , "EXO1", "TIPIN", "DSCC1", "BLM", "CASP8AP2", "USP1", "CLSPN", "POLA1",
                          "CHAF1B", "BRIP1", "E2F8"
                          ],
                    'G2M': ["HMGB2", "CDK1", "NUSAP1", "UBE2C", "BIRC5", "TPX2", "TOP2A", "NDC80", "CKS2", "NUF2", "CKS1B",
                            "MKI67", "TMPO", "CENPF", "TACC3", "FAM64A", "SMC4", "CCNB2", "CKAP2L", "CKAP2", "AURKB", "BUB1",
                            "KIF11", "ANP32E", "TUBB4B", "GTSE1", "KIF20B", "HJURP", "CDCA3", "HN1", "CDC20", "TTK", "CDC25C",
                            "KIF2C", "RANGAP1", "NCAPD2", "DLGAP5", "CDCA2", "CDCA8", "ECT2", "KIF23", "HMMR", "AURKA", "PSRC1",
                            "ANLN", "LBR", "CKAP5", "CENPE", "CTCF", "NEK2", "G2E3", "GAS2L3", "CBX5", "CENPA"
                            ]
                    }
cell_cycle_genes_update2019 = {'S': ["MCM5", "PCNA", "TYMS", "FEN1", "MCM7", "MCM4", "RRM1", "UNG", "GINS2", "MCM6",
                          "CDCA7", "DTL", "PRIM1", "UHRF1", "CENPU", "HELLS", "RFC2", "POLR1B", "NASP", "RAD51AP1",
                          "GMNN", "WDR76", "SLBP", "CCNE2", "UBR7", "POLD3", "MSH2", "ATAD2", "RAD51", "RRM2",
                          "CDC45", "CDC6" , "EXO1", "TIPIN", "DSCC1", "BLM", "CASP8AP2", "USP1", "CLSPN", "POLA1",
                          "CHAF1B", "MRPL36", "E2F8"
                          ],
                    'G2M': ["HMGB2", "CDK1", "NUSAP1", "UBE2C", "BIRC5", "TPX2", "TOP2A", "NDC80", "CKS2", "NUF2", "CKS1B",
                            "MKI67", "TMPO", "CENPF", "TACC3", "PIMREG", "SMC4", "CCNB2", "CKAP2L", "CKAP2", "AURKB", "BUB1",
                            "KIF11", "ANP32E", "TUBB4B", "GTSE1", "KIF20B", "HJURP", "CDCA3", "JPT1", "CDC20", "TTK", "CDC25C",
                            "KIF2C", "RANGAP1", "NCAPD2", "DLGAP5", "CDCA2", "CDCA8", "ECT2", "KIF23", "HMMR", "AURKA", "PSRC1",
                            "ANLN", "LBR", "CKAP5", "CENPE", "CTCF", "NEK2", "G2E3", "GAS2L3", "CBX5", "CENPA"
                            ]
                    }
# Format according to data's gene naming
for i in range(0, len(cell_cycle_genes['S'])):
    cell_cycle_genes['S'][i] = cell_cycle_genes['S'][i].lower().capitalize()
for i in range(0, len(cell_cycle_genes['G2M'])):
    cell_cycle_genes['G2M'][i] = cell_cycle_genes['G2M'][i].lower().capitalize()
for i in range(0, len(cell_cycle_genes_update2019['S'])):
    cell_cycle_genes_update2019['S'][i] = cell_cycle_genes_update2019['S'][i].lower().capitalize()
for i in range(0, len(cell_cycle_genes_update2019['G2M'])):
    cell_cycle_genes_update2019['G2M'][i] = cell_cycle_genes_update2019['G2M'][i].lower().capitalize()

#### Define any marker genes associated with cell types

In [None]:
# Change according to data analysis...e.g. 'Cell Type': ['list', 'of', 'gene', 'markers']
markers = {'MSC': ['Lepr', 'Cxcl12', 'Adipoq', 'Kitl', 'Angpt1', 'Nt5e', 'Vcam1', 'Eng', 'Grem1'],
           'OLC': ['Bglap'],
           'Chondrocyte': ['Col2a1', 'Acan'],
           'Fibroblast': ['S100a4'],
           'BMEC': ['Cdh5'],
           'Pericyte': ['Acta2']
          }
genes = list(itertools.chain(*list(markers.values())))

___
# <center>Data Loading & Preprocessing</center>  
<center>Read count matrix file(s), add annotations, compute stats</center>  

___

<p style="font-size:16px;"><strong>Choose the appropriate Scanpy function to <a href="https://scanpy.readthedocs.io/en/stable/api.html#reading">read different file formats.</a></strong></p>  

#### If you have a single dataset under investigation  

> Load count matrix  
> ```python  
> data = read_h5ad('dataset.h5ad')  
> ```  

#### If you have multiple datasets (e.g. different sample types, experimental conditions, etc...)  

> Load count matrix for each  
> ```python  
> data_1 = read_h5ad('dataset_1.h5ad')  
> data_2 = read_h5ad('dataset_2.h5ad')
> ```  
> Annotate each sample and combine them into a single dataset  
> ```python  
> data_1.obs['sample'] = 'first'  
> data_2.obs['sample'] = 'second'  
> data_3.obs['sample'] = 'third'  
> data = data_1.concatenate(data_2, data_3)
> ```  

<p>In this example, two datasets were produced from an experiment. One dataset contains wild type ("WT") samples and the other contains gene knock-out ("KO") samples. These datasets were produced as 10X formatted matrix data (i.e. filtered_feature_bc_matrix/ containing barcodes.tsv.gz, features.tsv.gz matrix.mtx.gz), so we'll load them using the sc.read_10x_mtx() function.</p>

<p style="color:green;font-size:16px;">Load count matrices</p>  

In [None]:
ko = sc.read_10x_mtx('filtered_feature_bc_matrix_Ko/', cache=True)
wt = sc.read_10x_mtx('filtered_feature_bc_matrix_Wt/', cache=True)                              
ko.var_names_make_unique()  
wt.var_names_make_unique() 

<p style="color:green;;font-size:16px;">Annotate samples and combine into a single dataset</p>  

In [None]:
ko.obs['sample'] = "KO"
wt.obs['sample'] = "WT"
data = ko.concatenate(wt)
del ko, wt  # delete no longer used variables to save memory  

<p style="color:darkviolet">Show annotated data where n_obs = number of cells and n_var = number of genes</p>

In [None]:
data

<p style="color:darkviolet">Peak at the data values</p>  

In [None]:
data.obs

In [None]:
data.var

If cells and genes need to be swapped, uncomment and run the following cell to transpose the count matrix:

In [None]:
#sc.AnnData(data.X.T)
#data.write('data.h5ad')
#data.read_h5ad('data.h5ad')

<p style="color:green;;font-size:16px;">Annotate any groups of genes:  
    <ul style="color:green;">  
        <li>Mitochondrial genes as 'mt'</li>  
        <li>Ribosomal genes as 'ribo'</li>  
        <li>Etc...</li>  
    </ul>  
</p>  
(Note how these genes are named from above data.var, e.g. mitochondrial genes start with 'MT-', not 'mito-', not 'mt-')

In [None]:
data.var['mt'] = data.var_names.str.startswith('MT-')
data.var['ribo'] = data.var_names.str.startswith('ribo-')

<p style="color:green;font-size:16px;">Calculate quality control metrics for dataset (and independently for any gene groups in qc_vars)</p>  

In [None]:
sc.pp.calculate_qc_metrics(data, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True) # Insert any names of groups defined above
#sc.pp.calculate_qc_metrics(data, percent_top=None, log1p=False, inplace=True) # If no gene groups defined, run this instead

___
# <center>Data Cleaning & Quality Control</center>  
<center>Plot and view data, filter out cells and genes</center>  

___

<p style="color:darkviolet">Observe the current state of data object</p>  

In [None]:
data

<p style="color:darkviolet;">View overall sequence depth quality of data</p>  
(An ideal scatter would follow the diagonal regression line indicating equal sequencing depth of genes among cells)

In [None]:
sns.jointplot(
    data=data.obs,
    x="total_counts",
    y="n_genes_by_counts",
    kind="reg",
    marginal_ticks=True,
    color='indigo',
    height=7, # changes size of plot
).fig.suptitle("%s cells"%data.obs.shape[0])

<p style="color:darkviolet;">Observe percent of counts that are mitochondrial genes expressed</p>  
(Here, mitochondrial genes have already been filtered from the raw data beforehand, hence the 0% across data points)

In [None]:
sns.scatterplot(data=data.obs, x='total_counts', y='pct_counts_mt', color='indigo')

<p style="color:darkviolet;">Observe percent of counts that are ribosomal genes expressed</p>

In [None]:
sns.scatterplot(data=data.obs, x='total_counts', y='pct_counts_ribo', color='indigo')

<p style="color:darkviolet;">Repeat for any other grouped genes expressed, if annotated above (change "GROUP" in y value)</p>

In [None]:
# sns.scatterplot(data=data.obs, x='total_counts', y='pct_counts_GROUP', color='indigo')

<p style="color:darkviolet;">Observe distribution of cells by their gene counts</p>

In [None]:
sc.pl.violin(data, ['n_genes_by_counts', 'total_counts'], jitter=0.4, multi_panel=True, color='indigo')

<p style="font-size:16px">Filter out low-quality data points produced by inevitable instrumentation/procedural/technical flaws when the data was obtained:
    <ul>
        <li>Remove droplets captured that contain dead cells characterized by high mitochondrial gene expression counts.</li>  
        <li>Remove empty droplets captured characterized by little/no relative gene expression counts.</li>  
        <li>Remove droplets captured containing multiple cells characterized by very high relative gene expression counts.</li>
        <li>Remove genes expressed by very few cells.</li>
        <li>Remove any other specific genes from the dataset.</li>
    </ul>
</p>

<p style="color:green;font-size:16px;">Remove data points of dead cells</p>  
(Based on visual inspection of the above scatter plot...)


In [None]:
data = data[data.obs.pct_counts_mt < 5, :] # keeps all cells where < 5% of expression counts are mitochondrial genes

<p style="color:green;font-size:16px;">Remove data points similarly for other annotated gene groups, as desired.</p>  
(Based on visual inspection of above scatter plot...)

In [None]:
# data = data[data.obs.pct_counts_ribo > 5, :] # keeps all cells where > 5% of expression counts are ribosomal genes

<p style="color:green;font-size:16px;">Remove data points with exceptionally low and high gene counts
</p>
(Based on visual inspection of above scatter plot...)  

Instead of choosing thresholds based on above scatter plots, here we filter based on quantiles of gene expression, e.g. removing the bottom 1% and top 1% outliers

In [None]:
min_num_genes = data.obs['n_genes_by_counts'].quantile(q=0.01, interpolation='lower')
max_num_genes = data.obs['n_genes_by_counts'].quantile(q=0.99, interpolation='higher')
print('Minimum number of genes = %s'%min_num_genes)
print('Maxmium number of genes = %s'%max_num_genes)

In [None]:
sc.pp.filter_cells(data, min_genes=min_num_genes) # Filter out low gene expressions
sc.pp.filter_cells(data, max_genes=max_num_genes) # Filter out high gene expressions

<p style="color:green;font-size:16px;">Remove genes with low number of cells expressing them
</p>

In [None]:
sc.pp.filter_genes(data, min_cells=3) # Filter out genes with < 3 cells expressing them

<p style="color:green;font-size:16px;">Remove cell-cycle genes, if desired
</p>

In [None]:
data = data[:, ~data.var_names.isin(cell_cycle_genes['S'])]
data = data[:, ~data.var_names.isin(cell_cycle_genes['G2M'])]
data = data[:, ~data.var_names.isin(cell_cycle_genes_update2019['S'])]
data = data[:, ~data.var_names.isin(cell_cycle_genes_update2019['G2M'])]

<p style="color:darkviolet;">Observe new distribution of cells by their gene counts</p>

In [None]:
sc.pl.violin(data, ['n_genes_by_counts', 'total_counts'], jitter=0.4, multi_panel=True, color='indigo')

<p style="color:darkviolet">Observe the current state of data object</p>  

In [None]:
data

___
# <center>Normalization and Batch Correction</center>  
<center>Make cell gene expression counts relative to each other between cells to avoid misinterpreting deeper-sequenced cells/genes as being differentially expressed from others downstream.
Apply batch-correction to reduce batch effects between sample sets, e.g. normalize between WT and KO</center>  

___

<p style="color:green;font-size:16px;">Log-Normalize data</p>

In [None]:
sc.pp.normalize_total(data, target_sum=1e4)
sc.pp.log1p(data)

<p style="color:green;font-size:16px;">Obtain highly variable genes</p>  
<a href="https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html">See API for options</a>  

In [None]:
# sc.pp.highly_variable_genes(data, flavor='seurat', min_mean=0.0125, max_mean=3, min_disp=0.5) # Use if single sample loaded in dataset
sc.pp.highly_variable_genes(data, flavor='seurat', min_mean=0.0125, max_mean=3, min_disp=0.5, batch_key='sample') # Use if multiple samples loaded in dataset

<p style="color:darkviolet">Observe highly variable genes plot</p>  

In [None]:
sc.pl.highly_variable_genes(data)

<p style="color:darkviolet">Observe highly variable genes</p>  
Highly variable genes found between KO and WT samples (batches)

In [None]:
var_genes = data.var[data.var['highly_variable_nbatches'] > 1] # highly variable among >1 batches (here, >1 is both samples, i.e. nbatches=2)
var_genes

<p style="color:darkviolet">Observe current state of data object</p>  

In [None]:
data

<p style="color:green;font-size:16px;">Save filtered and normalized data</p>

In [None]:
data.raw = data
data.write('data_clean.h5ad')

<p style="font-size:16px"><strong>Checkpoint!</strong></p>  
<p>(Skip batch-correction if not necessary)</p>  

<div class="alert alert-block alert-warning">  
    <p>
        <strong>Warning!</strong> Performing batch correction using ComBat below requires large memory consumption (depending on size of dataset) and may cause progress thus far to be terminated by the notebook crashing. In this case, close the notebook and restart.
    </p>
    <p>It is recommended to perform this on a computer/computing platform that can handle this operation.
    </p>
    <p>One suggestion is to send the current "data_clean.h5ad" file to a more powerful computer to perform batch correction, save results to a new file, and re-load the batch corrected file to continue analysis.
    </p>
    <p>The batch-correction step may be first skipped to determine if it may be required. This can be discovered after visualizing t-SNE or UMAP plots in the subsequence steps. A high separation between WT and KO clustering can indicate that batch-correction should be used prior to dimensionality reduction and clustering.
    </p>
</div>

<p style="color:green;font-size:16px;">Load previously saved data</p>

In [None]:
data = sc.read('data_clean.h5ad')

<p style="color:green;font-size:16px;">Store pre-corrected data to raw</p>

In [None]:
data_batch_corrected = sc.AnnData(X=data.raw.X, var=data.raw.var, obs = data.obs)
data_batch_corrected.raw = data_batch_corrected

<p style="color:green;font-size:16px;">Apply batch correction for KO + WT data integration</p>  
<a href=https://scanpy.readthedocs.io/en/stable/external.html#data-integration>See API for methods</a>  

In [None]:
sc.pp.combat(data_batch_corrected, key='sample')

# Other batch correction methods, if used instead of ComBat, these may require additional package installation
#sc.external.pp.bbknn(data_batch_corrected, key='sample')
#sc.external.pp.harmony_integrate(data_batch_corrected, key='sample')
#sc.external.pp.mnn_correct(data_batch_corrected, key='sample')
#sc.external.pp.scanorama_integrate(data_batch_corrected, key='sample')

<p style="color:green;font-size:16px;">Store log-normalized, unscaled data</p>

In [None]:
data_batch_corrected.raw = data_batch_corrected

<p style="color:green;font-size:16px;">Scale data, to be used in subsequent steps...</p>

In [None]:
sc.pp.scale(data_batch_corrected, max_value=10)

<p style="color:green;font-size:16px;">Save batch-corrected data to file</p>

In [None]:
data_batch_corrected.write('data_batch_corrected.h5ad')

___
# <center>Dimensionality Reduction & Clustering</center>  
<center>Apply PCA, compute neighborhood graph, compute embedding, and cluster cells</center>

___

<p style="font-size:16px"><strong>Checkpoint start!</strong></p>

<p style="color:green;font-size:16px;">Load data from file</p>  

In [None]:
#data = sc.read('data_batch_corrected.h5ad')
data = sc.read('data_clean.h5ad')

<p style="color:darkviolet">Observe current state of data object</p>  

In [None]:
data

<p style="color:green;font-size:16px;">Compute principle components for dimensionality reduction</p>  
<a href=https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.pca.html>See API for options</a>  

In [None]:
sc.tl.pca(data, n_comps=50, svd_solver='arpack', random_state=2022)

<p style="color:darkviolet">Observe PC plot (not necessary)</p>

In [None]:
sc.pl.pca(data)

<p style="color:darkviolet">Observe PCs by their variance ratio</p>  
Elbow of the curve indicates where more PCs contribute less to dimensionality reduction of the data

In [None]:
sc.pl.pca_variance_ratio(data, log=True)

<p style="color:green;font-size:16px;">Compute neighborhood graph</p>  
<a href="https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.neighbors.html#scanpy.pp.neighbors">See API for options</a>  

n_pcs can be chosen based on elbow from above plot (choose value greater than elbow point)  

In [None]:
sc.pp.neighbors(data, knn=True, n_neighbors=10, n_pcs=40, random_state=2022)

<p style="color:green;font-size:16px;">Embed the neighborhood graph with a 2-D projection</p>  
<a href="https://scanpy.readthedocs.io/en/stable/api.html#embeddings">See API for options</a>  

Choose 1 from below or all three for comparisons

In [None]:
sc.tl.tsne(data, n_pcs=50, use_fast_tsne=False, random_state=2022) # May take a minute...

In [None]:
sc.tl.umap(data, random_state=2022) # recommended

In [None]:
sc.tl.draw_graph(data, random_state=2022)

<p style="color:green;font-size:16px;">Cluster the graph</p>  
<a href="https://scanpy.readthedocs.io/en/stable/api.html#clustering-and-trajectory-inference">See API for options</a>  

Choose 1 from below or both for comparisons

In [None]:
sc.tl.louvain(data, resolution=1.0, random_state=2022) # recommended

In [None]:
sc.tl.leiden(data, resolution=1.0, random_state=2022)

<p style="color:darkviolet">Observe current state of data object</p>  

In [None]:
data

<p style="color:darkviolet">Observe projection(s) of data</p>  

In [None]:
sc.pl.tsne(data, color=["louvain"], legend_loc='on data', color_map='viridis')

In [None]:
sc.pl.umap(data, color=["louvain"], legend_loc='on data', color_map='viridis')

In [None]:
sc.pl.draw_graph(data, color=["louvain"], legend_loc='on data', color_map='viridis')

Provide genes to plot in addition to the desired clustering. Note: if gene doesn't exist in data (by experimentation or through filtering), then no plot will be generated.

In [None]:
genes = ['provide', 'list', 'of', 'genes', 'to', 'see', 'here']

<p style="color:darkviolet">Observe gene expression levels</p> 

In [None]:
sc.pl.tsne(data, color=genes, legend_loc='on data', color_map='viridis')

In [None]:
sc.pl.umap(data, color=genes, legend_loc='on data', color_map='viridis')

In [None]:
sc.pl.draw_graph(data, color=genes, legend_loc='on data', color_map='viridis')

In [None]:
sc.pl.violin(data, genes, group='louvain')

<p style="font-size:16px"><strong>Checkpoint!</strong></p>

(Optional)  
If you want to extract cells of certain clusters to look at more closely, define clusters below and save to file. Then you can load data and re-run steps as above, e.g. perform dimensionality reduction and clustering again to obtain further localized sub-clusters.

In [None]:
clusters = ['2', '3', '4', '5', '7', '9', '13', '15'] # Change values to desired clusters
sub_data = data[data.obs['louvain'].isin(clusters)]
# Extract matrix
adata = sc.AnnData(sub_data.X)
adata.obs = sc.get.obs_df(sub_data)
adata.var = sc.get.var_df(sub_data)
# Keep global clustering annotations
adata.obs['louvain_global'] = sub_data[sub_data.obs_names.isin(adata.obs_names)].obs['louvain']

In [None]:
# Save to file
adata.write_h5ad('sub_data.h5ad')

___
# <center>Differential Gene Expression & Cell Type Annotation</center>  
<center>Obtain differentially expressed genes between cell clusters, annotate cell types</center>  

___

Define cell types. Obtain Differential gene expressions.

<p style="color:green;font-size:16px;">Rank gene expression in each cluster</p>  
<a href="https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html">See API for options</a>  

In [None]:
sc.tl.rank_genes_groups(data, 'louvain', method='wilcoxon', random_state=2022)

Store gene rankings into DataFrames (can be saved to file later)

In [None]:
ranked_genes_in_clusters = pd.DataFrame(data.uns['rank_genes_groups']['names'])
result = data.uns['rank_genes_groups']
groups = result['names'].dtype.names
groups_pval = pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']})

<p style="color:darkviolet">Observe gene rankings in clusters</p>  

In [None]:
ranked_genes_in_clusters.head(n=10) # See highest top 10 expressed genes for each cluster

In [None]:
groups_pval.head(n=10) # See highest top 10 expressed genes for each cluster with p-values & details

(Optional) Save gene rankings to file

In [None]:
groups_pval.to_csv('ranked_genes.csv', header=True, index=True, sep=',')

<p style="color:darkviolet">Observe gene rankings across clusters</p>  

In [None]:
sc.pl.rank_genes_groups(data, n_genes=25, sharey=False)

<p style="color:green;font-size:16px;">Compute comparison of gene rankings between two groups</p>  

In [None]:
sc.tl.rank_genes_groups(data, 'louvain', groups=['0'], reference='1', method='wilcoxon')

<p style="color:darkviolet">Observe gene rankings of one cluster compared to another</p>

In [None]:
sc.pl.rank_genes_groups(data, groups=['0'], n_genes=20)

In [None]:
sc.pl.rank_genes_groups_violin(data, groups='0', n_genes=8)

(Optional) Define renaming for clusters/groups  
_Note: number of names in list should equal number of clusters and listed in the same numerical order_  

In [None]:
new_cluster_names = ['CD4 T', 'CD14 Monocytes', 'B', 'CD8 T', 'NK', 
                     'FCGR3A Monocytes', 'Dendritic', 'Megakaryocytes'
                    ]

<p style="color:green;font-size:16px;">(Optional) Annotate cell types</p>  

In [None]:
data.rename_categories('louvain', new_cluster_names)

<p style="color:darkviolet">Observe gene expression in clusters</p>  

In [None]:
sc.pl.dotplot(data, genes, groupby='louvain')

In [None]:
sc.pl.stacked_violin(data, genes, groupby='louvain', rotation=90)

<p style="font-size:16px"><strong>Checkpoint!</strong></p>

___
# <center>Trajectory Inference</center>  
<center>Predict cell differentiation pathways based on gene expression</center>

___

### Using Palantir (recommended)  

<p style="color:green;font-size:16px;">Compute Palantir diffusion mappings, MAGIC imputation, multiscaling</p>  
<a href="https://scanpy.readthedocs.io/en/stable/generated/scanpy.external.tl.palantir.html">See API for options</a>  

In [None]:
sc.external.tl.palantir(data,  n_components=5, knn=30)
dm_res = {'T': data.obsp['palantir_diff_op'].T, 
          'EigenVectors': pd.DataFrame(data.obsm['X_palantir_diff_comp'], index=data.obs_names), 
          'EigenValues': pd.Series(data.uns['palantir_EigenValues']), 
          'kernel': data.obsp['palantir_diff_op']}
ms_data = pd.DataFrame(data.obsm['X_palantir_multiscale'], index=data.obs_names)

<p style="color:darkviolet">Observe current state of data object</p>  

In [None]:
data

<p style="color:green;font-size:16px;">Re-map t-SNE projection using Palantir's multiscale representation</p>  

In [None]:
sc.tl.tsne(data, n_pcs=2, use_rep='X_palantir_multiscale', perplexity=150, random_state=2022)

<p style="color:darkviolet">Observe projection(s) of data with imputations</p>  

In [None]:
sc.pl.tsne(data, layer='palantir_imp', color=["louvain"], legend_loc='on data', color_map='viridis')

In [None]:
sc.pl.umap(data, layer='palantir_imp', color=["louvain"], legend_loc='on data', color_map='viridis')

In [None]:
sc.pl.draw_graph(data, layer='palantir_imp', color=["louvain"], legend_loc='on data', color_map='viridis')

Provide genes to plot in addition to the desired clustering. Note: if gene doesn't exist in data (by experimentation or through filtering), then no plot will be generated.

In [None]:
genes = ['provide', 'list', 'of', 'genes', 'to', 'see', 'here']

<p style="color:darkviolet">Observe gene expression levels</p> 

In [None]:
sc.pl.tsne(data, layer='palantir_imp', color=genes, legend_loc='on data', color_map='viridis')

In [None]:
sc.pl.umap(data, layer='palantir_imp', color=genes, legend_loc='on data', color_map='viridis')

In [None]:
sc.pl.draw_graph(data, layer='palantir_imp', color=genes, legend_loc='on data', color_map='viridis')

<p>Store projections into DataFrames (only run those which have been computed above).</p>

In [None]:
tsne = pd.DataFrame(data.obsm['X_tsne'], index=data.obs_names)
tsne.rename(columns={0: 'x', 1: 'y'}, inplace=True)

In [None]:
umap = pd.DataFrame(data.obsm['X_umap'], index=data.obs_names)
umap.rename(columns={0: 'x', 1: 'y'}, inplace=True)

In [None]:
fa = pd.DataFrame(data.obsm['X_draw_graph_fa'], index=data.obs_names)
fa.rename(columns={0: 'x', 1: 'y'}, inplace=True)

<p style="color:darkviolet">Observe diffusion components</p> 

In [None]:
palantir.plot.plot_diffusion_components(tsne, dm_res)

In [None]:
palantir.plot.plot_diffusion_components(umap, dm_res)

In [None]:
palantir.plot.plot_diffusion_components(fa, dm_res)

<p>Define the "early" cell to be used as the start of the trajectory</p>  
<p>Here, we take the cell with the highest Sox9 gene expression</p>

In [None]:
early_cell = pd.DataFrame(data[:, ['Sox9']].layers['palantir_imp'], index=data.obs_names, columns=['Sox9']).idxmax().values[0] # for max expression
#early_cell = pd.DataFrame(data[:, ['Sox9']].layers['palantir_imp'], index=data.obs_names, columns=['Sox9']).idxmin().values[0] # for min expression

<p>(Optional) Define the terminating cell(s), if known or speculated, to be used as endpoints of the trajectory</p>  
<p>Here, we do not define any terminal cells</p>

In [None]:
#terminal_cells = pd.DataFrame(data[:, ['Sox9']].layers['palantir_imp'], index=data.obs_names, columns=['Sox9']).idxmax().values # for max expression
#terminal_cell = pd.DataFrame(data[:, ['Sox9']].layers['palantir_imp'], index=data.obs_names, columns=['Sox9']).idxmin().values # for min expression

<p>Define the number of waypoints to compute trajectory probabilities</p>
<p>Here, we are simply selecting a value of ~5% of the total number of cells in the data</p>

In [None]:
num_waypoints = int(0.05*data.n_obs)

<p style="color:green;font-size:16px;">Compute Palantir trajectory results</p>  
<a href="https://scanpy.readthedocs.io/en/stable/generated/scanpy.external.tl.palantir_results.html">See API for options</a>  

In [None]:
pr_res = sc.external.tl.palantir_results(data, early_cell, ms_data='X_palantir_multiscale', num_waypoints=num_waypoints, use_early_cell_as_start=True)#, terminal_states=terminal_states.index)

<p style="color:darkviolet">Observe Palantir results</p> 

In [None]:
palantir.plot.plot_palantir_results(pr_res, tsne)

In [None]:
palantir.plot.plot_palantir_results(pr_res, umap)

In [None]:
palantir.plot.plot_palantir_results(pr_res, fa)

<p style="color:darkviolet">Observe terminal state probabilities for cells</p> 

In [None]:
cells = pr_res.branch_probs.columns.values.tolist()
cells.append(early_cell)
palantir.plot.plot_terminal_state_probs(pr_res, cells)

<p style="color:darkviolet">Observe locations of early and terminal cells</p> 

In [None]:
palantir.plot.highlight_cells_on_tsne(tsne, cells)

In [None]:
palantir.plot.highlight_cells_on_tsne(umap, cells)

In [None]:
palantir.plot.highlight_cells_on_tsne(fa, cells)

<p>Define genes for plotting gene expression trends between early and terminating cells</p>

In [None]:
genes = ['provide', 'list', 'of', 'genes', 'to', 'see', 'here']

<p style="color:green;font-size:16px;">Compute gene trends</p>  

In [None]:
imp_df = pd.DataFrame(data[:, genes].layers['palantir_imp'], index=data.obs_names, columns=genes) # Store imputed data into DataFrame
gene_trends = palantir.presults.compute_gene_trends(pr_res, imp_df.loc[:, genes])

<p style="color:darkviolet">Observe gene trends</p> 

In [None]:
palantir.plot.plot_gene_trends(gene_trends)

In [None]:
palantir.plot.plot_gene_trend_heatmaps(gene_trends)

<p style="color:green;font-size:16px;">Compute gene trends for 1000 genes (example)</p>  

In [None]:
imp_df_first1000 = pd.DataFrame(data[:, data.var_names.values[0:1000]].layers['palantir_imp'],
                     index=data.obs_names, columns=data.var_names.values[0:1000]) # Store imputed data into DataFrame
gene_trends_first1000 = palantir.presults.compute_gene_trends(pr_res, imp_df_first1000.iloc[:, 0:1000], )

<p style="color:darkviolet">Observe trends for clusters of genes</p> 

In [None]:
for c in range(0, len(cells)-1):
    trends = gene_trends_first1000[cells[c]]['trends']
    gene_clusters = palantir.presults.cluster_gene_trends(trends)
    palantir.plot.plot_gene_trend_clusters(trends, gene_clusters)

### Using PAGA

<p style="color:green;font-size:16px;">Compute force-directed graph (if not done above)</p>  

In [None]:
sc.tl.draw_graph(data) # takes time...

<p style="color:darkviolet">Observe embedding</p>  

In [None]:
sc.pl.draw_graph(data, legend_loc='on data')

<p style="color:green;font-size:16px;">Compute PAGA graph</p>  

In [None]:
sc.tl.paga(data, groups='louvain') # Assumes louvain method was used above clustering

<p style="color:darkviolet">Observe PAGA graph</p>  

In [None]:
sc.pl.paga(data, color=['louvain'])

In [None]:
sc.pl.paga(data, color=genes)

Annotate the louvain groups (should have same number of groups defined)

In [None]:
data.obs['louvain'].cat.categories

In [None]:
data.obs['louvain_anno'] = data.obs['louvain']
data.obs['louvain_anno'].cat.categories = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10/Ery', '11', '12',
       '13', '14', '15', '16/Stem', '17', '18', '19/Neu', '20/Mk', '21', '22/Baso', '23', '24/Mo']

<p style="color:green;font-size:16px;">Re-compute PAGA graph with annotations</p>  

In [None]:
sc.tl.paga(data, groups='louvain_anno') # Assumes louvain method was used above clustering

<p style="color:darkviolet">Observe embedding with annotations</p>  

In [None]:
sc.pl.draw_graph(data, legend_loc='on data')

<p style="color:green;font-size:16px;">Re-compute graph with PAGA initialization</p>  

In [None]:
sc.tl.draw_graph(data, init_pos='paga')

<p style="color:darkviolet">Observe embedding with annotations</p>  

In [None]:
sc.pl.draw_graph(data, color=['louvain_anno'], legend_loc='on data')

In [None]:
sc.pl.draw_graph(data, color=genes, legend_loc='on data')

View palette colors, these can be used to define colors for trajectories

In [None]:
pl.figure(figsize=(8, 2))
for i in range(28):
    pl.scatter(i, 1, c=sc.pl.palettes.zeileis_28[i], s=200)
pl.show()

In [None]:
zeileis_colors = np.array(sc.pl.palettes.zeileis_28)
new_colors = np.array(adata.uns['louvain_anno_colors'])

Define colors used for visualizing trajectory, new_colors value indicates cluster

In [None]:
new_colors[[16]] = zeileis_colors[[12]]  # Stem colors / green
new_colors[[10, 17, 5, 3, 15, 6, 18, 13, 7, 12]] = zeileis_colors[[5, 5, 5, 5, 11, 11, 10, 9, 21, 21]]  # Ery colors / red
new_colors[[20, 8]] = zeileis_colors[[17, 16]]  # Mk early Ery colors / yellow
new_colors[[4, 0]] = zeileis_colors[[2, 8]]  # lymph progenitors / grey
new_colors[[22]] = zeileis_colors[[18]]  # Baso / turquoise
new_colors[[19, 14, 2]] = zeileis_colors[[6, 6, 6]]  # Neu / light blue
new_colors[[24, 9, 1, 11]] = zeileis_colors[[0, 0, 0, 0]]  # Mo / dark blue
new_colors[[21, 23]] = zeileis_colors[[25, 25]]  # outliers / grey

<p style="color:green;font-size:16px;">Apply colors</p>  

In [None]:
data.uns['louvain_anno_colors'] = new_colors

<p style="color:darkviolet">Observe embedding with new colors</p>  

In [None]:
sc.pl.paga_compare(data, threshold=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5, legend_fontsize=12, fontsize=12, frameon=False, edges=True, save=False)

<p style="color:green;font-size:16px;">Define starting cell for trajectory</p>  

In [None]:
data.uns['iroot'] = np.flatnonzero(data.obs['louvain_anno']  == '16/Stem')[0]

<p style="color:green;font-size:16px;">Compute diffusion pseudotime</p>  

In [None]:
sc.tl.dpt(data)

Define marker genes

In [None]:
gene_names = ['Gata2', 'Gata1', 'Klf1', 'Epor', 'Hba-a2',  # erythroid
              'Elane', 'Cebpe', 'Gfi1',                    # neutrophil
              'Irf8', 'Csf1r', 'Ctsg']                     # monocyte

<p style="color:darkviolet">Observe embedding and diffusion pseudotime</p>  

In [None]:
sc.pl.draw_graph(data, color=['louvain_anno', 'dpt_pseudotime'], legend_loc='on data')

Define paths for each cell type based on above plots

In [None]:
paths = [('erythrocytes', [16, 12, 7, 13, 18, 6, 5, 10]),
         ('neutrophils', [16, 0, 4, 2, 14, 19]),
         ('monocytes', [16, 0, 4, 11, 1, 9, 24])]

<p style="color:green;font-size:16px;">(Re-)Define variables</p>  

In [None]:
data.obs['distance'] = data.obs['dpt_pseudotime']
data.obs['clusters'] = data.obs['louvain_anno']  # just a cosmetic change
data.uns['clusters_colors'] = data.uns['louvain_anno_colors']

<p style="color:darkviolet">Observe gene expression heatmaps for defined cell paths</p>  

In [None]:
_, axs = pl.subplots(ncols=3, figsize=(6, 2.5), gridspec_kw={'wspace': 0.05, 'left': 0.12})
pl.subplots_adjust(left=0.05, right=0.98, top=0.82, bottom=0.2)
for ipath, (descr, path) in enumerate(paths):
    _, path_data = sc.pl.paga_path(
        data, path, gene_names,
        show_node_names=False,
        ax=axs[ipath],
        ytick_fontsize=12,
        left_margin=0.15,
        n_avg=50,
        annotations=['distance'],
        show_yticks=True if ipath==0 else False,
        show_colorbar=False,
        color_map='Greys',
        groups_key='clusters',
        color_maps_annotations={'distance': 'viridis'},
        title='{} path'.format(descr),
        return_data=True,
        show=False)
    #path_data.to_csv('./write/paga_path_{}.csv'.format(descr))

___  
# <center>END</center>  
___  