# Basic single-cell analysis

## Overview

This notebook follows the tutorial by [mousepixels/sanbomics](https://github.com/mousepixels/sanbomics/blob/main/single_cell_analysis_complete_class.ipynb), which has an accompanying [screencast](https://youtu.be/uvyG9yLuNSE?t=319).

Analysis is illustrated with single-nucleus RNA sequencing data from the following paper <cite data-cite="Melms2021-bj">Melms et al. (2021)</cite>

> Melms JC, Biermann J, Huang H, Wang Y, Nair A, Tagore S, et al.
A molecular single-cell lung atlas of lethal COVID-19.
Nature. 2021;595: 114–119. [doi:10.1038/s41586-021-03569-0](https://doi.org/10.1038/s41586-021-03569-0)

This paper examined 116,000 nuclei from the lungs of nineteen patients who underwent autopsy following death in association with COVID-19. Findings reported in the abstract of the paper include:

1. activated monocyte-derived macrophages and alveolar macrophages
1. impaired T cell activation
1. monocyte/macrophage-derived interleukin-1β and epithelial cell-derived interleukin-6
1. alveolar type 2 cells adopted an inflammation-associated transient progenitor cell state and failed to undergo full transition into alveolar type 1 cells
1. expansion of CTHRC1+ pathological fibroblasts
1. protein activity and ligand–receptor interactions suggest putative drug targets

This notebook makes extensive use of <cite data-cite="Wolf2018-nu">Wolf et al. (2018)</cite> and <cite data-cite="Lopez2018-em">Lopez et al. (2018)</cite> including updates that have been made to the underlying software packages, [scanpy](https://github.com/scverse/scanpy) and [scvi-tools](https://github.com/scverse/scvi-tools), since their initial publication.

## Setup

### Import libraries

In [1]:
from inspect import getmembers
from pprint import pprint
from types import FunctionType

import scanpy as sc

### Setup plotting

In [2]:
import matplotlib.font_manager
import matplotlib.pyplot as plt

# import matplotlib_inline

In [3]:
# fonts_path = "/usr/share/texmf/fonts/opentype/public/lm/" #ubuntu
# fonts_path = "~/Library/Fonts/" # macos
fonts_path = "/usr/share/fonts/OTF/"  # arch
# user_path = "$HOME/" # user
# fonts_path = user_path + "fonts/latinmodern/opentype/public/lm/"  # home
matplotlib.font_manager.fontManager.addfont(fonts_path + "lmsans10-regular.otf")
matplotlib.font_manager.fontManager.addfont(fonts_path + "lmroman10-regular.otf")

In [4]:
# https://stackoverflow.com/a/36622238/446907
%config InlineBackend.figure_formats = ['svg']

In [5]:
plt.style.use("default")  # reset default parameters
# https://stackoverflow.com/a/3900167/446907
plt.rcParams.update(
    {
        "font.size": 16,
        "font.family": ["sans-serif"],
        "font.serif": ["Latin Modern Roman"] + plt.rcParams["font.serif"],
        "font.sans-serif": ["Latin Modern Sans"] + plt.rcParams["font.sans-serif"],
    }
)

### Utility functions

In [6]:
def attributes(obj):
    """
    get object attributes
    """
    disallowed_names = {
        name for name, value in getmembers(type(obj)) if isinstance(value, FunctionType)
    }
    return {
        name: getattr(obj, name)
        for name in dir(obj)
        if name[0] != "_" and name not in disallowed_names and hasattr(obj, name)
    }


def print_attributes(obj):
    """
    print object attributes
    """
    pprint(attributes(obj))

## Import data

Here we review how the data were downloaded, and proceed to import and inspect the data.

### Data download

Data with GEO accession [GSE171524](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE171524) was downloaded using [./data/download_geo_data.sh](./data/download_geo_data.sh) with parameters

```bash
./download_geo_data.sh \
       -a GSE132771 \
       -f 'ftp.*RAW.*' \
       -j '..|.supplementary_files?|..|.url?|select(length>0)'
```

A skeleton of this script that may work in this case is

```bash
!/usr/bin/env bash

#-- debugging (comment to reduce stderr output)
#-- https://wiki.bash-hackers.org/scripting/debuggingtips
export PS4='+(${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]:+${FUNCNAME[0]}(): }'
set -o xtrace

# get metadata
# Melms JC, Biermann J, Huang H, Wang Y, Nair A, Tagore S, et al.
# A molecular single-cell lung atlas of lethal COVID-19.
# Nature. 2021;595: 114–119. doi:10.1038/s41586-021-03569-0
# GSE171524
ffq -l 1 -o GSE171524.json GSE171524

# download raw data
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE171nnn/GSE171524/suppl/GSE171524_RAW.tar

# list contents
tar -tvf GSE171524_RAW.tar

# untar
mkdir -p GSE171524 && \
tar -xvf GSE171524_RAW.tar -C GSE171524
```

### Data load

In [7]:
adata = None
adata = sc.read_csv("data/GSE171524/supplementary/GSM5226574_C51ctr_raw_counts.csv.gz").T
adata

AnnData object with n_obs × n_vars = 6099 × 34546

Note the `scanpy.read_csv` function accepts gzipped files.

### Data properties

In [8]:
type(adata)

anndata._core.anndata.AnnData

In [9]:
type(adata.T)

anndata._core.anndata.AnnData

In [10]:
print_attributes(adata)

{'T': AnnData object with n_obs × n_vars = 34546 × 6099,
 'X': array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32),
 'file': Backing file manager: no file is set.,
 'filename': None,
 'is_view': False,
 'isbacked': False,
 'isview': False,
 'layers': Layers with keys: ,
 'n_obs': 6099,
 'n_vars': 34546,
 'obs': Empty DataFrame
Columns: []
Index: [TAGGTACCATGGCCAC-1_1, ATTCACTGTAACAGGC-1_1, TAACTTCCAACCACGC-1_1, TTGGGTACACGACAAG-1_1, AGGCCACAGAGTCACG-1_1, CACTGAAGTCGAAGCA-1_1, ACTGATGTCTGCACCT-1_1, TTACCGCCACTCAGAT-1_1, TTGGTTTTCCTAGCTC-1_1, TGGGAAGTCAGTGATC-1_1, CCACGAGTCTCTTAAC-1_1, ACTTCCGCACAACGCC-1_1, GGGAAGTAGCGACCCT-1_1, TGGTAGTTCCCGTGTT-1_1, CGCATAACATGCCGGT-1_1, TCTATCACAAGGCTTT-1_1, ATCCACCAGAGGTATT-1_1, TAACGACAGATGACCG-1_1, TCTTAGTGTATGAGGC-1_1, CACTTCGCAGTACTAC-1_1, GTCAAAC

  if name[0] != "_" and name not in disallowed_names and hasattr(obj, name)
  name: getattr(obj, name)


In [11]:
adata.obs

Unnamed: 0_level_0,clusters_coarse,clusters,S_score,G2M_score
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAACCTGAGAGGGATA,Pre-endocrine,Pre-endocrine,-0.224902,-0.252071
AAACCTGAGCCTTGAT,Ductal,Ductal,-0.014707,-0.232610
AAACCTGAGGCAATTA,Endocrine,Alpha,-0.171255,-0.286834
AAACCTGCATCATCCC,Ductal,Ductal,0.599244,0.191243
AAACCTGGTAAGTGGC,Ngn3 high EP,Ngn3 high EP,-0.179981,-0.126030
...,...,...,...,...
TTTGTCAAGTGACATA,Pre-endocrine,Pre-endocrine,-0.235896,-0.266101
TTTGTCAAGTGTGGCA,Ngn3 high EP,Ngn3 high EP,0.279374,-0.204047
TTTGTCAGTTGTTTGG,Ductal,Ductal,-0.045692,-0.208907
TTTGTCATCGAATGCT,Endocrine,Alpha,-0.240576,-0.206865


Gene names are saved 

In [12]:
adata.var

Unnamed: 0_level_0,highly_variable_genes
index,Unnamed: 1_level_1
Xkr4,False
Gm37381,
Rp1,
Rp1-1,
Sox17,
...,...
Gm28672,
Gm28670,
Gm29504,
Gm20837,


In [13]:
adata.obs_names

Index(['AAACCTGAGAGGGATA', 'AAACCTGAGCCTTGAT', 'AAACCTGAGGCAATTA',
       'AAACCTGCATCATCCC', 'AAACCTGGTAAGTGGC', 'AAACCTGGTATTAGCC',
       'AAACCTGTCCCTCTTT', 'AAACCTGTCTTTCCTC', 'AAACGGGAGACAATAC',
       'AAACGGGAGATATGGT',
       ...
       'TTTGGTTCACCAGATT', 'TTTGGTTCACGAAGCA', 'TTTGGTTTCACTTACT',
       'TTTGGTTTCCTTTCGG', 'TTTGTCAAGAATGTGT', 'TTTGTCAAGTGACATA',
       'TTTGTCAAGTGTGGCA', 'TTTGTCAGTTGTTTGG', 'TTTGTCATCGAATGCT',
       'TTTGTCATCTGTTTGT'],
      dtype='object', name='index', length=3696)

In [14]:
adata.var_names

Index(['Xkr4', 'Gm37381', 'Rp1', 'Rp1-1', 'Sox17', 'Gm37323', 'Mrpl15',
       'Rgs20', 'Npbwr1', '4732440D04Rik',
       ...
       'Gm28406', 'Gm29436', 'Gm28407', 'Gm29393', 'Gm21294', 'Gm28672',
       'Gm28670', 'Gm29504', 'Gm20837', 'Erdr1'],
      dtype='object', name='index', length=27998)

There are two layers corresponding to spliced and unspliced transcripts respectively.

In [15]:
adata.layers['spliced']

<3696x27998 sparse matrix of type '<class 'numpy.float32'>'
	with 9298890 stored elements in Compressed Sparse Row format>

In [16]:
adata.layers['unspliced']

<3696x27998 sparse matrix of type '<class 'numpy.float32'>'
	with 3156504 stored elements in Compressed Sparse Row format>

PCA and UMAP have retained 50 and 2 dimensions respectively.

In [17]:
print(adata.obsm)
print(adata.obsm['X_pca'].shape)
print(adata.obsm)
print(adata.obsm['X_umap'].shape)

AxisArrays with keys: X_pca, X_umap
(3696, 50)
AxisArrays with keys: X_pca, X_umap
(3696, 2)


In [18]:
print(adata.varm)

AxisArrays with keys: 


In [19]:
print(adata.obsp)
print(adata.obsp['distances'].shape)
print(adata.obsp)
print(adata.obsp['connectivities'].shape)

PairwiseArrays with keys: distances, connectivities
(3696, 3696)
PairwiseArrays with keys: distances, connectivities
(3696, 3696)


In [20]:
print(adata.varp)

PairwiseArrays with keys: 


The data appears to contain reads mapped to 34546 RNA molecule-associated features and 6099 cell-associated barcodes.