# Immune Single Cell Data Science

Zhao J, Zhang S, Liu Y, He X, Qu M et al. (2020) Single-cell RNA sequencing reveals the heterogeneity of liver-resident immune cells in human.

- https://pubmed.ncbi.nlm.nih.gov/32351704/

In [12]:
import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.simplefilter('ignore', RuntimeWarning)

In [13]:
import ollama

In [48]:
from IPython.display import display, Markdown

## Use of LLMs

In general, it is ok to use LLMs such as `mistral` to ask general quesitons. However, it is dangerous to ask it for code if you do not know what you are doing, as the code generated is often wrong or sub-optimal.


In [51]:
def ask(query, model='mistral'):
    """Ask a quesitn of an LLM."""

    response = ollama.chat(model=model, messages=[
      {
        'role': 'user',
        'content': query,
      },
    ])
    display(Markdown(response['message']['content']))

In [52]:
ask("Waht is scanpy?")

 Scanpy is a Python library designed for the analysis of single-cell RNA sequencing (scRNA-seq) data. It provides tools for low-level data processing, such as quality control and normalization, as well as high-level analyses like clustering and differential expression analysis. Scanpy is built on top of NumPy, Pandas, and scipy, making it an efficient and flexible tool for scRNA-seq data analysis. It also integrates with Seurat's Louvain clustering algorithm and provides a simple interface to perform trajectory analysis using Monocle. Overall, Scanpy offers a streamlined and user-friendly approach for exploring the complexities of single-cell transcriptomic data.

In [8]:
import scanpy as sc
import scanpy.external as sce

In [53]:
ask("List 10 sites where I can donwload single cell datasets in a format for scanpy")

 Scanpy is a popular tool for analyzing single-cell RNA sequencing data. Here are ten repositories and databases where you can find single-cell datasets that are often available in formats compatible with Scanpy:

1. **Single Cell Portal** (https://singlecell.broadinstitute.org/): This is a comprehensive resource from the Broad Institute, which provides access to large collections of single-cell RNA sequencing data and associated metadata. The data can be downloaded in various formats, including those suitable for Scanpy.
2. **10X Genomics Data Browser** (https://data.10xgenomics.com/): 10X Genomics is a leading provider of single-cell sequencing solutions, and their data browser offers public datasets that can be downloaded in various formats, including H5AD format which Scanpy supports.
3. **Gene Expression Omnibus (GEO)** (https://www.ncbi.nlm.nih.gov/geo/): GEO is a public database of microarray and RNA-Seq datasets, including many single-cell experiments. Some data may require preprocessing before being used with Scanpy.
4. **European Nucleotide Archive (ENA)** (https://www.ebi.ac.uk/ena): ENA is a public repository for nucleotide sequence data, including single-cell RNA-Seq datasets. The data can often be downloaded in formats suitable for Scanpy.
5. **BioConductor Single Cell Experiment** (https://bioconductor.org/packages/3.16/bioc/html/SingleCellExperiment.html): This Bioconductor package provides a framework for managing and analyzing single-cell RNA-Seq experiments. The data can be accessed from various sources, including GEO and ENA.
6. **The Human Cell Atlas** (https://portal.humancellatlas.org/): The Human Cell Atlas project aims to map every cell type in the human body. They provide open access to single-cell datasets, which can be downloaded in various formats, including those suitable for Scanpy.
7. **National Center for Biotechnology Information (NCBI) SRA** (https://www.ncbi.nlm.nih.gov/sra): The Sequence Read Archive contains single-cell RNA sequencing data and associated metadata from various sources, including 10X Genomics, the Broad Institute, and the Allen Institute for Brain Science. The data can often be downloaded in formats suitable for Scanpy.
8. **International Cancer Genome Consortium (ICGC)** (https://dcc.icgc.org/): ICGC is a large-scale international initiative to analyze the genomic and epigenomic alterations that occur in cancer. They provide access to single-cell RNA sequencing data from various cancer types, which can be downloaded in various formats, including those suitable for Scanpy.
9. **Molecular Taxonomy Unit (MTU)** at the European Bioinformatics Institute (EBI) (https://www.ebi.ac.uk/gold): The MTU provides access to single-cell RNA sequencing data from various organisms, including plants and fungi. The data can often be downloaded in formats suitable for Scanpy.
10. **Roadmap Epigenomics Project** (https://www.roadmapepigenomics.org/): This is a comprehensive effort to map the epigenomes of human cells. They provide access to single-cell RNA sequencing data, which can be downloaded in various formats, including those suitable for Scanpy.

It's important to note that some of these datasets may require additional processing or formatting before being used with Scanpy. Always check the documentation and metadata associated with each dataset to ensure compatibility.

In [54]:
ask("Are there any built-in datasets provided with scanpy?")

 Yes, Scanpy comes with several built-in datasets that can be used for learning and experimentation. Some of the most commonly used ones are:

1. PBMC (Peripheral Blood Mononuclear Cells) dataset: This is a large-scale single-cell RNA sequencing dataset consisting of 106,708 cells from 56 donors. The data includes 31 different cell types and various phenotypic and clinical metadata.
2. Mouse Cortex dataset: This dataset consists of single-cell RNA sequencing data from the mouse cortex. It contains information about gene expression profiles, cell types, and their spatial locations.
3. TCGA (The Cancer Genome Atlas) dataset: Scanpy includes an interface for accessing the TCGA dataset, which is a large collection of genomic and clinical data from various types of cancer.
4. PanglaoDB: This is a curated database of single-cell RNA sequencing datasets, which can be accessed through Scanpy using the `mm10_panglao` and `grch38_panglao` datasets. It includes data from various human tissues and cell types.

To load these datasets in Scanpy, you can use the following code snippets:

```python
import scanpy as sc

# Load PBMC dataset
pbmc = sc.read_10x('./data/single_cell/5k_pbmc3k_v1_1.h5ad')

# Load Mouse Cortex dataset
mouse_cortex = sc.read('../data/mouse_cortex/mouse_cortex.h5ad')

# Access TCGA data using the 'mm10' or 'grch38' kernels
tcga = sc.set_data(sc.external.fetch_tisa_data('TCGA', 'mm10')) # for mm10 genome
tcga = sc.set_data(sc.external.fetch_tisa_data('TCGA', 'grch38'))  # for grch38 genome

# Load PanglaoDB dataset
panglao = sc.read('../data/panglaoDB/mm10_panglao.h5ad') # for mm10 genome
panglao = sc.read('../data/panglaoDB/grch38_panglao.h5ad')  # for grch38 genome
```

These datasets can be used as a starting point for various single-cell data analysis tasks, such as clustering, dimensionality reduction, differential expression analysis, and visualization.

In [61]:
help(sc.datasets.pbmc3k)

Help on function pbmc3k in module scanpy.datasets._datasets:

pbmc3k() -> anndata._core.anndata.AnnData
    3k PBMCs from 10x Genomics.
    
    The data consists in 3k PBMCs from a Healthy Donor and is freely available
    from 10x Genomics (`here
    <http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz>`__
    from this `webpage
    <https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k>`__).
    
    The exact same data is also used in Seurat's
    `basic clustering tutorial <https://satijalab.org/seurat/pbmc3k_tutorial.html>`__.
    
    .. note::
    
        This downloads 5.9 MB of data upon the first call of the function and stores it in `./data/pbmc3k_raw.h5ad`.
    
    The following code was run to produce the file.
    
    .. code:: python
    
        adata = sc.read_10x_mtx(
            # the directory with the `.mtx` file
            './data/filtered_gene_bc_matrices/hg19/',
            # use gene s

In [38]:
data = sc.datasets.pbmc3k()

In [65]:
ask("Explain what an AnnData object is")

 An `AnnData` object is a type of data structure in the `anndata` package of the Scikit-learn library, which is specifically designed for handling single-cell RNA sequencing (scRNA-seq) data. This object extends the HDF5-based `h5ad` format to support additional annotation and metadata information, as well as various computational analyses common in scRNA-seq analysis, such as neighborhood graph construction, clustering, and dimensionality reduction.

At its core, an `AnnData` object is a multi-dimensional array with the following components:

1. **X**: The expression levels of genes across all cells, typically stored as a sparse matrix or dense NumPy array.
2. **obsm**: A dictionary that stores various pre-computed data matrices (e.g., PCA results, RNA velocity vectors) as optional slot-specific subarrays within the HDF5 file.
3. **layers**: A list of layer-specific data (e.g., cell type labels, trajectory information) as separate subarrays within the HDF5 file.
4. **obs**: Cell-level metadata and annotations such as batch information or cellular morphology features.
5. **uns**: Unstructured data associated with cells or genes, such as gene names, protein interaction data, or cellular phenotypes.

The `AnnData` object also provides methods for common scRNA-seq analysis tasks like neighborhood graph construction, clustering, and dimensionality reduction, making it a powerful tool for exploratory data analysis in the context of single-cell RNA sequencing experiments.

In [64]:
?data

[0;31mType:[0m        AnnData
[0;31mString form:[0m
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'
[0;31mLength:[0m      2700
[0;31mFile:[0m        ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/anndata/_core/anndata.py
[0;31mDocstring:[0m  
An annotated data matrix.

.. figure:: ../_static/img/anndata_schema.svg
   :width: 260px
   :align: right
   :class: dark-light

:class:`~anndata.AnnData` stores a data matrix :attr:`X` together with annotations
of observations :attr:`obs` (:attr:`obsm`, :attr:`obsp`),
variables :attr:`var` (:attr:`varm`, :attr:`varp`),
and unstructured annotations :attr:`uns`.

An :class:`~anndata.AnnData` object `adata` can be sliced like a
:class:`~pandas.DataFrame`,
for instance `adata_subset = adata[:, list_of_variable_names]`.
:class:`~anndata.AnnData`’s basic structure is similar to R’s ExpressionSet
[Huber15]_. If setting an `.h5ad`-formatted HDF5 backing file `.filename`,
data remains on the disk but is automatical

### Download data set from HCA

- [A single cell immune cell atlas of human hematopoietic system](https://explore.data.humancellatlas.org/projects/cc95ff89-2e68-4a08-a234-480eca21ce79)

In [29]:
%%bash

curl --location --fail https://service.azul.data.humancellatlas.org/manifest/files/ksQwlKVkY3AzNKRjdXJsxBAp0uUqOpZdEY6mvNiJeEmtxBDHUG4Ca5JfJLZvYxluSglOxCA1qXUJmqb7DtICTdweNzgqNavVJt6EfjEezHxf8dEUBg | curl --fail-early --continue-at - --retry 2 --retry-delay 10 --config -

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3127  100  3127    0     0   9060      0 --:--:-- --:--:-- --:--:--  9063
100  1951  100  1951    0     0   3492      0 --:--:-- --:--:-- --:--:-- 11148
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
100 66.5M  100 66.5M    0     0  11.0M      0  0:00:05  0:00:05 --:--:-- 24.3M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100  351M  100  351M    0     0  20.2M      0  0:00:17  0:00:17 --:--:-- 22.0M
  % Total    % Received % Xferd  Average Speed   Tim

Downloading to: bc351b55-800a-4803-9455-1eec99aff456/BL_hashing.h5ad

Downloading to: e89dfe2c-d2fe-4dec-8151-a90b82632f62/CB_pooling_and_hashing.h5ad

Downloading to: 20492a4b-0def-457b-9574-60dfdde2a0f2/BM_standard_design.h5ad

Downloading to: 5f29c29a-51c6-435c-8ff0-2b2a9d05ebee/BL_standard_design.h5ad

Downloading to: 405003d2-26e8-48f8-80f2-d538af357b22/CB_standard_design.h5ad

Downloading to: 2ae30314-9b39-4259-af50-67521df39c9a/CB_extra.h5ad

Downloading to: c50d807d-2306-4273-ba62-875486a96517/BL_pooling_and_control.h5ad

Downloading to: 511478a9-1940-4aa3-ab63-e2791aa6e623/BM_pooling_and_control.h5ad



In [9]:
ad = sc.datasets.ebi_expression_atlas('E-HCAD-32', )

In [10]:
ad

AnnData object with n_obs × n_vars = 58615 × 22790
    obs: 'Sample Characteristic[organism]', 'Sample Characteristic Ontology Term[organism]', 'Sample Characteristic[individual]', 'Sample Characteristic Ontology Term[individual]', 'Sample Characteristic[sex]', 'Sample Characteristic Ontology Term[sex]', 'Sample Characteristic[age]', 'Sample Characteristic Ontology Term[age]', 'Sample Characteristic[developmental stage]', 'Sample Characteristic Ontology Term[developmental stage]', 'Sample Characteristic[disease]', 'Sample Characteristic Ontology Term[disease]', 'Sample Characteristic[organism part]', 'Sample Characteristic Ontology Term[organism part]', 'Sample Characteristic[cell type]', 'Sample Characteristic Ontology Term[cell type]', 'Sample Characteristic[organism status]', 'Sample Characteristic Ontology Term[organism status]', 'Sample Characteristic[cause of death]', 'Sample Characteristic Ontology Term[cause of death]', 'Factor Value[organism part]', 'Factor Value Ontology Term

Object `sce.download_dataset` not found.
