In [3]:
%pip install scanpy

Collecting scanpy
  Downloading scanpy-1.11.4-py3-none-any.whl.metadata (9.2 kB)
Collecting anndata>=0.8 (from scanpy)
  Downloading anndata-0.12.2-py3-none-any.whl.metadata (9.6 kB)
Collecting legacy-api-wrap>=1.4.1 (from scanpy)
  Downloading legacy_api_wrap-1.4.1-py3-none-any.whl.metadata (2.1 kB)
Collecting session-info2 (from scanpy)
  Downloading session_info2-0.2.2-py3-none-any.whl.metadata (3.4 kB)
Collecting array-api-compat>=1.7.1 (from anndata>=0.8->scanpy)
  Downloading array_api_compat-1.12.0-py3-none-any.whl.metadata (2.5 kB)
Collecting zarr!=3.0.*,>=2.18.7 (from anndata>=0.8->scanpy)
  Downloading zarr-3.1.2-py3-none-any.whl.metadata (10 kB)
Collecting donfig>=0.8 (from zarr!=3.0.*,>=2.18.7->anndata>=0.8->scanpy)
  Downloading donfig-0.8.1.post1-py3-none-any.whl.metadata (5.0 kB)
Collecting numcodecs>=0.14 (from numcodecs[crc32c]>=0.14->zarr!=3.0.*,>=2.18.7->anndata>=0.8->scanpy)
  Downloading numcodecs-0.16.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.me

Cell 1: Import Libraries & Configure Settings

This cell imports all the necessary packages and sets up plotting defaults for a clean look.

In [4]:
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import seaborn as sns

# Configure settings for Scanpy and Matplotlib
# This makes our plots look nicer and saves them in high resolution
sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=100, facecolor='white', frameon=False)

print("✅ Libraries imported and settings configured.")

✅ Libraries imported and settings configured.


Cell 2: Load the Dataset

Here, we download and load the classic "PBMC 3k" dataset directly using Scanpy's built-in functionality. The data is loaded into an AnnData object, the standard for this type of analysis.

In [5]:
# This function downloads the data and returns it as an AnnData object.
adata = sc.datasets.pbmc3k()

print("✅ PBMC 3k dataset loaded.")

try downloading from url
https://falexwolf.de/data/pbmc3k_raw.h5ad
... this may take a while but only happens once


  0%|          | 0.00/5.58M [00:00<?, ?B/s]

✅ PBMC 3k dataset loaded.


Cell 3: Inspect the AnnData Object

Let's get a high-level summary of our dataset. This tells us how many cells (obs) and genes (vars) we have and what annotations are currently stored.



In [6]:
# AnnData is the central data structure in Scanpy.
# It stores the main data matrix (.X) along with annotations for cells (obs) and genes (var).
print("--- AnnData Object Summary ---")
print(adata)

--- AnnData Object Summary ---
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'


Cell 4: Check the Shape

This confirms the dimensions of our main gene expression matrix: 2700 cells × 32738 genes.



In [7]:
# .X stores the primary data: the gene expression count matrix.
# It's usually a sparse matrix to save memory.
print(f"Shape of the count matrix (.X): {adata.X.shape}")
print(f"Number of cells (n_obs): {adata.n_obs}")
print(f"Number of genes (n_vars): {adata.n_vars}")

Shape of the count matrix (.X): (2700, 32738)
Number of cells (n_obs): 2700
Number of genes (n_vars): 32738


Cell 5: Examine Cell Metadata (.obs)

The .obs attribute is a pandas DataFrame holding information about each cell. Right now, it's empty, but we will add cluster labels and QC metrics to it in the next notebooks.

In [8]:
# .obs stores metadata for the cells (observations)
print("--- First 5 rows of cell metadata (.obs) ---")
print(adata.obs.head())

--- First 5 rows of cell metadata (.obs) ---
Empty DataFrame
Columns: []
Index: [AAACATACAACCAC-1, AAACATTGAGCTAC-1, AAACATTGATCAGC-1, AAACCGTGCTTCCG-1, AAACCGTGTATGCG-1]


Cell 6: Examine Gene Metadata (.var)

Similarly, .var is a DataFrame for gene information. It currently contains the gene IDs. The gene names (symbols) are stored in .var_names.

In [9]:
# .var stores metadata for the genes (variables)
print("--- First 5 rows of gene metadata (.var) ---")
print(adata.var.head())

# The '.var_names' are the gene symbols. Let's look at a few.
print("\n--- Example gene names (.var_names) ---")
print(adata.var_names[:10].tolist())

--- First 5 rows of gene metadata (.var) ---
                     gene_ids
index                        
MIR1302-10    ENSG00000243485
FAM138A       ENSG00000237613
OR4F5         ENSG00000186092
RP11-34P13.7  ENSG00000238009
RP11-34P13.8  ENSG00000239945

--- Example gene names (.var_names) ---
['MIR1302-10', 'FAM138A', 'OR4F5', 'RP11-34P13.7', 'RP11-34P13.8', 'AL627309.1', 'RP11-34P13.14', 'RP11-34P13.9', 'AP006222.2', 'RP4-669L17.10']
