# Inspecting the AnnData object

## AnnData Object Structure

- **X**: Main expression matrix (cells × genes)
- **obs**: Cell-level annotations (DataFrame)
- **var**: Gene-level annotations (DataFrame)
- **obsm**: Cell embeddings (dict)
- **varm**: Gene embeddings (dict)
- **layers**: Multiple expression layers (dict)
- **uns**: Unstructured metadata / miscellaneous information (dict)

In [None]:
import scanpy as sc
from pathlib import Path
import scipy.sparse as sp

# =============================================================================
# 1. LOAD DATA
# =============================================================================
H5AD_PATH = "./data/norman.h5ad"
adata = sc.read_h5ad(Path("../data/data/replogle_k562.h5ad"))

print(f"""
================================================================================
DATASET OVERVIEW
================================================================================
Shape: {adata.n_obs:,} cells × {adata.n_vars:,} genes

Key fields:
  adata.X                    → Expression matrix (log-normalized, sparse)
  adata.obs['condition']     → Perturbation label (e.g., "CBL+ctrl", "CBL+UBASH3B")
  adata.obs['control']       → 1 = control cell, 0 = perturbed cell
  adata.var['gene_name']     → Gene names
""")


DATASET OVERVIEW
Shape: 162,751 cells × 5,000 genes

Key fields:
  adata.X                    → Expression matrix (log-normalized, sparse)
  adata.obs['condition']     → Perturbation label (e.g., "CBL+ctrl", "CBL+UBASH3B")
  adata.obs['control']       → 1 = control cell, 0 = perturbed cell
  adata.var['gene_name']     → Gene names



## 1️⃣ High-level overview

In [4]:
adata

AnnData object with n_obs × n_vars = 162751 × 5000
    obs: 'condition', 'cell_type', 'cov_drug_dose_name', 'dose_val', 'control', 'condition_name'
    var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg', 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20'

## 2️⃣ X: main expression matrix (cells × genes)

In [5]:
adata.X.shape
type(adata.X)

scipy.sparse._csr.csr_matrix

In [21]:
# Check whether it is sparse:
sp.issparse(adata.X)

# View a small slice:
adata.X[:5, :5].toarray() if sp.issparse(adata.X) else adata.X[:5, :5]


array([[0.        , 0.        , 0.        , 0.        , 0.5473604 ],
       [0.        , 0.        , 0.5986244 , 0.5986244 , 0.        ],
       [0.        , 0.        , 0.        , 0.70939165, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.55177546, 0.        ]],
      dtype=float32)

## 3️⃣ obs: cell-level annotations

In [None]:
adata.obs.head()

Unnamed: 0_level_0,condition,cell_type,cov_drug_dose_name,dose_val,control,condition_name
cell_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAACCCAAGAAGCCAC-34,UBL5+ctrl,K562,K562_UBL5+ctrl_1+1,1+1,0,K562_UBL5+ctrl_1+1
AAAGGATTCTCTCGAC-42,UBL5+ctrl,K562,K562_UBL5+ctrl_1+1,1+1,0,K562_UBL5+ctrl_1+1
AACGGGAGTAATGATG-25,UBL5+ctrl,K562,K562_UBL5+ctrl_1+1,1+1,0,K562_UBL5+ctrl_1+1
AAGAACAAGCTAGATA-35,UBL5+ctrl,K562,K562_UBL5+ctrl_1+1,1+1,0,K562_UBL5+ctrl_1+1
AAGACTCTCTATTGTC-33,UBL5+ctrl,K562,K562_UBL5+ctrl_1+1,1+1,0,K562_UBL5+ctrl_1+1


In [32]:
# List available columns:
print(adata.obs.columns.to_list())
# Commonly used fields:
print(adata.obs['condition'].value_counts().head())
print(adata.obs['control'].value_counts().head())

['condition', 'cell_type', 'cov_drug_dose_name', 'dose_val', 'control', 'condition_name']
condition
ctrl            10691
NCBP2+ctrl        765
SLC39A9+ctrl      724
DONSON+ctrl       688
GAB2+ctrl         637
Name: count, dtype: int64
control
0    152060
1     10691
Name: count, dtype: int64


## 4️⃣ var: gene-level annotations

In [34]:
adata.var.head()

Unnamed: 0_level_0,gene_name,chr,start,end,class,strand,length,in_matrix,mean,std,cv,fano,highly_variable,means,dispersions,dispersions_norm
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
ENSG00000237491,LINC01409,chr1,778747,810065,gene_version10,+,31318,True,0.137594,0.380048,2.762105,1.049733,True,0.130939,0.222407,0.028718
ENSG00000188290,HES4,chr1,998962,1000172,gene_version10,-,1210,True,0.249577,0.561933,2.25154,1.265214,True,0.205869,0.322631,0.715487
ENSG00000187608,ISG15,chr1,1001138,1014540,gene_version10,+,13402,True,0.377373,0.787623,2.08712,1.643865,True,0.335591,0.757568,3.695832
ENSG00000176022,B3GALT6,chr1,1232237,1235041,gene_version7,+,2804,True,0.315492,0.603217,1.911989,1.153345,True,0.251509,0.187828,-0.208232
ENSG00000131584,ACAP3,chr1,1292390,1309609,gene_version19,-,17219,True,0.146009,0.391124,2.678769,1.047732,True,0.133338,0.198733,-0.133505


In [None]:
# List gene annotation fields:
print(adata.var.columns.to_list())
# exaple gene names:
print(adata.var['gene_name'][:10])

['gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano', 'highly_variable', 'means', 'dispersions', 'dispersions_norm']
gene_id
ENSG00000237491     LINC01409
ENSG00000188290          HES4
ENSG00000187608         ISG15
ENSG00000176022       B3GALT6
ENSG00000131584         ACAP3
ENSG00000162576         MXRA8
ENSG00000221978         CCNL2
ENSG00000224870    MRPL20-AS1
ENSG00000242485        MRPL20
ENSG00000160072        ATAD3B
Name: gene_name, dtype: category
Categories (4999, object): ['A1BG', 'AAGAB', 'AAK1', 'AAMDC', ..., 'ZWINT', 'ZYG11B', 'ZYX', 'ZZEF1']


## 5️⃣ obsm: cell embeddings / latent representations

In [37]:
adata.obsm.keys()

KeysView(AxisArrays with keys: )

In [40]:
'''
# Check shapes:
print(adata.obsm['X_pca'].shape)
print(adata.obsm['X_umap'].shape)

# Visualize UMAP (if available):
sc.pl.umap(adata, color='condition')
'''

"\n# Check shapes:\nprint(adata.obsm['X_pca'].shape)\nprint(adata.obsm['X_umap'].shape)\n\n# Visualize UMAP (if available):\nsc.pl.umap(adata, color='condition')\n"

## 6️⃣ varm: gene embeddings (often empty initially)

In [41]:
adata.varm.keys()

KeysView(AxisArrays with keys: )

In [None]:
'''
# If you later add gene embeddings:
adata.varm['gene_embedding'].shape
'''

## 7️⃣ layers: alternative expression layers

In [42]:
adata.layers.keys()

KeysView(Layers with keys: )

In [44]:
'''
# Example:
adata.layers['counts'].shape

# Compare with X:
adata.layers['counts'][:5, :5]
'''

"\n# Example:\nadata.layers['counts'].shape\n\n# Compare with X:\nadata.layers['counts'][:5, :5]\n"

## 8️⃣ uns: unstructured metadata

In [47]:
adata.uns.keys()

dict_keys(['hvg', 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20'])

In [None]:
# Examples:
adata.uns.get('log1p', None)
adata.uns.get('neighbors', None)

## 9️⃣ One-shot inspection helper

In [49]:
def inspect_anndata(adata):
    print("=== AnnData overview ===")
    print(adata)
    print()

    print("obs columns:", list(adata.obs.columns))
    print("var columns:", list(adata.var.columns))
    print()

    print("obsm keys:", list(adata.obsm.keys()))
    print("varm keys:", list(adata.varm.keys()))
    print("layers:", list(adata.layers.keys()))
    print("uns keys:", list(adata.uns.keys()))

inspect_anndata(adata)


=== AnnData overview ===
AnnData object with n_obs × n_vars = 162751 × 5000
    obs: 'condition', 'cell_type', 'cov_drug_dose_name', 'dose_val', 'control', 'condition_name'
    var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg', 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20'

obs columns: ['condition', 'cell_type', 'cov_drug_dose_name', 'dose_val', 'control', 'condition_name']
var columns: ['gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano', 'highly_variable', 'means', 'dispersions', 'dispersions_norm']

obsm keys: []
varm keys: []
layers: []
uns keys: ['hvg', 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20']
