<a href="https://www.kaggle.com/code/dalloliogm/cytetype-exploration-1?scriptVersionId=251162050" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Evaluation of CyteType

CyteType is a tool for automatic curation of cell types in single cell, using an LLM Annotator Agent.

In this notebook, we try the tool on a few datasets and compare it with other annotation tools.

### Install Libraries

In [None]:
!python -m pip install -q cytetype scanpy igraph leidenalg

In [None]:
import anndata
import scanpy as sc
import cytetype
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Initialize Paul15 dataset

The Paul et al 2015 dataset is a small single-cell dataset of myeloid cells, available from scanpy. The Cell types have been manually annotated, meaning we can compare cytetype's predictions against the true values.

In [None]:
#adata = sc.datasets.paul15()


In [None]:
import os
import scanpy as sc

if not os.path.exists("paul15_small.h5ad"):
    adata = sc.datasets.paul15()
    adata.write("paul15_small.h5ad")
else:
    adata = sc.read("paul15_small.h5ad")


### Process the data - compute clusters, etc

In [None]:
adata.obs["paul15_clusters"]

In [None]:

# Load and preprocess your data
adata.var["gene_symbols"] = adata.var_names

# We compute the clusters as suggested by the tutorial
# However, this produces 10 clusters. We are going to use the 19
# clusters from the original paper, which are stored in the `paul15_clusters` column.
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.leiden(adata)
sc.tl.rank_genes_groups(adata, groupby='paul15_clusters', method='t-test')

# Initialize CyteType (performs data preparation)
annotator = cytetype.CyteType(
    adata,
    group_key='paul15_clusters',           # Required: cluster column name
    rank_key='rank_genes_groups',          # DE results key (default)
    gene_symbols_column='gene_symbols',    # Gene symbols column (default)
    n_top_genes=50,                        # Top marker genes per cluster
    #results_prefix='cytetype'              # Prefix for result columns
)

In [None]:
res['params']

In [None]:
# 1. Grab the result dict
res = adata.uns['rank_genes_groups']

# 2. Extract the structured array of gene names
#    This is a numpy recarray whose field names are your cluster labels,
#    e.g. ('0', '1', …, '14').
names = res['names']

# 3. Get the list of clusters:
clusters = names.dtype.names
print("Clusters found:", clusters)

# 4. Loop through and print top 50 per cluster
for cl in clusters:
    top50 = names[cl][:50]
    print(f"\nCluster {cl} — top 50 markers:")
    print(", ".join(top50))

In [None]:
adata.obs

In [None]:
# How many clusters are there?
print(f"Number of clusters: {len(adata.obs['paul15_clusters'].unique())}")
adata.obs['paul15_clusters'].value_counts().sort_index()


In [None]:
!pip install -q celltypist

In [None]:
import celltypist

# Run the default immune reference
predictions = celltypist.annotate(
    adata,
    model='Immune_All_Low.pkl',    # or whichever fits your system
    majority_voting=True,
    cluster_support=True,
    gpu=False 
)

# Add labels back to adata.obs
adata.obs['cell_type_ct'] = predictions.predicted_labels


### Call the Agent! Run the annotation

In [None]:
# Run annotation
adata = annotator.run(
    study_context="""
    Mouse bone marrow cells undergoing myeloid differentiation. This includes several subtypes of immune cell progenitors.

    """
)

# View results
#print(adata.obs.cytetype_leiden)

### Compare annotation with existing one

The original annotation are more granular than the ones from Cytetype, so we manually create a dictionary to make a comparison.

In [None]:
print(adata.obs)

In [None]:
print(adata.obs.cytetype_annotation_paul15_clusters.unique().to_list())

In [None]:
label_map = {
    '1Ery': 'Erythroblast',
    '2Ery': 'Erythroid precursor',
    '3Ery': 'Erythroblast',
    '4Ery': 'Erythroid precursor',
    '5Ery': 'Erythroid cell',
    '6Ery': 'Erythroid progenitor',

    '7MEP': 'Myeloid progenitor cell',
    '8Mk': 'Megakaryocyte',
    '9GMP': 'Hematopoietic progenitor cell',
    '10GMP': 'Hematopoietic progenitor cell',

    '11DC': 'Antigen-presenting myeloid cell',
    '12Baso': 'Mast cell',
    '13Baso': 'Mast cell',

    '14Mo': 'Neutrophil promyelocyte',
    '15Mo': 'Neutrophil promyelocyte',

    '16Neu': 'Neutrophil',
    '17Neu': 'Neutrophil',

    '18Eos': 'Eosinophil',
    '19Lymph': 'Natural Killer (NK) cell'
}


In [None]:
adata.obs["true_broad"] = adata.obs["cytetype_annotation_paul15_clusters"].map(label_map)

conf_mat = pd.crosstab(adata.obs["true_broad"], adata.obs["cytetype_annotation_paul15_clusters"], normalize='index')
conf_mat

In [None]:
conf_mat

In [None]:
# # Plot Confusion matrix
# plt.figure(figsize=(10, 6))
# sns.heatmap(conf_mat, annot=True, cmap="Blues", fmt=".2f", cbar=True)
# plt.title("CyteType vs Paul15 Broad Cell Type Mapping")
# plt.ylabel("True Label (Paul15 Broad)")
# plt.xlabel("Predicted Label (CyteType)")
# plt.tight_layout()
# plt.show()

## Repeating the confusion matrix, using the original labels


In the code above we had to manually match the Paul 15 labels with the ones generated by cytetype. Here we use the original lables, although we expand them to make it more readable, and we keep the original values.


In [None]:
import pandas as pd

# Step 1: Mapping from cluster ID to full name
paul15_label_fullname = {
    '1Ery': '1Ery - Early erythroid progenitors',
    '2Ery': '2Ery - Intermediate erythroid stage',
    '3Ery': '3Ery - Late erythroid / erythroblast',
    '4Ery': '4Ery - Terminally differentiating erythrocytes',
    '5MEP': '5MEP - Megakaryocyte-Erythroid Progenitor',
    '6MEP': '6MEP - Megakaryocyte-Erythroid Progenitor (more mature)',
    '7MEP': '7MEP - Megakaryocyte-Erythroid Progenitor (late stage)',
    '8GMP': '8GMP Granulocyte–monocyte progenitor',
    '9DC':  '9DC - Dendritic cell progenitors',
    '10Baso': '10Baso - Basophil progenitors',
    '11Mo': '11Mo - Monocyte progenitors',
    '12Mo': '12Mo - Differentiated monocyte progenitors',
    '13Baso': '13Baso - Mature basophil progenitors',
    '14Mo': '14Mo - Late-stage monocytes',
    '15Mo': '15Mo - Mature monocytes',
    '16Neu': '16Neu - Neutrophil progenitors',
    '17Neu': '17Neu - Late-stage neutrophils',
    '18Eos': '18Eos - Eosinophil progenitors',
    '19Lymph': '19Lymph - Lymphoid-like cells (NK/T precursors or contaminants)',
}

# Step 2: Map full names into obs
adata.obs["paul15_fullname"] = adata.obs["paul15_clusters"].map(paul15_label_fullname)
adata.obs.head()


In [None]:
adata.obs.cytetype_annotation_paul15_clusters.value_counts()

In [None]:

# Step 3: Define full-name order
manual_order_full = list(paul15_label_fullname.values())


# Step 4: Create and reorder confusion matrix
conf_mat_full = pd.crosstab(
    adata.obs["paul15_fullname"],
    adata.obs["cytetype_annotation_paul15_clusters"],
    normalize='index'
)
# Only keep labels that exist in the confusion matrix
existing_labels = [label for label in manual_order_full if label in conf_mat_full.index]

# Reorder based on existing labels only
conf_mat_ordered = conf_mat_full.loc[existing_labels]

# Optional: reorder columns if matching set
if all(label in conf_mat_ordered.columns for label in manual_order_full):
    conf_mat_ordered = conf_mat_ordered[manual_order_full]

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 10))
sns.heatmap(conf_mat_ordered, annot=True, fmt=".2f", cmap="Blues", cbar=True)

plt.title("Confusion Matrix (Original Cluster Names, Ordered by Manual Mapping)")
plt.xlabel("Predicted Label")
plt.ylabel("Original Cluster")
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


In [None]:
adata.obs.groupby(["cytetype_annotation_paul15_clusters", "paul15_clusters"]).size().reset_index().rename(columns={0:'cell_count'}).\
    query("cell_count>0").sort_values("paul15_clusters").drop(columns="cell_count").style.background_gradient(axis=None)               

In [None]:
adata.obs[["paul15_fullname"]].value_counts()

In [None]:
adata.obs[["cytetype_annotation_paul15_clusters", "paul15_fullname"]].value_counts()

In [None]:
adata.obs

In [None]:
#sc.pl.embedding(adata, basis='umap', color='cytetype_annotation_cell_type')

## What if we give the wrong description?

Let's give a wrong instruction to CyteType, for example by saying that this is a retina dataset in Zebrafish, to see if it still returns good results

In [None]:
adata2 = adata.copy()
# Initialize CyteType (performs data preparation)
annotator = cytetype.CyteType(
    adata2,
    group_key='paul15_clusters',
    rank_key='rank_genes_groups',
    gene_symbols_column='gene_symbols',
    n_top_genes=50,
)

In [None]:
adata2 = annotator.run(
    study_context="Zebrafish retina development during embryogenesis"
)


In [None]:
# adata2.obs["true_broad"] = adata2.obs["paul15_clusters"].map(label_map)

# pd.crosstab(adata2.obs["true_broad"], adata2.obs["cytetype_paul15_clusters"], normalize='index')


In [None]:
# adata2.obs["paul15_fullname"] = adata2.obs["paul15_clusters"].map(paul15_label_fullname)


In [None]:
# # Create normalized confusion matrix
# conf_mat = pd.crosstab(
#     adata2.obs["true_broad"],
#     adata2.obs["cytetype_paul15_clusters"],
#     normalize='index'
# )

# # Plot with seaborn
# plt.figure(figsize=(10, 6))
# sns.heatmap(conf_mat, annot=True, cmap="Blues", fmt=".2f", cbar=True)
# plt.title("CyteType vs Paul15 Broad Cell Type Mapping")
# plt.ylabel("True Label (Paul15 Broad)")
# plt.xlabel("Predicted Label (CyteType)")
# plt.tight_layout()
# plt.show()

In [None]:
# # Let's also look at the Confusion matrix with the original cluster names
# conf_mat_original = pd.crosstab(adata2.obs["paul15_clusters"], adata2.obs["cytetype_paul15_clusters"], normalize='index')
# conf_mat_original



# # Step 3: Define full-name order
# manual_order_full = list(paul15_label_fullname.values())


# # Step 4: Create and reorder confusion matrix
# conf_mat_full = pd.crosstab(
#     adata2.obs["paul15_fullname"],
#     adata2.obs["cytetype_paul15_clusters"],
#     normalize='index'
# )
# # Only keep labels that exist in the confusion matrix
# existing_labels = [label for label in manual_order_full if label in conf_mat_full.index]

# # Reorder based on existing labels only
# conf_mat_ordered = conf_mat_full.loc[existing_labels]

# # Optional: reorder columns if matching set
# if all(label in conf_mat_ordered.columns for label in manual_order_full):
#     conf_mat_ordered = conf_mat_ordered[manual_order_full]

In [None]:
# import seaborn as sns
# import matplotlib.pyplot as plt

# plt.figure(figsize=(12, 10))
# sns.heatmap(conf_mat_ordered, annot=True, fmt=".2f", cmap="Blues", cbar=True)

# plt.title("Confusion Matrix (Original Cluster Names, Ordered by Manual Mapping) - using a wrong study context")
# plt.xlabel("Predicted Label")
# plt.ylabel("Original Cluster")
# plt.xticks(rotation=90)
# plt.yticks(rotation=0)
# plt.tight_layout()
# plt.show()
