This package provides a wrapper around scanpy and OpenAI GPT to perform cell-type annotation on single-cell RNA-seq data.
pip install "git+https://github.com/VPetukhov/GPTCellAnnotator.git"
For a full usage example see auto_annotation.ipynb.
The package requires you to have an OpenAI API key. You can get one here. Once you have it, you can set it in your notebook to openai.api_key
. You can save it to a .env
file of your project and then use python-dotenv
to read it to an environment variable:
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
- Generate a list of expected cell types and their markers for the given tissue using GPT. This list would later be used in the prompt to improve the quality. First, it is needed to make cell type names standardized. Second, it makes GPT focus on the cell types that are relevant to the tissue of interest.
- Generate a list of markers for the given cell types using GPT. GPT is surprisingly good in providing relevant cell type markers. While this list would not be directly used for cell type annotation, it seems to improve the annotation quality if provided to the annotation prompt. These two steps can be run with:
expected_types, expected_markers = get_expected_cell_types( species='mouse', tissue='pancreas', model='gpt-4', max_tokens=800 )
- Process scRNA-seq data using
scanpy
. This step includes filtering, normalization, dimensionality reduction, clustering, and marker gene selection. In addition to the standardscanpy
pipeline, it usesAUC
andSpecificity
metrics to rank genes. Example processing for theAnnData
objectad
:sc.pp.neighbors(ad, n_neighbors=30) sc.tl.leiden(ad, resolution=0.8) marker_dfs = get_markers_per_cluster(ad, clustering='leiden')
- Annotate the scRNA-seq clusters based on their markers using GPT. This step takes the data from the three previous steps, iterates over the clusters and queries OpenAI for each of them.
annotation_res = annotate_clusters( marker_genes, species='mouse', tissue='pancreas', expected_markers=expected_markers, model='gpt-4' ) ann_df = parse_annotation(annotation_res)
After these steps, the data can be visualized with:
ad.obs['annotation'] = ann_df['Cell type'][ad.obs['leiden'].values.astype(int)].values
sc.pl.umap(ad, color=['annotation'], legend_loc='on data', legend_fontsize=8)
Biological data is usually sensitive and should be handled with care. This package does not send any actual data to OpenAI. All processing goes locally, and the only data that is sent to OpenAI are:
- Name of the tissue and the species
- List of clusters and top-5 marker genes per cluster (only names, not expression values)