GPT Cell Annotator

This package provides a wrapper around scanpy and OpenAI GPT to perform cell-type annotation on single-cell RNA-seq data.

Installation

pip install "git+https://github.com/VPetukhov/GPTCellAnnotator.git"

Usage

For a full usage example see auto_annotation.ipynb.

Set OpenAI key

The package requires you to have an OpenAI API key. You can get one here. Once you have it, you can set it in your notebook to openai.api_key. You can save it to a .env file of your project and then use python-dotenv to read it to an environment variable:

from dotenv import load_dotenv
load_dotenv()

openai.api_key = os.getenv("OPENAI_API_KEY")

Package workflow

Generate a list of expected cell types and their markers for the given tissue using GPT. This list would later be used in the prompt to improve the quality. First, it is needed to make cell type names standardized. Second, it makes GPT focus on the cell types that are relevant to the tissue of interest.
Generate a list of markers for the given cell types using GPT. GPT is surprisingly good in providing relevant cell type markers. While this list would not be directly used for cell type annotation, it seems to improve the annotation quality if provided to the annotation prompt. These two steps can be run with:
```
expected_types, expected_markers = get_expected_cell_types(
    species='mouse', tissue='pancreas', model='gpt-4', max_tokens=800
)
```
Process scRNA-seq data using scanpy. This step includes filtering, normalization, dimensionality reduction, clustering, and marker gene selection. In addition to the standard scanpy pipeline, it uses AUC and Specificity metrics to rank genes. Example processing for the AnnData object ad:
```
sc.pp.neighbors(ad, n_neighbors=30)
sc.tl.leiden(ad, resolution=0.8)
marker_dfs = get_markers_per_cluster(ad, clustering='leiden')
```

Annotate the scRNA-seq clusters based on their markers using GPT. This step takes the data from the three previous steps, iterates over the clusters and queries OpenAI for each of them.

annotation_res = annotate_clusters(
    marker_genes, species='mouse', tissue='pancreas', expected_markers=expected_markers,
    model='gpt-4'
)
ann_df = parse_annotation(annotation_res)

After these steps, the data can be visualized with:

ad.obs['annotation'] = ann_df['Cell type'][ad.obs['leiden'].values.astype(int)].values
sc.pl.umap(ad, color=['annotation'], legend_loc='on data', legend_fontsize=8)

Data privacy

Biological data is usually sensitive and should be handled with care. This package does not send any actual data to OpenAI. All processing goes locally, and the only data that is sent to OpenAI are:

Name of the tissue and the species
List of clusters and top-5 marker genes per cluster (only names, not expression values)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
gptcellannotator		gptcellannotator
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT Cell Annotator

Installation

Usage

Set OpenAI key

Package workflow

Data privacy

About

Releases

Packages

Languages

License

VPetukhov/GPTCellAnnotator

Folders and files

Latest commit

History

Repository files navigation

GPT Cell Annotator

Installation

Usage

Set OpenAI key

Package workflow

Data privacy

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages