# scGPT Embeddings Download and Set-Up Documentation

This guide should help you get scGPT working with flash-attn (on Lambda or other cloud service) while avoiding the pitfalls I encountered.

---
### 1. Setting Up the Instance

1. Download and install Anaconda:
```
wget <https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh> 
bash Anaconda3-2024.06-1-Linux-x86_64.sh source ~/.bashrc
```

2. Create and activate the scGPT Conda environment:

```conda create -n "scgpt_conda" python=3.9 conda activate scgpt_conda```

3. Install Jupyter kernel:

```conda install ipykernel python -m ipykernel install --user --name scgpt_conda --display-name "scgpt_conda 3.9"```

4. Install necessary Python packages:

```bash
pip install scgpt 
pip install gseapy
pip install gdown
pip install PyMuPdf
pip install fitz
```

Note: Installing scGPT without flash attention is fine when not training.

5. Download required models using gdown:

```gdown --folder <https://drive.google.com/drive/folders/1kkug5C7NjvXIwQGGaGoqXTk_Lb_pDrBU>```

 `Additional model downloads:`

```
gdown --folder <https://drive.google.com/drive/folders/1_GROJTzXiAV8HB4imruOTk6PEGuNOcgB> # CP gdown --folder <https://drive.google.com/drive/folders/1vf1ijfQSk7rGdDGpBntR5bi5g6gNt-Gx> # Brain gdown --folder <https://drive.google.com/drive/folders/1kkug5C7NjvXIwQGGaGoqXTk_Lb_pDrBU> # BC gdown --folder <https://drive.google.com/drive/folders/1GcgXrd7apn6y4Ze_iSCncskX3UsWPY2r> # Heart gdown --folder <https://drive.google.com/drive/folders/16A1DJ30PT6bodt4bWLa4hpS7gbWZQFBG> # Lung gdown --folder <https://drive.google.com/drive/folders/1S-1AR65DF120kNFpEbWCvRHPhpkGK3kK> # Kidney gdown --folder <http://drive.google.com/drive/folders/13QzLHilYUd0v3HTwa_9n4G4yEF-hdkqa> # Pan Cancer gdown --folder <https://drive.google.com/drive/folders/1oWh_-ZRdhtoGQ2Fw24HP41FgLoomVo-y> # Human (all 33 million)
```

 6. Clone the gene embedding analysis repository:

```git clone <https://github.com/briannaflynn/gene_embedding_analysis.git>```

### 2. Installing Dependencies

1\. Update package lists and install build tools:

```sudo apt-get update sudo apt-get install build-essential ninja-build```

2. Install Python development headers and libraries:

```sudo apt-get install python3-dev```

### 3. Verifying Compiler Setup

1. Create a test program file named `test_pybind11.cpp`:

#include <pybind11/pybind11.h> int main() { return 0; }

2. Compile the test program to check if the compiler can find headers:

```g++ test_pybind11.cpp -o test_pybind11 -I${CONDA_PREFIX}/include/python3.8 -L${CONDA_PREFIX}/lib -lpython3.8```

### 4. Setting Environment Variables

Set necessary environment variables to ensure correct library linking:

```bash
export CPLUS_INCLUDE_PATH=${CONDA_PREFIX}/include:${CONDA_PREFIX}/include/python3.8:$CPLUS_INCLUDE_PATH export C_INCLUDE_PATH=${CONDA_PREFIX}/include:${CONDA_PREFIX}/include/python3.8:$C_INCLUDE_PATH export LIBRARY_PATH=${CONDA_PREFIX}/lib:$LIBRARY_PATH export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH
```

### 5. Ensuring pip Uses the Correct Environment

#### Issue:
When using Lambda, pip may still point to the base user Python installation instead of the Conda environment.

#### Fix:
1. Add the following to `~/.bashrc`:

```export PATH="/home/ubuntu/anaconda3/envs/scgpt_conda/bin:$PATH"```

2. Apply the changes:

```source ~/.bashrc```

3. Verify the correct pip path:

```which pip```

`Expected output:`

```/home/ubuntu/anaconda3/envs/scgpt_conda/bin/pip```

### 6. Installing Flash Attention

The original instructions reference `flash-attn<1.0.5`, but this conflicts with CUDA 12. Instead, install version 1.0.6:

```pip install "flash-attn==1.0.6" --no-build-isolation```

 `- `--no-build-isolation`: Ensures that the package is built using the current environment's installed packages instead of an isolated build.
- If this causes issues in the future, downgrading to CUDA 11.7 may be necessary.

---
### 7. Installing Missing Dependencies

1. Install `wandb` for logging and visualization:`

```pip install wandb```

### 8. Issues with scvi-tools and optax (JAX)

The documentation states that scGPT supports Python 3.8+, but certain dependencies required for fine-tuning need **Python 3.9**.

To resolve this:
- Set up a clean Python 3.9 environment.
- Reinstall scGPT and dependencies.
- Verify the exact installation versions used in the scGPT Docker branch (see the relevant pull request for details).

---

### 9. There is an issue with numpy 2.0 compatibility, must downgrade

```pip install "numpy<2.0"```


In [1]:
import sys

In [2]:
# needs to be version 1, not version 2
import numpy
numpy.__version__

'1.26.4'

In [5]:
sys.executable

'/home/ubuntu/anaconda3/envs/scgpt_conda/bin/python'

In [6]:
import numpy as np

In [79]:
import scgpt; # silencing the warning about flash attention
# - kind of a pain to install and won't need it unless training

In [8]:
import copy
import json
import os
from pathlib import Path
import warnings

In [9]:
import torch
from anndata import AnnData
import scanpy as sc

In [52]:
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import pandas as pd
import tqdm
import pickle

In [11]:
import gseapy as gp

from torchtext.vocab import Vocab
from torchtext._torchtext import (
    Vocab as VocabPybind,
)

In [12]:
sys.path.insert(0, "../")
import scgpt as scg
from scgpt.tasks import GeneEmbedding
from scgpt.tokenizer.gene_tokenizer import GeneVocab
from scgpt.model import TransformerModel
from scgpt.preprocess import Preprocessor
from scgpt.utils import set_seed

os.environ["KMP_WARNINGS"] = "off"
warnings.filterwarnings('ignore')

In [80]:
def export2embeddings(model_name="./scGPT_bc", output_name="genes_names_embeddings.pkl"):

    # configs
    # setting parameters, seed value, num highly variable genes, num bins, etc
    set_seed(42)
    pad_token = "<pad>"
    special_tokens = [pad_token, "<cls>", "<eoc>"]
    n_hvg = 1200
    n_bins = 51
    mask_value = -1
    pad_value = -2
    n_input_bins = n_bins

    # load model
    model_dir = Path(model_name)
    model_config_file = model_dir / "args.json"
    model_file = model_dir / "best_model.pt"
    vocab_file = model_dir / "vocab.json"
    special_tokens = [pad_token, "<cls>", "<eoc>"]

    vocab = GeneVocab.from_file(vocab_file)

    for s in special_tokens:
            if s not in vocab:
                vocab.append_token(s)

    with open(model_config_file, "r") as f:
        model_configs = json.load(f)
    print(f"resume model from {model_file} weights, model args override the {model_config_file} config")

    # embedding size, number of attn heads, number of hidden dimensions, number of layers
    embsize = model_configs["embsize"]
    nhead = model_configs["nheads"]
    d_hid = model_configs["d_hid"]
    nlayers = model_configs["nlayers"]
    n_layers_cls = model_configs["n_layers_cls"]

    gene2idx = vocab.get_stoi()

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    ntokens = len(vocab)
    print(f"The size of the vocabulary is {ntokens}")

    # create transformer model with specified configs
    model = TransformerModel(
        ntokens,
        embsize,
        nhead,
        d_hid,
        nlayers,
        vocab=vocab,
        pad_value=pad_value, 
        n_input_bins=n_input_bins)

    # loaded the parameters from {model_file} 
    try:
        model.load_state_dict(torch.load(model_file))
        print(f"Loading all model params from {model_file}")
    except:
        # only load parameters that are in the model and match the correct size
        model_dict = model.state_dict()
        pretrained_dict = torch.load(model_file)
        pretrained_dict = {k : v for k, v in pretrained_dict.items() if k in model_dict and v.shape == model_dict[k].shape}
        for k, v in pretrained_dict.items():
            print(f"Loading parameters {k} with shape {v.shape}")
            model_dict.update(pretrained_dict)
            model.load_state_dict(model_dict)
    
    model.to(device)
    
    gene_ids = np.array([id for id in gene2idx.values()])
    gene_embeddings = model.encoder(torch.tensor(gene_ids, dtype=torch.long).to(device))
    gene_embeddings_vec = gene_embeddings.detach().cpu().numpy()
    
    assert len(gene2idx.keys()) == len(gene_embeddings)

    print("\nTotal number of embedding vectors:", len(gene_embeddings))
    print("Length of each vector", len(gene_embeddings[0]))

    # connecting each of the gene names with it's respective embedding array - this is the full version
    genes_names_embeddings = dict(zip(gene2idx.keys(), gene_embeddings_vec))
    test_key = list(genes_names_embeddings.keys())[0]
    print(f'\nTest Gene: {test_key}')
    print(genes_names_embeddings[test_key])

    # Save dictionary to a binary file
    with open(output_name, 'wb') as f:
        pickle.dump(genes_names_embeddings, f)

In [78]:
models

['scGPT_human',
 'scGPT_heart',
 'scGPT_bc',
 'scGPT_kidney',
 'scGPT_pancancer',
 'scGPT_lung',
 'scGPT_CP',
 'scGPT_brain']

In [81]:
os.mkdir('./scgpt_embeddings')

In [82]:
models=[f for f in  os.listdir('.') if f.startswith('scGPT_')]

for model in models:
    output_pkl = './scgpt_embeddings/' + model + '_embeddings.pkl'
    print('#' * 60)
    print(model)
    export2embeddings(model, output_pkl)
    print('')

############################################################
scGPT_human
resume model from scGPT_human/best_model.pt weights, model args override the scGPT_human/args.json config
The size of the vocabulary is 60697
Loading parameters encoder.embedding.weight with shape torch.Size([60697, 512])
Loading parameters encoder.enc_norm.weight with shape torch.Size([512])
Loading parameters encoder.enc_norm.bias with shape torch.Size([512])
Loading parameters value_encoder.linear1.weight with shape torch.Size([512, 1])
Loading parameters value_encoder.linear1.bias with shape torch.Size([512])
Loading parameters value_encoder.linear2.weight with shape torch.Size([512, 512])
Loading parameters value_encoder.linear2.bias with shape torch.Size([512])
Loading parameters value_encoder.norm.weight with shape torch.Size([512])
Loading parameters value_encoder.norm.bias with shape torch.Size([512])
Loading parameters transformer_encoder.layers.0.self_attn.out_proj.weight with shape torch.Size([512, 512