# LLM Usage on Cluster

Steps to download and install a huggingface model locally in a node. Assume you want to use [medgemma-27b-text-it](https://huggingface.co/google/medgemma-27b-text-it), which occupies around **60-80 GB** solid state storage. Given a typical quota of **100 GB** this means that you can only install one model at a time. 
You can check your usage with `df -h ~/` in terminal while being logged into one of the frontend notes (`s-sc-frontent1-3.charite.de`). 
In general shell commands can be executed in a terminal while being on the cluster (also available in the JupyterHub, scroll down) or can be executed in this notebook by prepending "!". E.g., to create directory you can execute `mkdir new_dir` in terminal or you can create a new code cell in this notebook and write `!mkdir new_dir` and run the cell.

## Preliminary

Get your API key from Huggingface. By creating an account on [Huggingface](https://huggingface.co/) you should be able to create an access token in your account under the tab `Access Tokens`. 


## 1. Login to Cluster

### Pull this Notebook from Github

Login into one of three frontend node by specifying your username and a frontend node ID in [1,2,3] with terminal (Powershell on Windows):
```shell
ssh <usr>@s-sc-frontend[1-3].charite.de
````
Then clone the repository containing this notebook:
```

```


Go to the [Jupyterhub](https://git.bihealth.org/charite-sc-public/sc-wiki/-/wikis/Resources/User%20Documentation/User%20Guide:%20HPC%20@Charite#jupyterhub) which offers you pre-configured notebooks and shells. Select a server offering **Exclusive GPU**. Open a shell on the cluster and pull this notebook by cloning the repository.
```shell
mkdir -p git/charite_hpc_llm
git clone https://github.com/aieoa/charite_hpc_llm.git
```



## 2. Create LLM-Specific Virtual Environment and Set Project Directory for Storage

### Virtual Environment
For reusability create and later simply load (activate) a virtual environment which contains all packages needed to run a specific model. For Medgemma, we will need packages provided in a pre-configured conda environment specific for GPU compute nodes plus additional modules. Complete list of pre-configured kernels can be found [here](https://git.bihealth.org/charite-sc-public/sc-wiki/-/wikis/Resources/User%20Documentation/User%20Guide:%20HPC%20@Charite#shared-conda-environments) and [yaml](https://git.bihealth.org/charite-sc-public/conda-envs/-/tree/main/)). 

As we will require newer versions of at least one package as provided in the pre-configured conda kernel conda_envs-gpulab (write-protected), we will clone the gpulab environment, install additional packages into the cloned one and create a kernel that can be selected for this notebook:

#### Create Conda gpulab Clone
Create clone, called `llm_env` or any other name and activate:
```shell
conda create --name llm_env --clone gpulab
conda activate llm_env
```

#### Register Jupyter Kernel
Register a Jupyter kernel with currently active environment 
```
conda install ipykernel
python -m ipykernel install --user --name llm_env --display-name "Python (llm_env)"
```

#### Install Medgemma-Specific Modules
Install missing packages and force update for Jinja2

```
conda env update -f environment.yaml --prune
```

#### Start Jupyter Notebook with Registered Kernel
Start notebook by selecting it from the filebrowser showing your home directory on the cluster. Then select the new environment: Listed top right in notebook and selectable in drop-down menue likely listed as `conda env:.conda-llm_env`.

### Project Directory
On Charite's [service portal](https://s-m42-it-appl.charite.de/wm/app-SelfServicePortal/landing-page) you can apply for a project directory of eg 1 TB.
Given that you are granted with a project directory (see email by bihealth), ascertain that model and conda environments point to it for not running into out of memory errors. Otherwise, for smaller models, you might be able to use a folder in the home directory if enough space is left (1 TB is user quota).

In [None]:
import os
from pathlib import Path

# otherwise, eg, use home directory "~"
scratch_dir = "<your_granted_scratch_dir_on_hpc_see_mail" 
repo_id = "google/medgemma-27b-text-it"
model_dir = Path(scratch_dir) / "data" / "models" / repo_id.split('/')[-1]
model_dir.mkdir(parents=True, exist_ok=True)

model_dir

In [None]:
# verify that llm_env is used and Jinja2>=3.1.x
import sys
print(sys.executable)
import jinja2
assert jinja2.__version__.startswith("3.1.")
print(jinja2.__version__)



## 4. Model Installation
Install the model (attached file) by executing the next cell. Note, this has to be done only once and is relatively time-consuming. Never share your notebook with your plaintext key. A better habit is to set it as an environment variable, say `HF_TOKEN`, and query it with `os.environ["HF_TOKEN"]`.

### a) Terminal

```shell
python scripts/download_model.py --repo_id "google/medgemma-27b-text-it" --local_dir <model_dir> --token $HF_TOKEN
```

### b) Notebook

In [None]:
import sys

# Option a) unsafe: copy-paste HuggingFace token here
token="<your_secret_HF_token>"

# Option b) set with 'export HF_TOKEN="secret_token"' in ~/.bashrc followed by 
# source ~/.bashrc to take immediate effect then load environment variable with
token=os.environ["HF_TOKEN"]

assert len(token) > 3, "ERROR\tToken not set!"

In [None]:
from huggingface_hub import snapshot_download

## Download and install medgemma once by decommenting following line
## Accidently relaunching does not trigger another download if model_dir has not changed
snapshot_download(
    repo_id=repo_id,
    local_dir=model_dir,
    max_workers=4,
    token=token
)

### Troubleshooting
If you encounter problems with your quota (you should have 100 GB), check for bulky folders and delete unused stuff via `du -h --max-depth=1 ~ | sort -hr | head -n 10`.


In [None]:
!df -h $scratch_dir
!du -sh $scratch_dir


## 6. Prompt Completion

Organize your code to load model only once as this is a true bottleneck and then loop over prompts and collect completions.

If the repo must be fetched again, redirect the Hub cache to a larger mount before any imports:
export HF_HUB_CACHE=/large_mount/hf_cache  Or in Python prior to importing transformers:
import os; os.environ“HF_HUB_CACHE” = “/large_mount/hf_cache”  This is the recommended fix when “No space left on device” arises due to small root partitions .

In [None]:
!nvidia-smi

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

# Important: point here to downloaded version to avoid repeated caching under ~/.cache of the model
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir, dtype="auto", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

# this may take 10-15 minutes
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

## Automize Queries

### Push Data from PC to Cluster

Assume text files are located in your home directory under `data` and you want to copy it to your home directory on the cluster. If the folder does not exist yet on the cluster. Ideally, you would use the granted project directory. If not available (yet, use instead your home directory: `~`).
```shell
ssh <usr>@s-sc-frontend[1-3].charite.de "mkdir -p <scratch_dir>/data/input"
# or stored in home directory
ssh <usr>@s-sc-frontend[1-3].charite.de "mkdir -p ~/data/input"
```
Then copy all files recursively. If data is large, compression into single file is recommended.
```shell
scp -r ./data <usr>@s-sc-frontend[1-3].charite.de:<scratch_dir>/data/
# or copied to home directory
scp -r ./data <usr>@s-sc-frontend[1-3].charite.de:~/data/
```

Now assume, we want to iterate over the texts representing single patient cases and ask the same question, eg, "Is the patient depressed?".

In [None]:
# Needs to be loaded only once
import os

from pathlib import Path
from transformers import AutoTokenizer, AutoModelForCausalLM

# Important: point here to downloaded version to avoid repeated caching under ~/.cache of the model
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir, dtype="auto", device_map="auto")

reports_dir = Path(scratch_dir) / "data" / "input"
# if scratch_dir not available:
# reports_dir = Path(os.path.expanduser("~")) / "data" / "input"
log_dir = Path(scratch_dir) / "data" / "output"
os.makedirs(log_dir)

In [None]:
# set here the maximum number of tokens to be generated to a reasonable number to restrict output length
# 1 word ~ 2.5 token
max_new_token = 500 

# Iterate over text files and store responses
for filename in reports_dir.iterdir():

    if not filename.name.endswith(".txt"): # filter for text files
        continue
    text_id = filename.stem
    with open(reports_dir / filename) as fr:
        text = fr.read()
        messages = [
            {
                "role": "user", 
                "content": f"Given the record of a fictional patient: {text}. Is the patient depressed?"
            },
        ]
        inputs = tokenizer.apply_chat_template(
        	messages,
        	add_generation_prompt=True,
        	tokenize=True,
        	return_dict=True,
        	return_tensors="pt",
        ).to(model.device)
        
        outputs = model.generate(**inputs, max_new_tokens=max_new_token)
        print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))        
        answer = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])
        logfile = log_dir / f"{text_id}.log"
        with open(logfile, "w") as fw:
            fw.write(answer)
        print(f"STATUS\tOutput written to {logfile}")
