## 1. Choose a GPU Server: `Nvidia4090 24G`

Ideally, we should have different servers for different purposes, since Python environment is so fragile.
Otherwise, using `Conda` to manage isolated pythong environment per frameworks:
- Training frameworks
- Inference frameworks
- Deployment frameworks

## 2. Prepare the pre-trained model and dataset

### 2.1 HuggingFace `https://huggingface.co/`

- Option 1: Python `transformers` library
```bash
# Download model
# !pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-1.8B-Chat")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-1.8B-Chat")

# Download dataset
wget https://hf-mirror.com/datasets/LooksJuicy/ruozhiba/blob/main/ruozhiba_qa.json
```

### 2.2 ModelScope `https://modelscope.cn/`

- Option 1: Python `modelscope` library
```bash
# Download model
# !pip install modelscope
from modelscope import snapshot_download
snapshot_download('Qwen/Qwen1.5-1.8B-Chat', cache_dir='/root/autodl-tmp/models')

# Download dataset
wget https://hf-mirror.com/datasets/LooksJuicy/ruozhiba/blob/main/ruozhiba_qa.json
```

### 2.3 HuggingFace Mirror `https://hf-mirror.com`

- Option 1: hfd.sh tool
```bash
export HF_ENDPOINT=https://hf-mirror.com
# download CLI tool
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh

# download models
mkdir -p /root/autodl-tmp/models
cd /root/autodl-tmp/models
./hfd.sh Qwen/Qwen1.5-1.8B-Chat

# download datasets
mkdir -p /root/autodl-tmp/datasets
cd /root/autodl-tmp/datasets
./hfd.sh LooksJuicy/ruozhiba --dataset
```

## 3. Formatting dataset as LLaMA-Factory supports

In [None]:
import json

def convert_json_format(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    for item in data:
        item['instruction'] = item.pop('query')
        item['input'] = ''
        item['output'] = item.pop('response')

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

input_file = '/root/autodl-tmp/datasets/ruozhiba_qaswift.json'
output_file = '/root/autodl-tmp/datasets/ruozhiba.json'

convert_json_format(input_file, output_file)

## 4. Choose fine-tuning tool: `LLaMA-Factory`

- Install LLaMA-Factory
```bash
# In case `source /etc/network_turbo` for autodl.com is enabled, reset it
unset http_proxy && unset https_proxy
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory

# Store the packages in give PATH to manage the storage on Server
conda create -n llama-factory -p /root/autodl-tmp/penv/llama-factory python==3.12
conda activate llama-factory
pip install -e ".[torch,metrics]"
```
- Run LLaMA-Factory webUI
```bash
nohup llamafactory-cli webui &
```

## 5. Configure and start fine-tuning process: `QLoRA`
- Specific package version
```bash
pip install bitsandbytes==0.44.1
```

## 6. Evaluate the fine-tuned model

- Chat with Hugging Face inference framework
- Chat with VLLM inference framework
```bash
pip install -e ".[vllm]"
```

## 7. Export the fine-tuned model

## 8. Convert model format from HF to GGUF: [`llama.cpp`](https://github.com/ggerganov/llama.cpp)

- Download llama.cpp
```bash
git clone https://github.com/ggerganov/llama.cpp.git
```

- Install requirements
```bash
source /etc/network_turbo

conda create -n llama-cpp -p /root/autodl-tmp/penv/llama-cpp python==3.12
conda activate llama-cpp

pip install -r llama.cpp/requirements.txt
# In case missing these from requirements.txt
pip install sentencepiece
pip install safetensors
pip install transformers
```

- Convert model format
```bash
MODEL=./path/to/the/model/model_name
GGUF=./path/to/the/gguf/file/model_name.gguf
python llama.cpp/convert_hf_to_gguf.py $MODEL --outtype f16 --verbose --outfile $GGUF
```

## 9. Deploy the fine-tuned model: [`Ollama`](https://ollama.com/)

- Install Ollama
```bash
curl -fsSL https://ollama.com/install.sh | sh

ollama serve
```

- Create Ollama `ModelFile` with below content
```text
FROM /path/to/the/gguf/file/model-name.gguf
```

- Create Ollama Model
```bash
ollama create model-name --file ./ModelFile
```

- Start Ollama Model
```bash
ollama run model-name
```

## 10. Deploy [`OpenWebUI`](https://github.com/open-webui/open-webui)

- Create Conda environment
```bash
# Store the packages in give PATH to manage the storage on Server
conda create -n open-webui -p /root/autodl-tmp/penv/open-webui python==3.11
conda init
conda activate open-webui

pip install -U open-webui
```

- Start `OpenWebUI` as a service
```bash
# Used for initiating the framework
export HF_ENDPOINT=https://hf-mirror.com
# Used for polling all models from OpenAI or Ollama
export OPENAI_API_BASE_URL=http://127.0.0.1:11434/v1
# This is default value already, just to be aware.
export ENABLE_OLLAMA_API=True
# Can be ignored, and configure in the WebUI later
export DEFAULT_MODELS="/path/to/the/model"

open-webui serve
```

- Docker container with both `Ollama` and `OpenWebUI`
```bash
docker run -d -p 3000:8080 \n
-v ollama:/root/.ollama \n
-v open-webui:/app/backend/data \n
--name open-webui \n
--restart always \n
ghcr.io/open-webui/open-webui:ollama
```