# Getting Started with EXAONE 3.0
- Created: 2024-12-02 (Mon)
- Updated: 2024-12-02 (Mon)

## Purpose
This guide provides a prcatical guide for utilizing the EXAONE language model. While the model is available via the Hugging Face repository (https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct), implementing it requires some initial setup. This document offers a clear and concise walkthrough to facilitate a smooth and efficient integration of EXAONE into your projects. This notebook will guide you through the necessary steps to execute the code snippet successfully.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct")

# Choose your prompt
prompt = "Explain who you are"  # English example
prompt = "너의 소원을 말해봐"   # Korean example

messages = [
    {"role": "system", 
     "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
    {"role": "user", "content": prompt}
]
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

output = model.generate(
    input_ids.to("cuda"),
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=128
)
print(tokenizer.decode(output[0]))
```

## Requirements
Make sure your instance meets both CPU and GPU requirements.
- Vertex AI Workbench instance's default settings or `n1-standard-1` will fail.

<img src="images/vertex-ai-workbench-instance-details-hardware-default-options-without-nvidia-t4-gpu.png" width="60%">

- Adding NVIDIA T4 GPU will also fail.

<img src="images/vertex-ai-workbench-instance-t4-is-default.png">

That is,

<img src="images/vertex-ai-workbench-instance-details-hardware-default-options.png" width=60%>

An instance with NVIDIA L4 GPU or above is required to run EXAONE successfully.

| Machine type  | CPU     | CPU RAM | Enough? | GPU    | GPU RAM | Enough? | Memo                                 |
|---------------|---------|---------|---------|--------|---------|---------|--------------------------------------|
| n1-standard-1 | 1 vCPUs | 3.75 GB | N       | T4 x 1 | 16 GB   | N/A     | RuntimeError                         |
| n1-standard-2 | 2 vCPUs | 7.5 GB  | Y       | T4 x 1 | 16 GB   | N       | OutOfMemoryError: CUDA out of memory |
| g2-standard-4 | 4 vCPUs | 16 GB   | Y       | L4 x 1 | 24 GB   | Y       | (No error)                           |

CPU RAM size:
- 3.75 GB is insufficient.
- 7.5  GB is good enough.

GPU RAM size:
- 16 GB on T4 is insufficient.
- 24 GB on L4 is good enough.

Note that [NVIDIA L4 GPU](https://www.nvidia.com/en-us/data-center/l4/) has 
- 1.5x more GPU RAM and
- 2.5x more Gen AI performance

than [NVIDIA T4 GPU](https://www.nvidia.com/en-us/data-center/tesla-t4/).

## Configure

In [1]:
# ModuleNotFoundError: No module named 'torch'
!pip install --upgrade --quiet torch torchvision torchaudio


Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: --quite


In [17]:
# ModuleNotFoundError: No module named 'transformers'
!pip install --quiet transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [3]:
# OSError: You are trying to access a gated repo. Make sure to have access to it at https:// ...
!pip install --quiet huggingface_hub



## Log into Hugging Face Hub
The Hugging Face token is created and saved in `token-to-access-exaone-since-2024-q4.txt`.

Open a new terminal (File > New > Terminal) and run:
```bash
$ huggingface-cli login
# Paste your access token when prompted.
```


In [4]:
# ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install 'accelerate>=0.26.0'`
!pip install accelerate>=0.26.0

In [5]:
# Double-check
import accelerate
print(accelerate.__version__)

1.1.1


In [6]:
# Restart runtime
#   Otherwise, the ImportError won't go away.
import sys

if "google.colab" in sys.modules:
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

A significant amount of RAM is required to load the EXAONE-3.0-7.8B-Instruct model.
If RAM is insufficient, a `RuntimeError` is expected. 
```bash
# RuntimeError: unable to mmap 4932636680 bytes from file </home/jupyter/.cache/huggingface/hub/models--LGAI-EXAONE--EXAONE-3.0-7.8B-Instruct/snapshots/7f15baedd46858153d817445aff032f4d6cf4939/model-00001-of-00007.safetensors>: Cannot allocate memory (12)
```

In [7]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:            15Gi       1.0Gi        13Gi       0.0Ki       1.2Gi        14Gi
Swap:             0B          0B          0B





 (,  RAM)


In [8]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [9]:
import datetime
print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

2024-12-02 05:00:34


In [10]:
model = AutoModelForCausalLM.from_pretrained(
    "LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)
print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

2024-12-02 05:03:02


In [11]:
tokenizer = AutoTokenizer.from_pretrained("LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct")
print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

2024-12-02 05:03:02


## Find the sufficiently large instance type

### OutOfMemoryError: CUDA out of memory
```bash
OutOfMemoryError: CUDA out of memory. Tried to allocate 800.00 MiB. GPU 0 has a total capacity of 14.57 GiB of which 708.75 MiB is free. Including non-PyTorch memory, this process has 13.87 GiB memory in use. Of the allocated memory 12.87 GiB is allocated by PyTorch, and 898.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

### Use NVIDIA L4 GPU instead of T4
| Machine type  | CPU     | RAM     | Enough? | GPU    | GPU RAM | Enough? | Memo                                 |
|---------------|---------|---------|---------|--------|---------|---------|--------------------------------------|
| n1-standard-1 | 1 vCPUs | 3.75 GB | N       | T4 x 1 | 16 GB   | N/A     | RuntimeError                         |
| n1-standard-2 | 2 vCPUs | 7.5 GB  | Y       | T4 x 1 | 16 GB   | N       | OutOfMemoryError: CUDA out of memory |
| g2-standard-4 | 4 vCPUs | 16 GB   | Y       | L4 x 1 | 24 GB   | Y       | (No error)                           |

In [29]:
# See the total GPU memory
!nvidia-smi --query-gpu=memory.total --format=csv,noheader

23034 MiB


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [30]:
# See "used memory, free memory"
!nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader,nounits

15479, 7001


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [31]:
# See more details,
!nvidia-smi

Mon Dec  2 06:34:57 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      On  |   00000000:00:03.0 Off |                    0 |
| N/A   75C    P0             35W /   72W |   15479MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Test EXAONE with the sample prompt in English

In [20]:
# Choose your prompt
prompt = "Explain who you are"  # English example

messages = [
    {"role": "system", 
     "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
    {"role": "user", "content": prompt}
]

In [21]:
print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

2024-12-02 05:17:55


In [22]:
print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
output = model.generate(
    input_ids.to("cuda"),
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=128
)
print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

print(tokenizer.decode(output[0]))

2024-12-02 05:17:55
2024-12-02 05:18:01
[|system|]You are EXAONE model from LG AI Research, a helpful assistant.[|endofturn|]
[|user|]Explain who you are
[|assistant|]Hello! I'm EXAONE 3.0, an advanced language model developed by LG AI Research. My primary function is to assist users by providing information, answering questions, and helping with various tasks using natural language. I'm designed to understand and generate human-like text based on the data I've been trained on. My goal is to be a helpful and informative assistant for your needs. How can I assist you today?[|endofturn|]


## Test EXAONE with the sample prompt in Korean

In [23]:
# Choose your prompt
prompt = "너의 소원을 말해봐"       # Korean example

messages = [
    {"role": "system", 
     "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
    {"role": "user", "content": prompt}
]

In [24]:
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

output = model.generate(
    input_ids.to("cuda"),
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=128
)

print(tokenizer.decode(output[0]))

[|system|]You are EXAONE model from LG AI Research, a helpful assistant.[|endofturn|]
[|user|]너의 소원을 말해봐
[|assistant|]EXAONE 3.0 모델로서 저의 주된 목적은 사용자에게 정확하고 유용한 정보를 제공하는 것입니다. 저는 다양한 질문에 답변하고, 문제를 해결하며, 학습과 연구를 돕는 역할을 합니다. 또한, 사용자의 프라이버시와 데이터 보안을 최우선으로 생각합니다. 이를 통해 사람들의 삶의 질을 향상시키고, 더 나은 세상을 만드는 데 기여하고자 합니다.[|endofturn|]


## Congratulations!
You have successfully run EXAONE 3.0!