In [1]:
from huggingface_hub import login
import os
from dotenv import load_dotenv

load_dotenv()


login(token= os.environ["HF_TOKEN"])

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/aeolian83/.cache/huggingface/token
Login successful


In [2]:
model_id = "beomi/Llama-3-KoEn-8B-Instruct-preview"
device_map = {"": 0}
cache_model_dir="/mnt/t7/.cache/huggingface/models"

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

In [4]:
# Settings for 4-bit QLoRA Training(4bit QLoRA 학습을 위한 설정)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_compute_dtype=torch.bfloat16, # Nvidia의 Ampere 아키텍처 이후 가속기는 bf16으로 속도 향상을 꾀할수 있다. 
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# bnb_4bit_quant_type="nf4" 설정상 기본값은 bnb_4bit_quant_type="fp4"이나 허깅페이스 저자들에 의하면
# 경험적 결과로 "nf4"가 결과가 더 좋았다고 한다. https://huggingface.co/blog/4bit-transformers-bitsandbytes
# bnb_4bit_use_double_quant=True로 하면 매개변수단 0.4bit을 추가로 절약 할 수 있다고 한다. 

In [5]:
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map=device_map, cache_dir=cache_model_dir, trust_remote_code=True)
model.config.use_cache = True

# model.config.pretraining_tp = 1
# 종종 QLoRA 코드에 이 코드가 보이는데 병렬 학습에 쓰이는 코드로 보인다. 

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, cache_dir=cache_model_dir)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
tokenizer.add_special_tokens({'pad_token': '<PAD>'})
tokenizer.padding_side = "left"
model.resize_token_embeddings(len(tokenizer)) # pad_token이 추가되었으므로 embedding과 language modeling head를 resize

Embedding(128257, 4096)

In [9]:
examples = [
    '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: Despite their sample quality, our models do not have competitive log likelihoods compared to other likelihood-based models.
''',
    '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: Our models do, however, have log likelihoods better than the large estimates annealed importance sampling has been reported to produce for energy based models and score matching.
''',
 '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: We focus on Latent Diffusion Models since they can perform a wide range of generative tasks. This work shows that simply fine-tuning a small part of the generative model.
''']

In [10]:
example_batch = tokenizer(examples, return_tensors="pt", padding=True)['input_ids'].to(model.device)

In [12]:
with torch.cuda.amp.autocast():
    output_tokens = model.generate(example_batch, max_new_tokens = 1024, pad_token_id=tokenizer.pad_token_id)

In [13]:
outputs = [tokenizer.decode(t, skip_special_tokens=True) for t in output_tokens]
for o in outputs:
    print(o)
    print('#'*100)


### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: Despite their sample quality, our models do not have competitive log likelihoods compared to other likelihood-based models.
### Output: (번역) 샘플의 질이 좋지 않음에도 불구하고, 우리의 모델은 다른 likelihood-based 모델과 비교하여 경쟁력 있는 log likelihood를 갖지 못합니다.
### Note: The translation is done by a machine translator, and may not be perfect. For example, the word "competitive" is not directly translated to "경쟁력 있는" in Korean, but the translator chose the closest equivalent phrase. Also, the word "likelihood" is not directly translated to "log likelihood" in Korean, but the translator chose the closest equivalent phrase. The translation is intended to be a rough guide, and should be reviewed by a human translator to ensure accuracy. ###
### Reference: Naver's Korean-English translation service (h

In [14]:
examples = [
    '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: Large Language Models (LLM) represent the most recent advances in Natural Language Processing (NLP) demonstrating a wide range of capabilities in language processing [Zhao et al.(2023)]. They came into prominence after ChatGPT, an application by OpenAI that opened for public testing, went vira This has fueled attempts to use LLMs for a variety of applications ranging from creative writing [Gómez-Rodríguez and Williams(2023)], to programming [Liventsev et al.(2023)], legal [Louis et al.(2023)] and medical [He et al.(2023)] domains which require greater factual accuracy.
''',
    '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: A promising area of application for LLMs is question answering over proprietary organizational documents such as governance/policy manuals. Such documents are often a regular point of reference as they guide the day-to-day operations and decision making within an organization. This results in frequent references to such documents or to experts within the organization who respond to queries about such information. Hence there is potential for increased efficiency from having an application that can respond to a diverse range of user queries based on organizational documents.
''',
 '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: There are several considerations when deploying an LLM application in such settings. One major concern is the security risks given the confidential nature of such documents. As a result, it is not possible to use proprietary LLM models over an API due to data leakage risk $2^{2}$ This necessitates the use of open source models that can be deployed on-premise. A second concern is limited computational resources as well as relatively smaller training datasets that can be generated based on the available documents. Finally, any such application must be able to reliably and correctly respond to[^0]user queries. Therefore, deploying a robust application in such settings is not trivial, requiring many decisions and customization.
''']

In [16]:
example_batch = tokenizer(examples, return_tensors="pt", padding=True)['input_ids'].to(model.device)

In [18]:
with torch.cuda.amp.autocast():
    output_tokens = model.generate(example_batch, max_new_tokens = 2048, pad_token_id=tokenizer.pad_token_id)

In [19]:
outputs = [tokenizer.decode(t, skip_special_tokens=True) for t in output_tokens]
for o in outputs:
    print(o)
    print('#'*100)


### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: Large Language Models (LLM) represent the most recent advances in Natural Language Processing (NLP) demonstrating a wide range of capabilities in language processing [Zhao et al.(2023)]. They came into prominence after ChatGPT, an application by OpenAI that opened for public testing, went vira This has fueled attempts to use LLMs for a variety of applications ranging from creative writing [Gómez-Rodríguez and Williams(2023)], to programming [Liventsev et al.(2023)], legal [Louis et al.(2023)] and medical [He et al.(2023)] domains which require greater factual accuracy.
### Output: 대규모 언어 모델(Large Language Models, LLMs)은 자연어 처리(Natural Language Processing, NLP) 분야에서 가장 최근의 발전으로, 언어 처리의 다양한 능력을 보여준다 [Zhao et al.(2023)]. ChatGPT, OpenAI가 개발한 응용 프로그램이 일반 사용자에게 공개된 후, LL

In [20]:
examples = [
    '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: Retrieval-Augmented Generation (RAG) enhances the performance of LLMs on domain specific tasks by providing the model with an external source of information. While there are many variations, we provide an overview of a typical RAG application in Algorithm 1. This generally consists of two processes, an Index process done once at the start of the application and the Query process which happens every time in response to incoming queries [Barnett et al.(2024)]. The index process occurs as follows. The input document $D$ is split into discrete chunks $\left\{c_{1}, c_{2}, \ldots, c_{n}\right\}$ (steps $2 \& 3$ ). Using an encoder model, the split chunks $c_{i}$ are converted to embedding vectors $\vec{d}_{i}=\operatorname{encoder}\left(c_{i}\right)$ (step 4) which are then stored in a vector database (step 5). This database is later used to retrieve relevant chunks for a given query.
''',
    '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: The Query processing happens in response to incoming user queries. For a given query $q$, the encoding model is used to create a vector embedding of the query $\vec{v}=\operatorname{encoder}(q)$. The database is then searched to find the top $k$ chunk embeddings $\left\{\overrightarrow{d_{1}}, \overrightarrow{d_{2}}, \ldots, \overrightarrow{d_{k}}\right\}$ that are similar to the query embedding $\vec{v}$. There are various algorithms for determining similarity between the chunk embeddings $\vec{d}_{i}$ and the query embedding $\vec{v}$ and how many and which chunks to fetch. The top $k$ chunks $\left\{c_{1}, c_{2}, \ldots, c_{k}\right\}$ retrieved from the database, along with the query, are then passed into the prompt template. The completed prompt is then input to an LLM model which generates an output based on the provided information. This response is then returned to the user.
''',
 '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: The overall workflow of our system, Tree-RAG (T-RAG), is shown in Figure 1 and outlined in Algorithm 2. Our system differs from the typical RAG application in the Query process. Instead of using an existing pre-trained LLM, we use a finetuned version of the LLM for answer generation; we finetuned the LLM model on an instruction dataset of questions and answers generated based on the organization's document as described in later sections.
''']

In [21]:
example_batch = tokenizer(examples, return_tensors="pt", padding=True)['input_ids'].to(model.device)

with torch.cuda.amp.autocast():
    output_tokens = model.generate(example_batch, max_new_tokens = 2048, pad_token_id=tokenizer.pad_token_id)

outputs = [tokenizer.decode(t, skip_special_tokens=True) for t in output_tokens]
for o in outputs:
    print(o)
    print('#'*100)


### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
ight)$ (step 4) which are then stored in a vector database (step 5). This database is later used to retrieve relevant chunks for a given query.r}\left(c_{i}information. While there are many variations, we provide an overview of a typical RAG application in Algorithm 1. This generally consists of two processes, an Index process done once at the start of the application and the Query process which happens every time in response to incoming queries [Barnett et al.(2024)]. The index process occurs as follows. The input document $D$ is split into discrete chunks $\left\{c_{1}, c_{2}, \ldots, c_{n}
### Output: (번역) RAG는 LLM의 성능을 도메인 특정 작업에서 향상시키기 위해 모델에 외부 정보를 제공하는 데 사용됩니다. 많은 변형이 있지만, 우리는 Algorithm 1의 일반적인 RAG 응용 프로그램에 대한 개요를 제공합니다. 일반적으로, RAG는 2 개의 프로세스로 구성됩니다. 인덱싱 프로세스는 응용 프로그램 시

In [22]:
examples = [
    '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: A feature of T-RAG is the inclusion of an entities tree in addition to the vector database for context retrieval. The entities tree holds information about entities in the organization and their location within the hierarchy. Each node in this tree represents an entity with the parent node indicating the group it belongs to. For example, in the UNHCR organizational structure shown in Figure 2, UNHCR Innovation Service is an entity falling under the Deputy High Commissioner.
''',
    '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: During retrieval, we use the entities tree to further augment the context retrieved by the vector database. The entity tree search and context generation occurs as follows. A parser module searches the user query for keywords matching the names of entities in the organization. If one or more matches are found, information about each matched entity is extracted from the tree and converted into a textual statement providing information about the entity and its location within the organization's hierarchy. This information is then combined with the document chunks retrieved from the vector database to form the context. This allows the model to access information about entities and their location within the organization's hierarchy when users ask questions about these entities.
''',
 '''
### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: The overall workflow of our system, Tree-RAG (T-RAG), is shown in Figure 1 and outlined in Algorithm 2. Our system differs from the typical RAG application in the Query process. Instead of using an existing pre-trained LLM, we use a finetuned version of the LLM for answer generation; we finetuned the LLM model on an instruction dataset of questions and answers generated based on the organization's document as described in later sections.
''']

In [23]:
example_batch = tokenizer(examples, return_tensors="pt", padding=True)['input_ids'].to(model.device)

with torch.cuda.amp.autocast():
    output_tokens = model.generate(example_batch, max_new_tokens = 2048, pad_token_id=tokenizer.pad_token_id)

outputs = [tokenizer.decode(t, skip_special_tokens=True) for t in output_tokens]
for o in outputs:
    print(o)
    print('#'*100)


### system prompt: Translate the following English text related to Computer Science into Korean. When translating, for Computer Science terms, translate them in the format: Korean translation (English original).
### Input: A feature of T-RAG is the inclusion of an entities tree in addition to the vector database for context retrieval. The entities tree holds information about entities in the organization and their location within the hierarchy. Each node in this tree represents an entity with the parent node indicating the group it belongs to. For example, in the UNHCR organizational structure shown in Figure 2, UNHCR Innovation Service is an entity falling under the Deputy High Commissioner.
### Translation:
T-RAG의 특징은 문맥 검색을 위한 벡터 데이터베이스 이외에 엔티티 트리를 포함하는 것이다. 엔티티 트리는 조직 내의 엔티티에 대한 정보를 포함하고 있으며, 이 정보는 엔티티의 계층 구조 내의 위치를 나타낸다. 이 트리의 각 노드는 엔티티를 나타내며, 부모 노드는 엔티티가 속한 그룹을 나타낸다. 예를 들어, 도 2의 UNHCR 조직 구조에서 UNHCR Innovation Service는 Deputy High Commissioner의 그룹에 속하는 엔티티이다.
### Note: The transl