# Korean LLM (Large Language Model) Inference
---

### Model: [KoAlpaca-KoRWKV-6B](https://huggingface.co/beomi/KoAlpaca-KoRWKV-6B)

- GitHub: https://github.com/Beomi/KoAlpaca

In [1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('../utils')
sys.path.append('../templates') 

In [2]:
!pip install -qU boto3 huggingface_hub sagemaker langchain deepspeed 
!pip install -qU bitsandbytes accelerate peft

In [3]:
import os
from pathlib import Path
from huggingface_hub import snapshot_download

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path("/home/ec2-user/SageMaker/hf_cache")
local_model_path.mkdir(exist_ok=True)
model_id = "beomi/KoAlpaca-KoRWKV-6B"

# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.txt", "*.model", "*.safetensors"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_id,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

Fetching 25 files:   0%|          | 0/25 [00:00<?, ?it/s]

Downloading (…)bf97414b/config.json:   0%|          | 0.00/501 [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Downloading (…)14b/all_results.json:   0%|          | 0.00/193 [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/997M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/728M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/873M [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/40.6k [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/40.6k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

Downloading (…)7414b/tokenizer.json:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

Downloading (…)b/trainer_state.json:   0%|          | 0.00/731 [00:00<?, ?B/s]

In [5]:
import torch
import deepspeed
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, GPTNeoXLayer

with deepspeed.OnDevice(dtype=torch.float16, device="cuda"):
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
        cache_dir=local_model_path
    )

config = {
    "tensor_parallel": {"tp_size": 1},
    "dtype": "fp16",
    "injection_policy": {
        GPTNeoXLayer:('attention.dense', 'mlp.dense_4h_to_h')
    }
}

model = deepspeed.init_inference(model, config)

local_rank = int(os.getenv('LOCAL_RANK', '0'))
tokenizer = AutoTokenizer.from_pretrained(model_id)
generator = pipeline(
    task="text-generation", model=model, tokenizer=tokenizer, device=local_rank
)

[2023-07-24 01:18:59,188] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)


Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

[2023-07-24 01:19:23,256] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
[2023-07-24 01:19:23,258] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1


Downloading (…)okenizer_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
The model 'InferenceEngine' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusF

In [18]:
# config = {
#     "tensor_parallel": {"tp_size": 1},
#     "dtype": "fp16",
#     "injection_policy": {
#         GPTNeoXLayer:('attention.dense', 'mlp.dense_4h_to_h')
#     }
# }

# modl = deepspeed.init_inference(model, config)

local_rank = int(os.getenv('LOCAL_RANK', '0'))
tokenizer = AutoTokenizer.from_pretrained(model_id)
generator = pipeline(
    task="text-generation", model=model, tokenizer=tokenizer
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.65M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [73]:
import json
from inference_lib import Prompter

prompter = Prompter("korwkv")

params = {
    "do_sample": False,
    "max_new_tokens": 256,
    "return_full_text": False,
    "temperature": 0.2,
    "top_p": 0.9,
    "return_full_text": False,
    "repetition_penalty": 1.2,
    "early_stopping": True,
    "presence_penalty": None,
    "eos_token_id": 2,
}

In [81]:
instruction = ""
input_text = "아마존 웹서비스(AWS)에 대해 알려줘."
prompt = prompter.generate_prompt(instruction, input_text)
payload = {
    "inputs": prompt,
    "parameters": params
}

text_inputs, params = payload["inputs"], payload["parameters"]
result = generator(text_inputs, **params)
print(result[0]['generated_text'].split('###')[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Amazon Web Services는 AWS의 서비스 이름입니다. 이 서비스는 클라우드 컴퓨팅 서비스를 제공하며, 전 세계 여러 지역에서 사용 가능합니다.  아마존은 고객이 더 많은 것을 할 수 있도록 지원하고자 합니다. 이를 위해 Amazon Web Services를 통해 기업이 IT 인프라를 쉽게 확장하고 관리할 수 있는 솔루션을 제공하고 있습니다.


In [84]:
instruction = ""
input_text = "오늘 기분은 좀 어때?"
prompt = prompter.generate_prompt(instruction, input_text)
payload = {
    "inputs": prompt,
    "parameters": params
}

text_inputs, params = payload["inputs"], payload["parameters"]
result = generator(text_inputs, **params)
print(result[0]['generated_text'].split('###')[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


"좀 어떠니?"라는 질문에 대한 대답은 상황에 따라 다를 수 있습니다. 예를 들어, "오늘은 기분이 좋아."라고 말할 수도 있고, 아니면 다른 방식으로 표현할 수도 있습니다. 하지만 오늘 기분이 어떤지 묻는 것은 매우 중요합니다!


In [85]:
instruction = ""
input_text = "투자 초심자가 하기 좋은 선물 투자 종류를 알려주세요."
prompt = prompter.generate_prompt(instruction, input_text)
payload = {
    "inputs": prompt,
    "parameters": params
}

text_inputs, params = payload["inputs"], payload["parameters"]
result = generator(text_inputs, **params)
print(result[0]['generated_text'].split('###')[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


선물 투자를 처음 시작하시는 분이라면, 주식과 마찬가지로 개별 종목에 투자하는 것이 좋습니다. 다만 초보자의 경우 위험성이 크기 때문에 안정적인 선물이 적합합니다. 
예를 들어 KOSPI 200 지수 선물을 매수하는 방법이 있습니다. 이 때 주의할 점은 지수가 상승하더라도 수익을 얻을 수 있는 만큼 손실도 발생할 수 있다는 점입니다. 따라서 적절한 비율로 포트폴리오를 구성하는 것이 중요합니다.
