# Korean LLM (Large Language Model) Inference
---

### Model: [KoAlpaca-Polyglot-12.8B](https://huggingface.co/beomi/KoAlpaca-Polyglot-12.8B)

- GitHub: https://github.com/Beomi/KoAlpaca

In [12]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('../utils')
sys.path.append('../templates') 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
!pip install -qU boto3 huggingface_hub sagemaker langchain deepspeed 
!pip install -qU bitsandbytes accelerate peft

In [14]:
import os
from pathlib import Path
from huggingface_hub import snapshot_download

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path("/home/ec2-user/SageMaker/hf_cache")
local_model_path.mkdir(exist_ok=True)
model_id = "beomi/KoAlpaca-Polyglot-12.8B"

# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.txt", "*.model", "*.safetensors"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_id,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

Fetching 37 files:   0%|          | 0/37 [00:00<?, ?it/s]

Downloading (…)56381ba6/config.json:   0%|          | 0.00/682 [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/945M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/843M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Downloading (…)ba6/all_results.json:   0%|          | 0.00/196 [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/843M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/517M [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/52.5k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Downloading (…)81ba6/tokenizer.json:   0%|          | 0.00/1.65M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

Downloading (…)6/trainer_state.json:   0%|          | 0.00/705 [00:00<?, ?B/s]

In [15]:
import torch
from transformers import BitsAndBytesConfig

In [16]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [17]:
import torch
import deepspeed
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, GPTNeoXLayer

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    cache_dir=local_model_path,
    quantization_config=bnb_config
)

Loading checkpoint shards:   0%|          | 0/28 [00:00<?, ?it/s]

In [18]:
# config = {
#     "tensor_parallel": {"tp_size": 1},
#     "dtype": "fp16",
#     "injection_policy": {
#         GPTNeoXLayer:('attention.dense', 'mlp.dense_4h_to_h')
#     }
# }

# modl = deepspeed.init_inference(model, config)

local_rank = int(os.getenv('LOCAL_RANK', '0'))
tokenizer = AutoTokenizer.from_pretrained(model_id)
generator = pipeline(
    task="text-generation", model=model, tokenizer=tokenizer
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.65M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [39]:
import json
from inference_lib import Prompter

prompter = Prompter("kullm")

params = {
    "do_sample": False,
    "max_new_tokens": 256,
    "return_full_text": True,
    "temperature": 0.2,
    "top_p": 0.9,
    "return_full_text": False,
    "repetition_penalty": 1.1,
    "presence_penalty": None,
    "eos_token_id": 2,
}

instruction = "아래 질문에 대답해줘."
#instruction = ""
input_text = "아마존 웹서비스(AWS)에 대해 알려줘"
prompt = prompter.generate_prompt(instruction, input_text)
payload = {
    "inputs": [prompt,],
    "parameters": params
}

In [40]:
text_inputs, params = payload["inputs"], payload["parameters"]
result = generator(text_inputs, **params)
print(result)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[[{'generated_text': 'AWS은 클라우드 컴퓨팅 서비스로, EC2 인스턴스, RDS, Elastic Beanstalk, S3 등의 다양한 서비스를 제공하고 있습니다.'}]]
