필수 라이브러리 설치

In [None]:
# !pip install -U transformers bitsandbytes accelerate sentencepiece -qqq

#### bitsandbytes 라이브러리

비트샌드바이트는 CUDA 커스텀 함수, 특히 8비트 옵티마이저, 행렬 곱셈(LLM.int8()), 양자화 함수에 대한 경량 래퍼입니다.

The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.

# 1.4비트 로딩

In [4]:
from transformers import AutoModelForCausalLM
import torch


model_id = "bigscience/bloom-1b7"
# model_id = "gpt2"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
)


config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

In [5]:
# 1.63GB로 양자화함
print(model.get_memory_footprint() / 1000000000)

1.632878592


In [8]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    # load_in_4bit=True,
)

In [9]:
# 원래 모델 크기
print(model.get_memory_footprint() / 1000000000)

6.88963584


# 2.8비트 로딩

In [10]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_8bit=True,
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [11]:
# 원래 모델 크기
print(model.get_memory_footprint() / 1000000000)

2.236858368


# 3.16비트 로딩

In [12]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.float16
)


In [13]:
print(model.get_memory_footprint()/1000000000)

3.44481792


In [14]:
model.save_pretrained('./../bloom-lb7-ft16')

In [15]:
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto'
)


if torch.cuda.is_available():
    model = model.to("cuda").half() # 메모리를 쿠다로 옮길때, 16비트로 옮김


In [17]:
print(model.get_memory_footprint()/1000000000)

3.44481792


# 4.기타 용법

## 4.1. compute_dtype 변경

compute_dtype은 계산 중에 사용할 dtype을 변경하는 데 사용됩니다.

예를 들어 은닉상태를 float32로 설정할 수 있지만 속도를 높이려면 계산을 bf16으로 설정할 수 있습니다.

기본적으로 compute dtype은 float32로 설정됩니다.

In [19]:
import torch

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map='auto',
    bnb_4bit_compute_dtype=torch.bfloat16
)


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


## 4.2. NF4 (Normal Float 4)

- float4로 경량화

In [20]:
import torch

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map='auto',
    bnb_4bit_quant_type='nf4'
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


In [21]:
print(model.get_memory_footprint()/1000000000)

1.632878592


## 4.3. 중첩 양자화

또한 사용자에게 중첩 양자화 기법을 사용할 것을 권장합니다. 이 기법을 사용하면 추가 성능 없이 더 많은 메모리를 절약할 수 있습니다. 경험적 관찰에 따르면 이 기법을 사용하면 시퀀스 길이 1024, 배치 크기 1, 그라데이션 누적 단계 4로 NVIDIA-T4 16GB에서 라마-13b 모델을 미세 조정할 수 있습니다.

In [23]:
from transformers import AutoModelForCausalLM

model_id = "bigscience/bloom-1b7"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


In [24]:
print(model.get_memory_footprint()/1000000000)

1.632878592


## 4.4. `cpu` 와 `gpu` 오프로드

- gpu가 부족할 때, cpu를 활용하여 모델을 올림
- 하단의 device_map에서 layer마다 cpu/gpu 지정 가능 (0은 gpu, 'cpu'는 cpu를 사용)

In [25]:
device_map = {
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": "cpu",
    "transformer.h": 0,
    "transformer.ln_f": 0,
}

In [35]:
model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7",
    device_map=device_map
)

Tied parameters are on different devices: {'lm_head.weight': 'cpu', 'transformer.word_embeddings.weight': 0}. Please modify your custom device map or set `device_map='auto'`. 


In [36]:
model

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 2048)
    (word_embeddings_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=2048, out_features=6144, bias=True)
          (dense): Linear(in_features=2048, out_features=2048, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=2048, out_features=8192, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=8192, out_features=2048, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
  )
  (

# 5.130억 파라마터 모델 로딩 및 추론하기

In [39]:
%%time

from transformers import AutoModelForCausalLM

model_path = "./../LLM/open_llama_13b/"
# model_id = "openlm-research/open_llama_13b"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    trust_remote_code=True
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

CPU times: user 5.22 s, sys: 19.6 s, total: 24.8 s
Wall time: 9min 58s


In [40]:
print(model.get_memory_footprint()/1000000000)

7.04202752


In [42]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
)

In [45]:
# 1) 질의 문장
input_text = "What is deep learning?"

In [46]:
# 2) 토큰화하고 텐서로 변환합니다.
input_ids = tokenizer.encode(input_text, return_tensors="pt")
# input_ids = input_ids.to('cuda')

In [50]:
# 3) 생성 옵션을 설정하고 텍스트를 생성합니다.
max_length = 200
sample_outputs = model.generate(input_ids, do_sample=True, max_length=max_length, temperature=0.75)



In [51]:
# 4) 생성된 텍스트를 디코딩합니다.

print(tokenizer.decode(sample_outputs[0], skip_special_tokens=True))

What is deep learning?
Kaplan, Alex
The term “deep learning” can be confusing because it is applied to a variety of research fields, each with their own goals and applications.
This article will be divided into the three main areas of deep learning research: 1) Representation learning, 2) Deep neural networks, and 3) Unsupervised learning.
Each area will be described in depth, with links to other resources and a bibliography to the major papers cited in this article.
Deep Learning and Representation Learning
Deep learning is a particular form of representation learning, in which the goal is to represent a complex, high-dimensional input as a low-dimensional representation.
Deep learning is applied in three major areas:
1) Computer vision
2) Natural language processing
3) Speech recognition
The term “deep learning” was coined in 2006 by Geoffrey Hinton. The term was
