# 文本生成

## 初始環境設定

In [1]:
import os
from pathlib import Path
HOME = str(Path.home())
Add_Binarry_Path=HOME+'/.local/bin'
os.environ['PATH']=os.environ['PATH']+':'+Add_Binarry_Path
current_foldr=!pwd
current_foldr=current_foldr[0]
current_foldr

'/work/g00cjz00/github/20240115_RAG'

## 確認CUDA版本, 以及否能使用GPU
若無gpu 請點選右側->已連線->變更執行階段類型->T4 Gpu

In [None]:
!nvidia-smi
import torch
torch.cuda.is_available()

## 安裝套件
安裝完成後建議, 點選上方選單, 直接階段->重新啟動工作階段, 確保 library重置

In [None]:
!pip install cohere gdown kaleido langchain openai pyngrok pypdf python-dotenv sentence-transformers tiktoken -q
!pip install accelerate bitsandbytes hf_transfer huggingface_hub optimum transformers -q 

## HF_TOKEN

### LOAD LIBRARY

In [None]:
# LOAD LIBRARY
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from transformers.generation.utils import GenerationConfig
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
import torch

## 輸入
- inputs：輸入prompt。如果為空，則以batch size為1的bos_token_id初始化。對於只有decoder的模型（GPT系列），輸入需要是input_ids；對於encoder-decoder模型（BART、T5等），輸入更多樣化
- max_length：產生序列的最大長度。
- min_length：生成序列的最短長度，預設是10。
- do_sample：是否開啟採樣，預設是False，即貪婪找最大條件機率的字。
- early_stopping：是否在至少產生num_beams個句子後停止beam search，預設為False。
- num_beams：預設是1，也就是不進行beam search。
- temperature: 預設是1.0，溫度越低（小於1），softmax輸出的貧富差距越大；溫度越高，softmax差距越小。
- top_k：top-k-filtering 演算法保留多少個最高機率的字作為候選，預設為50。詳見下文。
- top_p：已知產生各字的總機率是1（即預設是1.0）如果top_p小於1，則由高到低累加直到top_p，取這前N個字為候選。
- repetition_penalty：預設是1.0，重複詞懲罰。
- pad_token_id (int, 可選) — 填充令牌的 id。
- bos_token_id (int, 可選) — 序列開始標記的 id。
- eos_token_id (int, 可選) — 序列結束標記的 id。
- pad_token_id / bos_token_id / eos_token_id：填充詞<PAD>、起始附<s>、結束符</s> 的id。
- length_penalty：長度懲罰，預設是1.0。
- length_penalty=1.0：beam search分數會受到產生序列長度的懲罰
- length_penalty=0.0：無懲罰
- length_penalty<0.0：鼓勵模型生成長句子
- length_penalty>0.0：鼓勵模型產生短句子
- no_repeat_ngram_size：用來控制重複字生成，預設是0，如果大於0，則對應N-gram只出現一次
- encoder_no_repeat_ngram_size：也是用來控制重複字生成，預設是0，如果大於0，則encoder_input_ids的N-gram不會出現在decoder_input_ids裡。
- bad_words_ids：禁止產生的詞id列表，可用tokenizer(bad_words, add_prefix_space=True, add_special_tokens=False).input_ids方法取得ids。
- force_words_ids：跟上面的bad_words_ids 相反，這個傳入必須產生的token id 清單。如果ids格式是[List[List[int]]]，例如[[1,2],[3,4]]，則觸發析取約束（Disjunctive Positive Constraint Decoding），大概意思是可以產生一個單字不同的形式，如「lonely」、「loneliness」等。


### 四、各解碼演算法原理簡述
本小節主要介紹自迴歸文字產生的幾個最常用的解碼方法，包括
- Greedy search
- Beam search 
- Top-K sampling
- Top-p sampling


# MODEL

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "/work/u00cjz00/slurm_jobs/github/models/Breeze-7B-Instruct-64k-v0.1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
#model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto")
#model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", torch_dtype=torch.float16).to(0)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", load_in_4bit=True)
#model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", torch_dtype=torch.float16, use_flash_attention_2=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

##　1. 接龍

In [28]:
# Greedy search, top_k=1,do_sample=False
text = "Paris is the city"
inputs = tokenizer(text, return_tensors="pt").to(0)
for number in range(10):
    outputs = model.generate(**inputs, max_new_tokens=2,top_k=1,do_sample=False,pad_token_id=tokenizer.eos_token_id)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Paris is the city of love
Paris is the city of love
Paris is the city of love
Paris is the city of love
Paris is the city of love
Paris is the city of love
Paris is the city of love
Paris is the city of love
Paris is the city of love
Paris is the city of love


In [29]:
# Greedy search, top_k=2,do_sample=True
text = "Paris is the city"
inputs = tokenizer(text, return_tensors="pt").to(0)
for number in range(10):
    outputs = model.generate(**inputs, max_new_tokens=2,top_k=2,do_sample=True,pad_token_id=tokenizer.eos_token_id)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Paris is the city of love
Paris is the city of love
Paris is the city of love
Paris is the city of lights
Paris is the city of love
Paris is the city of lights
Paris is the city of love
Paris is the city of love
Paris is the city of love
Paris is the city of love


In [43]:
# Greedy search, top_p,do_sample=True
text = "Paris is the city"
inputs = tokenizer(text, return_tensors="pt").to(0)
for number in range(10):
    outputs = model.generate(**inputs, max_new_tokens=2,top_p=0.90,do_sample=True,pad_token_id=tokenizer.eos_token_id)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Paris is the city of light
Paris is the city of love
Paris is the city of love
Paris is the city of lights
Paris is the city of light
Paris is the city of love
Paris is the city of lights
Paris is the city of love
Paris is the city of love
Paris is the city of lights


In [44]:
# Greedy search, top_p, Temperature
text = "Paris is the city"
inputs = tokenizer(text, return_tensors="pt").to(0)
for number in range(10):
    outputs = model.generate(**inputs, max_new_tokens=2,top_p=0.90,do_sample=True,temperature=1.0,pad_token_id=tokenizer.eos_token_id)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Paris is the city of love
Paris is the city of love
Paris is the city of lights
Paris is the city of lights
Paris is the city of romance
Paris is the city of love
Paris is the city of love
Paris is the city of romance
Paris is the city of light
Paris is the city of lights


In [54]:
# Beams search, top_p, Temperature
text = "Paris is the city"
inputs = tokenizer(text, return_tensors="pt").to(0)

for number in range(10):
    n=number+5
    outputs = model.generate(**inputs, max_new_tokens=n,num_beams=2,do_sample=False,pad_token_id=tokenizer.eos_token_id)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Paris is the city of love, and it
Paris is the city of love, and it’
Paris is the city of love, and it’s
Paris is the city of love, but it’s also
Paris is the city of love, but it’s also a
Paris is the city of love, but it’s also the city
Paris is the city of love, but it’s also the city of
Paris is the city of love, but it’s also the city of art
Paris is the city of love, but it’s also the city of art,
Paris is the city of love, but it’s also the city of art, culture


In [57]:
# Beams search
text = "Paris is the city"
inputs = tokenizer(text, return_tensors="pt").to(0)

for number in range(10):
    n=number+5
    outputs = model.generate(**inputs, max_new_tokens=n,num_beams=2,do_sample=False,pad_token_id=tokenizer.eos_token_id)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))



Paris is the city of love, and it
Paris is the city of love, and it’
Paris is the city of love, and it’s
Paris is the city of love, but it’s also
Paris is the city of love, but it’s also a
Paris is the city of love, but it’s also the city
Paris is the city of love, but it’s also the city of
Paris is the city of love, but it’s also the city of art
Paris is the city of love, but it’s also the city of art,
Paris is the city of love, but it’s also the city of art, culture


## 2. TOENIZER 樣板

In [10]:
chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "A01!"},
  {"role": "user", "content": "Q02?"},
  {"role": "assistant", "content": "A02!"},
  {"role": "user", "content": "Q03?"},
  {"role": "assistant", "content": "A03!"},
    
]

print(tokenizer.apply_chat_template(chat, tokenize=False))

<s>You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.   [INST] Hello, how are you? [/INST] A01! [INST] Q02? [/INST] A02! [INST] Q03? [/INST] A03! 


## 3. Tokenizer 擴充詞語表

In [12]:
text = '''The primary use of LLaMA is research on large language models, including'''
print("Test text:\n", text)
print(f"{tokenizer.tokenize(text)} -> Media")

text = '''蔡英文，中華民國政治人物、法學家與律師，民主進步黨籍，現任中華民國總統。'''
print(f"{tokenizer.tokenize(text)} -> Media")

Test text:
 The primary use of LLaMA is research on large language models, including
['▁The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including'] -> Media
['▁', '蔡', '英文', '，', '中華', '民國', '政治', '人物', '、', '法', '學家', '與', '律師', '，', '民主', '進步', '黨', '籍', '，', '現任', '中華', '民國', '總統', '。'] -> Media


## BK