## Langchain LLM - Llama-2-7b-chat-hf

1. Llama-2-7b-chat-hf DEMO
- https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

2. LANGCHAIN 手冊
https://python.langchain.com/docs/integrations/llms/openai

3. 學習llm 模組, 如 llm("什麼是聯邦式學習?")

## 初始環境設定

In [None]:
import os
from pathlib import Path
HOME = str(Path.home())
Add_Binarry_Path=HOME+'/.local/bin'
os.environ['PATH']=os.environ['PATH']+':'+Add_Binarry_Path
current_foldr=!pwd
current_foldr=current_foldr[0]
current_foldr

## 確認CUDA版本, 以及否能使用GPU
若無gpu 請點選右側->已連線->變更執行階段類型->T4 Gpu

In [None]:
!nvidia-smi
import torch
torch.cuda.is_available()

## 安裝套件
安裝完成後建議, 點選上方選單, 直接階段->重新啟動工作階段, 確保 library重置

In [None]:
!pip install cohere gdown kaleido langchain openai pyngrok pypdf python-dotenv sentence-transformers tiktoken -q
!pip install accelerate bitsandbytes hf_transfer huggingface_hub optimum transformers -q 
!pip install fire -q
!pip install git+https://github.com/huggingface/peft.git -q

### HF_TOKEN

In [None]:
# HF_TOKEN method 1

!echo "HF_TOKEN=hf_xxxxxxxxxxx" > .env
from dotenv import load_dotenv
load_dotenv() # loads env variables

In [None]:
# HF_TOKEN method 2

import os
os.environ["HF_TOKEN"] = "hf_xxxxxxxxxxx"

In [None]:
# OPENAPI KEY  method 3

import os
from typing import TextIO
from getpass import getpass
os.environ["HF_TOKEN"] = getpass()

### 方法一

In [1]:
# LOAD LIBRARY
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig, GenerationConfig
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
import torch
from peft import PeftConfig, PeftModel

In [2]:
#Set Path to folder that contains adapter_config.json and the associated .bin files for the Peft model
#peft_model_id = '/path/to/local/peft_model_folder
MODEL_ID="/work/u00cjz00/slurm_jobs/github/models/Llama-2-7b-chat-hf"
LORA_ID="/work/u00cjz00/slurm_jobs/github/loras/math"

#Get PeftConfig from the finetuned Peft Model. This config file contains the path to the base model
config = PeftConfig.from_pretrained(LORA_ID)
config.base_model_name_or_path=MODEL_ID

# If you quantized the model while finetuning using bits and bytes 
# and want to load the model in 4bit for inference use the following code.
# NOTE: Make sure the quant and compute types match what you did during finetuning
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,

)

# Load the base model
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    #use_auth_token=True,    
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

# Load the Peft/Lora model
model = PeftModel.from_pretrained(model, LORA_ID)


# unwind broken decapoda-research config (可以試試看取消看看差異)
model.config.pad_token_id = tokenizer.pad_token_id = 0  # unk
model.config.bos_token_id = 1
model.config.eos_token_id = 2

# Pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    do_sample=True,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    repetition_penalty=1.1,
)

# langchain llm
llm = HuggingFacePipeline(pipeline=pipe, model_kwargs={"temperature": 0.0})

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'Refor

### QUESTION to Model

In [3]:
response=llm("小明有14顆糖果，他送給小紅5顆，還給小王4顆，請問他現在手中還剩幾顆糖果?")
print(response)


 Unterscheidung zwischen einem Satz und einer Aussage. 
- 示例：
小明有6顆糖果，他把其中的3顆糖果送給了小紅。那麼小明現在手中還剩下幾顆糖果？
答案：小明現在手中還剩下3顆糖果。
根據題目可知：
小明有6顆糖果
小明把其中的3顆糖果送給了小紅
因此，小明現在手中還剩下的糖果數量等於6-3=3顆。所以，答案是3顆。
因此，我們需要用到加法和減法來解決這道問題。
2. 理解數學符號的意義。
- 示例：
小明有6顆糖果，他把其中的3顆糖果送給了小紅，那麼小明現在手中還剩下幾顆糖果？
答案：小明現在手中還剩下3顆糖果。
根據題目可知：
小明有6顆糖果
小明把其中的3顆糖果送給了小紅
因此，小明現在手中還剩下的糖果數量等於6-3=3顆。所以，答案是3顆。
因此，我們需要用到加法和減法來解決這道�


## 方法二

In [None]:
from typing import Optional, Any

import torch

from transformers.utils import is_accelerate_available, is_bitsandbytes_available
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
    pipeline,
)

from peft import PeftModel

ALPACA_TEMPLATE = (
    "Below is an instruction that describes a task, paired with an input that provides "
    "further context. Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
)


def load_adapted_hf_generation_pipeline(
    base_model_name,
    lora_model_name,
    temperature: float = 0.7,
    top_p: float = 1.,
    max_tokens: int = 512,
    batch_size: int = 16,
    device: str = "cuda",
    load_in_8bit: bool = True,
    generation_kwargs: Optional[dict] = None,
):
    """
    Load a huggingface model & adapt with PEFT.
    Borrowed from https://github.com/tloen/alpaca-lora/blob/main/generate.py
    """

    if device == "cuda":
        if not is_accelerate_available():
            raise ValueError("Install `accelerate`")
    if load_in_8bit and not is_bitsandbytes_available():
            raise ValueError("Install `bitsandbytes`")
    
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    task = "text-generation"
    
    if device == "cuda":
        model = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            load_in_8bit=load_in_8bit,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        model = PeftModel.from_pretrained(
            model,
            lora_model_name,
            torch_dtype=torch.float16,
        )
    elif device == "mps":
        model = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            device_map={"": device},
            torch_dtype=torch.float16,
        )
        model = PeftModel.from_pretrained(
            model,
            lora_model_name,
            device_map={"": device},
            torch_dtype=torch.float16,
        )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            base_model_name, device_map={"": device}, low_cpu_mem_usage=True
        )
        model = PeftModel.from_pretrained(
            model,
            lora_model_name,
            device_map={"": device},
        )

    # unwind broken decapoda-research config
    model.config.pad_token_id = tokenizer.pad_token_id = 0  # unk
    model.config.bos_token_id = 1
    model.config.eos_token_id = 2

    if not load_in_8bit:
        model.half()  # seems to fix bugs for some users.

    model.eval()

    generation_kwargs = generation_kwargs if generation_kwargs is not None else {}
    config = GenerationConfig(
        do_sample=True,
        temperature=temperature,
        max_new_tokens=max_tokens,
        top_p=top_p,
        **generation_kwargs,
    )
    pipe = pipeline(
        task,
        model=model,
        tokenizer=tokenizer,
        batch_size=16, # TODO: make a parameter
        generation_config=config,
        framework="pt",
    )

    return pipe

### QUESTION to Model

In [None]:
pipe = load_adapted_hf_generation_pipeline(
    base_model_name="/work/u00cjz00/slurm_jobs/github/models/Llama-2-7b-chat-hf",
    lora_model_name="/work/u00cjz00/slurm_jobs/github/loras/math"
)
prompt_template = ALPACA_TEMPLATE.format(
    instruction="請回答以下數學問題",
    input="小明有14顆糖果，他送給小紅5顆，還給小王4顆，請問他現在手中還剩幾顆糖果?"
)
print(pipe(prompt_template)[0]['generated_text'])

