# Llama 2 Chat
Notebook demonstrating the usage of Llama 2 Chat to obtain software responsibilities.
## Run
- Llama 2 70B Chat GGML [download here](https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML) (used version q5_K_M `llama-2-70b-chat.ggmlv3.q5_K_M.bin`, GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K)
- Convert GGML model to GGUF. Clone and build [Llama.cpp](https://github.com/ggerganov/llama.cpp), run `py convert-llama-ggml-to-gguf.py -i {{input_file_path} -o {{output_file_path}} --gqa 8 --eps 1e-5 -c 4096`
- (GPU only, uncomment on first time use) Rebuild `llama-cpp-python` to support CUDA


In [1]:
#!CMAKE_ARGS="-DLLAMA_CUBLAS=1" FORCE_CMAKE=1 LLAMA_CUBLAS=1  pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

## Configure
Refer to [docs](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md) to customize values

In [4]:
MODEL_CONFIGURATION = {
    "embedding": False, # Embedding mode only.
    "f16_kv": True, # Use half-precision for key/value cache
    "last_n_tokens_size": 64, # Maximum number of tokens to keep in the last_n_tokens deque
    "logits_all": False, # Return logits for all tokens, not just the last token
    "lora_base": None, # Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model
    "lora_path": None, # Path to a LoRA file to apply to the model
    "low_vram": False, # Use low VRAM mode
    "main_gpu": 0, # Main GPU to use
    "model_path": "../model/gguf-model.bin", # Replace with output_file_path from GGML to GGUF conversion
    "mul_mat_q": True, # if true, use experimental mul_mat_q kernels
    "n_batch": 512, # Maximum number of prompt tokens to batch together when calling llama_eval
    "n_ctx": 4096, # Maximum context size
    "n_gpu_layers": 1, # Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded
    "n_threads": None, # Number of threads to use. If None, the number of threads is automatically determine
    "numa": False, # Enable NUMA support. (NOTE: The initial value of this parameter is used for the remainder of the program as this value is set in llama_backend_init)
    "rope_freq_base": 10000.0, # Base frequency for rope sampling
    "rope_freq_scale": 1.0, # Scale factor for rope sampling
    "use_mmap": True, # Use mmap if possible
    "use_mlock": True, # Force the system to keep the model in RAM
    "verbose": True, # Print verbose output to stderr
    "vocab_only": False # Only load the vocabulary no weights
}
          
INFERENCE_CONFIGURATION = {
    "echo": True, # Whether to echo the prompt
    "frequency_penalty": 0.0, # Repeat alpha frequency penalty
    #"grammar_path": "../../grammar/json.gbnf", # Path to the .gbnf file that defines formal grammars for constraining model outputs
    "logprobs": None, # The number of logprobs to return. If None, no logprobs are returned
    "max_tokens": 0, # The maximum number of tokens to generate. If max_tokens <= 0, the maximum number of tokens to generate is unlimited and depends on n_ctx
    "mirostat_mode": 0, # Enable Mirostat sampling, controlling perplexity during text generation (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
    "mirostat_tau": 5.0, # Set the Mirostat target entropy, parameter tau
    "mirostat_eta": 0.1, # Set the Mirostat learning rate, parameter eta 
    "n_parts": -1, # Number of parts to split the model into. If -1, the number of parts is automatically determined
    "presence_penalty": 0.0, # Repeat alpha presence penalty
    "repeat_penalty": 1.1, # The penalty to apply to repeated tokens
    "stop": [], # A list of strings to stop generation when encountered
    "stopping_criteria": None,
    "stream": True, # Whether to stream the results.
    "top_k": 40, # The top-k value to use for sampling
    "top_p": 0.95, # The top-p value to use for sampling
    "temperature": 0.0, # The temperature to use for sampling or 'randomness' of outputs, 0.0 is the min and 1.0 the max   
    "tfs_z": 1.0, # Enable tail free sampling with parameter z
}

SYSTEM_MESSAGE = "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

In [2]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [None]:
from langchain.llms import LlamaCpp

llm = LlamaCpp(callback_manager=callback_manager, **MODEL_CONFIGURATION, **INFERENCE_CONFIGURATION)

llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from ../model/gguf-model.bin (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  8192, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q6_K     [  8192, 32000,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q5_K     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight q5_K     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight q6_K     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight q5_K     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor    7:           blk.0.attn_norm.weight f32      [  8192,     1,     1,     1 ]
lla

In [5]:
from langchain.prompts import (
    FewShotChatMessagePromptTemplate,
    ChatPromptTemplate,
)

instruction = """
Identify from the following requirement the responsibilities. To identify responsibilities use the following rules:
Rules:
A - A responsibility is a verb in base form, 3rd person singular present, or non-3rd person singular present, gerund form, which is associated to a direct object and optionally to the subject performing the action indicated by the verb.
B - A responsibility is a verb associated to a direct object and optionally to the subject performing the action indicated by the verb, when a verb is actually a phrasal verb modified by preposition, or when using passive voice structures in which the action receives more attention than the subject that performs it; thus, it is rather common that these structures lack the subject of the action.
C - A responsibility is when a verb is actually a phrasal verb modified by preposition"
Requirement: "{0}"
"""

examples = [
    {"input": instruction.format("If a student accesses the system then the student can add a new course or drop an added course."), "output": "access system, add course, drop course"},
    {"input": instruction.format("After the student adds a course, the system sends the transaction information to the billing system."), "output": "add course, send information to billing system"}
]

example_prompt = ChatPromptTemplate.from_messages([
    ("user", "{input}"),
    ("assistant", "{output}"),
])

few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)

final_prompt = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_MESSAGE),
    few_shot_prompt,
    ("user", "{input}")
])

In [6]:
chain = final_prompt | llm

In [7]:
chain.invoke({"input": "The student can click on the list_courses button for listing its added courses."})



AI: 
list courses


llama_print_timings:        load time = 38240.44 ms
llama_print_timings:      sample time =     6.08 ms /     9 runs   (    0.68 ms per token,  1480.26 tokens per second)
llama_print_timings: prompt eval time = 51089.22 ms /   588 tokens (   86.89 ms per token,    11.51 tokens per second)
llama_print_timings:        eval time = 12840.74 ms /     8 runs   ( 1605.09 ms per token,     0.62 tokens per second)
llama_print_timings:       total time = 63981.24 ms


'\n\nAI: \nlist courses'

In [8]:
chain.invoke({"input": "When the student drops a course, a confirmation dialog is displayed on the screen."})

Llama.generate: prefix-match hit




AI: drop course, display confirmation dialog


llama_print_timings:        load time = 38240.44 ms
llama_print_timings:      sample time =     7.83 ms /    12 runs   (    0.65 ms per token,  1531.98 tokens per second)
llama_print_timings: prompt eval time = 24339.85 ms /    18 tokens ( 1352.21 ms per token,     0.74 tokens per second)
llama_print_timings:        eval time = 17496.16 ms /    11 runs   ( 1590.56 ms per token,     0.63 tokens per second)
llama_print_timings:       total time = 41898.92 ms


'\n\nAI: drop course, display confirmation dialog'