# Use Llama 2 13B Locally

## Ways to improve result
- load
- split
    - Explore Context-aware splitters, which keep the location ("context") of each split in the original Document
    - Text splitting by header https://python.langchain.com/docs/use_cases/question_answering/document-context-aware-QA
    - Documents can be filtered during vector store retrieval using metadata filters.
- store
- retrieve
    - MultiQueryRetriever generates variants of the input question to improve retrieval hit rate.
    - MultiVectorRetriever (diagram below) instead generates variants of the embeddings, also in order to improve retrieval hit rate.
    - Max marginal relevance selects for relevance and diversity among the retrieved documents to avoid passing in duplicate context.
    - Integrations: Integrations with retrieval services. (https://python.langchain.com/docs/integrations/retrievers/)
- generate
    - Choosing LLMs
    - prompt engineering
    - Adding memory (https://python.langchain.com/docs/use_cases/question_answering/#adding-memory)

In [1]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import LlamaCpp

In [2]:
model_path = "/Users/cintiaching/Library/Caches/llama_index/models/llama-2-13b-chat.Q5_K_M.gguf"

In [3]:
n_gpu_layers = 1000
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# input model_url="https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q5_K_M.gguf" if the model is not stored locally
# https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama
llm = LlamaCpp(
    model_path=model_path,
    n_gpu_layers=n_gpu_layers, # Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded.
    n_ctx=2048, # Context size, text limits for responses
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls # Use half-precision for key/value cache.
    callback_manager=callback_manager,
    # seed=1, # Random seed. -1 for random.
    verbose=False,
)

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /Users/cintiaching/Library/Caches/llama_index/models/llama-2-13b-chat.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q5_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q5_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q5_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.w

In [4]:
# response = llm.generate(["What can you tell me about the modern family TV Series? in less then 150 words"])

## RAG

In [5]:
from langchain.document_loaders import UnstructuredWordDocumentLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import GPT4AllEmbeddings

from langchain.prompts import PromptTemplate

from langchain import hub
from langchain.chains.question_answering import load_qa_chain

from langchain.chains import RetrievalQA

In [7]:
# load document
loader = UnstructuredWordDocumentLoader(
    "../data/New Staff Handbook Q&A.docx", strategy="fast",
)
docs = loader.load()

# Split the Document into chunks for embedding and vector storage.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=20)
all_splits = text_splitter.split_documents(docs)

# store the splits to look up later
vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())
retriever = vectorstore.as_retriever()

objc[3765]: Class GGMLMetalClass is implemented in both /Users/cintiaching/PycharmProjects/in-context-chatbot/venv/lib/python3.10/site-packages/llama_cpp/libllama.dylib (0x10d91c228) and /Users/cintiaching/PycharmProjects/in-context-chatbot/venv/lib/python3.10/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x175c441d0). One of the two will be used. Which one is undefined.


bert_load_from_file: gguf version     = 2
bert_load_from_file: gguf alignment   = 32
bert_load_from_file: gguf data offset = 695552
bert_load_from_file: model name           = BERT
bert_load_from_file: model architecture   = bert
bert_load_from_file: model file type      = 1
bert_load_from_file: bert tokenizer vocab = 30522


In [None]:
all_splits[1]

In [7]:
# Prompt
prompt = PromptTemplate.from_template(
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question. "
    "If you don't know the answer, just say that you don't know. don't try to make up an answer"
    "Keep the answer concise, try to use exact wording from context that is relevant."
    "<</SYS>> \nQuestion: {question} \nContext: {context} \nAnswer: [/INST]"
)

In [8]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
)

In [9]:
def ask(question):
    qa_chain({"query": question})

In [10]:
ask("why is there a new staff handbook")

  Sure, I'd be happy to help! Based on the context you provided, the answer to the question "Why is there a new staff handbook?" is:

The company is launching a new staff handbook to ensure consistency and alignment in terms of main policies and procedures across the group, including Lane Crawford, Lane Crawford Joyce Group, Joyce, and ImagineX.

Please let me know if you need any further assistance!

In [11]:
ask("difference between new staff handbook and the old one?")

  Sure! Based on the provided context, here's the answer to your question:

The main differences between the new staff handbook and the old one are:

1. Consistency and alignment: The new handbook has been updated and aligned across the group, ensuring consistency and alignment in terms of main policies and procedures.
2. Coverage: Part A of the handbook sets out policies and procedures specific to employees in Hong Kong, while Part B covers policies and procedures applicable to all employing companies within the Lane Crawford Joyce Group in Hong Kong.
3. Staff Declarations: The new handbook introduces the Staff Declarations: Disclosure of Outside Work and Conflicts of Interest, which restricts staff from taking on certain work for outside enterprises or individuals without LCJG's consent.

In [12]:
ask("when to accept the new handbook")

  Sure, I can help you with that! Based on the provided context, here are the answers to your questions:

1. When to accept the new staff handbook?
Answer: All employees, including full-time and part-time staff, must read, understand, and acknowledge the new staff handbook before passing their probation period.
2. What are the consequences if staff do not acknowledge the staff handbook?
Answer: Failure to acknowledge the staff handbook will result in delaying the probation end date.
3. Who should staff reach out to if there are questions?
Answer: All employees should reach out to their supervisors or HR representatives if they have any questions regarding the staff handbook.
4. How often would the staff handbook be renewed?
Answer: The Staff Handbook will be reviewed on a regular basis and updated as needed, with all changes communicated to all employees.
5. What is the Company's policy on dress code?
Answer: The Company does not have a specific policy on dress code, but staff are expe

In [13]:
ask("when is the deadline to accept the new handbook")

  Sure, I'd be happy to help! Based on the context you provided, the deadline to accept the new staff handbook is October 31st, 2023.

In [14]:
ask("how to accept the new handbook")

  Sure, I'd be happy to help! To acknowledge the new staff handbook, employees should read, understand, and acknowledge the handbook once they have joined the company. Failure to do so may result in delaying their probation end date. If staff have any questions, they should reach out to their supervisor or HR representative for assistance.

In [15]:
ask("How to acknowledge new staff handbook?")

  Sure, I can help you with that! Based on the provided context, all new staff members are required to acknowledge the Staff Handbook. This is a pre-requisite for passing probation, and failure to do so may result in delaying the probation end date. Therefore, all new staff should read, understand, and acknowledge the Staff Handbook as soon as possible after joining the company.

In [16]:
ask("do staffs get free tickets to taylor swift concert")

  Based on the provided context, here are the answers to the questions:

1. Do staff get free tickets to Taylor Swift concerts?
No, there is no mention of free Taylor Swift concert tickets in the provided context.
2. Why are staff required to acknowledge the staff handbook?
Staff are required to acknowledge the staff handbook to understand the policies and procedures that apply to their employment with the company.
3. Who needs to acknowledge the staff handbook?
All employees of the Company in Hong Kong are required to acknowledge the staff handbook.
4. How to download a copy of the staff handbook?
Employees can download a copy of the staff handbook from the staff handbook portal (https://portal.lcjgroup.com/staff_handbook/main/login.aspx).
5. Where can I find medical insurance benefit?
Medical insurance benefits can be found on SAP by clicking "Benefit" on the SAP home page and viewing the latest Medical Insurance Plan, Forms, Panel Network List etc.
6. Where can I find the Dental Ins

## wrap in function, test with another docs

In [18]:
from langchain.document_loaders import PyPDFLoader

In [19]:

def init_qa_chain(path, chunk_size=500, chunk_overlap=20, prompt=None, doc_type="docx"):
    # load document
    if doc_type == "docx":
        loader = UnstructuredWordDocumentLoader(
            path, strategy="fast",
        )
    elif doc_type == "pdf":
        # loader = UnstructuredPDFLoader(path)
        loader = PyPDFLoader(path)
        
    docs = loader.load()
    
    # Split the Document into chunks for embedding and vector storage.
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    all_splits = text_splitter.split_documents(docs)
    
    # store the splits to look up later
    vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())
    retriever = vectorstore.as_retriever()
    
    # Prompt
    if prompt is None:
        prompt = PromptTemplate.from_template(
            "You are an assistant for question-answering tasks. "
            "Use the following pieces of retrieved context to answer the question. "
            "If you don't know the answer, just say that you don't know. don't try to make up an answer"
            "Use three sentences maximum and keep the answer concise, try to use exact wording from context that is relevant."
            "<</SYS>> \nQuestion: {question} \nContext: {context} \nAnswer: [/INST]"
        )
    else:
        prompt = PromptTemplate.from_template(prompt)
    
    qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=vectorstore.as_retriever(),
        chain_type_kwargs={"prompt": prompt},
    )
    return qa_chain


def ask(question, qa_chain):
    qa_chain({"query": question})
    

In [21]:
path = "./data/Employment+Handbook_0000020125.pdf"
qa_chain_full_book = init_qa_chain(
    path, 
    chunk_size=500, 
    chunk_overlap=20, 
    prompt=None, 
    doc_type="pdf",
)

bert_load_from_file: gguf version     = 2
bert_load_from_file: gguf alignment   = 32
bert_load_from_file: gguf data offset = 695552
bert_load_from_file: model name           = BERT
bert_load_from_file: model architecture   = bert
bert_load_from_file: model file type      = 1
bert_load_from_file: bert tokenizer vocab = 30522


In [22]:
ask("tell me about sick leave", qa_chain_full_book)

  Sure, I'd be happy to help! Based on the context you provided, here is the answer to your question:

Sick leave is granted under the following criteria: frontline employees who are unable to attend work should inform their Line Manager, and office employees should inform their Line Manager by phone or email within 30 minutes of the start of their working day. For the first three months of employment, paid sick leave will be granted in accordance with the provisions of the labor legislation currently in force. From the fourth month onward, full paid sick leave will be granted for the entire period taken, provided that the number of sickness days is within the accumulated sickness allowance under the discretion of brand heads.

## use MultiVectorRetriever
MultiVectorRetriever instead generates variants of the embeddings, also in order to improve retrieval hit rate.

https://python.langchain.com/docs/use_cases/question_answering/#go-deeper-3

In [None]:
# TBC

## Text splitting by header