### Code base interaction chat w/CodeLlama 13B Instruct
- similar to QA over docs
- different splitting strategy for code vs docs
- each top level function in the code is loaded as a doc
- each top level class is loaded as a doc
- any remaining top level code loaded as a doc

In [5]:
import dotenv
from langchain.text_splitter import Language  
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import LanguageParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains.question_answering import load_qa_chain

In [6]:
# Load the .env file containing the API key
dotenv.load_dotenv()

True

In [10]:
# I have cloned the langchain repo locally 
repo_path = '/Users/laceymorgan/Desktop/code_llama_repo_chat/Langchain/'

# Load
loader = GenericLoader.from_filesystem(repo_path+'Libs/langchain/langchain',
        glob = "**/*",
        suffixes = [".py"],
        parser = LanguageParser(language=Language.PYTHON, parser_threshold=0.5))
    
documents = loader.load()
len(documents)

3062

### Split
Spilte the documents into chucks for embedding and vector storage using RecursiveCharacterTextSplitter 

In [11]:
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=2000,
    chunk_overlap=200)

texts = python_splitter.split_documents(documents)
len(texts)

5139

In [12]:
texts[0]

Document(page_content='"""For backwards compatibility."""\nfrom langchain.utilities.serpapi import SerpAPIWrapper\n\n__all__ = ["SerpAPIWrapper"]', metadata={'source': '/Users/laceymorgan/Desktop/code_llama_repo_chat/Langchain/Libs/langchain/langchain/serpapi.py', 'content_type': 'simplified_code', 'language': <Language.PYTHON: 'python'>})

### RetrievalQA
We need to store the documents in a way that we can semantically search for their content. 

In [13]:
# Create a database for the vectors
db = Chroma.from_documents(texts, OpenAIEmbeddings(disallowed_special=()))
retriever = db.as_retriever(
    search_type='mmr', # using max marginal relevance - the chunks it retrieves should not belong to the same file
    search_kwargs={'k': 8},
)

### Using CodeLlama

In [14]:
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.memory import ConversationSummaryMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [15]:
callbackmanager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
    model_path='./model/codellama-13b-instruct.Q5_K_M.gguf',
    n_ctx=5000,
    max_tokens=5000,
    n_gpu_layers=1,
    n_batch=512,
    f16_kv=True, # Must be set to True, otherwise you will run tino problems after a couple of calls
    callback_manager=callbackmanager,
    verbose=True,
)

llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from ./model/codellama-13b-instruct.Q5_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  5120, 32016,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q5_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q5_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q5_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q5_K     [  5120,  5120,

In [16]:
# Using the template langchain provides for the QA chain prompt, which produces so so results
# prompt
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say "I don't know", don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(
    input_variables=["context","question"],
    template=template,
)

# Docs
question = "Is there a prompt template for evaluating a generative model?"
docs = retriever.get_relevant_documents(question)

In [17]:
# Chain
chain = load_qa_chain(llm, chain_type="stuff",prompt=QA_CHAIN_PROMPT)

# Run
chain({"input_documents":docs, "question":question}, return_only_outputs=True)

 Yes, use the EVAL_TEMPLATE."""


llama_print_timings:        load time = 17242.26 ms
llama_print_timings:      sample time =    12.11 ms /    14 runs   (    0.87 ms per token,  1155.78 tokens per second)
llama_print_timings: prompt eval time = 63922.82 ms /  2877 tokens (   22.22 ms per token,    45.01 tokens per second)
llama_print_timings:        eval time =  1407.84 ms /    13 runs   (  108.30 ms per token,     9.23 tokens per second)
llama_print_timings:       total time = 65393.84 ms


{'output_text': ' Yes, use the EVAL_TEMPLATE."""'}

In [18]:
# Using a llama template, works for codellama and llama2
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

system_prompt  = """You are a helpful assistant, you will use the provided context to 
answer the questions. Read the given context before answering questions and think
step by step. If you can not answer a user question based on the provided context,
inform the user. Do not use any other information for answering user."""

instruction = """"
Context: {context}
User: {question}"""

def prompt_format(instruction=instruction, system_prompt=system_prompt):
    SYSTEM_PROMPT= B_SYS + system_prompt + E_SYS
    prompt_template = B_INST + instruction + E_INST
    return prompt_template 

In [19]:
# prompt
template = prompt_format()

QA_CHAIN_PROMPT = PromptTemplate(
    input_variables=["context","question"],
    template=template,
)

# Docs
question = "What are the different chain_types that can be passed to load_qa_chain?"
docs = retriever.get_relevant_documents(question)

In [20]:
# chain 
chain = load_qa_chain(llm, chain_type="stuff",prompt=QA_CHAIN_PROMPT)
# Run
chain({"input_documents":docs, "question":question}, return_only_outputs=True)


Llama.generate: prefix-match hit


  The `load_qa_chain` function in LangChain allows you to specify a `chain_type` parameter, which determines what type of question-answering chain is loaded. Different `chain_type` values correspond to different types of chains that can be used for question-answering tasks.

Here are some examples of common `chain_type` values and the chains they represent:

* "stuff": This chain type corresponds to a basic question-answering chain that uses a simple prompt template to generate questions and answers. It is typically used as a starting point for more complex question-answering tasks.
* "qa-with-sources": This chain type corresponds to a question-answering chain that retrieves answers from a vector database, but also includes sources in the output. It is typically used when you want to retrieve answers and also provide information about the sources of those answers.
* "vector-db-qa": This chain type corresponds to a question-answering chain that uses a vector database to retrieve answers


llama_print_timings:        load time = 17242.26 ms
llama_print_timings:      sample time =   302.71 ms /   350 runs   (    0.86 ms per token,  1156.22 tokens per second)
llama_print_timings: prompt eval time = 37788.55 ms /  2349 tokens (   16.09 ms per token,    62.16 tokens per second)
llama_print_timings:        eval time = 36609.11 ms /   349 runs   (  104.90 ms per token,     9.53 tokens per second)
llama_print_timings:       total time = 75723.64 ms


{'output_text': '  The `load_qa_chain` function in LangChain allows you to specify a `chain_type` parameter, which determines what type of question-answering chain is loaded. Different `chain_type` values correspond to different types of chains that can be used for question-answering tasks.\n\nHere are some examples of common `chain_type` values and the chains they represent:\n\n* "stuff": This chain type corresponds to a basic question-answering chain that uses a simple prompt template to generate questions and answers. It is typically used as a starting point for more complex question-answering tasks.\n* "qa-with-sources": This chain type corresponds to a question-answering chain that retrieves answers from a vector database, but also includes sources in the output. It is typically used when you want to retrieve answers and also provide information about the sources of those answers.\n* "vector-db-qa": This chain type corresponds to a question-answering chain that uses a vector datab