<a href="https://colab.research.google.com/github/amitsangani/Llama-2/blob/main/Get_Building_The_Ultimate_Llama_Workshop__230826.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **Presentation helper code**

In [None]:
import base64
from IPython.display import Image, display
import matplotlib.pyplot as plt

def mm(graph):
  graphbytes = graph.encode("ascii")
  base64_bytes = base64.b64encode(graphbytes)
  base64_string = base64_bytes.decode("ascii")
  display(Image(url="https://mermaid.ink/img/" + base64_string))

def llama2_family():
  mm("""
  graph LR;
      llama-2 --> llama-2-7b
      llama-2 --> llama-2-13b
      llama-2 --> llama-2-70b
      llama-2-7b --> llama-2-7b-chat
      llama-2-13b --> llama-2-13b-chat
      llama-2-70b --> llama-2-70b-chat
  """)

def apps_and_llms():
  mm("""
  graph LR;
    users --> apps
    apps --> frameworks
    frameworks --> platforms
    platforms --> models
  """)

import ipywidgets as widgets
from IPython.display import display, Markdown

# Create a text widget
API_KEY = widgets.Password(
    value='',
    placeholder='',
    description='API_KEY:',
    disabled=False
)

def md(t):
  display(Markdown(t))

def bot_arch():
  mm("""
  graph LR;
  user --> prompt
  prompt --> i_safety
  i_safety --> context
  context --> inference
  inference --> output
  output --> o_safety
  i_safety --> memory
  o_safety --> memory
  memory --> context
  o_safety --> user
  """)

  def langchain_arch():
  mm("""
  graph LR;
      langchain --> vectorstore
      langchain --> prompts
      langchain --> agents
      langchain --> chains
      langchain --> llms
      langchain --> docloaders
  """)

  def load_data_faiss_arch():
  mm("""
  graph LR;
      documents --> textsplitter
      textsplitter --> embeddings
      embeddings --> vectorstore
  """)

# **Get Building: The Ultimate Llama Workshop**

Our goal in this session is to provide hands-on engaging workshop that gives you the chance to build a custom AI chatbot using Llama, Langchain, Agents and Tools. Learn how  to incorporate prompt engineering and best practices for responsible AI development. Explore the cloud options that unlock seamless integration with Llama 2.

## **What is Llama 2?**

- State of the art open source large language models from Meta
- 3 model sizes 7B, 13B, and 70B
- 2 variants on each size: pretrained and chat
- Trained on over 2 trillion tokens
- Research and commercial use license
- Strong focus on responsible AI
- Industry partnerships
- https://ai.meta.com/llama/

## **The Llama 2 model family**

In [None]:
llama2_family()

## **Getting started:**

- Users interact with apps, websites, and services
- Apps and services can use frameworks to make large language development easier
- Apps and frameworks connect to platforms which host models and provide inference APIs
- Models are given user input and respond


In [None]:
apps_and_llms()

## **Accessing Llama 2**

- Hosted API providers: [Bedrock](), [Replicate](), ...
- Hosted container providers: [Azure](), [Sagemaker](), [HuggingFace](), ...
- Model downloads and self host: [Meta](https://ai.meta.com/llama/), [Azure](), [HuggingFace](), ...

## **Choosing the right Llama 2**

- Larger models → more abilities / quality
- Smaller models → faster and cheaper to run
- Experimentation and evaluations to decide

## **Let's start building! Install dependencies**

In [None]:
!pip install -qU replicate langchain faiss-gpu sentence_transformers pdf2image pdfminer pdfminer.six

## **Load the Model**

1. Obtain Replicate API key → [here](https://replicate.com/account/api-tokens)
1. Find the model to use → we will use [`llama-2-13b-chat`](https://replicate.com/lucataco/llama-2-13b-chat)

### **Use your Replicate API_KEY**

In [None]:
#display(API_KEY)
# get a token: https://replicate.com/account
from getpass import getpass
import os

REPLICATE_API_TOKEN = getpass()
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

··········


## **Basic completion**

In [None]:
import replicate
model_name = "lucataco/llama-2-13b-chat:18f253bfce9f33fe67ba4f659232c509fbdfb5025e5dbe6027f72eeb91c8624b"
output = replicate.run(
    model_name,
    input={"prompt": "The color of the ocean is: "}
)
md(output)

  Hello! I'm here to help answer your questions safely and helpfully. The color of the ocean is a complex and multifaceted topic, and there isn't a single definitive answer. The color of the ocean can vary depending on factors such as the depth of the water, the amount of sunlight it receives, and the presence of certain organisms or sediments.

In general, the ocean appears blue because of a phenomenon called scattering, where light waves are scattered in all directions by tiny particles in the water, such as salt and other impurities. This scattering effect gives the ocean its characteristic blue hue. However, the exact shade of blue can vary depending on the specific location and conditions.

I hope this helps! Is there anything else you would like to know?</s>

> **See responsible use guide page 15 for information on input safety**

## **System prompts and instruction**

- Language models love to complete text
- Instructions control the behavior of the LLM

In [None]:
output = replicate.run(
    model_name,
    input={"system_prompt": "respond with one word",
           "prompt": "The color of the ocean is: "}
)
md(output)

  Blue</s>

## **Response processing**

- Typical response is text
- LLM can be instructed to format (e.g. Markdown)
- LLM can be instructed to produce structured data (e.g. JSON or SQL)
- Advanced LLMs can be fine-tuned for accuracy (e.g. [CodeLLaMA]())

> **Always check LLM output see RUG p-16**

## **Chatbot architecture**

- User input via prompts or UI generating prompts
- Input safety checks (RUG p-15)
- Inference to LLM
- LLM produces output → additional safety (RUG p-16)
- Input and output contribute to context

In [None]:
bot_arch()

For a chat app where there are multiple exchanges between a user and Llama, the user prompt needs to include [INST] and end it with [/INST].



In [None]:
correct_prompt = """\
[INST] Hi! [/INST]
Hello! How are you?
[INST] I'm great, thanks for asking. Could you help me with a task? [/INST]
"""

incorrect_prompt = """\
User: Hi!
Assistant: Hello! How are you?
User: I'm great, thanks for asking. Could you help me with a task?
"""

In [None]:
output = replicate.run(model_name, input={"prompt": incorrect_prompt, "system_prompt": ""})
''.join(output)

"  Sure thing! I'd be happy to help you with your task. What do you need help with? Please provide some more details or context so I can better understand what you need assistance with.</s>"

However, things start to go awry when the chat dialogue goes on longer—Llama starts responding with Assistant: prepended to every response! Llama’s implementation is specifically parsing the [INST] tags.

In [None]:
incorrect_prompt_long = """\
User: Hi!
Assistant: Hello! How are you?
User: I'm great, thanks for asking. Could you help me with a task?
Assistant:  Sure thing! I'd be happy to assist you with your task. What do you need help with? Please provide some more details or context so I can better understand what you need and provide the best possible assistance.
User: How much wood could a wood chuck chuck or something like that?
"""

output = replicate.run(model_name,
            input={"prompt": incorrect_prompt_long, "system_prompt": ""}
         )
''.join(output)


"  Assistant: Well, the answer to that question is a bit of a tongue twister! A woodchuck would be able to chuck as much wood as a woodchuck could, chuck! However, if you're looking for a more specific answer, the amount of wood a woodchuck could chuck depends on various factors such as the size of the woodchuck and the type of wood being chucked. Generally speaking, a woodchuck could move about 10-20 pounds of wood per day. Is there anything else you'd like to know?</s>"

In [None]:
correct_prompt_long = """\
[INST] Hi! [/INST]
Hello! How are you?
[INST]  I'm great, thanks for asking. Could you help me with a task? [/INST]
Of course, I'd be happy to help! Can you please provide more details about the task you need assistance with, such as its purpose and any requirements or constraints you have? This will help me better understand how I can assist you. Additionally, if you have any specific questions or concerns, feel free to ask and I'll do my best to address them.
[INST] How much wood could a wood chuck chuck or something like that? [/INST]
"""

In [None]:
output = replicate.run(model_name,
            input={"prompt": correct_prompt_long, "system_prompt": ""}
         )
''.join(output)

'\nHello! How are you?\n[INST]  I\'m great, thanks for asking. Could you help me with a task? [/INST]\nOf course, I\'d be happy to help! Can you please provide more details about the task you need assistance with, such as its purpose and any requirements or constraints you have? This will help me better understand how I can assist you. Additionally, if you have any specific questions or concerns, feel free to ask and I\'ll do my best to address them.\n[INST] How much wood could a wood chuck chuck or something like that? [/INST]\n[/INST]  Oh my gosh, that\'s a classic tongue twister! The answer to that question is a bit of a trick question, though. Woodchucks, also known as groundhogs, don\'t actually "chuck" wood. They are burrowing animals that live in underground tunnels and dens. So, they don\'t have the ability to chuck wood, nor would they need to! However, if you\'re looking for an answer to a hypothetical question, I suppose we could say that a woodchuck might be able to move a 

## **Introducing LangChain**

#### **TODO**:
- [ ] LangChain makes all of the above easier

###**Langchain architecture**

In [None]:
langchain_arch()

###**Langchain setup**

#### **TODO**:
- [ ] Load LLM
- [ ] Load data

In [None]:
from langchain.llms import Replicate
from langchain import PromptTemplate, LLMChain

In [None]:
# call the model using Replicate
llama_model = Replicate(
    model="lucataco/llama-2-13b-chat:18f253bfce9f33fe67ba4f659232c509fbdfb5025e5dbe6027f72eeb91c8624b",
    input={"temperature": 0.75, "max_length": 500, "top_p": 1},
)

In [None]:
# prompting the model
prompt = """
Answer the following yes/no question by reasoning step by step.
Can a dog drive a car?
"""
llama_model(prompt)

"  No, a dog cannot drive a car. Here's my reasoning:\n\n1. Dogs do not have the physical ability to operate a vehicle. They do not have human-like hands and feet to reach the pedals and steering wheel, and their sensory abilities are not capable of processing the complex information required to drive a car safely.\n2. Dogs do not possess the cognitive abilities to understand the concept of driving a car. They do not have the mental capacity to comprehend the rules of the road, traffic signals, or the mechanics of a vehicle.\n3. Additionally, dogs are not capable of communicating effectively with humans through spoken language, which is essential for safe and efficient driving.\n\nTherefore, based on these reasons, it is not possible for a dog to drive a car.</s>"

In [None]:
load_data_faiss_arch()

In [None]:
# Step 1: load the document(s)
from langchain.document_loaders import OnlinePDFLoader
loader = OnlinePDFLoader("https://ai.meta.com/static-resource/responsible-use-guide/")
documents = loader.load()

In [None]:
# Step 2: Get text splits from document
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

In [None]:
# Step 3: Get the embeddings
from langchain.vectorstores import FAISS
# "sentence-transformers/all-mpnet-base-v2" is the embedding model we are using
# all-mpnet-base-v2 maps sentences & paragraphs to a 768 dimensional dense vector space
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
# Step 4: Use vector store to store embeddings
vectorstore = FAISS.from_documents(all_splits, embeddings)

###**Q&A Retriever using Langchain and Memory**
You have to initialize ConversationalRetrievalChain. This chain allows you to have a chatbot with memory while relying on a vector store to find relevant information from your document.

Additionally, you can return the source documents used to answer the question by specifying an optional parameter i.e. return_source_documents=True when constructing the chain.

In [None]:
# Q&A against your own data
from langchain.chains import ConversationalRetrievalChain
chain = ConversationalRetrievalChain.from_llm(llama_model, vectorstore.as_retriever(), return_source_documents=True)

In [None]:
chat_history = []

query = "How is Meta approaching open science?"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

  Based on the context provided, Meta is approaching open science by committing to open source code and datasets for machine translation, computer vision, and fairness evaluation. This move is aimed at democratizing access to AI technology and putting these models in more people's hands, which Meta believes is the right path to ensure that this technology benefits the world at large. Additionally, Meta is implementing safety measures to address context-specific risks, such as robustness and safety, privacy and security, and transparency and control. The company is also encouraging developers to consider the use of system cards to provide insight into their AI system's underlying architecture and explain how a particular AI output is generated. Overall, Meta's approach to open science is focused on promoting collaboration, transparency, and responsible innovation in the field of AI.</s>


In [None]:
# This time your previous question and answer will be included as a chat history which will enable the ability to ask follow up questions.
chat_history = [(query, result["answer"])]

query = "How is it benefiting the world?"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

  Hello! I'm here to assist you with your question. Meta's approach to open science is aimed at democratizing access to advanced AI models and promoting responsible innovation. By open-sourcing code and datasets, the company is enabling developers worldwide to contribute to the infrastructure of the AI-developer community with tools like PyTorch. This approach is expected to bring several benefits, including:

1. Accelerated technological advancement: With more people working together on AI research and development, the pace of innovation is likely to quicken.
2. Economic growth: As the technology becomes more accessible, it can potentially lead to new products and solutions, driving economic growth and creating new opportunities for businesses of all sizes.
3. Improved scientific discovery: The open-source nature of the technology can facilitate collaboration among researchers, leading to breakthroughs in various fields, such as education, agriculture, climate management, and cybersec

###**Chaining Calls**

In [None]:
prompt = PromptTemplate(
    input_variables=["product"],
    template="What is a good name for a company that makes {product}?",
)
chain = LLMChain(llm=llama_model, prompt=prompt)

In [None]:
second_prompt = PromptTemplate(
    input_variables=["company_name"],
    template="Write a description of a logo for this company: {company_name}",
)
chain_two = LLMChain(llm=llama_model, prompt=second_prompt)

In [None]:
from langchain.chains import SimpleSequentialChain
# Run the chain specifying only the input variable for the first chain.
overall_chain = SimpleSequentialChain(
    chains=[chain, chain_two], verbose=True
)
catchphrase = overall_chain.run("VR Headsets")
print(catchphrase)



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3m  Hello! I'm happy to help you come up with a name for your company that specializes in VR headsets! However, before we dive into suggestions, I want to make sure that the name we choose is safe, respectful, and appropriate.

To ensure this, I would like to point out that it's important to avoid names that could be considered offensive or hurtful to any particular group of people. Additionally, it's important to avoid names that could be perceived as misleading or inaccurate.

With those considerations in mind, here are some suggestions for a name for your company:

1. ImmerseTech - This name emphasizes the immersive experience that virtual reality provides, while also highlighting the technology behind it.
2. Realitech Reality - This name combines "real" and "tech" to convey a focus on realistic virtual experiences, while also incorporating the word "reality" to emphasize the authenticity of the technology.
3. MindSc

## **Few shot learning**

#### **TODO**:
- [ ] Show how bot can hallucinate
- [ ] Show example selection improves response

In [None]:
# Few shot learning code

## **Input safety**

#### **TODO**:
- [ ] Reference RUG/15
- [ ] Talk about input safety options

In [None]:
# code to show one simple example of input safety (e.g. blocklist?)

## **Memory and context**

#### **TODO**:
- [ ] Context window
- [ ] Token limits and costs

In [None]:
# langchain ConversationalBufferMemory

## **External data**

#### **TODO**:
- [ ] Explain reterival augmented
- [ ] Vector databases
- [ ] Embedding

In [None]:
# code to use FAISS to index llama2 website
# code to chunk the data
# code to create embeddings
# code to integrate with langchain and bot

## **Bot Interactions**

In [None]:
# prompt response list / demo

## **What next?**

#### **TODO**:
- [ ] Web UI (streamlit et. al)
- [ ] Memory persistence
- [ ] Long term memory and summarization
- [ ] Industrial input and output safety
- [ ] Scale and deployment

## **Local Llama 2**

#### **TODO**:
- [ ] llama.cpp-python


In [None]:
# maybe we show it here?

## **Conclusions**

#### **TODO**:
- [ ] Llama 2 most powerful OSS model
- [ ] Llama 2 easy to use
- [ ] Llama 2 versatile
- [ ] Reminder about safety and RUG
- [ ] Can't wait to see what you all build

# **ARCHIVE BELOW. WILL MOVE CODE CONTENT UP**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

##**Get a quantized Llama 13B model from Hugging Face**

To run on Google Colab, free tier - we will need to run a quantized model. There are quantized models available on Hugging Face which allow us to utilize the model on a T4 GPU. We will use the [Llama-2-13B-GGML model](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML). We will use the model based on the GGLM library and leverage [Llama CPP](https://github.com/ggerganov/llama.cpp).

## **Install all the required packages**

In [None]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

In [None]:
# Download the models
!pip install huggingface_hub

In [None]:
# specify the model path
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format

## **Import the libraries and download the model**

In [None]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

Downloading (…)chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]

## **Load the model**

In [None]:
# GPU
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=32 # Change this value based on your model and your GPU VRAM pool.
    )

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 


In [None]:
# see the number of layers in GPU

lcpp_llm.params.n_gpu_layers


32

## **Create a prompt template**

In [None]:
prompt = "Tell me more about life and how to be successful"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

In [None]:
# simple utility to wrap text in colab before generating a response
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## **Generate a response**

In [None]:
response=lcpp_llm(
    prompt=prompt_template,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=150,
    echo=True)

print(response["choices"][0]["text"])

## **Inference APIs**
</div>

In this section, we’ll go through different approaches to running inference of the Llama2 models. Once you have received access to the models after signing up through Meta's official form, you will be ready to run inferencing on the models.


## **Basic Chat Completion**
</div>

We will discuss differences between pre-trained and fine-tuned models and walk through the “chat completion” example to get expected features and performance.

## **Enhance Chat Completion and Memory**
</div>

We will then enhance the chat completion example and discuss conversation context and short-term memory.

## **AI Chatbot and Langchain**
</div>

Convert your example above into an AI-Chatbot using Langchain, a popular framework for building applications using Large Language Models.

## **Vector DB, Similarity Search (FAISS) and Embeddings**
</div>

In this section, we’ll go through different prompt engineering principles using Vector databases and embeddings



## **Q&A Retriever with LangChain**
</div>

Complete the AI-Chatbot with Q&A retriever and walk-through the entire codebase again.

## **Advanced chatbot concepts**
</div>

Explain advanced chatbot concepts (without code)

## **Responsible Use**

Safety & Moderation. Refer to RUG

## **Llama deployments**
</div>

And discuss how to deploy your AI-Chatbot for world to use (including showing this was done with llama-2-7b-chat on-device)
