# **Getting to know Llama 3: Everything you need to start building**
Our goal in this session is to provide a guided tour of Llama 3 with comparison with Llama 2, including understanding different Llama 3 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), Fine-tuning and more. All this is implemented with a starter code for you to take it and use it in your Llama 3 projects.

### **0 - Prerequisites**
* Basic understanding of Large Language Models
* Basic understanding of Python

In [112]:
import base64
from getpass import getpass

import replicate
from IPython.display import display, Image, Markdown, HTML
from groq import Groq
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import OnlinePDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import Replicate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

from chrisbase.data import *
from chrisbase.io import *
from chrisbase.util import *


def mm(graph):
    graphbytes = graph.encode("ascii")
    base64_bytes = base64.b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))


def md(t):
    display(Markdown(t))


def hr(s="<hr style='border:3px solid MidnightBlue; width:880px; padding:0; margin:0; margin-bottom:15px; margin-top:10px;'>"):
    display(HTML(s))


def html(t,
         s="<pre style='font-family:Arial; font-size:12pt;"
           " padding:10px; margin-left:10px; margin-top:10px;"
           " background-color:LightSkyBlue; border:3px solid MidnightBlue;"
           " width:100%; height:100px;'>",
         e="</pre>"):
    display(HTML(s + t + e))


def llama3_family():
    mm("""
    graph LR;
        llama-3 --> llama-3-8b
        llama-3 --> llama-3-70b
        llama-3-8b --> llama-3-8b-instruct
        llama-3-70b --> llama-3-70b-instruct
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def genai_app_arch():
    mm("""
    flowchart TD
        A[Users] --> B(Applications e.g. mobile, web)
        B --> |Hosted API|C(Platforms e.g. Custom, HuggingFace, Replicate)
        B -- optional --> E(Frameworks e.g. LangChain)
        C-->|User Input|D[Llama 3]
        D-->|Model Output|C
        E --> C
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def bot_arch():
    mm("""
    graph LR;
        user --> prompt
        prompt --> i_safety
        i_safety --> context
        context --> Llama_3
        Llama_3 --> output
        output --> o_safety
        i_safety --> memory
        o_safety --> memory
        memory --> context
        o_safety --> user
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def mem_context():
    mm("""
    graph LR
        context(text)
        user_prompt --> context
        instruction --> context
        examples --> context
        memory --> context
        context --> tokenizer
        tokenizer --> embeddings
        embeddings --> LLM
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def rag_arch():
    mm("""
    flowchart TD
        A[User Prompts] --> B(Frameworks e.g. LangChain)
        B <--> |Database, Docs, XLS|C[fa:fa-database External Data]
        B -->|API|D[Llama 3]
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def load_data_faiss_arch():
    mm("""
    graph LR;
        documents --> textsplitter
        textsplitter --> embeddings
        embeddings --> vectorstore
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def fine_tuned_arch():
    mm("""
    graph LR;
        Custom_Dataset --> Pre-trained_Llama
        Pre-trained_Llama --> Fine-tuned_Llama
        Fine-tuned_Llama --> RLHF
        RLHF --> |Loss:Cross-Entropy|Fine-tuned_Llama
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)

### **1 - Understanding Llama 3**

### **1.1 - What is Llama 3?**

* State of the art (SOTA), Open Source LLM
* 8B, 70B
* Choosing model: Size, Quality, Cost, Speed
* Pretrained + Chat
* [Meta Llama 3 Blog](https://ai.meta.com/blog/meta-llama-3/)
* [Getting Started with Meta Llama](https://llama.meta.com/docs/get-started)

In [113]:
llama3_family()

### **1.2 - Accessing Llama 3**
* Download + Self Host (i.e. [download Llama](https://ai.meta.com/resources/models-and-libraries/llama-downloads))
* Hosted API Platform (e.g. [Groq](https://console.groq.com/), [Replicate](https://replicate.com/meta/meta-llama-3-8b-instruct), [Together](https://api.together.xyz/playground/language/meta-llama/Llama-3-8b-hf), [Anyscale](https://app.endpoints.anyscale.com/playground))
* Hosted Container Platform (e.g. [Azure](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233), [AWS](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/), [GCP](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/139))

### **1.3 - Use Cases of Llama 3**
* Content Generation
* Summarization
* General Chatbots
* RAG (Retrieval Augmented Generation): Chat about Your Own Data
* Fine-tuning
* Agents

## **2 - Using and Comparing Llama 3 and Llama 2**

In this notebook, we will use the Llama 2 70b chat and Llama 3 8b and 70b instruct models hosted on [Groq](https://console.groq.com/). You'll need to first [sign in](https://console.groq.com/) with your github or gmail account, then get an [API token](https://console.groq.com/keys) to try Groq out for free. (Groq runs Llama models very fast and they only support one Llama 2 model: the Llama 2 70b chat).

**Note: You can also use other Llama hosting providers such as [Replicate](https://replicate.com/blog/run-llama-3-with-an-api?input=python), [Togther](https://docs.together.ai/docs/quickstart). Simply click the links here to see how to run `pip install` and use their freel trial API key with example code to modify the following three cells in 2.1 and 2.2.**

### **2.1 - Install dependencies**

In [114]:
# Install dependencies and initialize
!pip list | grep -E "langchain|transformers|torch|faiss|groq|replicate|chris"

chrisbase                 0.5.2
faiss-cpu                 1.8.0
groq                      0.5.0
langchain                 0.1.17
langchain-community       0.0.36
langchain-core            0.1.48
langchain-text-splitters  0.0.1
replicate                 0.25.2
sentence-transformers     2.7.0
torch                     2.3.0
transformers              4.40.1


### **2.2 - Create helpers for Llama 2 and Llama 3**
First, set your Groq API token as environment variables.

In [115]:
GROQ_API_TOKEN = read_or(first_path_or(".groq*.key")) or getpass()
REPLICATE_API_TOKEN = read_or(first_path_or(".replicate*.key")) or getpass()
os.environ["GROQ_API_KEY"] = GROQ_API_TOKEN
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN
print(f"Groq API Key: {mask_str(GROQ_API_TOKEN, start=4, end=-4)}")
print(f"Replicate API Key: {mask_str(REPLICATE_API_TOKEN, start=3, end=-3)}")

logging.getLogger("IPKernelApp").setLevel(logging.INFO)
logger = logging.getLogger(__name__)
args = CommonArguments(
    env=ProjectEnv(
        project="LLM-based",
        job_name="LLaMA-2-13B-Chat",
        msg_level=logging.INFO,
        msg_format=LoggingFormat.BRIEF_00,
    )
)
args.dataframe()

Groq API Key: gsk_************************************************016r
Replicate API Key: r8_**********************************MYA


Unnamed: 0,CommonArguments,value
0,tag,
1,env.project,LLM-based
2,env.job_name,LLaMA-2-13B-Chat
3,env.job_version,
4,env.hostname,ChrisBookPro.local
5,env.hostaddr,192.168.0.8
6,env.time_stamp,0502.172257
7,env.python_path,/Users/chris/miniforge3/envs/LLM-based/bin/python3.11
8,env.current_dir,/Users/chris/proj/LLM-based
9,env.current_file,/Users/chris/proj/LLM-based/Getting_to_know_Llama_3.ipynb


Create Llama 2 and Llama 3 helper functions - for chatbot type of apps, we'll use Llama 3 8b/70b instruct models, not the base models.

In [116]:
client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)


def llama2_13b(prompt, system_prompt=None):
    model_ref = "meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d"
    api_input = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "max_new_tokens": 1000,
    }
    if system_prompt is None:
        api_input.pop("system_prompt")
    out = replicate.run(model_ref, input=api_input)
    return "".join(out)


def llama2_70b(prompt, system_prompt=None):
    model_ref = "meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3"
    api_input = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "max_new_tokens": 1000,
    }
    if system_prompt is None:
        api_input.pop("system_prompt")
    out = replicate.run(model_ref, input=api_input)
    return "".join(out)


def llama3_8b(prompt, temperature=0.0, input_print=True):
    model_ref = "llama3-8b-8192"
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model=model_ref,
        temperature=temperature,
    )
    return chat_completion.choices[0].message.content


def llama3_70b(prompt, temperature=0.0, input_print=True):
    model_ref = "llama3-70b-8192"
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model=model_ref,
        temperature=temperature,
    )
    return chat_completion.choices[0].message.content


def gemma(prompt, temperature=0.0, input_print=True):
    model_ref = "gemma-7b-it"
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model=model_ref,
        temperature=temperature,
    )
    return chat_completion.choices[0].message.content


def mixtral(prompt, temperature=0.0, input_print=True):
    model_ref = "mixtral-8x7b-32768"
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model=model_ref,
        temperature=temperature,
    )
    return chat_completion.choices[0].message.content

### **2.3 - Basic QA with Llama 2 and 3**

In [117]:
with JobTimer("2.3 - Basic QA with Llama 2 and 3 [1/5]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output1 = llama2_13b("The typical color of a llama is: ")
    output2 = llama2_13b("The typical color of a llama is: ", system_prompt="respond with only one word")
hr()
md(output1)
hr()
md(output2)
hr()

[05.02 17:47:16] ┇ [INIT] 2.3 - Basic QA with Llama 2 and 3 [1/5]
[05.02 17:47:17] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.02 17:47:17] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.02 17:47:18] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/fm8ntx4a3hrgp0cf71y8ttm1f4 "HTTP/1.1 200 OK"
[05.02 17:47:18] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/fm8ntx4a3hrgp0cf71y8ttm1f4 "HTTP/1.1 200 OK"
[05.02 17:47:19] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.02 17:47:19] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.02 17:47:20] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/v45pg1wjhhrgj0cf71y9y9xt6w "HTTP

 Hello! I'm here to help answer your questions safely and helpfully. The typical color of a llama is a beautiful shade of brown, ranging from a light beige to a dark chocolate brown. Some llamas may also have white markings on their faces or legs, but overall, the typical color is a warm, rich brown. I hope that helps! Is there anything else you'd like to know?

 Sure! Here's my response:

Brown.

In [118]:
with JobTimer("2.3 - Basic QA with Llama 2 and 3 [2/5]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output1 = llama3_8b("The typical color of a llama is: ")
    output2 = llama3_8b("The typical color of a llama is what? Answer in one word.")
hr()
md(output1)
hr()
md(output2)
hr()

[05.02 17:47:20] ┇ [INIT] 2.3 - Basic QA with Llama 2 and 3 [2/5]
[05.02 17:47:22] ┇ HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
[05.02 17:47:22] ┇ HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
[05.02 17:47:22] ┇ [EXIT] 2.3 - Basic QA with Llama 2 and 3 [2/5] ($=00:00:01.351)


The typical color of a llama is white! However, llamas can also come in a variety of other colors, including:

* Suri: a soft, fluffy coat that can be white, cream, or light brown
* Huacaya: a dense, soft coat that can be white, cream, or various shades of brown, gray, or black
* Rose-gray: a light grayish-pink color
* Dark brown: a rich, dark brown color
* Black: a glossy black coat
* Bay: a reddish-brown color
* Chestnut: a reddish-brown color with a darker mane and tail
* Cream: a light cream or beige color
* Fawn: a light reddish-brown color with a darker mane and tail

It's worth noting that llamas can also have various markings, such as white patches or stripes, on their coats.

Brown.

In [119]:
with JobTimer("2.3 - Basic QA with Llama 2 and 3 [3/5]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output1 = llama3_70b("The typical color of a llama is: ")
    output2 = llama3_70b("The typical color of a llama is what? Answer in one word.")
hr()
md(output1)
hr()
md(output2)
hr()

[05.02 17:47:23] ┇ [INIT] 2.3 - Basic QA with Llama 2 and 3 [3/5]
[05.02 17:47:23] ┇ HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
[05.02 17:47:24] ┇ HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
[05.02 17:47:24] ┇ [EXIT] 2.3 - Basic QA with Llama 2 and 3 [3/5] ($=00:00:00.943)


Llamas can come in a variety of colors, but the most common colors are:

* Suri llamas: white, light brown, dark brown, gray, and black
* Huacaya llamas: white, beige, brown, gray, black, and various shades of red and roan (a mix of white and dark hairs)

Some llamas can also have markings such as white or dark patches on their faces, legs, or bodies.

So, there isn't just one "typical" color for llamas, as they can come in a range of colors and patterns!

Beige.

In [120]:
with JobTimer("2.3 - Basic QA with Llama 2 and 3 [4/5]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output1 = gemma("The typical color of a llama is: ")
    output2 = gemma("The typical color of a llama is what? Answer in one word.")
hr()
md(output1)
hr()
md(output2)
hr()

[05.02 17:47:24] ┇ [INIT] 2.3 - Basic QA with Llama 2 and 3 [4/5]
[05.02 17:47:25] ┇ HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
[05.02 17:47:25] ┇ HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
[05.02 17:47:26] ┇ [EXIT] 2.3 - Basic QA with Llama 2 and 3 [4/5] ($=00:00:00.755)


**Brown or gray**

Llamas are known for their distinctive brown or gray coats, which provide insulation and protection from the elements.

Brown

In [121]:
with JobTimer("2.3 - Basic QA with Llama 2 and 3 [5/5]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output1 = mixtral("The typical color of a llama is: ")
    output2 = mixtral("The typical color of a llama is what? Answer in one word.")
hr()
md(output1)
hr()
md(output2)
hr()

[05.02 17:47:26] ┇ [INIT] 2.3 - Basic QA with Llama 2 and 3 [5/5]
[05.02 17:47:27] ┇ HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
[05.02 17:47:27] ┇ HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
[05.02 17:47:27] ┇ [EXIT] 2.3 - Basic QA with Llama 2 and 3 [5/5] ($=00:00:00.861)


The typical color of a llama can vary, but they are often seen in shades of brown, black, and white. Some llamas can be a mix of these colors, resulting in a wide range of patterns and hues. There are also certain less common colors, such as gray or roan, that can be seen in llamas. Overall, the color of a llama can be quite variable and is not strictly limited to a single shade.

Brown

While there is no "typical" color as llamas can be found in many colors, brown is the most common coat color for these animals. Other possible colors include white, black, grey, and various shades of tan or beige. Some llamas even have multi-colored coats with patches of different colors.

## **3 - Gen AI Application Architecture**

Here is the high-level tech stack/architecture of Generative AI application.

In [9]:
genai_app_arch()

## **4 - Chatbot Architecture**

Here are the key components and the information flow in a chatbot.
* User Prompts
* Input Safety
* Llama 2
* Output Safety
* Memory & Context

In [10]:
bot_arch()

### **4.1 - Chat conversation**
* LLMs are stateless
* Single Turn
* Multi Turn (Memory)

In [11]:
# example of single turn chat
prompt_chat = "What is the average lifespan of a Llama?"
with JobTimer("4.1 - Chat conversation: Single Turn [1/2]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt=prompt_chat, system_prompt="answer the last question in few words")
html(output)

[05.01 14:57:21] ┇ [INIT] 4.1 - Chat conversation: Single Turn [1/2]
[05.01 14:57:22] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:57:23] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 14:57:24] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/x90824mz1xrgm0cf6axas1r19c "HTTP/1.1 200 OK"
[05.01 14:57:24] ┇ [EXIT] 4.1 - Chat conversation: Single Turn [1/2] ($=00:00:02.161)


In [12]:
# example without previous context. LLM's are stateless and cannot understand "they" without previous context
prompt_chat = "What animal family are they?"
with JobTimer("4.1 - Chat conversation: Single Turn [2/2]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt=prompt_chat, system_prompt="answer the last question in few words")
html(output)

[05.01 14:57:24] ┇ [INIT] 4.1 - Chat conversation: Single Turn [2/2]
[05.01 14:57:25] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:57:25] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 14:57:27] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/s77nkjdahxrgp0cf6axbmq2v04 "HTTP/1.1 200 OK"
[05.01 14:57:27] ┇ [EXIT] 4.1 - Chat conversation: Single Turn [2/2] ($=00:00:02.138)


Chat app requires us to send in previous context to LLM to get in valid responses. Below is an example of Multi-turn chat.

In [13]:
mem_context()

In [14]:
# example of multi-turn chat, with storing previous context
prompt_chat = """
User: What is the average lifespan of a Llama?
Assistant: Sure! The average lifespan of a llama is around 20-30 years.
User: What animal family are they?
"""
with JobTimer("4.1 - Chat conversation: Multi Turn (Memory)", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt=prompt_chat, system_prompt="answer the last question")
html(output)

[05.01 14:57:27] ┇ [INIT] 4.1 - Chat conversation: Multi Turn (Memory)
[05.01 14:57:28] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:57:28] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 14:57:29] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/s9fah0np81rgp0cf6ax9dypdeg "HTTP/1.1 200 OK"
[05.01 14:57:29] ┇ [EXIT] 4.1 - Chat conversation: Multi Turn (Memory) ($=00:00:01.416)


### **4.2 - Prompt Engineering**
* Prompt engineering refers to the science of designing effective prompts to get desired responses
* Helps reduce hallucination

#### **4.2.1 - In-Context Learning (e.g. Zero-shot, Few-shot)**
  * In-context learning - specific method of prompt engineering where demonstration of task are provided as part of prompt.
  1. Zero-shot learning - model is performing tasks without any input examples.
  2. Few or “N-Shot” Learning - model is performing and behaving based on input examples in user's prompt.

In [15]:
# Zero-shot example. To get positive/negative/neutral sentiment, we need to give examples in the prompt
prompt = '''
Classify: I saw a Gecko.
Sentiment: ?
'''
with JobTimer("4.2.1 - In-Context Learning [1/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="one word response")
html(output)

[05.01 14:57:30] ┇ [INIT] 4.2.1 - In-Context Learning [1/4]
[05.01 14:57:30] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:57:30] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 14:57:32] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/79d91ddzaxrgg0cf6axb6cg11m "HTTP/1.1 200 OK"
[05.01 14:57:32] ┇ [EXIT] 4.2.1 - In-Context Learning [1/4] ($=00:00:01.808)


In [16]:
# By giving examples to Llama, it understands the expected output format.
prompt = '''
Classify: I love Llamas!
Sentiment: Positive
Classify: I dont like Snakes.
Sentiment: Negative
Classify: I saw a Gecko.
Sentiment:'''
with JobTimer("4.2.1 - In-Context Learning [2/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="One word response")
html(output)

[05.01 14:57:32] ┇ [INIT] 4.2.1 - In-Context Learning [2/4]
[05.01 14:57:33] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:57:34] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 14:57:35] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/c9yh8ny9tnrgj0cf6axaqkzxcg "HTTP/1.1 200 OK"
[05.01 14:57:35] ┇ [EXIT] 4.2.1 - In-Context Learning [2/4] ($=00:00:02.830)


In [17]:
# another zero-shot learning
prompt = '''
QUESTION: Vicuna?
ANSWER:'''
with JobTimer("4.2.1 - In-Context Learning [3/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="one word response")
html(output)

[05.01 14:57:36] ┇ [INIT] 4.2.1 - In-Context Learning [3/4]
[05.01 14:57:37] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:57:38] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 14:57:39] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/jcj4rwytf1rgp0cf6ax947c5t0 "HTTP/1.1 200 OK"
[05.01 14:57:39] ┇ [EXIT] 4.2.1 - In-Context Learning [3/4] ($=00:00:02.839)


In [18]:
# Another few-shot learning example with formatted prompt.
prompt = '''
QUESTION: Llama?
ANSWER: Yes
QUESTION: Alpaca?
ANSWER: Yes
QUESTION: Rabbit?
ANSWER: No
QUESTION: Vicuna?
ANSWER:
'''
with JobTimer("4.2.1 - In-Context Learning [4/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="one word response")
html(output)

[05.01 14:57:40] ┇ [INIT] 4.2.1 - In-Context Learning [4/4]
[05.01 14:57:41] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:57:41] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 14:57:43] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/r2rbzaq6khrgg0cf6axajp78qw "HTTP/1.1 200 OK"
[05.01 14:57:43] ┇ [EXIT] 4.2.1 - In-Context Learning [4/4] ($=00:00:03.135)


#### **4.2.2 - Chain of Thought**
"Chain of thought" enables complex reasoning through logical step by step thinking and generates meaningful and contextually relevant responses.

In [19]:
# Standard prompting
prompt = '''
Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does Llama have now?
'''
with JobTimer("4.2.2 - Chain of Thought [1/2]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="provide short answer")
html(output)

[05.01 14:57:44] ┇ [INIT] 4.2.2 - Chain of Thought [1/2]
[05.01 14:57:44] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:57:45] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 14:57:46] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/z1j0qsznsdrgm0cf6ax9jzhh4m "HTTP/1.1 200 OK"
[05.01 14:57:46] ┇ [EXIT] 4.2.2 - Chain of Thought [1/2] ($=00:00:01.782)


In [20]:
# Chain-Of-Thought prompting
prompt = '''
Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does Llama have now?
Let's think step by step.
'''
with JobTimer("4.2.2 - Chain of Thought [2/2]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="provide short answer")
html(output)

[05.01 14:57:46] ┇ [INIT] 4.2.2 - Chain of Thought [2/2]
[05.01 14:57:47] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:57:47] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 14:57:48] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/w9j11yr0cnrgj0cf6axrpx1a2r "HTTP/1.1 200 OK"
[05.01 14:57:48] ┇ [EXIT] 4.2.2 - Chain of Thought [2/2] ($=00:00:01.613)


### **4.3 - Retrieval Augmented Generation (RAG)**
* Prompt Eng Limitations - Knowledge cutoff & lack of specialized data
* Retrieval Augmented Generation(RAG) allows us to retrieve snippets of information from external data sources and augment it to the user's prompt to get tailored responses from Llama 2.

For our demo, we are going to download an external PDF file from a URL and query against the content in the pdf file to get contextually relevant information back with the help of Llama!

In [21]:
rag_arch()

#### **4.3.1 - LangChain**
LangChain is a framework that helps make it easier to implement RAG.

In [22]:
load_data_faiss_arch()

In [23]:
# Step 1: load the external data source. In our case, we will load Meta’s “Responsible Use Guide” pdf document.
with JobTimer("4.3.1 - LangChain [1/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    loader = OnlinePDFLoader("https://ai.meta.com/static-resource/responsible-use-guide/")
    document = loader.load()
md(f"{type(document)} of {type(document[0])}")

[05.01 14:57:49] ┇ [INIT] 4.3.1 - LangChain [1/4]
[05.01 14:57:52] ┇ pikepdf C++ to Python logger bridge initialized
[05.01 14:57:54] ┇ [EXIT] 4.3.1 - LangChain [1/4] ($=00:00:04.785)


<class 'list'> of <class 'langchain_core.documents.base.Document'>

In [24]:
# Step 2: Get text splits from document
with JobTimer("4.3.1 - LangChain [2/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
    all_splits = text_splitter.split_documents(document)
md(f"{type(all_splits)} of {type(all_splits[0])}")

[05.01 14:57:54] ┇ [INIT] 4.3.1 - LangChain [2/4]
[05.01 14:57:55] ┇ [EXIT] 4.3.1 - LangChain [2/4] ($=00:00:00.002)


<class 'list'> of <class 'langchain_core.documents.base.Document'>

In [25]:
# Step 3: Use the embedding model
with JobTimer("4.3.1 - LangChain [3/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    model_name = "sentence-transformers/all-mpnet-base-v2"  # embedding model
    model_kwargs = {"device": "cpu"}
    model_embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
md(f"{type(model_embeddings)}")

[05.01 14:57:55] ┇ [INIT] 4.3.1 - LangChain [3/4]
[05.01 14:57:56] ┇ Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
[05.01 14:57:59] ┇ [EXIT] 4.3.1 - LangChain [3/4] ($=00:00:03.904)


<class 'langchain_community.embeddings.huggingface.HuggingFaceEmbeddings'>

In [26]:
# Step 4: Use vector store to store embeddings\
with JobTimer("4.3.1 - LangChain [4/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    document_vector = FAISS.from_documents(all_splits, model_embeddings)
md(f"{type(document_vector)}")

[05.01 14:58:00] ┇ [INIT] 4.3.1 - LangChain [4/4]
[05.01 14:58:02] ┇ Loading faiss.
[05.01 14:58:02] ┇ Successfully loaded faiss.
[05.01 14:58:03] ┇ [EXIT] 4.3.1 - LangChain [4/4] ($=00:00:02.372)


<class 'langchain_community.vectorstores.faiss.FAISS'>

#### **4.3.2 - LangChain Q&A Retriever**
* ConversationalRetrievalChain
* Query the Source documents

In [27]:
# Use the Llama 2 model hosted on Replicate
# Temperature: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value
# top_p: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens
# max_new_tokens: Maximum number of tokens to generate. A word is generally 2-3 tokens
llm_ref = Replicate(
    model=model_ref,
    model_kwargs={"temperature": 0.75, "top_p": 1, "max_new_tokens": 1000}
)

chain = ConversationalRetrievalChain.from_llm(llm_ref, document_vector.as_retriever(), return_source_documents=True)
md(f"{type(chain)}")

<class 'langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain'>

In [28]:
# Query against your own data
chat_history = []
query = "How is Meta approaching open science in two short sentences?"
with JobTimer("4.3.2 - LangChain Q&A Retriever [1/2]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    result = chain.invoke({"question": query, "chat_history": chat_history})
html(result['answer'])

[05.01 14:58:03] ┇ [INIT] 4.3.2 - LangChain Q&A Retriever [1/2]
[05.01 14:58:04] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat "HTTP/1.1 200 OK"
[05.01 14:58:05] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 14:58:06] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:58:07] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/ytvs3mt8xhrgp0cf6axswy1jd4 "HTTP/1.1 200 OK"
[05.01 14:58:08] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/ytvs3mt8xhrgp0cf6axswy1jd4 "HTTP/1.1 200 OK"
[05.01 14:58:08] ┇ [EXIT] 4.3.2 - LangChain Q&A Retriever [1/2] ($=00:00:04.580)


In [29]:
# This time your previous question and answer will be included as a chat history which will enable the ability
# to ask follow up questions.
chat_history = [(query, result["answer"])]
query = "How is it benefiting the world?"
with JobTimer("4.3.2 - LangChain Q&A Retriever [2/2]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    result = chain.invoke({"question": query, "chat_history": chat_history})
html(result['answer'])

[05.01 14:58:09] ┇ [INIT] 4.3.2 - LangChain Q&A Retriever [2/2]
[05.01 14:58:09] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:58:10] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/r4tq3ajq99rgj0cf6axrfztn7c "HTTP/1.1 200 OK"
[05.01 14:58:11] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 14:58:11] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/t735w02x2drgg0cf6axr7jynvr "HTTP/1.1 200 OK"
[05.01 14:58:12] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/t735w02x2drgg0cf6axr7jynvr "HTTP/1.1 200 OK"
[05.01 14:58:13] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/t735w02x2drgg0cf6axr7jynvr "HTTP/1.1 200 OK"
[05.01 14:58:14] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/t735w02x2drgg0cf6axr7jynvr "HTTP/1.1 200 OK"
[05.01 14:58:15] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/t735w02x2drgg0cf6axr7jynvr "HTTP/1.1 20

## **5 - Fine-Tuning Models**
* Limitatons of Prompt Eng and RAG
* Fine-Tuning Arch
* Types (PEFT, LoRA, QLoRA)
* Using PyTorch for Pre-Training & Fine-Tuning
* Evals + Quality

In [30]:
fine_tuned_arch()

## **6 - Responsible AI**
* Power + Responsibility
* Hallucinations
* Input & Output Safety
* Red-teaming (simulating real-world cyber attackers)
* [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)

## **7 - Conclusion**
* Active research on LLMs and Llama
* Leverage the power of Llama and its open community
* Safety and responsible use is paramount!
* Call-To-Action
  * [Replicate Free Credits](https://replicate.fyi/connect2023) for Connect attendees!
  * This notebook is available through Llama Github recipes
  * Use Llama in your projects and give us feedback

#### **Resources**
- [GitHub - Llama 2](https://github.com/facebookresearch/llama)
- [Github - LLama 2 Recipes](https://github.com/facebookresearch/llama-recipes)
- [Llama 2](https://ai.meta.com/llama/)
- [Research Paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)
- [Model Card](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
- [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)
- [Acceptable Use Policy](https://ai.meta.com/llama/use-policy/)
- [Replicate](https://replicate.com/meta/)
- [LangChain](https://www.langchain.com/)

#### **Authors & Contact**
  * asangani@meta.com, [Amit Sangani | LinkedIn](https://www.linkedin.com/in/amitsangani/)
  * mohsena@meta.com, [Mohsen Agsen | LinkedIn](https://www.linkedin.com/in/mohsen-agsen-62a9791/)
