# **Getting to know Llama 3: Everything you need to start building**
Our goal in this session is to provide a guided tour of Llama 3 with comparison with Llama 2, including understanding different Llama 3 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), Fine-tuning and more. All this is implemented with a starter code for you to take it and use it in your Llama 3 projects.

### **0 - Prerequisites**
* Basic understanding of Large Language Models
* Basic understanding of Python

In [1]:
import base64
from getpass import getpass

import replicate
from IPython.display import display, Image, Markdown, HTML
from groq import Groq
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import OnlinePDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import Replicate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

from chrisbase.data import *
from chrisbase.io import *
from chrisbase.util import *


def mm(graph):
    graphbytes = graph.encode("ascii")
    base64_bytes = base64.b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))


def md(t):
    display(Markdown(t))


def hr():
    display(HTML("<hr style='border:3px solid MidnightBlue; width:880px; padding:0;"
                 " margin-left:0; margin-right:0; margin-top:10px; margin-bottom:15px;'>"))


def mds(ps: dict):
    for k, v in ps.items():
        hr()
        md(f"**[{k}]**\n{prefixed_str(v, '> ')}")
    hr()


def html(t,
         s="<pre style='font-family:Arial; font-size:12pt;"
           " padding:10px; margin-left:10px; margin-top:10px;"
           " background-color:LightSkyBlue; border:3px solid MidnightBlue;"
           " width:100%; height:100px;'>",
         e="</pre>"):
    display(HTML(s + t + e))


def llama3_family():
    mm("""
    graph LR;
        llama-3 --> llama-3-8b
        llama-3 --> llama-3-70b
        llama-3-8b --> llama-3-8b
        llama-3-8b --> llama-3-8b-instruct
        llama-3-70b --> llama-3-70b
        llama-3-70b --> llama-3-70b-instruct
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def genai_app_arch():
    mm("""
    flowchart TD
        A[Users] --> B(Applications e.g. mobile, web)
        B --> |Hosted API|C(Platforms e.g. Custom, HuggingFace, Replicate)
        B -- optional --> E(Frameworks e.g. LangChain)
        C-->|User Input|D[Llama 3]
        D-->|Model Output|C
        E --> C
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def bot_arch():
    mm("""
    graph LR;
        user --> prompt
        prompt --> i_safety
        i_safety --> context
        context --> Llama_3
        Llama_3 --> output
        output --> o_safety
        i_safety --> memory
        o_safety --> memory
        memory --> context
        o_safety --> user
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def mem_context():
    mm("""
    graph LR
        context(text)
        user_prompt --> context
        instruction --> context
        examples --> context
        memory --> context
        context --> tokenizer
        tokenizer --> embeddings
        embeddings --> LLM
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def rag_arch():
    mm("""
    flowchart TD
        A[User Prompts] --> B(Frameworks e.g. LangChain)
        B <--> |Database, Docs, XLS|C[fa:fa-database External Data]
        B -->|API|D[Llama 3]
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def load_data_faiss_arch():
    mm("""
    graph LR;
        documents --> textsplitter
        textsplitter --> embeddings
        embeddings --> vectorstore
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def fine_tuned_arch():
    mm("""
    graph LR;
        Custom_Dataset --> Pre-trained_Llama
        Pre-trained_Llama --> Fine-tuned_Llama
        Fine-tuned_Llama --> RLHF
        RLHF --> |Loss:Cross-Entropy|Fine-tuned_Llama
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)

## **1 - Understanding Llama 3**

### **1.1 - What is Llama 3?**

* State of the art (SOTA), Open Source LLM
* 8B, 70B - base and instruct models
* Choosing model: Size, Quality, Cost, Speed
* Pretrained + Chat
* [Meta Llama 3 Blog](https://ai.meta.com/blog/meta-llama-3/)
* [Getting Started with Meta Llama](https://llama.meta.com/docs/get-started)

In [2]:
llama3_family()

### **1.2 - Accessing Llama 3**
* Download + Self Host (i.e. [download Llama](https://ai.meta.com/resources/models-and-libraries/llama-downloads))
* Hosted API Platform (e.g. [Groq](https://console.groq.com/), [Replicate](https://replicate.com/meta/meta-llama-3-8b-instruct), [Together](https://api.together.xyz/playground/language/meta-llama/Llama-3-8b-hf), [Anyscale](https://app.endpoints.anyscale.com/playground))
* Hosted Container Platform (e.g. [Azure](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233), [AWS](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/), [GCP](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/139))

### **1.3 - Use Cases of Llama 3**
* Content Generation
* Summarization
* General Chatbots
* RAG (Retrieval Augmented Generation): Chat about Your Own Data
* Fine-tuning
* Agents

## **2 - Using and Comparing Llama 3 and Llama 2**

We will be using Llama 2 7b & 70b chat and Llama 3 8b & 70b instruct models hosted on [Replicate](https://replicate.com/search?query=llama) to run the examples here. 
You will need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. 
You can also use other Llama 3 cloud providers such as [Groq](https://console.groq.com/), [Together](https://api.together.xyz/playground/language/meta-llama/Llama-3-8b-hf), or [Anyscale](https://app.endpoints.anyscale.com/playground).

### **2.1 - Install dependencies**

In [3]:
# Install dependencies and initialize
!pip list | grep -E "langchain|transformers|torch|faiss|groq|replicate|chris"

chrisbase                 0.5.3
faiss-cpu                 1.8.0
groq                      0.5.0
langchain                 0.1.20
langchain-community       0.0.38
langchain-core            0.1.52
langchain-text-splitters  0.0.1
replicate                 0.25.2
sentence-transformers     2.7.0
torch                     2.3.0
transformers              4.40.2


### **2.2 - Create helpers for Llama 2 and Llama 3**
First, set your API token as environment variables.

In [4]:
GROQ_API_TOKEN = read_or(first_path_or("*groq*.key")) or getpass()
REPLICATE_API_TOKEN = read_or(first_path_or("*replicate*.key")) or getpass()
os.environ["GROQ_API_KEY"] = GROQ_API_TOKEN
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN
print(f"Groq API Key: {mask_str(GROQ_API_TOKEN, start=4, end=-4)}")
print(f"Replicate API Key: {mask_str(REPLICATE_API_TOKEN, start=3, end=-3)}")

logging.getLogger("IPKernelApp").setLevel(logging.INFO)
logger = logging.getLogger(__name__)
args = CommonArguments(
    env=ProjectEnv(
        project="LLM-based",
        job_name="LLaMA-based-Chat",
        msg_level=logging.INFO,
        msg_format=LoggingFormat.PRINT_00,
    )
)
args.dataframe()

Groq API Key: gsk_************************************************2Uuw
Replicate API Key: r8_**********************************MYA


Unnamed: 0,CommonArguments,value
0,tag,
1,env.project,LLM-based
2,env.job_name,LLaMA-based-Chat
3,env.job_version,
4,env.hostname,chris137
5,env.hostaddr,129.254.164.137
6,env.time_stamp,0617.190135
7,env.python_path,/home/chris/miniforge3/envs/LLM-based/bin/python3.11
8,env.current_dir,/dat/proj/LLM-based
9,env.current_file,/home/chris/proj/LLM-based/Getting_to_know_Llama_3.ipynb


Create Llama 2 and Llama 3 helper functions - for chatbot type of apps, we'll use Llama 3 instruct and Llama 2 chat models, not the base models.

In [5]:
def llama2_7b(prompt):
    output = replicate.run(
        "meta/llama-2-7b-chat",
        input={"prompt": prompt}
    )
    return ''.join(output)


def llama2_70b(prompt):
    output = replicate.run(
        "meta/llama-2-70b-chat",
        input={"prompt": prompt}
    )
    return ''.join(output)


def llama3_8b(prompt):
    output = replicate.run(
        "meta/meta-llama-3-8b-instruct",
        input={"prompt": prompt}
    )
    return ''.join(output)


def llama3_70b(prompt):
    output = replicate.run(
        "meta/meta-llama-3-70b-instruct",
        input={"prompt": prompt}
    )
    return ''.join(output)

In [6]:
client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)


def llama3_8b_g(prompt, temperature=0.0, input_print=True):
    model_ref = "llama3-8b-8192"
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model=model_ref,
        temperature=temperature,
    )
    return chat_completion.choices[0].message.content


def llama3_70b_g(prompt, temperature=0.0, input_print=True):
    model_ref = "llama3-70b-8192"
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model=model_ref,
        temperature=temperature,
    )
    return chat_completion.choices[0].message.content

### **2.3 - Basic QA with Llama 2 and 3**

In [7]:
outputs = dict()
prompt = "The typical color of a llama is: "
with JobTimer("2.3 - Basic QA with Llama 2 and 3 (1)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    outputs["llama2_7b"] = llama2_7b(prompt)
    outputs["llama3_8b"] = llama3_8b(prompt)
mds(outputs)

[INIT] 2.3 - Basic QA with Llama 2 and 3 (1)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/hrrtr9s4psrgg0cg4pjth1sdkg "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/sqcxfb18khrgj0cg4pjthkdas0 "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/sqcxfb18khrgj0cg4pjthkdas0 "HTTP/1.1 200 OK"
[EXIT] 2.3 - Basic QA with Llama 2 and 3 (1) ($=00:00:02.870)


**[llama2_7b]**
>  Thank you for asking! The typical color of a llama is actually gray. Llamas are known for their distinctive gray coats, which can range in shade from light to dark. Some llamas may also have white or brown markings on their faces and legs. However, it's important to note that llama coat colors can vary, and some llamas may have different colors depending on their breed or genetics. Is there anything else I can help with?

**[llama3_8b]**
> 
> 
> The typical color of a llama is brown or tan, although they can also come in a variety of other colors such as white, black, gray, and various shades of red and fawn. Some llamas may also have markings or patches of different colors.

In [8]:
outputs = dict()
prompt = "The typical color of a llama is what? Answer in one word."
with JobTimer("2.3 - Basic QA with Llama 2 and 3 (2)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    outputs["llama2_7b"] = llama2_7b(prompt)
    outputs["llama3_8b"] = llama3_8b(prompt)
mds(outputs)

[INIT] 2.3 - Basic QA with Llama 2 and 3 (2)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/dgzgrkhjpxrgm0cg4pjseqr3sr "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/sjj0909pkhrgp0cg4pjvn5gz5w "HTTP/1.1 200 OK"
[EXIT] 2.3 - Basic QA with Llama 2 and 3 (2) ($=00:00:02.366)


**[llama2_7b]**
>  Great! I'm happy to help. The typical color of a llama is... gray! 🐮

**[llama3_8b]**
> 
> 
> Brown.

## **3 - Chat conversation**

### **3.1 - Single-turn chat**

In [9]:
# example without previous context. LLM's are stateless and cannot understand "they" without previous context
outputs = dict()
prompt = "What animal family are they? Answer the question in few words."
with JobTimer("3.1 - Single-turn chat", rt=1, rb=1, rw=114, rc='=', verbose=1):
    outputs["llama2_7b"] = llama2_7b(prompt)
    outputs["llama3_8b"] = llama3_8b(prompt)
    outputs["llama2_70b"] = llama2_70b(prompt)
    outputs["llama3_70b"] = llama3_70b(prompt)
mds(outputs)

[INIT] 3.1 - Single-turn chat
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/h778exhz75rgj0cg4pjvgpcdzw "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/3mxvpfj35drgj0cg4pjvjgav84 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/3ne3vy274srgg0cg4pjvxz1xyr "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/xd74pfjbg1rj20cg4pjvzqvtrc "HTTP/1.1 200 OK"
[EXIT] 3.1 - Single-turn chat ($=00:00:04.644)


**[llama2_7b]**
>  Of course, I'd be happy to help! The animal family that the question refers to is the "Carnivora" family, which includes all meat-eating mammals such as lions, tigers, bears, and wolves.

**[llama3_8b]**
> 
> 
> Mammals.

**[llama2_70b]**
>  The animals you are referring to are part of the family Canidae.

**[llama3_70b]**
> 
> 
> Please provide the animal's name, and I'll respond with its family!

**Note: Llama 3 70b doesn't hallucinate.**

### **3.2 - Multi-turn chat**
Chat app requires us to send in previous context to LLM to get in valid responses. Below is an example of Multi-turn chat.

In [10]:
# example of multi-turn chat, with storing previous context
outputs = dict()
prompt = """
User: What is the average lifespan of a Llama?
Assistant: 15-20 years.
User: What animal family are they?
"""
with JobTimer("3.2 - Multi-turn chat (1)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    outputs["llama2_7b"] = llama2_7b(prompt)
    outputs["llama3_8b"] = llama3_8b(prompt)
mds(outputs)

[INIT] 3.2 - Multi-turn chat (1)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/780w5c2mmnrgg0cg4pjt42k43w "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/780w5c2mmnrgg0cg4pjt42k43w "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/pr6rvg2vfnrgp0cg4pjs88h344 "HTTP/1.1 200 OK"
[EXIT] 3.2 - Multi-turn chat (1) ($=00:00:02.764)


**[llama2_7b]**
>  Assistant: Hello! I'm happy to help you with your questions. However, I must inform you that Llamas are not actually animals, they are actually a type of South American camelid. So, they do not belong to any animal family. They are known for their distinctive long necks, ears, and woolly coats. Llamas are social animals and are often kept as pets or used for their wool and meat. They can live for around 15-20 years on average. I hope that answers your question! Is there anything else I can help you with?

**[llama3_8b]**
> 
> 
> Llamas are members of the camelid family (Camelidae).

**Note: Llama 2 and 3 both behave well for using the chat history for follow up questions.**

In [11]:
# example of multi-turn chat, with storing previous context
outputs = dict()
prompt = """
User: What is the average lifespan of a Llama?
Assistant: Sure! The average lifespan of a llama is around 20-30 years.
User: What animal family are they?

Answer the question with one word.
"""
with JobTimer("3.2 - Multi-turn chat (2)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    outputs["llama2_7b"] = llama2_7b(prompt)
    outputs["llama2_70b"] = llama2_70b(prompt)
    outputs["llama3_8b"] = llama3_8b(prompt)
mds(outputs)

[INIT] 3.2 - Multi-turn chat (2)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/t44evcb3zdrgm0cg4pjrtbc2cw "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/cxrgjsk7rsrgg0cg4pjr8h0d0w "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/cx1ddsbbjsrgj0cg4pjrdgemsc "HTTP/1.1 200 OK"
[EXIT] 3.2 - Multi-turn chat (2) ($=00:00:03.254)


**[llama2_7b]**
>  Sure! The animal family that llamas belong to is: Camelidae.

**[llama2_70b]**
>  Camelids

**[llama3_8b]**
> 
> 
> Camelid.

**Both Llama 3 8b and Llama 2 70b follows instructions (e.g. "Answer the question with one word") better than Llama 2 7b in multi-turn chat.**

### **3.3 - Prompt Engineering**
* Prompt engineering refers to the science of designing effective prompts to get desired responses
* Helps reduce hallucination

#### **3.3.1 - In-Context Learning (e.g. Zero-shot, Few-shot)**
 * In-context learning - specific method of prompt engineering where demonstration of task are provided as part of prompt.
  1. Zero-shot learning - model is performing tasks without any
input examples.
  2. Few or “N-Shot” Learning - model is performing and behaving based on input examples in user's prompt.

In [12]:
# Zero-shot example. To get positive/negative/neutral sentiment, we need to give examples in the prompt
outputs = dict()
prompt = '''
Classify: I saw a Gecko.
Sentiment: ?

Give one word response.
'''
with JobTimer("3.3.1 - In-Context Learning (1)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    outputs["llama2_7b"] = llama2_7b(prompt)
    outputs["llama3_8b"] = llama3_8b(prompt)
mds(outputs)

[INIT] 3.3.1 - In-Context Learning (1)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/50fvek3jmnrgm0cg4pjrswn7mg "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/661wb4vph9rgg0cg4pjvkj6s0m "HTTP/1.1 200 OK"
[EXIT] 3.3.1 - In-Context Learning (1) ($=00:00:02.000)


**[llama2_7b]**
>  Interesting!

**[llama3_8b]**
> 
> 
> Interesting

**Note: Llama 3 has different opinions than Llama 2.**

In [13]:
# By giving examples to Llama, it understands the expected output format.
outputs = dict()
prompt = '''
Classify: I love Llamas!
Sentiment: Positive
Classify: I dont like Snakes.
Sentiment: Negative
Classify: I saw a Gecko.
Sentiment:

Give one word response.
'''
with JobTimer("3.3.1 - In-Context Learning (2)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    outputs["llama2_7b"] = llama2_7b(prompt)
    outputs["llama3_8b"] = llama3_8b(prompt)
mds(outputs)

[INIT] 3.3.1 - In-Context Learning (2)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/5cekn5kxpxrgg0cg4pjr0y30d8 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/e1xjmdc2z5rgg0cg4pjtax12sg "HTTP/1.1 200 OK"
[EXIT] 3.3.1 - In-Context Learning (2) ($=00:00:02.362)


**[llama2_7b]**
>  Sure, I'd be happy to help! Here are the classifications and sentiments for each of the statements you provided:
> 
> Classify: I love Llamas!
> Sentiment: Positive
> 
> Classify: I don't like Snakes.
> Sentiment: Negative
> 
> Classify: I saw a Gecko.
> Sentiment: Neutral

**[llama3_8b]**
> 
> 
> Neutral

**Note: Llama 2, with few shots, has the same output "Neutral" as Llama 3, but Llama 2 doesn't follow instructions (Give one word response) well.**

#### **3.3.2 - Chain of Thought**
"Chain of thought" enables complex reasoning through logical step by step thinking and generates meaningful and contextually relevant responses.

In [14]:
# Standard prompting
outputs = dict()
prompt = '''
Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does Llama have?

Answer in one word.
'''
with JobTimer("3.3.2 - Chain of Thought (1)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    outputs["llama3_8b"] = llama3_8b(prompt)
    outputs["llama3_70b"] = llama3_70b(prompt)
mds(outputs)

[INIT] 3.3.2 - Chain of Thought (1)
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/9gjembcbghrgp0cg4pjr095ge4 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/da1xxbwfg9rj00cg4pjtddq8qr "HTTP/1.1 200 OK"
[EXIT] 3.3.2 - Chain of Thought (1) ($=00:00:02.548)


**[llama3_8b]**
> 
> 
> Seven

**[llama3_70b]**
> 
> 
> Eleven.

**Note: Llama 3-8b did not get the right answer because it was asked to answer in one word.**

In [15]:
# By default, Llama 3 models follow "Chain-Of-Thought" prompting
outputs = dict()
prompt = '''
Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does Llama have?
'''
with JobTimer("3.3.2 - Chain of Thought (2)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    outputs["llama3_8b"] = llama3_8b(prompt)
    outputs["llama3_70b"] = llama3_70b(prompt)
mds(outputs)

[INIT] 3.3.2 - Chain of Thought (2)
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/thzap0mqb9rgg0cg4pjsxgcb3w "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/thzap0mqb9rgg0cg4pjsxgcb3w "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/6ef11wcya9rj20cg4pjraav70r "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/6ef11wcya9rj20cg4pjraav70r "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/6ef11wcya9rj20cg4pjraav70r "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/6ef11wcya9rj20cg4pjraav70r "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/6ef11wcya9rj20cg4pjraav70r "HTTP/1.1 200

**[llama3_8b]**
> 
> 
> Let's help Llama count its tennis balls!
> 
> Llama started with 5 tennis balls.
> 
> It buys 2 more cans of tennis balls, and each can has 3 tennis balls. So, Llama gets 2 x 3 = 6 more tennis balls.
> 
> Adding the 6 new tennis balls to the 5 it already had, Llama now has a total of:
> 
> 5 + 6 = 11
> 
> So, Llama has 11 tennis balls now!

**[llama3_70b]**
> 
> 
> Let's break it down step by step!
> 
> Llama starts with 5 tennis balls.
> 
> It buys 2 more cans of tennis balls, and each can has 3 tennis balls. So, the total number of new tennis balls is:
> 2 cans x 3 tennis balls/can = 6 new tennis balls
> 
> Now, let's add the new tennis balls to the ones Llama already had:
> 5 (initial tennis balls) + 6 (new tennis balls) = 11 tennis balls
> 
> So, Llama now has 11 tennis balls!

**Note: By default, Llama 3 models identify word problems and solves it step by step!**

In [16]:
prompt = """
15 of us want to go to a restaurant.
Two of them have cars
Each car can seat 5 people.
Two of us have motorcycles.
Each motorcycle can fit 2 people.
Can we all get to the restaurant by car or motorcycle?
Think step by step.
Provide the answer as a single yes/no answer first.
Then explain each intermediate step.
"""
with JobTimer("3.3.2 - Chain of Thought (3)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    outputs["llama3_8b"] = llama3_8b(prompt)
    outputs["llama3_70b"] = llama3_70b(prompt)
mds(outputs)

[INIT] 3.3.2 - Chain of Thought (3)
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/xvq2hhnkxdrgm0cg4pjvxt04r4 "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/xvq2hhnkxdrgm0cg4pjvxt04r4 "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/xvq2hhnkxdrgm0cg4pjvxt04r4 "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/xvq2hhnkxdrgm0cg4pjvxt04r4 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/yqp21re1sdrj60cg4pjr6k07kr "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/yqp21re1sdrj60cg4pjr6k07kr "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/yqp21re1sdrj60cg4pjr6k07kr "HTTP/1.1 200

**[llama3_8b]**
> 
> 
> Yes.
> 
> Here's the step-by-step breakdown:
> 
> 1. We have 15 people who want to go to the restaurant.
> 2. Two people have cars, which can seat 5 people each. This means we can transport 5 x 2 = 10 people by car.
> 3. However, we still have 15 - 10 = 5 people left who need to be transported.
> 4. Two people have motorcycles, which can seat 2 people each. This means we can transport 2 x 2 = 4 people by motorcycle.
> 5. Unfortunately, we still have 5 - 4 = 1 person left who needs to be transported, but we only have two motorcycles, which can't accommodate this person.
> 6. Therefore, we can't transport all 15 people to the restaurant by car or motorcycle.

**[llama3_70b]**
> 
> 
> **Answer:** NO
> 
> Here's the step-by-step explanation:
> 
> 1. We have 15 people who want to go to the restaurant.
> 2. We have 2 cars, each of which can seat 5 people. This means we can transport a total of 2 x 5 = 10 people by car.
> 3. We still have 15 - 10 = 5 people left who need transportation.
> 4. We have 2 motorcycles, each of which can fit 2 people. This means we can transport a total of 2 x 2 = 4 people by motorcycle.
> 5. We still have 5 - 4 = 1 person left who needs transportation, but we don't have any more vehicles to accommodate them.
> 6. Therefore, we cannot transport all 15 people to the restaurant by car or motorcycle.

**Note: Llama 3 70b model works correctly in this example.**

**Summary: Llama 2 often needs encouragement for step by step thinking to correctly reasoning. Llama 3 understands, reasons and explains better, making chain of thought unnecessary in the cases above.**

### **3.4 - Retrieval Augmented Generation (RAG)**
* Prompt Eng Limitations - Knowledge cutoff & lack of specialized data
* Retrieval Augmented Generation(RAG) allows us to retrieve snippets of information from external data sources and augment it to the user's prompt to get tailored responses from Llama 2.

For our demo, we are going to download an external PDF file from a URL and query against the content in the pdf file to get contextually relevant information back with the help of Llama!

In [None]:
rag_arch()

#### **4.3.1 - LangChain**
LangChain is a framework that helps make it easier to implement RAG.

In [None]:
load_data_faiss_arch()

In [None]:
# Step 1: load the external data source. In our case, we will load Meta’s “Responsible Use Guide” pdf document.
with JobTimer("4.3.1 - LangChain [1/4]", rt=1, rb=1, rw=114, rc='=', verbose=1):
    loader = OnlinePDFLoader("https://ai.meta.com/static-resource/responsible-use-guide/")
    document = loader.load()
md(f"{type(document)} of {type(document[0])}")

In [None]:
# Step 2: Get text splits from document
with JobTimer("4.3.1 - LangChain [2/4]", rt=1, rb=1, rw=114, rc='=', verbose=1):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
    all_splits = text_splitter.split_documents(document)
md(f"{type(all_splits)} of {type(all_splits[0])}")

In [None]:
# Step 3: Use the embedding model
with JobTimer("4.3.1 - LangChain [3/4]", rt=1, rb=1, rw=114, rc='=', verbose=1):
    model_name = "sentence-transformers/all-mpnet-base-v2"  # embedding model
    model_kwargs = {"device": "cpu"}
    model_embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
md(f"{type(model_embeddings)}")

In [None]:
# Step 4: Use vector store to store embeddings\
with JobTimer("4.3.1 - LangChain [4/4]", rt=1, rb=1, rw=114, rc='=', verbose=1):
    document_vector = FAISS.from_documents(all_splits, model_embeddings)
md(f"{type(document_vector)}")

#### **4.3.2 - LangChain Q&A Retriever**
* ConversationalRetrievalChain
* Query the Source documents

In [None]:
# Use the Llama 2 model hosted on Replicate
# Temperature: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value
# top_p: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens
# max_new_tokens: Maximum number of tokens to generate. A word is generally 2-3 tokens
llm_ref = Replicate(
    model=model_ref,
    model_kwargs={"temperature": 0.75, "top_p": 1, "max_new_tokens": 1000}
)

chain = ConversationalRetrievalChain.from_llm(llm_ref, document_vector.as_retriever(), return_source_documents=True)
md(f"{type(chain)}")

In [None]:
# Query against your own data
chat_history = []
query = "How is Meta approaching open science in two short sentences?"
with JobTimer("4.3.2 - LangChain Q&A Retriever [1/2]", rt=1, rb=1, rw=114, rc='=', verbose=1):
    result = chain.invoke({"question": query, "chat_history": chat_history})
html(result['answer'])

In [None]:
# This time your previous question and answer will be included as a chat history which will enable the ability
# to ask follow up questions.
chat_history = [(query, result["answer"])]
query = "How is it benefiting the world?"
with JobTimer("4.3.2 - LangChain Q&A Retriever [2/2]", rt=1, rb=1, rw=114, rc='=', verbose=1):
    result = chain.invoke({"question": query, "chat_history": chat_history})
html(result['answer'])

## **5 - Fine-Tuning Models**
* Limitatons of Prompt Eng and RAG
* Fine-Tuning Arch
* Types (PEFT, LoRA, QLoRA)
* Using PyTorch for Pre-Training & Fine-Tuning
* Evals + Quality

In [None]:
fine_tuned_arch()

## **6 - Responsible AI**
* Power + Responsibility
* Hallucinations
* Input & Output Safety
* Red-teaming (simulating real-world cyber attackers)
* [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)

## **7 - Conclusion**
* Active research on LLMs and Llama
* Leverage the power of Llama and its open community
* Safety and responsible use is paramount!
* Call-To-Action
  * [Replicate Free Credits](https://replicate.fyi/connect2023) for Connect attendees!
  * This notebook is available through Llama Github recipes
  * Use Llama in your projects and give us feedback

#### **Resources**
- [GitHub - Llama 2](https://github.com/facebookresearch/llama)
- [Github - LLama 2 Recipes](https://github.com/facebookresearch/llama-recipes)
- [Llama 2](https://ai.meta.com/llama/)
- [Research Paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)
- [Model Card](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
- [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)
- [Acceptable Use Policy](https://ai.meta.com/llama/use-policy/)
- [Replicate](https://replicate.com/meta/)
- [LangChain](https://www.langchain.com/)

#### **Authors & Contact**
  * asangani@meta.com, [Amit Sangani | LinkedIn](https://www.linkedin.com/in/amitsangani/)
  * mohsena@meta.com, [Mohsen Agsen | LinkedIn](https://www.linkedin.com/in/mohsen-agsen-62a9791/)
