# **Getting to know Llama 3: Everything you need to start building**
Our goal in this session is to provide a guided tour of Llama 3 with comparison with Llama 2, including understanding different Llama 3 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), Fine-tuning and more. All this is implemented with a starter code for you to take it and use it in your Llama 3 projects.

### **0 - Prerequisites**
* Basic understanding of Large Language Models
* Basic understanding of Python

In [1]:
import base64
from getpass import getpass

from IPython.display import display, Image, Markdown, HTML

from chrisbase.data import *
from chrisbase.io import *
from chrisbase.util import *

In [2]:
def mm(graph):
    graphbytes = graph.encode("ascii")
    base64_bytes = base64.b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))


def md(t):
    display(Markdown(t))


def hr():
    display(HTML("<hr style='border:3px solid MidnightBlue; width:880px; padding:0;"
                 " margin-left:0; margin-right:0; margin-top:10px; margin-bottom:15px;'>"))


def mds(ps: dict):
    for k, v in ps.items():
        hr()
        md(f"**[{k}]**\n{prefixed_str(v, '> ')}")
    if ps:
        hr()


def llama2_family():
    mm("""
    graph LR;
        llama-2 --> llama-2-7b
        llama-2 --> llama-2-13b
        llama-2 --> llama-2-70b
        llama-2-7b --> llama-2-7b-chat
        llama-2-13b --> llama-2-13b-chat
        llama-2-70b --> llama-2-70b-chat
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def llama3_family():
    mm("""
    graph LR;
        llama-3 --> llama-3-8b
        llama-3 --> llama-3-70b
        llama-3-8b --> llama-3-8b-instruct
        llama-3-70b --> llama-3-70b-instruct
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def rag_arch():
    mm("""
    flowchart TD
        A[User Prompts] --> B(Frameworks e.g. LangChain)
        B <--> |Database, Docs, XLS|C[fa:fa-database External Data]
        B -->|API|D[Llama 3]
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


## **1 - Understanding Llama 3**

### **1.1 - What is Llama 3?**
* State of the art (SOTA), Open Source LLM
* 8B, 70B - base and instruct models
* Choosing model: Size, Quality, Cost, Speed
* Pretrained + Chat
* [Meta Llama 3 Blog](https://ai.meta.com/blog/meta-llama-3/)
* [Getting Started with Meta Llama](https://llama.meta.com/docs/get-started)

In [3]:
llama2_family()

In [4]:
llama3_family()

### **1.2 - Accessing Llama 3**
* Download + Self Host (i.e. [download Llama](https://ai.meta.com/resources/models-and-libraries/llama-downloads))
* Hosted API Platform (e.g. [Groq](https://console.groq.com/), [Replicate](https://replicate.com/meta/meta-llama-3-8b-instruct), [Together](https://api.together.xyz/playground/language/meta-llama/Llama-3-8b-hf), [Anyscale](https://app.endpoints.anyscale.com/playground))
* Hosted Container Platform (e.g. [Azure](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233), [AWS](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/), [GCP](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/139))

### **1.3 - Use Cases of Llama 3**
* Content Generation
* Summarization
* General Chatbots
* RAG (Retrieval Augmented Generation): Chat about Your Own Data
* Fine-tuning
* Agents

## **2 - Using and Comparing Llama 3 and Llama 2**
We will be using Llama 2 7b & 70b chat and Llama 3 8b & 70b instruct models hosted on [Replicate](https://replicate.com/search?query=llama) to run the examples here. 
You will need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. 
You can also use other Llama 3 cloud providers such as [Groq](https://console.groq.com/), [Together](https://api.together.xyz/playground/language/meta-llama/Llama-3-8b-hf), or [Anyscale](https://app.endpoints.anyscale.com/playground).

### **2.1 - Install dependencies**

In [5]:
# Install dependencies and initialize
!pip list | grep -E "torch|faiss|langchain|transformers|huggingface|groq|replicate|chris"

chrisbase                 0.5.3
faiss-cpu                 1.8.0
groq                      0.9.0
huggingface-hub           0.23.4
langchain                 0.2.5
langchain-community       0.2.5
langchain-core            0.2.9
langchain-groq            0.1.5
langchain-huggingface     0.0.3
langchain-text-splitters  0.2.1
replicate                 0.26.0
sentence-transformers     3.0.1
torch                     2.3.1
transformers              4.41.2


### **2.2 - Create helpers for Llama 2 and Llama 3**
First, set your API token as environment variables.

In [6]:
logging.getLogger("IPKernelApp").setLevel(logging.INFO)
logger = logging.getLogger(__name__)
args = CommonArguments(
    env=ProjectEnv(
        project="LLM-based",
        job_name="LLaMA-based-Tutorial",
        msg_level=logging.INFO,
        msg_format=LoggingFormat.PRINT_00,
    )
)
args.dataframe()

Unnamed: 0,CommonArguments,value
0,tag,
1,env.project,LLM-based
2,env.job_name,LLaMA-based-Tutorial
3,env.job_version,
4,env.hostname,ChrisBookPro.local
5,env.hostaddr,192.168.0.2
6,env.time_stamp,0621.020355
7,env.python_path,/Users/chris/miniforge3/envs/LLM-based/bin/python3.11
8,env.current_dir,/Users/chris/PycharmProjects/LLM-based
9,env.current_file,/Users/chris/PycharmProjects/LLM-based/Getting_to_know_Llama_3.ipynb


Create Llama 2 and Llama 3 helper functions - for chatbot type of apps, we'll use Llama 3 instruct and Llama 2 chat models, not the base models.

In [7]:
GROQ_API_KEY = read_or(first_path_or("groq-tokens*")) or getpass()
REPLICATE_API_TOKEN = read_or(first_path_or("replicate-tokens*")) or getpass()
os.environ["GROQ_API_KEY"] = GROQ_API_KEY
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN
print(f'GROQ_API_KEY = {mask_str(os.environ.get("GROQ_API_KEY"), start=4, end=-4)}')
print(f'REPLICATE_API_TOKEN = {mask_str(os.environ.get("REPLICATE_API_TOKEN"), start=3, end=-3)}')

GROQ_API_KEY = gsk_************************************************2Uuw
REPLICATE_API_TOKEN = r8_**********************************MYA


In [8]:
import groq
import replicate

groq_client = groq.Groq(api_key=os.environ.get("GROQ_API_KEY"))
replicate_client = replicate.client.Client(api_token=os.environ.get("REPLICATE_API_TOKEN"))


def llm_chat_by_replicate(prompt, model):
    output = replicate_client.run(
        ref=model,
        input={"prompt": prompt}
    )
    return ''.join(output)


def llm_chat_by_groq(prompt, model):
    chat_completion = groq_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return chat_completion.choices[0].message.content


chat_functions = {
    "llama2_7b_by_replicate": lambda x: llm_chat_by_replicate(x, model="meta/llama-2-7b-chat"),
    "llama2_70b_by_replicate": lambda x: llm_chat_by_replicate(x, model="meta/llama-2-70b-chat"),
    "llama3_8b_by_replicate": lambda x: llm_chat_by_replicate(x, model="meta/meta-llama-3-8b-instruct"),
    "llama3_70b_by_replicate": lambda x: llm_chat_by_replicate(x, model="meta/meta-llama-3-70b-instruct"),
    # "llama3_8b_by_groq": lambda x: llm_chat_by_groq(x, model="llama3-8b-8192"),
    # "llama3_70b_by_groq": lambda x: llm_chat_by_groq(x, model="llama3-70b-8192"),
}
chat_functions

{'llama2_7b_by_replicate': <function __main__.<lambda>(x)>,
 'llama2_70b_by_replicate': <function __main__.<lambda>(x)>,
 'llama3_8b_by_replicate': <function __main__.<lambda>(x)>,
 'llama3_70b_by_replicate': <function __main__.<lambda>(x)>}

In [9]:
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings


def load_external_data(web_path):
    # Step 1: Load the document from a web url
    loader = WebBaseLoader(web_path)
    documents = loader.load()

    # Step 2: Split the document into chunks with a specified chunk size
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    all_splits = text_splitter.split_documents(documents)

    # Step 3: Store the document into a vector store with a specific embedding model
    vectorstore = FAISS.from_documents(all_splits, HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))
    return vectorstore.as_retriever()


def llm_chat_with_retrieval(prompt, retriever, model, history=()):
    llm = ChatGroq(temperature=0, model_name=model)
    ret_chain = ConversationalRetrievalChain.from_llm(llm,
                                                      retriever,
                                                      return_source_documents=True)
    output = ret_chain.invoke({"question": prompt, "chat_history": history})
    return output["answer"]


langchain_functions = {
    "llama3_8b_by_langchain": lambda p, r, h=(): llm_chat_with_retrieval(prompt=p, retriever=r, history=h, model="llama3-8b-8192"),
    "llama3_70b_by_langchain": lambda p, r, h=(): llm_chat_with_retrieval(prompt=p, retriever=r, history=h, model="llama3-70b-8192"),
}
langchain_functions

USER_AGENT environment variable not set, consider setting it to identify your requests.


{'llama3_8b_by_langchain': <function __main__.<lambda>(p, r, h=())>,
 'llama3_70b_by_langchain': <function __main__.<lambda>(p, r, h=())>}

### **2.3 - Basic QA with Llama 2 and 3**

In [10]:
chat_prompt = "The typical color of a llama is: "
outputs = dict()
with JobTimer("2.3 - Basic QA with Llama 2 and 3 (1)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 2.3 - Basic QA with Llama 2 and 3 (1)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/c798hg6z05rgm0cg6td9an9tbc "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/q0a819z341rgm0cg6td8kcz0z8 "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/q0a819z341rgm0cg6td8kcz0z8 "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/q0a819z341rgm0cg6td8kcz0z8 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/kq4kxvqcy1rgj0cg6td8mnzw44 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predic

**[llama2_7b_by_replicate]**
>  Great, I'm glad you're interested in learning about llamas! The typical color of a llama is actually gray. Llamas are known for their distinctive gray coats, which can range in shade from a light gray to a darker, more mottled gray. Some llamas may also have white or yellow markings on their faces or bodies, but overall, their typical color is gray. Is there anything else you'd like to know about llamas?

**[llama2_70b_by_replicate]**
>  The typical color of a llama is not a specific color, as llamas can come in a variety of colors and patterns. Some common colors include white, black, brown, gray, and various shades of red and tan. Some llamas may also have spotted or striped patterns. It's important to note that the color of a llama can vary depending on the individual animal and breed. Is there anything else I can help you with?

**[llama3_8b_by_replicate]**
> 
> 
> The typical color of a llama is white, but they can also come in a variety of other colors such as light brown, dark brown, gray, black, and even spotted or patchy patterns!

**[llama3_70b_by_replicate]**
> 
> 
> The typical color of a llama can vary, but they are often found in shades of:
> 
> * White
> * Brown
> * Gray
> * Black
> * Roan (a mix of white and dark hairs)
> * Fawn (a reddish-brown color)
> 
> Some llamas can also have unique markings, such as spots or patches, on their coats.

In [11]:
chat_prompt = "The typical color of a llama is what? Answer in one word."
outputs = dict()
with JobTimer("2.3 - Basic QA with Llama 2 and 3 (2)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 2.3 - Basic QA with Llama 2 and 3 (2)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/tamm5ffvv1rgj0cg6td83eqe60 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/rf2r5f7zvxrgp0cg6tdamsns10 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/0rqvjpg3tdrgg0cg6tdv3vhwa4 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/bjmb9eg7tdrj60cg6tds3929k4 "HTTP/1.1 200 OK"
[EXIT] 2.3 - Basic QA with Llama 2 and 3 (2) ($=00:00:04.036)


**[llama2_7b_by_replicate]**
>  Sure, I'd be happy to help! The typical color of a llama is gray.

**[llama2_70b_by_replicate]**
>  Brown

**[llama3_8b_by_replicate]**
> 
> 
> Brown.

**[llama3_70b_by_replicate]**
> 
> 
> Beige.

## **3 - Chat conversation**

### **3.1 - Single-turn chat**

In [12]:
chat_prompt = "What is the average lifespan of a Llama? Answer the question in few words."
outputs = dict()
with JobTimer("3.1 - Single-turn chat (1)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 3.1 - Single-turn chat (1)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/5hpkbxgexhrgg0cg6tdsq5bb3g "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/pqpeb28jw9rgm0cg6tdvph3rar "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/xgga8p0pwnrgg0cg6tdtwzzkvg "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/fd6ty9rtrdrj00cg6tdrxt98br "HTTP/1.1 200 OK"
[EXIT] 3.1 - Single-turn chat (1) ($=00:00:03.989)


**[llama2_7b_by_replicate]**
>  The average lifespan of a llama is around 15-20 years.

**[llama2_70b_by_replicate]**
>  The average lifespan of a llama is 20-30 years.

**[llama3_8b_by_replicate]**
> 
> 
> 15-20 years on average.

**[llama3_70b_by_replicate]**
> 
> 
> 20-30 years.

In [13]:
# example without previous context. LLM's are stateless and cannot understand "they" without previous context
chat_prompt = "What animal family are they? Answer the question in few words."
outputs = dict()
with JobTimer("3.1 - Single-turn chat (2)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 3.1 - Single-turn chat (2)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/36dgty91v9rgm0cg6tdvntts04 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/881ph4h5vdrgg0cg6tdsaw53fm "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/7xda9ns9wnrgg0cg6tdvnsjz3m "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/jzdsj1sds1rj00cg6tdtj2v3tg "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/jzdsj1sds1rj00cg6tdtj2v3tg 

**[llama2_7b_by_replicate]**
>  Of course, I'd be happy to help! The animal family that the question is referring to is... (pausing for a moment to ensure the answer is accurate) ...the Canidae family, which includes dogs, wolves, and foxes.

**[llama2_70b_by_replicate]**
>  They are primates.

**[llama3_8b_by_replicate]**
> 
> 
> The animal family is Canidae.

**[llama3_70b_by_replicate]**
> 
> 
> I'm happy to help! However, I need more context. Who are "they" referring to? Could you please provide more information or clarify your question?

**Note: Llama 3 70b doesn't hallucinate.**

### **3.2 - Multi-turn chat**
Chat app requires us to send in previous context to LLM to get in valid responses. Below is an example of Multi-turn chat.

In [14]:
# example of multi-turn chat, with storing previous context
chat_prompt = """
User: What is the average lifespan of a Llama?
Assistant: 15-20 years.
User: What animal family are they?
"""
outputs = dict()
with JobTimer("3.2 - Multi-turn chat (1)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 3.2 - Multi-turn chat (1)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/g93q9m1qm1rgp0cg6tds86x8k4 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/gpszsm1vjsrgj0cg6tdsbjz2zr "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/c6txp0szjhrgp0cg6tdswdn9p0 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/q99bset3cxrj40cg6tdsxc7yzc "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/q99bset3cxrj40cg6tdsxc7yzc "

**[llama2_7b_by_replicate]**
>  Assistant: Thank you for your question! Llamas are actually not part of any specific animal family. They are members of the camel family (Camelidae) and are closely related to camels, alpacas, and vicuñas. So, the answer to your question is: the camel family.

**[llama2_70b_by_replicate]**
>  Assistant: Llamas are part of the Camelidae family, which includes camels and alpacas.

**[llama3_8b_by_replicate]**
> 
> 
> Llamas belong to the camelid family, which also includes camels and alpacas.

**[llama3_70b_by_replicate]**
> 
> 
> Llamas are part of the camelid family, which also includes camels, alpacas, guanacos, and vicuñas.

**Note: Llama 2 and 3 both behave well for using the chat history for follow up questions.**

In [15]:
# example of multi-turn chat, with storing previous context
chat_prompt = """
User: What is the average lifespan of a Llama?
Assistant: Sure! The average lifespan of a llama is around 20-30 years.
User: What animal family are they?

Answer the question with one word.
"""
outputs = dict()
with JobTimer("3.2 - Multi-turn chat (2)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 3.2 - Multi-turn chat (2)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/54z7r0teaxrgg0cg6tdvfn1yrw "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/0t9dwhtja5rgj0cg6tdtvhskjg "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/va3chvjp9xrgp0cg6tdt2w60j8 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/zn2bp52t65rj40cg6tdtkat854 "HTTP/1.1 200 OK"
[EXIT] 3.2 - Multi-turn chat (2) ($=00:00:03.986)


**[llama2_7b_by_replicate]**
>  Sure! The animal family that llamas belong to is: Camelidae.

**[llama2_70b_by_replicate]**
>  Sure!
> Camelids

**[llama3_8b_by_replicate]**
> 
> 
> Camelid.

**[llama3_70b_by_replicate]**
> 
> 
> Camelid

**Both Llama 3 8b and Llama 2 70b follows instructions (e.g. "Answer the question with one word") better than Llama 2 7b in multi-turn chat.**

### **3.3 - Prompt Engineering**
* Prompt engineering refers to the science of designing effective prompts to get desired responses
* Helps reduce hallucination

#### **3.3.1 - In-Context Learning (e.g. Zero-shot, Few-shot)**
 * In-context learning - specific method of prompt engineering where demonstration of task are provided as part of prompt.
  1. Zero-shot learning - model is performing tasks without any
input examples.
  2. Few or “N-Shot” Learning - model is performing and behaving based on input examples in user's prompt.

In [16]:
# Zero-shot example. To get positive/negative/neutral sentiment, we need to give examples in the prompt
chat_prompt = '''
Classify: I saw a Gecko.
Sentiment: ?

Give one word response.
'''
outputs = dict()
with JobTimer("3.3.1 - In-Context Learning (1)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 3.3.1 - In-Context Learning (1)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/qe24brb17nrgm0cg6tdr734raw "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/2n1brvv56xrgj0cg6tdt7xj04g "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/txb3xrb95srgg0cg6tdtg84ay8 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/6n5mgbkd19rj60cg6tdsrarv6w "HTTP/1.1 200 OK"
[EXIT] 3.3.1 - In-Context Learning (1) ($=00:00:03.986)


**[llama2_7b_by_replicate]**
>  Positive

**[llama2_70b_by_replicate]**
>  Neutral

**[llama3_8b_by_replicate]**
> 
> 
> Interesting

**[llama3_70b_by_replicate]**
> 
> 
> Neutral

**Note: Llama 3 has different opinions than Llama 2.**

In [17]:
# By giving examples to Llama, it understands the expected output format.
chat_prompt = '''
Classify: I love Llamas!
Sentiment: Positive
Classify: I dont like Snakes.
Sentiment: Negative
Classify: I saw a Gecko.
Sentiment:

Give one word response.
'''
outputs = dict()
with JobTimer("3.3.1 - In-Context Learning (2)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 3.3.1 - In-Context Learning (2)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/p2wsw4vnc1rgj0cg6tdsrmzs0c "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/qbptpckscxrgp0cg6tdvzwyd4w "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/qjxvw5vxdxrgj0cg6tdvkwy80c "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/007t9s41c9rj20cg6tdt631ner "HTTP/1.1 200 OK"
[EXIT] 3.3.1 - In-Context Learning (2) ($=00:00:04.347)


**[llama2_7b_by_replicate]**
>  Of course! I'm here to help. Here are the classifications and sentiments for the three statements you provided:
> 
> Classify: I love Llamas!
> Sentiment: Positive
> 
> Classify: I don't like Snakes.
> Sentiment: Negative
> 
> Classify: I saw a Gecko.
> Sentiment: Neutral

**[llama2_70b_by_replicate]**
>  Neutral

**[llama3_8b_by_replicate]**
> 
> 
> Neutral

**[llama3_70b_by_replicate]**
> 
> 
> Neutral

**Note: Llama 2, with few shots, has the same output "Neutral" as Llama 3, but Llama 2 doesn't follow instructions (Give one word response) well.**

#### **3.3.2 - Chain of Thought**
"Chain of thought" enables complex reasoning through logical step by step thinking and generates meaningful and contextually relevant responses.

In [18]:
# Standard prompting
chat_prompt = '''
Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does Llama have?

Answer in one word.
'''
outputs = dict()
with JobTimer("3.3.2 - Chain of Thought (1)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 3.3.2 - Chain of Thought (1)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/3m747jw8dsrgp0cg6tdsezjq84 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/arevx7ccdnrgj0cg6tdvz9hzqw "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/7hdj73cgd9rgp0cg6tdvfnsygm "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-70b-instruct/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/g2ddq64m9hrj40cg6tdv1j8yfr "HTTP/1.1 200 OK"
[EXIT] 3.3.2 - Chain of Thought (1) ($=00:00:04.010)


**[llama2_7b_by_replicate]**
>  Sure, I'd be happy to help! Based on the information provided, Llama has a total of 7 tennis balls.

**[llama2_70b_by_replicate]**
>  Sure! Here's my answer:
> 
> Seven

**[llama3_8b_by_replicate]**
> 
> 
> Seven.

**[llama3_70b_by_replicate]**
> 
> 
> Eleven

**Note: Llama 3-8b did not get the right answer because it was asked to answer in one word.**

In [19]:
# By default, Llama 3 models follow "Chain-Of-Thought" prompting
chat_prompt = '''
Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does Llama have?
'''
outputs = dict()
with JobTimer("3.3.2 - Chain of Thought (2)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 3.3.2 - Chain of Thought (2)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/a72f3a4vd9rgg0cg6tdt0cfhtm "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/a72f3a4vd9rgg0cg6tdt0cfhtm "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/2mqtzq52bdrgj0cg6tdrqk0hv8 "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/2mqtzq52bdrgj0cg6tdrqk0hv8 "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/2mqtzq52bdrgj0cg6tdrqk0hv8 "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/2mqtzq52bdrgj0cg6tdrqk0hv8 "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Create

**[llama2_7b_by_replicate]**
>  Great question! I'm happy to help you with this puzzle.
> 
> Let's break it down step by step:
> 
> 1. Llama started with 5 tennis balls.
> 2. Llama buys 2 more cans of tennis balls.
> 3. Each can contains 3 tennis balls.
> 
> So, if Llama buys 2 more cans of tennis balls, that means they will have:
> 
> 5 + 2 x 3 = 19 tennis balls
> 
> Therefore, Llama has 19 tennis balls in total! 🎾🎾🐃

**[llama2_70b_by_replicate]**
>  Sure, I'd be happy to help you with that!
> 
> So, Llama started with 5 tennis balls, and then it bought 2 more cans of tennis balls. Each can has 3 tennis balls, so if we multiply 2 cans by 3 tennis balls per can, we get 6 tennis balls.
> 
> Now, we need to add the 6 tennis balls that Llama just bought to the 5 tennis balls that Llama already had. That gives us a total of 11 tennis balls!
> 
> So, to answer your question, Llama

**[llama3_8b_by_replicate]**
> 
> 
> I'd be happy to help you with that!
> 
> Llama started with 5 tennis balls. Then, it buys 2 more cans of tennis balls, and each can has 3 tennis balls. So, Llama buys a total of 2 x 3 = 6 tennis balls.
> 
> Adding the 6 new tennis balls to the 5 Llama already had, Llama now has a total of:
> 
> 5 + 6 = 11
> 
> So, Llama has 11 tennis balls!

**[llama3_70b_by_replicate]**
> 
> 
> Let's break it down step by step!
> 
> Llama started with 5 tennis balls.
> 
> It buys 2 more cans of tennis balls, and each can has 3 tennis balls. So, Llama gets 2 x 3 = 6 new tennis balls.
> 
> To find the total number of tennis balls Llama has, we add the new tennis balls to the ones it already had: 5 (initial tennis balls) + 6 (new tennis balls) = 11
> 
> Therefore, Llama now has 11 tennis balls!

**Note: By default, Llama 3 models identify word problems and solves it step by step!**

In [20]:
chat_prompt = """
15 of us want to go to a restaurant.
Two of them have cars
Each car can seat 5 people.
Two of us have motorcycles.
Each motorcycle can fit 2 people.
Can we all get to the restaurant by car or motorcycle?
Think step by step.
Provide the answer as a single yes/no answer first.
Then explain each intermediate step.
"""
outputs = dict()
with JobTimer("3.3.2 - Chain of Thought (3)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in chat_functions.items():
        outputs[name] = function(chat_prompt)
mds(outputs)

[INIT] 3.3.2 - Chain of Thought (3)
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-7b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/0yjaeny7qsrgp0cg6tdv6jbxnm "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/0yjaeny7qsrgp0cg6tdv6jbxnm "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions "HTTP/1.1 201 Created"
HTTP Request: GET https://api.replicate.com/v1/predictions/18g5apeennrgp0cg6tdvqpy70g "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/18g5apeennrgp0cg6tdvqpy70g "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/18g5apeennrgp0cg6tdvqpy70g "HTTP/1.1 200 OK"
HTTP Request: GET https://api.replicate.com/v1/predictions/18g5apeennrgp0cg6tdvqpy70g "HTTP/1.1 200 OK"
HTTP Request: POST https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct/predictions "HTTP/1.1 201 Create

**[llama2_7b_by_replicate]**
>  Yes, it is possible for all 15 people to get to the restaurant by car or motorcycle. Here's how:
> 
> Step 1: Two people with cars can transport 10 people (5 people per car) to the restaurant.
> Step 2: Two people with motorcycles can transport 2 people (2 people per motorcycle) to the restaurant.
> Step 3: The remaining 3 people can be transported by the two people with cars, or by the two motorcycles (if they are available).
> 
> Therefore, the answer is yes, all 15 people can get to the restaurant by car or motorcycle.

**[llama2_70b_by_replicate]**
>  No, we cannot all get to the restaurant by car or motorcycle.
> 
> Let's break this down step by step:
> 
> 1. Two people have cars.
> 2. Each car can seat 5 people.
> 3. Therefore, the two cars can seat a total of 10 people.
> 4. Two people have motorcycles.
> 5. Each motorcycle can fit 2 people.
> 6. Therefore, the two motorcycles can fit a total of 4 people.
> 7. In total, we have 15 people who want to go to the restaurant.
> 

**[llama3_8b_by_replicate]**
> 
> 
> **Yes**
> 
> Here's the step-by-step explanation:
> 
> 1. We have 15 people who want to go to the restaurant.
> 2. Two people have cars, which can seat 5 people each. This means each car can take 5 people to the restaurant.
> 3. So, we can take 2 x 5 = 10 people to the restaurant using the cars.
> 4. We still have 15 - 10 = 5 people left who haven't been accounted for yet.
> 5. Two people have motorcycles, which can seat 2 people each. This means each motorcycle can take 2 people to the restaurant.
> 6. We can take 2 x 2 = 4 people to the restaurant using the motorcycles.
> 7. We still have 5 - 4 = 1 person left who hasn't been accounted for yet.
> 8. Since we only have 1 person left, we can accommodate them by pairing them with one of the people who already has a car or motorcycle.
> 9. Therefore, we can take all 15 people to the restaurant using the available cars and motorcycles.

**[llama3_70b_by_replicate]**
> 
> 
> **Answer:** NO
> 
> Here's the step-by-step explanation:
> 
> 1. We have 15 people who want to go to the restaurant.
> 2. We have 2 cars, each of which can seat 5 people. This means we can transport a total of 2 x 5 = 10 people by car.
> 3. We still have 15 - 10 = 5 people left who need transportation.
> 4. We have 2 motorcycles, each of which can fit 2 people. This means we can transport a total of 2 x 2 = 4 people by motorcycle.
> 5. We still have 5 - 4 = 1 person left who needs transportation, but we don't have any more vehicles to accommodate them.
> 
> Therefore, we cannot transport all 15 people to the restaurant by car or motorcycle.

**Note: Llama 3 70b model works correctly in this example.**

**Summary: Llama 2 often needs encouragement for step by step thinking to correctly reasoning. Llama 3 understands, reasons and explains better, making chain of thought unnecessary in the cases above.**

### **3.4 - Retrieval Augmented Generation (RAG)**
* Prompt Eng Limitations - Knowledge cutoff & lack of specialized data
* Retrieval Augmented Generation(RAG) allows us to retrieve snippets of information from external data sources and augment it to the user's prompt to get tailored responses from Llama 2.

For our demo, we are going to download an external PDF file from a URL and query against the content in the pdf file to get contextually relevant information back with the help of Llama!

In [21]:
rag_arch()

#### **3.4.1 - LangChain**
LangChain is a framework that helps make it easier to implement RAG.

In [22]:
with JobTimer("3.4.1 - LangChain", rt=1, rb=1, rw=114, rc='=', verbose=1):
    about_llama3 = load_external_data(["https://huggingface.co/blog/llama3"])

[INIT] 3.4.1 - LangChain


  from tqdm.autonotebook import tqdm, trange


Use pytorch device_name: mps
Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2




Loading faiss.
Successfully loaded faiss.
[EXIT] 3.4.1 - LangChain ($=00:00:05.069)


#### **3.4.2 - LangChain Q&A Retriever**
* ConversationalRetrievalChain
* Query the Source documents

In [23]:
# no chat history passed
chat_history = {name: [] for name in langchain_functions.keys()}
query = "What’s new with Llama 3?"
outputs = dict()
with JobTimer("3.4.2 - LangChain Q&A Retriever (1)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in langchain_functions.items():
        outputs[name] = function(query, about_llama3)
        chat_history[name].append((query, outputs[name]))
mds(outputs)

[INIT] 3.4.2 - LangChain Q&A Retriever (1)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
[EXIT] 3.4.2 - LangChain Q&A Retriever (1) ($=00:00:02.978)


**[llama3_8b_by_langchain]**
> According to the provided context, the main changes in Llama 3 compared to Llama 2 are:
> 
> 1. A new tokenizer that expands the vocabulary size to 128,256 tokens, allowing for more efficient encoding of text and potentially stronger multilingualism.
> 2. The introduction of two model sizes: 8B for efficient deployment and development on consumer-size GPU, and 70B for large-scale AI native applications.
> 3. The availability of base and instruction-tuned variants for each model size.
> 4. The release of Llama Guard 2, a safety fine-tuned version of Llama 3 8B.
> 
> These changes aim to improve the performance and versatility of the Llama 3 models, making them more suitable for a wider range of applications.

**[llama3_70b_by_langchain]**
> According to the text, the new features of Llama 3 include:
> 
> * 4 new open LLM models, with 2 sizes (8B and 70B parameters) and 2 variants (base and instruct-tuned)
> * A new tokenizer that expands the vocabulary size to 128,256 (from 32K tokens in Llama 2)
> * A larger vocabulary that can encode text more efficiently and potentially yield stronger multilingualism
> * A larger embedding input and output matrices, which accounts for a good portion of the parameter count increase
> * A new version of Llama Guard, fine-tuned on Llama 3 8B and released as Llama Guard 2 (safety fine-tune)

In [24]:
# This time your previous question and answer will be included as a chat history which will enable the ability to ask follow up questions.
query = "What two sizes?"
outputs = dict()
with JobTimer("3.4.2 - LangChain Q&A Retriever (2)", rt=1, rb=1, rw=114, rc='=', verbose=1):
    for name, function in langchain_functions.items():
        outputs[name] = function(query, about_llama3, chat_history[name])
        chat_history[name].append((query, outputs[name]))
mds(outputs)

[INIT] 3.4.2 - LangChain Q&A Retriever (2)
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
[EXIT] 3.4.2 - LangChain Q&A Retriever (2) ($=00:00:05.509)


**[llama3_8b_by_langchain]**
> According to the context, the two model sizes introduced in Llama 3 are:
> 
> 1. 8B
> 2. 70B

**[llama3_70b_by_langchain]**
> The two sizes of the new open LLM models in Llama 3 are 8B and 70B parameters.

## **4 - Fine-Tuning Models**
* Limitatons of Prompt Eng and RAG
* Fine-Tuning Arch
* Types (PEFT, LoRA, QLoRA)
* Using PyTorch for Pre-Training & Fine-Tuning
* Evals + Quality

* Examples of Fine-Tuning:
  * [Meta Llama Recipes](https://github.com/meta-llama/llama-recipes/tree/main/recipes/finetuning)
  * [Hugging Face fine-tuning with Llama 3](https://huggingface.co/blog/llama3#fine-tuning-with-%F0%9F%A4%97-trl)

## **5 - Responsible AI**
* Power + Responsibility
* Hallucinations
* Input & Output Safety
* Red-teaming (simulating real-world cyber attackers)
* [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)

## **6 - Conclusion**
* Active research on LLMs and Llama
* Leverage the power of Llama and its open community
* Safety and responsible use is paramount!

* Call-To-Action
  * [Replicate Free Credits](https://replicate.fyi/connect2023) for Connect attendees!
  * This notebook is available through Llama Github recipes
  * Use Llama in your projects and give us feedback


#### **Resources**
- [Meta Llama 3 Blog](https://ai.meta.com/blog/meta-llama-3/)
- [Getting Started with Meta Llama](https://llama.meta.com/docs/get-started)
- [Llama 3 repo](https://github.com/meta-llama/llama3)
- [Llama 3 model card](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)
- [LLama 3 Recipes repo](https://github.com/meta-llama/llama-recipes)
- [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)
- [Acceptable Use Policy](https://ai.meta.com/llama/use-policy/)
