# **Getting to know Llama 2: Everything you need to start building**
Our goal in this session is to provide a guided tour of Llama 2, including understanding different Llama 2 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), Fine-tuning and more. All this is implemented with a starter code for you to take it and use it in your Llama 2 projects.

## **0 - Prerequisites**
* Basic understanding of Large Language Models
* Basic understanding of Python

In [1]:
import base64
from getpass import getpass

import replicate
from IPython.display import display, Image, Markdown, HTML

from chrisbase.data import *
from chrisbase.io import *
from chrisbase.util import *


def mm(graph):
    graphbytes = graph.encode("ascii")
    base64_bytes = base64.b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))


def md(t):
    display(Markdown(t))


def html(t,
         s="<div style='font-family:Arial; font-size:12pt;"
           " padding:10px; margin-left:10px; margin-top:10px;"
           " background-color:LightSkyBlue; border:3px solid MidnightBlue;'>",
         e="</div>"):
    display(HTML(s + t + e))


def genai_app_arch():
    mm("""
    flowchart TD
        A[Users] --> B(Applications e.g. mobile, web)
        B --> |Hosted API|C(Platforms e.g. Custom, HuggingFace, Replicate)
        B -- optional --> E(Frameworks e.g. LangChain)
        C-->|User Input|D[Llama 2]
        D-->|Model Output|C
        E --> C
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def rag_arch():
    mm("""
    flowchart TD
        A[User Prompts] --> B(Frameworks e.g. LangChain)
        B <--> |Database, Docs, XLS|C[fa:fa-database External Data]
        B -->|API|D[Llama 2]
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def llama2_family():
    mm("""
    graph LR;
        llama-2 --> llama-2-7b
        llama-2 --> llama-2-13b
        llama-2 --> llama-2-70b
        llama-2-7b --> llama-2-7b-chat
        llama-2-13b --> llama-2-13b-chat
        llama-2-70b --> llama-2-70b-chat
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def apps_and_llms():
    mm("""
    graph LR;
        users --> apps
        apps --> frameworks
        frameworks --> platforms
        platforms --> Llama 2
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def bot_arch():
    mm("""
    graph LR;
        user --> prompt
        prompt --> i_safety
        i_safety --> context
        context --> Llama_2
        Llama_2 --> output
        output --> o_safety
        i_safety --> memory
        o_safety --> memory
        memory --> context
        o_safety --> user
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def fine_tuned_arch():
    mm("""
    graph LR;
        Custom_Dataset --> Pre-trained_Llama
        Pre-trained_Llama --> Fine-tuned_Llama
        Fine-tuned_Llama --> RLHF
        RLHF --> |Loss:Cross-Entropy|Fine-tuned_Llama
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def load_data_faiss_arch():
    mm("""
    graph LR;
        documents --> textsplitter
        textsplitter --> embeddings
        embeddings --> vectorstore
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


def mem_context():
    mm("""
    graph LR
        context(text)
        user_prompt --> context
        instruction --> context
        examples --> context
        memory --> context
        context --> tokenizer
        tokenizer --> embeddings
        embeddings --> LLM
        classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
    """)


# text completion with input prompt
def Completion(prompt):
    output = replicate.run(
        model_ref,
        input={"prompt": prompt,
               "max_new_tokens": 1000}
    )
    return "".join(output)


# chat completion with input prompt and system prompt
def ChatCompletion(prompt, system_prompt=None):
    output = replicate.run(
        model_ref,
        input={"system_prompt": system_prompt,
               "prompt": prompt,
               "max_new_tokens": 1000}
    )
    return "".join(output)

## **1 - Understanding Llama 2**

### **1.1 - What is Llama 2?**
* State of the art (SOTA), Open Source LLM
* 7B, 13B, 70B
* Pretrained + Chat
* Choosing model: Size, Quality, Cost, Speed
* [Research paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)
* [Responsible use guide](https://ai.meta.com/llama/responsible-use-guide/)

In [2]:
llama2_family()

### **1.2 - Accessing Llama 2**
* Download + Self Host (on-premise)
* Hosted API Platform (e.g. [Replicate](https://replicate.com/meta))
* Hosted Container Platform (e.g. [Azure](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233), [AWS](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/), [GCP](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/139))

### **1.3 - Use Cases of Llama 2**
* Content Generation
* Chatbots
* Summarization
* Programming (e.g. Code Llama)
* and many more...

## **2 - Using Llama 2**
In this notebook, we are going to access [Llama 13b chat model](https://replicate.com/meta/llama-2-13b-chat) using hosted API from Replicate.

### **2.1 - Install dependencies**

In [3]:
# Install dependencies and initialize
!pip list | grep -E "torch|transformers|chris|langchain|replicate|faiss|jupyter"

chrisbase                 0.5.2
faiss-cpu                 1.8.0
jupyter                   1.0.0
jupyter_client            8.6.1
jupyter-console           6.6.3
jupyter_core              5.7.2
jupyter-events            0.10.0
jupyter-lsp               2.2.5
jupyter_server            2.14.0
jupyter_server_terminals  0.5.3
jupyterlab                4.1.8
jupyterlab_pygments       0.3.0
jupyterlab_server         2.27.1
jupyterlab_widgets        3.0.10
langchain                 0.1.16
langchain-community       0.0.34
langchain-core            0.1.46
langchain-text-splitters  0.0.1
replicate                 0.25.2
sentence-transformers     2.7.0
torch                     2.3.0
transformers              4.40.1


In [4]:
# model url on Replicate platform that we will use for inferencing
# We will use llama 2 13b chat model hosted on replicate server ()agent_name = "llama-2-13b-chat"
model_ref = f"meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d"

In [5]:
# We will use Replicate hosted cloud environment
# Obtain Replicate API key → https://replicate.com/account/api-tokens)

# enter your replicate api token
REPLICATE_API_TOKEN = read_or(first_path_or(".replicate*")) or getpass()
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

logging.getLogger("IPKernelApp").setLevel(logging.INFO)
logger = logging.getLogger(__name__)
args = CommonArguments(
    env=ProjectEnv(
        project="LLM-based",
        job_name="LLaMA-2-13B-Chat",
        msg_level=logging.INFO,
        msg_format=LoggingFormat.BRIEF_00,
    )
)
args.dataframe()

Unnamed: 0,CommonArguments,value
0,tag,
1,env.project,LLM-based
2,env.job_name,LLaMA-2-13B-Chat
3,env.job_version,
4,env.hostname,ChrisBookPro.local
5,env.hostaddr,172.20.10.5
6,env.time_stamp,0501.134159
7,env.python_path,/Users/chris/miniforge3/envs/LLM-based/bin/python3.11
8,env.current_dir,/Users/chris/proj/LLM-based
9,env.current_file,/Users/chris/proj/LLM-based/Getting_to_know_Llama_2.ipynb


### **2.2 - Basic completion**

In [6]:
with JobTimer("2.2 - Basic completion", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = Completion(prompt="The typical color of a llama is: ")
html(output)

[05.01 13:42:00] ┇ [INIT] 2.2 - Basic completion
[05.01 13:42:02] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:03] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:04] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/rvywavx65nrgp0cf69tv5h88j4 "HTTP/1.1 200 OK"
[05.01 13:42:04] ┇ [EXIT] 2.2 - Basic completion ($=00:00:03.571)


### **2.3 - System prompts**

In [7]:
with JobTimer("2.3 - System prompts", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(
        prompt="The typical color of a llama is: ",
        system_prompt="respond with only one word",
    )
html(output)

[05.01 13:42:05] ┇ [INIT] 2.3 - System prompts
[05.01 13:42:06] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:06] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:07] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/gx756j5kpdrgm0cf69trjcy20m "HTTP/1.1 200 OK"
[05.01 13:42:07] ┇ [EXIT] 2.3 - System prompts ($=00:00:02.431)


### **2.4 - Response formats**
* Can support different formatted outputs e.g. text, JSON, etc.

In [8]:
with JobTimer("2.4 - Response formats", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(
        prompt="The typical color of a llama is: ",
        system_prompt="response in json format",
    )
html(output)

[05.01 13:42:08] ┇ [INIT] 2.4 - Response formats
[05.01 13:42:09] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:09] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:10] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/tyvs8qe0fsrgg0cf69tt35pkmm "HTTP/1.1 200 OK"
[05.01 13:42:10] ┇ [EXIT] 2.4 - Response formats ($=00:00:01.628)


## **3 - Gen AI Application Architecture**

Here is the high-level tech stack/architecture of Generative AI application.

In [9]:
genai_app_arch()

## **4 - Chatbot Architecture**

Here are the key components and the information flow in a chatbot.
* User Prompts
* Input Safety
* Llama 2
* Output Safety
* Memory & Context

In [10]:
bot_arch()

### **4.1 - Chat conversation**
* LLMs are stateless
* Single Turn
* Multi Turn (Memory)

In [11]:
# example of single turn chat
prompt_chat = "What is the average lifespan of a Llama?"
with JobTimer("4.1 - Chat conversation: Single Turn [1/2]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt=prompt_chat, system_prompt="answer the last question in few words")
html(output)

[05.01 13:42:10] ┇ [INIT] 4.1 - Chat conversation: Single Turn [1/2]
[05.01 13:42:11] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:11] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:13] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/jvdp80ea49rgj0cf69ts1kryvm "HTTP/1.1 200 OK"
[05.01 13:42:13] ┇ [EXIT] 4.1 - Chat conversation: Single Turn [1/2] ($=00:00:01.953)


In [12]:
# example without previous context. LLM's are stateless and cannot understand "they" without previous context
prompt_chat = "What animal family are they?"
with JobTimer("4.1 - Chat conversation: Single Turn [2/2]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt=prompt_chat, system_prompt="answer the last question in few words")
html(output)

[05.01 13:42:13] ┇ [INIT] 4.1 - Chat conversation: Single Turn [2/2]
[05.01 13:42:14] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:14] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:15] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/xhy99tyn3nrgj0cf69ts59b3fr "HTTP/1.1 200 OK"
[05.01 13:42:15] ┇ [EXIT] 4.1 - Chat conversation: Single Turn [2/2] ($=00:00:01.545)


Chat app requires us to send in previous context to LLM to get in valid responses. Below is an example of Multi-turn chat.

In [13]:
# example of multi-turn chat, with storing previous context
prompt_chat = """
User: What is the average lifespan of a Llama?
Assistant: Sure! The average lifespan of a llama is around 20-30 years.
User: What animal family are they?
"""
with JobTimer("4.1 - Chat conversation: Multi Turn (Memory)", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt=prompt_chat, system_prompt="answer the last question")
html(output)

[05.01 13:42:16] ┇ [INIT] 4.1 - Chat conversation: Multi Turn (Memory)
[05.01 13:42:16] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:17] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:18] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/5bvdcrpye1rgp0cf69ttd4bjm0 "HTTP/1.1 200 OK"
[05.01 13:42:18] ┇ [EXIT] 4.1 - Chat conversation: Multi Turn (Memory) ($=00:00:02.040)


### **4.2 - Prompt Engineering**
* Prompt engineering refers to the science of designing effective prompts to get desired responses
* Helps reduce hallucination

#### **4.2.1 - In-Context Learning (e.g. Zero-shot, Few-shot)**
  * In-context learning - specific method of prompt engineering where demonstration of task are provided as part of prompt.
  1. Zero-shot learning - model is performing tasks without any input examples.
  2. Few or “N-Shot” Learning - model is performing and behaving based on input examples in user's prompt.

In [14]:
# Zero-shot example. To get positive/negative/neutral sentiment, we need to give examples in the prompt
prompt = '''
Classify: I saw a Gecko.
Sentiment: ?
'''
with JobTimer("4.2.1 - In-Context Learning [1/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="one word response")
html(output)

[05.01 13:42:18] ┇ [INIT] 4.2.1 - In-Context Learning [1/4]
[05.01 13:42:19] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:20] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:21] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/s0es0tq9r9rgj0cf69tr4rza0w "HTTP/1.1 200 OK"
[05.01 13:42:21] ┇ [EXIT] 4.2.1 - In-Context Learning [1/4] ($=00:00:02.083)


In [15]:
# By giving examples to Llama, it understands the expected output format.
prompt = '''
Classify: I love Llamas!
Sentiment: Positive
Classify: I dont like Snakes.
Sentiment: Negative
Classify: I saw a Gecko.
Sentiment:'''
with JobTimer("4.2.1 - In-Context Learning [2/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="One word response")
html(output)

[05.01 13:42:21] ┇ [INIT] 4.2.1 - In-Context Learning [2/4]
[05.01 13:42:22] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:23] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:24] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/7ywvpdqn3xrgg0cf69tsfmebdm "HTTP/1.1 200 OK"
[05.01 13:42:24] ┇ [EXIT] 4.2.1 - In-Context Learning [2/4] ($=00:00:02.447)


In [16]:
# another zero-shot learning
prompt = '''
QUESTION: Vicuna?
ANSWER:'''
with JobTimer("4.2.1 - In-Context Learning [3/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="one word response")
html(output)

[05.01 13:42:25] ┇ [INIT] 4.2.1 - In-Context Learning [3/4]
[05.01 13:42:25] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:26] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:27] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/368qsa01yxrgm0cf69v82krb2w "HTTP/1.1 200 OK"
[05.01 13:42:27] ┇ [EXIT] 4.2.1 - In-Context Learning [3/4] ($=00:00:01.833)


In [17]:
# Another few-shot learning example with formatted prompt.
prompt = '''
QUESTION: Llama?
ANSWER: Yes
QUESTION: Alpaca?
ANSWER: Yes
QUESTION: Rabbit?
ANSWER: No
QUESTION: Vicuna?
ANSWER:
'''
with JobTimer("4.2.1 - In-Context Learning [4/4]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="one word response")
html(output)

[05.01 13:42:27] ┇ [INIT] 4.2.1 - In-Context Learning [4/4]
[05.01 13:42:29] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:29] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:30] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/a8ana3repdrgp0cf69v943pyy8 "HTTP/1.1 200 OK"
[05.01 13:42:31] ┇ [EXIT] 4.2.1 - In-Context Learning [4/4] ($=00:00:02.869)


#### **4.2.2 - Chain of Thought**
"Chain of thought" enables complex reasoning through logical step by step thinking and generates meaningful and contextually relevant responses.

In [18]:
# Standard prompting
prompt = '''
Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does Llama have now?
'''
with JobTimer("4.2.2 - Chain of Thought [1/2]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="provide short answer")
html(output)

[05.01 13:42:31] ┇ [INIT] 4.2.2 - Chain of Thought [1/2]
[05.01 13:42:32] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:33] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:33] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/mpjcmxrv0srgp0cf69v8892q5g "HTTP/1.1 200 OK"
[05.01 13:42:34] ┇ [EXIT] 4.2.2 - Chain of Thought [1/2] ($=00:00:02.235)


In [19]:
# Chain-Of-Thought prompting
prompt = '''
Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does Llama have now?
Let's think step by step.
'''
with JobTimer("4.2.2 - Chain of Thought [2/2]", rt=1, rb=1, rw=80, rc='=', verbose=1):
    output = ChatCompletion(prompt, system_prompt="provide short answer")
html(output)

[05.01 13:42:34] ┇ [INIT] 4.2.2 - Chain of Thought [2/2]
[05.01 13:42:35] ┇ HTTP Request: POST https://api.replicate.com/v1/predictions "HTTP/1.1 201 Created"
[05.01 13:42:35] ┇ HTTP Request: GET https://api.replicate.com/v1/models/meta/llama-2-13b-chat/versions/f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d "HTTP/1.1 200 OK"
[05.01 13:42:36] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/d5mgps988srgp0cf69v9ts92pg "HTTP/1.1 200 OK"
[05.01 13:42:37] ┇ HTTP Request: GET https://api.replicate.com/v1/predictions/d5mgps988srgp0cf69v9ts92pg "HTTP/1.1 200 OK"
[05.01 13:42:37] ┇ [EXIT] 4.2.2 - Chain of Thought [2/2] ($=00:00:02.523)


### **4.3 - Retrieval Augmented Generation (RAG)**
* Prompt Eng Limitations - Knowledge cutoff & lack of specialized data
* Retrieval Augmented Generation(RAG) allows us to retrieve snippets of information from external data sources and augment it to the user's prompt to get tailored responses from Llama 2.

For our demo, we are going to download an external PDF file from a URL and query against the content in the pdf file to get contextually relevant information back with the help of Llama!

In [20]:
rag_arch()

#### **4.3.1 - LangChain**
LangChain is a framework that helps make it easier to implement RAG.

In [21]:
# langchain setup
from langchain.llms import Replicate

# Use the Llama 2 model hosted on Replicate
# Temperature: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value
# top_p: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens
# max_new_tokens: Maximum number of tokens to generate. A word is generally 2-3 tokens
llama_model = Replicate(
    model=model_ref,
    model_kwargs={"temperature": 0.75, "top_p": 1, "max_new_tokens": 1000}
)

In [22]:
# Step 1: load the external data source. In our case, we will load Meta’s “Responsible Use Guide” pdf document.
from langchain.document_loaders import OnlinePDFLoader

loader = OnlinePDFLoader("https://ai.meta.com/static-resource/responsible-use-guide/")
documents = loader.load()

# Step 2: Get text splits from document
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

# Step 3: Use the embedding model
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"  # embedding model
model_kwargs = {"device": "cpu"}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# Step 4: Use vector store to store embeddings
vectorstore = FAISS.from_documents(all_splits, embeddings)

[05.01 13:42:41] ┇ pikepdf C++ to Python logger bridge initialized
[05.01 13:42:42] ┇ Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
[05.01 13:42:49] ┇ Loading faiss.
[05.01 13:42:49] ┇ Successfully loaded faiss.
