#### This demo app shows:
How to run Llama 3.1 in the cloud hosted on Replicate
How to use LangChain to ask Llama general questions and follow up questions
How to use LangChain to load a recent web page - Hugging Face's blog post on Llama 3.1 - and chat about it. This is the well known RAG (Retrieval Augmented Generation) method to let LLM such as Llama 3 be able to answer questions about the data not publicly available when Llama 3 was trained, or about your own data. RAG is one way to prevent LLM's hallucination
Note We will be using Replicate to run the examples here. You will need to first sign in with Replicate with your github account, then create a free API token here that you can use for a while. You can also use other Llama 3.1 cloud providers such as Groq, Together, or Anyscale - see Section 2 of the Getting to Know Llama notebook for more information.

Let's start by installing the necessary packages:

sentence-transformers for text embeddings
FAISS gives us database capabilities
LangChain provides necessary RAG tools for this demo

In [1]:
!pip install langchain

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting langchain
  Downloading langchain-0.3.2-py3-none-any.whl.metadata (7.1 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.10.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Collecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Downloading async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting langchain-core<0.4.0,>=0.3.8 (from langchain)
  Downloading langchain_core-0.3.9-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.131-py3-none-any.whl.metadata (13 kB)
Collecting 

In [3]:
!pip install huggingface-hub

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting huggingface-hub
  Downloading huggingface_hub-0.25.1-py3-none-any.whl.metadata (13 kB)
Collecting filelock (from huggingface-hub)
  Downloading filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting tqdm>=4.42.1 (from huggingface-hub)
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Downloading huggingface_hub-0.25.1-py3-none-any.whl (436 kB)
Downloading fsspec-2024.9.0-py3-none-any.whl (179 kB)
Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Installing collected packages: tqdm, fsspec, filelock, huggingface-hub
Successfully installed filelock-3.16.1 fsspec-2024.9.0 huggingface-hub-0.25.1 tqdm-4.66.5


In [4]:
# go here for token: https://huggingface.co/settings/tokens
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [5]:
!pip install transformers
!pip install 'accelerate>=0.26.0'
!pip install langchain_huggingface

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting transformers
  Downloading transformers-4.45.1-py3-none-any.whl.metadata (44 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Downloading tokenizers-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.45.1-py3-none-any.whl (9.9 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m9.9/9.9 MB[0m [31m114.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading regex-2024.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (782 kB)


In [6]:

from langchain_huggingface import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="meta-llama/Meta-Llama-3-8B-Instruct",
    device=0,
    task="text-generation",
    pipeline_kwargs={
        "max_new_tokens": 100,
        "top_k": 50,
        "temperature": 0.1,
    },
)
llm.invoke("Hugging Face is")

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


"Hugging Face is a popular open-source library for natural language processing (NLP) tasks. It provides a wide range of pre-trained models and a simple interface for using them. In this tutorial, we will explore how to use Hugging Face's Transformers library to perform sentiment analysis on a text dataset.\n\n### Installing the required libraries\n\nBefore we start, make sure you have the following libraries installed:\n```\npip install transformers pandas numpy\n```\n### Loading the dataset\n\nFor this tutorial, we will use the IM"

In [7]:
print(llm.invoke("Hugging Face is"))

Hugging Face is a company that provides a range of AI-powered tools and services for natural language processing (NLP) and machine learning. Their flagship product is the Transformers library, which is a popular open-source library for NLP tasks such as language modeling, text classification, and sentiment analysis.

In this tutorial, we will explore how to use the Transformers library to build a simple sentiment analysis model using a pre-trained language model.

### Installing the Transformers Library

To install the Transformers library, you can use pip:
``


In [8]:
question = "who wrote the book Innovator's dilemma?"
answer = llm.invoke(question)
print(answer)

who wrote the book Innovator's dilemma??
The book "The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail" was written by Clayton M. Christensen, a Harvard Business School professor. The book was first published in 1997 and has since become a classic in the field of innovation and strategy. Christensen's work explores the challenges that established companies face when trying to innovate and adapt to new technologies, and how they often fail to do so due to their own strengths and success. The book has


In [9]:
# chat history not passed so Llama doesn't have the context and doesn't know this is more about the book
followup = "tell me more"
followup_answer = llm.invoke(followup)
print(followup_answer)

tell me more about the new 2019 ford f-150
The 2019 Ford F-150 is a full-size pickup truck that is part of the 14th generation of the F-Series. It was introduced in 2015 and has been updated for the 2019 model year with several new features and improvements. Here are some of the key changes and features of the 2019 Ford F-150:
New Engine Options: The 2019 F-150 offers three new engine options,


In [10]:
# using ConversationBufferMemory to pass memory (chat history) for follow up questions
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
conversation = ConversationChain(
    llm=llm, 
    memory = memory,
    verbose=False
)

  memory = ConversationBufferMemory()
  conversation = ConversationChain(


In [11]:
# restart from the original question
answer = conversation.predict(input=question)
print(answer)

The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: who wrote the book Innovator's dilemma?
AI: Ah, a great question! The book "The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail" was written by Clayton Christensen, a Harvard Business School professor. It was first published in 1997 and has since become a classic in the field of business and innovation. The book explores how established companies can struggle to innovate and adapt to new technologies, leading to their decline. Christensen's work has been widely influential and has been applied in many industries.

Human


In [12]:
# pass context (previous question and answer) along with the follow up "tell me more" to Llama who now knows more of what
memory.save_context({"input": question},
                    {"output": answer})
followup_answer = conversation.predict(input=followup)
print(followup_answer)

The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: who wrote the book Innovator's dilemma?
AI: The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: who wrote the book Innovator's dilemma?
AI: Ah, a great question! The book "The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail" was written by Clayton Christensen, a Harvard Business School professor. It was first published in 1997 and has since become a classic in the field of business and innovation. The book explores how established companies can struggle to innovate and adapt to new technologies, leading

In [14]:
!pip install langchain_community

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting langchain_community
  Downloading langchain_community-0.3.1-py3-none-any.whl.metadata (2.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.5.2-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.22.0-py3-none-any.whl.metadata (7.2 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspec

In [15]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(["https://huggingface.co/blog/llama3"])
docs = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [18]:
!pip install faiss-cpu

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m27.0/27.0 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0mm
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0.post1


In [19]:
# Split the document into chunks with a specified chunk size
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
all_splits = text_splitter.split_documents(docs)

# Store the document into a vector store with a specific embedding model
vectorstore = FAISS.from_documents(all_splits, HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

In [20]:
# use LangChain's RetrievalQA, to associate Llama 3 with the loaded documents stored in the vector db
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever()
)

question = "What's new with Llama 3?"
result = qa_chain({"query": question})
print(result['result'])

  result = qa_chain({"query": question})


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

What‚Äôs new with Llama 3?
	

The Llama 3 release introduces 4 new open LLM models by Meta based on the Llama 2 architecture. They come in two sizes: 8B and 70B parameters, each with base (pre-trained) and instruct-tuned versions. All the variants can be run on various types of consumer hardware and have a context length of 8K tokens.

Meta‚Äôs Llama 3, the next iteration of the open-access Llama family, is now released and available at Hugging Face. It's great to see Meta continuing its commitment to open AI, and we‚Äôre excited to fully support the launch with comprehensive integration in the Hugging Face ecosystem.

A big change in Llama 3 compared to Llama 2 is the use of a new tokenizer that expands the vocabulary size to 128,256 (from 32K tokens in the previous version). This larger vocabulary can encode text more effi

In [21]:
# no context passed so Llama 3 doesn't have enough context to answer so it lets its imagination go wild
result = qa_chain({"query": "Based on what architecture?"})
print(result['result'])

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Models on the Hub, with their model cards and licenses
ü§ó Transformers integration
Hugging Chat integration for Meta Llama 3 70b
Inference Integration into Inference Endpoints, Google Cloud & Amazon SageMaker
An example of fine-tuning Llama 3 8B on a single GPU with ü§ó¬†TRL






		Table of contents
	


What‚Äôs new with Llama 3?
Llama 3 evaluation
How to prompt Llama 3
Demo
Using ü§ó¬†Transformers
Inference Integrations
Fine-tuning with ü§ó¬†TRL
Additional Resources
Acknowledgments

Llama 3 comes in two sizes: 8B for efficient deployment and development on consumer-size GPU, and 70B for large-scale AI native applications. Both come in base and instruction-tuned variants. In addition to the 4 models, a new version of Llama Guard was fine-tuned on Llama 3 8B and is released as Llama Guard 2 (safety fine-tune).

ybelkada

In [22]:
# use ConversationalRetrievalChain to pass chat history for follow up questions
from langchain.chains import ConversationalRetrievalChain
chat_chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

In [23]:
# let's ask the original question What's new with Llama 3?" again
result = chat_chain({"question": question, "chat_history": []})
print(result['answer'])

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

What‚Äôs new with Llama 3?
	

The Llama 3 release introduces 4 new open LLM models by Meta based on the Llama 2 architecture. They come in two sizes: 8B and 70B parameters, each with base (pre-trained) and instruct-tuned versions. All the variants can be run on various types of consumer hardware and have a context length of 8K tokens.

Meta‚Äôs Llama 3, the next iteration of the open-access Llama family, is now released and available at Hugging Face. It's great to see Meta continuing its commitment to open AI, and we‚Äôre excited to fully support the launch with comprehensive integration in the Hugging Face ecosystem.

A big change in Llama 3 compared to Llama 2 is the use of a new tokenizer that expands the vocabulary size to 128,256 (from 32K tokens in the previous version). This larger vocabulary can encode text more effi

In [24]:
# this time we pass chat history along with the follow up so good things should happen
chat_history = [(question, result["answer"])]
followup = "Based on what architecture?"
followup_answer = chat_chain({"question": followup, "chat_history": chat_history})
print(followup_answer['answer'])

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

A big change in Llama 3 compared to Llama 2 is the use of a new tokenizer that expands the vocabulary size to 128,256 (from 32K tokens in the previous version). This larger vocabulary can encode text more efficiently (both for input and output) and potentially yield stronger multilingualism. This comes at a cost, though: the embedding input and output matrices are larger, which accounts for a good portion of the parameter count increase of the small model: it goes from 7B in Llama 2 to 8B in

What‚Äôs new with Llama 3?
	

The Llama 3 release introduces 4 new open LLM models by Meta based on the Llama 2 architecture. They come in two sizes: 8B and 70B parameters, each with base (pre-trained) and instruct-tuned versions. All the variants can be run on various types of consumer hardware and have a context length of 8K tokens.



In [25]:
# further follow ups can be made possible by updating chat_history like this:
chat_history.append((followup, followup_answer["answer"]))
more_followup = "What changes in vocabulary size?"
more_followup_answer = chat_chain({"question": more_followup, "chat_history": chat_history})
print(more_followup_answer['answer'])

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

A big change in Llama 3 compared to Llama 2 is the use of a new tokenizer that expands the vocabulary size to 128,256 (from 32K tokens in the previous version). This larger vocabulary can encode text more efficiently (both for input and output) and potentially yield stronger multilingualism. This comes at a cost, though: the embedding input and output matrices are larger, which accounts for a good portion of the parameter count increase of the small model: it goes from 7B in Llama 2 to 8B in

What‚Äôs new with Llama 3?
	

The Llama 3 release introduces 4 new open LLM models by Meta based on the Llama 2 architecture. They come in two sizes: 8B and 70B parameters, each with base (pre-trained) and instruct-tuned versions. All the variants can be run on various types of consumer hardware and have a context length of 8K tokens.



Note: If results can get cut off, you can set "max_new_tokens" in the Replicate call above to a larger number (like shown below) to avoid the cut off.

model_kwargs={"temperature": 0.01, "top_p": 1, "max_new_tokens": 1000}