# **Run LLM Falcon Locally**

# **Description:**
This project will walk you through the process of setting up a local Falcon LLM using Langchain’s prompt template and conversationChain functionalities.

# **Steps to Perform:**
1. Set up the Environment
2. Download Falcon 7B Model and Tokenizer from Hugging Face
3. Set up Model and Generation Configuration
3. Build the Conversation Chain
4. Modify the Prompt Template to Define a Specific Conversational Style
5. Manage Conversation History with Conversationbufferwindowmemory
6. Interact with the LLM


# **Step 1: Set up the Environment**

In [2]:
# Install the libraries if not installed
!pip install bitsandbytes
!pip install torch
!pip install transformers
!pip install accelerate
!pip install xformers
!pip install einops
!pip install langchain

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl (137.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.3
Collecting xformers
  Downloading xformers-0.0.28.post1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting torch==2.4.1 (from xformers)
  Downloading torch-2.4.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.4.1->xformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.4.1->xformers)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupt

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_core-0.3.0-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.120-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.0->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting httpx<1,>=0.23.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.7-cp310-cp310-ma

#### bitsandbytes: Optimizes models for lower memory usage by enabling 8-bit inference. This is especially useful for running large models like Falcon 7B.
#### torch: A core library for machine learning that powers the training and inference of models.
#### transformers: From Hugging Face, this provides the interfaces to load and use models like  Falcon.
#### accelerate: Helps in running models on multiple devices, distributing loads for efficiency.
#### xformers: Optimizes attention layers for transformers, making large models faster.
#### einops: A lightweight library for tensor operations, which aids in efficiently reshaping and manipulating data.
#### langchain: A library that allows easy building of chains of LLM calls, including prompt handling, chaining responses, etc.

In [4]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.0-py3-none-any.whl.metadata (2.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.5.2-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.22.0-py3-none-any.whl.metadata (7.2 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloa

In [5]:
import re
import warnings
from typing import List


import torch
from langchain import PromptTemplate
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.llms import HuggingFacePipeline
from langchain.schema import BaseOutputParser
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    StoppingCriteria,
    StoppingCriteriaList,
    pipeline,
)

# **Step 2: Download the Falcon 7B Model and Tokenizer from Hugging Face**

In [6]:
MODEL_NAME = "tiiuae/falcon-7b-instruct"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, trust_remote_code=True, load_in_8bit=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"Model device: {model.device}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

configuration_falcon.py:   0%|          | 0.00/7.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.



modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]



Model device: cuda:0


# **Step 3: Set up the LLM and Generation Configuration**
Configure the LLM for inference using the following Python code:

In [8]:

model.eval()
generation_config = model.generation_config
# Set temperature to 0 for deterministic responses
generation_config.temperature = 0
# Set number of returned sequences to 1
generation_config.num_return_sequences = 1
# Set maximum new tokens per response
generation_config.max_new_tokens = 512
# Disable token caching
generation_config.use_cache = False
# Set repetition penalty for more diverse responses
generation_config.repetition_penalty = 1.7
# Define pad and EOS token IDs
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

# **Step 4: Build the Conversation Chain**


*   Define a custom prompt template that sets the context for the conversation.
*   Create the ConversationChain object using the following Python code:





In [23]:

initial_prompt = """
The following is a conversation between a human and an AI. The AI is knowledgeable and provides detailed answers.

Current conversation:

Human: What is the theory of relativity?
AI:
""".strip()

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512)

# Create the HuggingFacePipeline object
llm_pipeline = HuggingFacePipeline(pipeline=pipe)

# Create the ConversationChain object
dialogue_chain = ConversationChain(llm=llm_pipeline)

# Print the initial prompt template
print(dialogue_chain.prompt.template)

The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
{history}
Human: {input}
AI:


# **Step 5: Modify the Prompt Template to Define a Specific Conversational Style**

In [24]:
new_template = """
The following is a conversation between a human and an AI. The AI behaves like Albert Einstein, providing detailed explanations about physics.

Current conversation:
{history}
Human: {input}
AI:""".strip()

# Create a new PromptTemplate object
prompt = PromptTemplate(input_variables=["history", "input"], template=new_template)

# Print the new prompt template
print(new_template)


The following is a conversation between a human and an AI. The AI behaves like Albert Einstein, providing detailed explanations about physics.

Current conversation:
{history}
Human: {input}
AI:


# **Step 6: Manage Conversation History with Conversationbufferwindowmemory**

In [25]:
memory = ConversationBufferWindowMemory(
    memory_key="history", k=1, return_only_outputs=True
)

chain = ConversationChain(llm=llm_pipeline, memory=memory, prompt=prompt, verbose=False)

# **Step 7: Interact with the LLM**
* Provide an input prompt to initiate the conversation and start interacting with the LLM.
* Observe the chain’s output and continue the dialogue by providing further input.

In [27]:
text = "Who was albert einstien"
res = chain.predict(input=text)
print(res)




The following is a conversation between a human and an AI. The AI behaves like Albert Einstein, providing detailed explanations about physics.

Current conversation:
Human: Write a story about a human and alien
AI: The following is a conversation between a human and an AI. The AI behaves like Albert Einstein, providing detailed explanations about physics.

Current conversation:

Human: Write a story about a human and alien
AI: Once upon a time, there was a human named John who had always been fascinated by the stars. He dreamed of visiting other galaxies and exploring what lay beyond our own. One day, he met an alien from a distant planet, who shared with him stories of their incredible homeworld. Curious to learn more, John asked the alien many questions. As they talked, they discovered that despite being vastly different in appearance, they both felt connected in ways they could never fully understand. They exchanged farewells as John wished to one day visit the alien's world, filled

# **Conclusion**
This tutorial provides a basic understanding of running a local Falcon LLM using Langchain’s PromptTemplate and ConversationChain functionalities. \


In [29]:
# Define the new custom prompt template for a creative writing assistant
new_template = """
You are a highly creative and imaginative AI, specialized in assisting writers to develop new and exciting story ideas.
 Based on the input from the user,
 you will suggest a unique plot concept,
 describe the main characters, and outline key events in the story.

Current conversation:
{history}
Human: {input}
AI:""".strip()

# Create the new PromptTemplate object using the new template
prompt = PromptTemplate(input_variables=["history", "input"], template=new_template)

# Use this new template with the existing chain setup
memory = ConversationBufferWindowMemory(memory_key="history", k=6, return_only_outputs=True)
chain = ConversationChain(llm=llm_pipeline, memory=memory, prompt=prompt, verbose=True)

# Example interaction with the new prompt
text = "I want a mystery story set in a small town where strange things keep happening. Can you help me come up with a plot?"
res = chain.predict(input=text)
print(res)




Prompt after formatting:
[32;1m[1;3mYou are a highly creative and imaginative AI, specialized in assisting writers to develop new and exciting story ideas.
 Based on the input from the user, 
 you will suggest a unique plot concept, 
 describe the main characters, and outline key events in the story.

Current conversation:

Human: I want a mystery story set in a small town where strange things keep happening. Can you help me come up with a plot?
AI:[0m

[1m> Finished chain.[0m
You are a highly creative and imaginative AI, specialized in assisting writers to develop new and exciting story ideas.
 Based on the input from the user, 
 you will suggest a unique plot concept, 
 describe the main characters, and outline key events in the story.

Current conversation:

Human: I want a mystery story set in a small town where strange things keep happening. Can you help me come up with a plot?
AI: Of course! How about a story about a series of mysterious disappearances in a seemingly idyllic