<a href="https://www.kaggle.com/code/gpreda/simple-sequential-chain-with-llama-2-and-langchain?scriptVersionId=172016669" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<center><h1>Simple sequential chain with Llama 2 and Langchain</h1></center>

<center><img src="https://eu-images.contentstack.com/v3/assets/blt6b0f74e5591baa03/blt98d8a946b63c9b5f/64b7170ab314c94aa481d8c3/Untitled_design_(1).jpg" width=400></center>


# Introduction


## Objective  

Use Llama 2 and Langchain to create a multi-step task chain. 

## Models details  

* **Model #1**: Llama 2  
* **Variation**: 7b-chat-hf    
* **Version**: V1  
* **Framework**: PyTorch  


LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model. 

# InstalIing, imports, utils

Install packages.

In [1]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12

Collecting einops==0.6.1
  Downloading einops-0.6.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m842.2 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain==0.0.300
  Downloading langchain-0.0.300-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xformers==0.0.21
  Downloading xformers-0.0.21-cp310-cp310-manylinux2014_x86_64.whl (167.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.0/167.0 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes==0.41.1
  Downloading bitsandbytes-0.41.1-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence_transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━

Import packages.

In [2]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time

from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain, SimpleSequentialChain
from langchain import PromptTemplate


## Initialize model, tokenizer, query pipeline  

Define the model, the device, and the bitsandbytes configuration.

In [3]:
model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

Prepare the model and the tokenizer.  

We perform this operation for the model 7b.

In [4]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=None,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model #1, tokenizer: {round(time_2-time_1, 3)} sec.")



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Prepare model #1, tokenizer: 157.695 sec.


Define a pipeline.

In [5]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline #1: {round(time_2-time_1, 3)} sec.")

llm = HuggingFacePipeline(pipeline=query_pipeline)


Prepare pipeline #1: 2.429 sec.



We test it by running a simple query.

In [6]:
from IPython.display import display, Markdown

def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Answer", "Total time"], ["blue", "red", "green", "magenta"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [7]:
# checking model
t = time()
response = llm(prompt="What is the most popular food in France for tourists? Just return the name of the food.")
display(Markdown(colorize_text(f"{response}\n\nTotal time: {round(time()-t, 2)} sec.")))







**<font color='green'>Answer:</font>** Escargots (Snails)



**<font color='magenta'>Total time:</font>** 5.65 sec.

# Define and execute the sequential chain


We define a chain with two tasks in sequence.  
The input for the second step is the output of the first step.  

In [8]:
def sequential_chain(country, llm):
    """
    Args:
        country: country selected
    Returns:
        None
    """
    time_1 = time()
    template = "What is the most popular food in {country} for tourists? Just return the name of the food."

    #  first task in chain
    first_prompt = PromptTemplate(

    input_variables=["country"],

    template=template)

    chain_one = LLMChain(llm = llm, prompt = first_prompt)

    # second step in chain
    second_prompt = PromptTemplate(

    input_variables=["food"],

    template="What are the top three ingredients in {food}. Just return the answer as three bullet points.",)

    chain_two = LLMChain(llm=llm, prompt=second_prompt)

    # combine the two steps and run the chain sequence
    overall_chain = SimpleSequentialChain(chains=[chain_one, chain_two], verbose=True)
    overall_chain.run(country)
    time_2 = time()
    print(f"Run sequential chain: {round(time_2-time_1, 3)} sec.")

Test the sequence with Llama v2 **7b** chat HF model.

In [9]:
final_answer = sequential_chain("France", llm)



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3m

Answer: Escargots (Snails)[0m
[33;1m[1;3m

* Escargots (Snails)
* Garlic
* Butter[0m

[1m> Finished chain.[0m
Run sequential chain: 3.566 sec.


Test the sequence with Llama v2 **7b** chat HF model.

In [10]:
final_answer = sequential_chain("Italy", llm)



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3m

Answer: Pizza.[0m
[33;1m[1;3m

Top three ingredients in pizza:

• Cheese
• Tomato sauce
• Pepperoni[0m

[1m> Finished chain.[0m
Run sequential chain: 3.881 sec.


# Conclusions

The model answers are correct and the time for answering is in acceptable limits (less than 5 seconds for two steps-task).