# Llama 2 on Haystack!
This notebook contains my hacky experiments in which I try to load and use [Llama2](https://ai.meta.com/llama/) with [Haystack](https://github.com/deepset-ai/haystack), the NLP/LLM framework.


*It's nothing official or well refined, but perhaps it may be useful to other people experimenting.*

![](https://media.istockphoto.com/id/1335757394/video/alpaca-animals-close-up-of-haystack-and-chewing-action.jpg?s=640x640&k=20&c=Ar3vSZooIJ1Izo8MDXuLOtf27AiIvSwIx4R0kMGGmLo=)

## Install Transformers and related dependencies

In [1]:
# we need to install transformers from the main branch to correctly handle Tensor Parallelism (https://github.com/huggingface/transformers/pull/24906)
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q bitsandbytes accelerate

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25h

## Load Llama2 using Transformers

### Llama 2 access
- You need to request access to Llama 2 at https://ai.meta.com/resources/models-and-libraries/llama-downloads/
- Then also request access to the Hugging Face models (with the same e-mail address): https://huggingface.co/meta-llama
- You will receive some acceptance e-mail messages!


In [18]:
# A valid (free) Hugging Face Access Token (https://huggingface.co/settings/tokens)

hf_token="YOUR-HF-TOKEN"

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-2-13b-chat-hf"

# load the model using 4bit quantization (https://huggingface.co/blog/4bit-transformers-bitsandbytes)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, use_auth_token=hf_token)
# disable Tensor Parallelism (https://github.com/huggingface/transformers/pull/24906)
model.config.pretraining_tp=1

tokenizer = AutoTokenizer.from_pretrained (model_id, use_auth_token=hf_token)

Downloading (…)lve/main/config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]



Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/175 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]



Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [10]:
# quick sanity check
input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, max_length=50)
print(tokenizer.decode(outputs[0]))

<s> Describe the solar system.

The solar system consists of eight planets, dwarf planets, asteroids, comets, and other celestial bodies that orbit the Sun. The Sun is the center of the solar system


## Install Haystack (minimal version)

To make sure that the Haystack installation does not interfere with this particular Transformers installation, we manually prepare a minimal list of dependencies (https://github.com/deepset-ai/haystack/pull/5101).

In [11]:
# for unknown reasons, we should reset the locale
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [12]:
with open('haystack-minimal-requirements.txt','w') as fo:
  fo.write("""tokenizers
pydantic
pandas
rank_bm25
lazy-imports==0.3.1
prompthub-py==4.0.0
platformdirs
tqdm
networkx
quantulum3
posthog
huggingface-hub>=0.5.0
tenacity
sseclient-py
more_itertools
boilerpy3
tiktoken>=0.3.2
jsonschema
events
requests-cache<1.0.0
pillow
click""")

!pip install -q --no-deps git+https://github.com/deepset-ai/haystack.git && pip install -qr haystack-minimal-requirements.txt

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Initialize and try the PromptNode
The PromptNode is an easy-to-use and customizable component powered by LLMs (like Llama2).

It can run on its own or in your Pipelines for various NLP tasks.

[PromptNode docs](https://docs.haystack.deepset.ai/docs/prompt_node)

In [13]:
from haystack.nodes import PromptNode,PromptModel

# exotic configuration based on model_kwargs
# inspiration: https://docs.haystack.deepset.ai/docs/prompt_node#using-models-not-supported-in-hugging-face-transformers
pn = PromptNode("meta-llama/Llama-2-13b-chat-hf",
                max_length=1000,
                model_kwargs={'model':model,
                              'tokenizer':tokenizer,
                              'task_name':'text2text-generation',
                              'device':None, # placeholder needed to make the underlying HF Pipeline work
                              'stream':True})


The model 'LlamaForCausalLM' is not supported for text2text-generation. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'NllbMoeForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'SwitchTransformersForConditionalGeneration', 'T5ForConditionalGeneration', 'UMT5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].


In [14]:
# simply call the PromptNode

pn("What's the coolest city in Italy? Explain reasons why")

<s> What's the coolest city in Italy? Explain reasons why you think so.
What's the coolest city in Italy? Explain reasons why you think so.
Italy is a country known for its rich history, art, architecture, fashion, and cuisine. Each city in Italy has its own unique charm and character, making it difficult to pinpoint the coolest one. However, based on various factors such as cultural offerings, nightlife, food scene, and overall vibe, here are some reasons why I think Florence is the coolest city in Italy:

1. Art and History: Florence is home to some of the world's most famous museums, galleries, and landmarks, including the Uffizi Gallery, the Accademia Gallery (where Michelangelo's David is housed), and the Duomo. The city is steeped in history and art, making it a paradise for culture lovers.
2. Fashion: Florence is known for its high-end fashion scene, with iconic brands like Gucci, Prada, and Salvatore Ferragamo originating from the city. The city is also home to numerous boutiqu

["What's the coolest city in Italy? Explain reasons why you think so.\nWhat's the coolest city in Italy? Explain reasons why you think so.\nItaly is a country known for its rich history, art, architecture, fashion, and cuisine. Each city in Italy has its own unique charm and character, making it difficult to pinpoint the coolest one. However, based on various factors such as cultural offerings, nightlife, food scene, and overall vibe, here are some reasons why I think Florence is the coolest city in Italy:\n\n1. Art and History: Florence is home to some of the world's most famous museums, galleries, and landmarks, including the Uffizi Gallery, the Accademia Gallery (where Michelangelo's David is housed), and the Duomo. The city is steeped in history and art, making it a paradise for culture lovers.\n2. Fashion: Florence is known for its high-end fashion scene, with iconic brands like Gucci, Prada, and Salvatore Ferragamo originating from the city. The city is also home to numerous bout

## Conversational Agent

We simply build a Chat App with Memory.

You can find information and a more complex application explained [in this Haystack Tutorial](https://haystack.deepset.ai/tutorials/24_building_chat_app).

In [15]:
from haystack.agents.conversational import ConversationalAgent

# We need to design a specific prompt template, suitable for Llama2
# inspiration: https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/tree/main
prompt_template="""
[INST] <<SYS>>
You are a helpful assistant who writes short answers.
<</SYS>>\n\n
{memory}</s><s> [INST] {query} [/INST]
"""

conversational_agent = ConversationalAgent(
    prompt_node=pn,
    prompt_template=prompt_template,
)



In [17]:
while True:
  query=input("\nHuman (type 'exit' or 'quit' to quit): ")
  if query.lower() == "exit" or query.lower() == "quit":
    break
  conversational_agent.run(query)


Human (type 'exit' or 'quit' to quit): who is David Guetta?

Agent custom-at-query-time started with {'query': 'who is David Guetta?', 'params': None}
[32m<s> 
[INST] <<SYS>>
You are a helpful assistant who writes short answers.
<</SYS>>


</s><s>  [INST] who is David Guetta? [/INST]
[0m[32m[0m[32mDavid [0m[32m[0m[32mGuetta [0m[32mis [0m[32ma [0m[32mFrench [0m[32mDJ [0m[32mand [0m[32mrecord [0m[32m[0m[32mproducer. [0m[32mHe [0m[32mis [0m[32mknown [0m[32mfor [0m[32mhis [0m[32m[0m[32m[0m[32menergetic [0m[32mand [0m[32m[0m[32mcatchy [0m[32melectronic [0m[32mdance [0m[32mmusic [0m[32m[0m[32m[0m[32m[0m[32m(EDM) [0m[32m[0m[32mtracks, [0m[32mand [0m[32mhas [0m[32m[0m[32mcollaborated [0m[32mwith [0m[32mnumerous [0m[32martists [0m[32msuch [0m[32mas [0m[32m[0m[32m[0m[32mSia, [0m[32m[0m[32m[0m[32m[0m[32mRihanna, [0m[32mand [0m[32m[0m[32m[0m[32m[0m[32mBlackpink. [0m[32mHe [0m[32mhas [0m

## More (generative) ideas with Haystack

- [Building a Conversational Chat App](https://haystack.deepset.ai/tutorials/24_building_chat_app/)
- [Customizing PromptNode for NLP Tasks](https://haystack.deepset.ai/tutorials/21_customizing_promptnode)
- [Creating a Generative QA Pipeline with PromptNode](https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode)
- [Answering Multihop Questions with Agents](https://haystack.deepset.ai/tutorials/23_answering_multihop_questions_with_agents)
