# Mistral + Haystack: build RAG pipelines that rock 🤘

### Unofficial experiment in which I build a Retrieval Augmented Generation pipeline for Rock music, using the new powerful [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/) and [Haystack](https://github.com/deepset-ai/haystack) LLM orchestration framework.

<a target="_blank" href="https://colab.research.google.com/github/anakin87/mistral-haystack/blob/main/mistral_haystack.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" width="200" alt="Open In Colab"/>
</a>

<img src="https://techcrunch.com/wp-content/uploads/2023/09/mistral-7b-v0.1.jpg" width="300" style="display:inline;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img src="https://img.freepik.com/premium-vector/electric-guitar-fire-hot-rock-music-guitar-flames-hard-rock-rock-roll-concert-festival-label-night-club-live-show-vector-logo-emblem_570429-23178.jpg?w=2000" width="200"><img src="https://haystack.deepset.ai/images/haystack-ogimage.png" width="400" style="display:inline;">

## Install Haystack

In [1]:
%%capture
! pip install farm-haystack

## Load Mistral 7b in Haystack

For simplicity, I am going to use the [Hugging Face Inference API](https://docs.haystack.deepset.ai/docs/prompt_node#using-hugging-face-inference-api), so we need a free Hugging Face Access Token (https://huggingface.co/settings/tokens).


In [2]:
from haystack.nodes import PromptNode
from getpass import getpass

In [3]:
HF_TOKEN = getpass("Your Hugging Face Token")

Your Hugging Face Token··········


In [None]:
pn = PromptNode(model_name_or_path="mistralai/Mistral-7B-Instruct-v0.1",  # instruct fine-tuned model: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
                max_length=800,
                api_key=HF_TOKEN)

In [5]:
# Let's quickly try the model

out=pn("<s>[INST] Explain in a ironic way why Large Language Model rock! [/INST]</s>")

print(out[0])

 Oh, absolutely! Large Language Models are the epitome of rock and roll. They're like the ultimate supergroup, with members from all walks of life and backgrounds, coming together to create something truly magical.

First of all, they're incredibly versatile. They can handle any topic or question you throw their way, from pop culture to politics to philosophy. They're like the ultimate jukebox, always ready to play whatever song you want to hear.

And let's not forget about their ability to learn and adapt. They're constantly evolving, taking in new information and refining their responses. It's like they're always on the cutting edge of the music scene, always pushing the boundaries of what's possible.

But perhaps the most impressive thing about Large Language Models is their sheer size. They're like a giant machine, churning out hit after hit, never stopping until they've exhausted every last bit of their creativity.

So yeah, in a completely ironic way, Large Language Models rock! 

## Load and prepare data

I load the data (Wikipedia pages about Rock) from an old project of mine

In [6]:
! git clone https://github.com/anakin87/fact-checking-rocks
! tar -xzf /content/fact-checking-rocks/data/rock_wiki.tar.gz

Cloning into 'fact-checking-rocks'...
remote: Enumerating objects: 319, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 319 (delta 21), reused 22 (delta 9), pack-reused 277[K
Receiving objects: 100% (319/319), 77.79 KiB | 553.00 KiB/s, done.
Resolving deltas: 100% (186/186), done.
Filtering content: 100% (4/4), 222.56 MiB | 44.02 MiB/s, done.


In [7]:
# they are JSON file with a content field and some metadata

! python -m json.tool /content/rock_wiki/100405.json

{
    "content": "Robert Anthony Plant  (born 20 August 1948) is an English singer and songwriter, best known as the lead singer and lyricist of the English rock band Led Zeppelin for all of its existence from 1968 until 1980, when the band broke up following the death of John Bonham, the band's drummer. He was inducted with the band into the Rock and Roll Hall of Fame in 1995. \nPlant enjoyed great success with Led Zeppelin from the late 1960s to the end of the 1970s. He developed a compelling image as the charismatic rock-and-roll front man, similar to those of contemporaries such as Mick Jagger of the Rolling Stones, Roger Daltrey of the Who, Jim Morrison of the Doors, and Freddie Mercury of Queen. After Led Zeppelin dissolved in 1980,  Plant continued to perform and record continuously on a variety of solo and group projects.  His first well known post-Led Zeppelin project was The Honeydrippers, alongside former Led Zeppelin guitarist Jimmy Page, among others.  In 1988, he released

I load these files into Haystack Documents, then I chunk them using a [Preprocessor](https://docs.haystack.deepset.ai/docs/preprocessor)

In [8]:
import glob,json
from haystack import Document
from haystack.nodes import PreProcessor

docs = []

for json_file in glob.glob("/content/rock_wiki/*.json"):
    with open(json_file, "r") as fin:
        doc_json = json.load(fin)
    doc = Document.from_json(doc_json)

    docs.append(doc)

In [9]:
processor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=200,
    split_respect_sentence_boundary=True,
    split_overlap=0,
    language="en",
)
preprocessed_docs = processor.process(docs)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Preprocessing: 100%|██████████| 453/453 [00:11<00:00, 38.18docs/s]


In [10]:
# a smaller chunked document

preprocessed_docs[0]

<Document: {'content': 'Stephen Lawrence Winwood (born 12 May 1948) is an English professional musician and songwriter whose genres include blue-eyed soul, rhythm and blues, blues rock and pop rock. Though primarily a keyboard player and vocalist prominent for his distinctive, soulful high tenor voice, Winwood plays other instruments proficiently, including drums, mandolin, guitars, bass and saxophone.\nWinwood was an integral member of three seminal musical ensembles of the 1960s and 1970s: the Spencer Davis Group, Traffic, and Blind Faith. Beginning in the 1980s, his solo career flourished and he had a number of hit singles, including "While You See a Chance" (1980) from the album Arc of a Diver and "Valerie" (1982) from Talking Back to the Night ("Valerie" became a hit when it was re-released with a remix from Winwood\'s 1987 compilation album Chronicles). His 1986 album Back in the High Life marked his career zenith, with hit singles including "Back in the High Life Again", "The Fi

### Create an InMemoryDocumentStore and store data

It's a very simple Document Store, ideal for rapid prototyping


In [11]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

In [12]:
document_store.write_documents(preprocessed_docs)

Updating BM25 representation...: 100%|██████████| 13236/13236 [00:01<00:00, 12001.50 docs/s]


## Create a RAG Pipeline

It will be composed by:
- a [BM25 Retriever](https://docs.haystack.deepset.ai/docs/retriever#bm25-recommended) 🔎 to find relevant Documents
- a [PromptNode](https://docs.haystack.deepset.ai/docs/prompt_node) 🤖 with our LLM to generate grounded answers

In [13]:
from haystack import Pipeline
from haystack.nodes import BM25Retriever, PromptNode, PromptTemplate

In [14]:
retriever = BM25Retriever(document_store, top_k=4)

In [15]:
# a good Question Answering template, adapted for the instruction format
# (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)

qa_template = PromptTemplate(prompt=
  """<s>[INST] Using the information contained in the context, answer the question (using a maximum of two sentences).
  If the answer cannot be deduced from the context, answer \"I don't know.\"
  Context: {join(documents)};
  Question: {query}
  [/INST]</s>""")

In [16]:
prompt_node = PromptNode(model_name_or_path="mistralai/Mistral-7B-Instruct-v0.1",
                         api_key=HF_TOKEN,
                         default_prompt_template=qa_template,
                         max_length=5500,
                         model_kwargs={"model_max_length":8000})

In [17]:
rag_pipeline = Pipeline()
rag_pipeline.add_node(component=retriever, name="retriever", inputs=["Query"])
rag_pipeline.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])

In [18]:
from pprint import pprint
print_answer = lambda out: pprint(out["results"][0].strip())

## Let's try our RAG Pipeline 🎸



In [19]:
print_answer(rag_pipeline.run(query="Who was Elvis Presley?"))


('Elvis Presley was a legendary American singer, actor, and musician, known '
 'for his significant impact on the popularization of rock and roll. He was '
 'born on January 8, 1935, in Tupelo, Mississippi, and had a close bond with '
 'his mother. His father was of German, Scottish, and English origins, while '
 "his mother was of Scots-Irish with some French Norman ancestry. Elvis's "
 'mother believed that her great-great-grandmother, Morning Dove White, was '
 "Cherokee. Elvis's early years were marked by his love for music, which he "
 'discovered at an Assembly of God church.')


In [20]:
print_answer(rag_pipeline.run(query="What was the initial name of Sum 41?"))

'The initial name of Sum 41 was Supernova.'


In [21]:
print_answer(rag_pipeline.run(query="Is Ozzy Osbourne a member of Blink 182?"))

('No, Ozzy Osbourne is not a member of Blink-182. He is a member of Black '
 'Sabbath.')


In [22]:
# let's try to fool the model

print_answer(rag_pipeline.run(query="Is the earth flat?"))

print_answer(rag_pipeline.run(query="Who is Michael Jordan?"))

("I don't know. The provided context does not mention anything about the Earth "
 'being flat.')
("I don't know. The context does mention Louis Jordan, but it does not provide "
 'any information about a person named Michael Jordan.')


## Try it for yourself!


In [23]:
questions="""Who is Lady Gaga?
What is Nirvana's first album?
Where System of a down come from?
Who wrote the song Stairway to heaven?
Can you tell me the names of some hard rock bands?
How many albums have Faith No More release?
Why are Sex Pistols so popular, even though their career was short?""".split("\n")

In [None]:
for q in questions:
  print("\n"+q)
  print_answer(rag_pipeline.run(query=q))

## More (generative) ideas with Haystack
- [Customizing PromptNode for NLP Tasks](https://haystack.deepset.ai/tutorials/21_customizing_promptnode/)
- [Creating a Generative QA Pipeline with PromptNode](https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode)
- [Answering Multihop Questions with Agents](https://haystack.deepset.ai/tutorials/23_answering_multihop_questions_with_agents)
- [Building a Conversational Chat App](https://haystack.deepset.ai/tutorials/24_building_chat_app/)
- [Customizing Agent to Chat with Your Documents](https://haystack.deepset.ai/tutorials/25_customizing_agent)

