<a href="https://colab.research.google.com/github/prosto/neo4j-haystack-playground/blob/main/neo4j_haystack_journey.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://raw.githubusercontent.com/prosto/neo4j-haystack-playground/54b1f372f6f44e6851a5fa8bc1a1f554aa1dd080/images/neo4-haystack-friendship.svg)



# Agenda

*   **Introduction** 👋🏼
*   **Meet Neo4j** 🕸
*   **Meet Haystack 2.0** 🧩
*   **Neo4j Document Store** 🏗️
*   **Explore RAG Pipelines** 🔍
*   **Q/A** ❓

## Introduction


### First steps towards Graph + RAG + Haystack

Idea to build Neo4j + Haystack integration came after investigation about representing legislation using Graphs with complex relationships between legal **documents**. In the context of semantic search of legal docs Graphs seemed like a natural choice as a way to effectively ground LLMs in RAG pipelines. At the time there were news about Neo4j introducing support for vector indexes. Haystack with its component based NLP framework appeared well structured and documented. I liked the idea of pipelines with components having well defined interfaces (e.g. input/output slots, document stores). I decided to build the integration and learn both technolgies.


### Thinking Graphs

Modeling legislation, for example, is a complex task and simple chunking strategies which existed at the time did not seem to fulfill requirements. One needs to effectively decompose legislation into meaningful pieces which are inter-connected, easily queriable as "traceable data sources". Having both Graph nodes and vector indexes in the same database helps to reduce amount of queries and simplify RAG solutions.

Below are some examples of Graps related to legislation to give you an idea of how to structure such content:

<table width="100%">
<tr><th>Legsilators (Congress)</th><th>Clauses</th></tr>
<tr>
<td bgcolor="5b6663">
  <img src="https://raw.githubusercontent.com/prosto/neo4j-haystack-playground/54b1f372f6f44e6851a5fa8bc1a1f554aa1dd080/images/legis-graph.svg"/></td>
<td bgcolor="5b6663">
  <img src="https://raw.githubusercontent.com/prosto/neo4j-haystack-playground/54b1f372f6f44e6851a5fa8bc1a1f554aa1dd080/images/legis-clauses.svg"/>
</td>
</tr>
<tr>
<td>Source: <a href="https://github.com/jbarrasa/goingmeta/tree/main/session23">Going Meta</a>
</td>
</tr>
</table>

Apart from retrieving context for LLM prompts using semantic search your RAG peipeline could additionally perform more complex queries to tune the information being retrieved. E.g. only search for documents issued by a particular body (Committee) for a given law domain.

Neo4j gives many benefits if we compare it to "single-purpose" vector databases:

- Role-based security
- Advanced queries using [Cypher](https://neo4j.com/docs/cypher-manual/current/introduction/cypher_overview/)
- Both Full-Text and Semantic search indexes
- [Schema Constraints](https://neo4j.com/docs/cypher-manual/current/constraints/)
- ACID transactions, cluster support, runtime failover

### Haystack Pipelines are also Graphs 🧐


> To build modern search pipelines with LLMs, you need two things: powerful components and an easy way to put them together. The Haystack pipeline is built for this purpose and enables you to design and scale your interactions with LLMs.

> The pipelines in Haystack 2.0 are directed **multigraphs** of different Haystack components and integrations. They give you the freedom to connect these components in various ways. This means that the pipeline doesn't need to be a continuous stream of information. With the flexibility of Haystack Pipelines, you can have simultaneous flows, standalone components, loops, and other types of connections.

Learn more from the [Pipelines](https://docs.haystack.deepset.ai/docs/pipelines) documentatiion.

<table width="100%">
<tr><th>Sample Pipeline</th></tr>
<tr>
<td align="center" bgcolor="202424">
  <img src="https://github.com/prosto/neo4j-haystack-playground/blob/main/images/generic-pipeline-sample.png?raw=true"/>
</td>
</tr>
</table>

## Meet Neo4j

> Neo4j uses a property graph database model. A graph data structure consists of nodes (discrete objects) that can be connected by relationships. Below is the image of a graph with three nodes (the circles) and three relationships (the arrows).

> Neo4j is a native graph database, which means that it implements a true graph model all the way down to the storage level. The data is stored as you whiteboard it, instead of as a "graph abstraction" on top of another technology.

### The property graph model

<table width="100%">
<tr>
<td align="center" bgcolor="202424">
  <img src="https://raw.githubusercontent.com/prosto/neo4j-haystack-playground/54b1f372f6f44e6851a5fa8bc1a1f554aa1dd080/images/the-property-graph-model.svg" />
</td>
</tr>
<table>

Nodes are the entities in the graph.

1. Nodes can be tagged with labels, representing their different roles in your domain (for example, Person).
2. Nodes can hold any number of key-value pairs, or properties (for example, name).
3. Node labels may also attach metadata (such as index or constraint information) to certain nodes.

Relationships provide directed, named connections between two node entities (for example, Person `LOVES` Person).

1. Relationships always have a direction, a type, a start node, and an end node, and they can have properties, just like nodes.
2. Nodes can have any number or type of relationships without sacrificing performance.
3. Although relationships are always directed, they can be navigated efficiently in any direction.

Below is another example of a Movies Graph:

<table width="100%">
<tr>
<td align="center" bgcolor="202424">
  <img src="https://raw.githubusercontent.com/prosto/neo4j-haystack-playground/54b1f372f6f44e6851a5fa8bc1a1f554aa1dd080/images/movies-graph-model.svg" />
</td>
</tr>
<table>


### Installation Options

1. [Neo4j AuraDB](https://neo4j.com/cloud/platform/aura-graph-database/) is a fully managed cloud service and a good place to start for anyone interested in graph technologies. Besides the free option, you can select the subscription plan that suits you best.
2. [Neo4j Database](https://neo4j.com/deployment-center/) can be installed on-premises and deployed in various systems
3. The Neo4j Docker image provides a standard package of Neo4j Community Edition and Enterprise Edition for a variety of versions.
4. [Neo4j Desktop](https://neo4j.com/docs/desktop-manual/current/) is one of the ways to set up an environment for developing an application with Neo4j and Cypher®. Download Neo4j Desktop from https://neo4j.com/download/ and follow the installation instructions for your operating system. Neo4j Desktop comes with a variety of tools that can be installed as plugins.
5. [Neo4j Sandbox](https://neo4j.com/sandbox/) provides a number of example datasets that can help you to learn more about Neo4j graph database and Cypher queries applied to a specific use case.

The simplest way to start database locally would be with Docker container:

```bash
docker run \
    --restart always \
    --publish=7474:7474 --publish=7687:7687 \
    --env NEO4J_AUTH=neo4j/passw0rd \
    neo4j:5.15.0
```

> **Note** Assuming you have a docker container running navigate to http://localhost:7474 to open [Neo4j Browser](https://neo4j.com/docs/browser-manual/current/) to explore graph data and run Cypher queries.

### Cypher with Movies

Lets prepare a Movies Graph database according to the property graph model example given above. We will use it to explore Vector Index setup.

If working in Google Colab easiest option is to quickly start a free cloud instance in [AuraDB](https://neo4j.com/cloud/platform/aura-graph-database/). Once the instance is up and running and assuming you have stored the credentials you could navigate to the [Neo4j Broswer App](https://browser.neo4j.io/) and connect to the instance.

Run the following query `:play movie-graph` and you should be able to see the "Movie Graph Guide" running as a result. As part of that guide there will be a "single Cypher query statement composed of multiple `CREATE` clauses. This will create the movie graph..". The guide should also help you to learn basic Cypher query syntax.

Below is an example `CREATE` clause to add a Movie and Actors with `:ACTED_IN` relationships:

```cypher
CREATE (TheMatrix:Movie {title:'The Matrix', released:1999, tagline:'Welcome to the Real World'})

CREATE (Keanu:Person:Actor {name:'Keanu Reeves', born:1964})
CREATE (Laurence:Person:Actor {name:'Laurence Fishburne', born:1961})

CREATE
  (Keanu)-[:ACTED_IN {roles:['Neo']}]->(TheMatrix),
  (Laurence)-[:ACTED_IN {roles:['Morpheus']}]->(TheMatrix)
```

#### Run queries with Neo4j Python Driver


[The Official Neo4j Driver for Python](https://neo4j.com/docs/api/python-driver/current/api.html) can be installed as follows:

In [None]:
!pip install neo4j

Create helper function to query Neo4j using Cypher:
<a name="cell_cypher_read_query"/>

In [None]:
from neo4j import GraphDatabase, RoutingControl
from google.colab import userdata

GRAPH_DB_URI = userdata.get("GRAPH_DB_URI")
GRAPH_DB_NAME = userdata.get("GRAPH_DB_NAME")
GRAPH_DB_AUTH = ("neo4j", userdata.get("GRAPH_DB_SECRET"))

def cypher_read_query(query, **parameters):
  with GraphDatabase.driver(GRAPH_DB_URI, auth=GRAPH_DB_AUTH) as driver:
    records, _, _ = driver.execute_query(query,
                                         parameters_=parameters,
                                         database_=GRAPH_DB_NAME,
                                         routing_=RoutingControl.READ)
    return records

def cypher_write_query(query, **parameters):
  with GraphDatabase.driver(GRAPH_DB_URI, auth=GRAPH_DB_AUTH) as driver:
    result = driver.execute_query(query,
                                  parameters_=parameters,
                                  database_=GRAPH_DB_NAME,
                                  routing_=RoutingControl.WRITE)
    return result

Example (1): Find the actor named "Tom Hanks"

In [None]:
records = cypher_read_query("MATCH (tom {name: $name}) RETURN tom", name="Tom Hanks")
[record.data().get("tom") for record in records]

Example (2): Find movies released in the 1990s

In [None]:
records = cypher_read_query(
    "MATCH (nineties:Movie) WHERE nineties.released >= 1990 AND nineties.released < 2000 RETURN nineties.title",
    year_start=1990, year_end=2000
)
[record.data() for record in records]

Example (3): List all Tom Hanks Movies

In [None]:
records = cypher_read_query(
    "MATCH (tom:Person {name: $name})-[:ACTED_IN]->(tomHanksMovies) RETURN tom,tomHanksMovies",
    name="Tom Hanks"
)
[record.data().get("tomHanksMovies") for record in records]

### Vector search indexes

> Node vector [search indexes](https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/) were released as a public beta in Neo4j 5.11 and general availability in Neo4j 5.13.

<table>
<tr><td>
<img src="https://github.com/prosto/neo4j-haystack-playground/blob/main/images/grounding-llm-neo4j-vector-index.png?raw=true"/>
</td></tr>
<tr><td>
Source: <a href="https://neo4j.com/labs/genai-ecosystem/vector-search/">Neo4j Vector Index and Search</a>
</td></tr>
</table>

#### Create and configure vector indexes

You can create vector indexes using the `CREATE VECTOR INDEX` command. An index can be given a unique name when created (or get a generated one), which is used to reference the specific index when querying or dropping it.

In [None]:
create_index_query = """
  CREATE VECTOR INDEX `movie-embeddings` IF NOT EXISTS
  FOR (n:Movie) ON (n.embedding)
  OPTIONS {indexConfig: {
  `vector.dimensions`: $dimensions,
  `vector.similarity_function`: $similarity_function
  }}
"""

cypher_write_query(create_index_query, dimensions=384, similarity_function="cosine")

Query index information:

In [None]:
cypher_read_query("SHOW VECTOR INDEXES YIELD name, type, entityType, labelsOrTypes, properties, options")

#### Create embeddings with sentence_transformers

In [None]:
!pip install -U sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

movie_nodes =  [record.get("movie") for record in cypher_read_query("MATCH (movie:Movie) RETURN movie")]

tagline_embeddings = model.encode([node.get("tagline") for node in movie_nodes])
movie_embeddings = [{"id": node.element_id, "vector": embedding} for embedding, node in zip(tagline_embeddings, movie_nodes)]

cypher_write_query("""
  WITH $movie_embeddings AS batch
  UNWIND batch as movie_embedding
  MATCH (n:Movie) WHERE elementId(n) = movie_embedding.id
  CALL db.create.setNodeVectorProperty(n, 'embedding', movie_embedding.vector)
  """, movie_embeddings=movie_embeddings)

In [None]:
cypher_read_query("SHOW VECTOR INDEXES YIELD name, type, entityType, labelsOrTypes, properties, options")

#### Query a vector index

You can query a vector index using the `db.index.vector.queryNodes` procedure.

In [None]:
text_query = "crack the cypher" # @param {type:"string"}
query_mbedding = SentenceTransformer("all-MiniLM-L6-v2").encode(text_query)

cypher_read_query("""
  CALL db.index.vector.queryNodes('movie-embeddings', $top_k, $embedding)
  YIELD node AS similarMovie, score

  MATCH (similarMovie) WHERE similarMovie.released > 2000
  RETURN similarMovie.tagline AS tagline, score
""", embedding=query_mbedding, top_k=3)

## Meet Haystack 2.0



**Haystack is an open-source framework for building production-ready LLM applications, retrieval-augmented generative pipelines and state-of-the-art search systems that work intelligently over large document collections.**

Please explore the [fabulous documentation](https://docs.haystack.deepset.ai/docs/intro) for more details.

<table width="100%">
<tr><th>Haystack Ecosystem</th></tr>
<tr>
<td align="center" bgcolor="202424">
  <img src="https://github.com/prosto/neo4j-haystack-playground/blob/main/images/haystack-universe.png?raw=true"/>
</td>
</tr>
</table>

and more...

#### Components

Components are the building blocks of a pipeline. They perform tasks such as preprocessing, retrieving, or summarizing text while routing queries through different branches of a pipeline.

Below are some examples of components which we will be using in pipelines later in the notebook.

<table width="100%">
<tr>
  <td align="center" bgcolor="202424">
    <img src="https://github.com/prosto/neo4j-haystack-playground/blob/main/images/retriever-class-diagram.png?raw=true"/>
  </td>
</tr>
</table>
<br/>
<table width="100%">
<tr>
  <td bgcolor="202424">
    <img src="https://github.com/prosto/neo4j-haystack-playground/blob/main/images/generator-class-diagram.png?raw=true"/>
  </td>
</tr>
</table>

Please notice the following:

* Each Haystack component is usually created with a set of initialization parameters (e.g. see `model`, `url` in `HuggingFaceTGIGenerator` or `top_k` in `InMemoryEmbeddingRetriever`)
* Component can be directly invoked in python code by calling its `run` method
* The component usually expects inputs which you would provide to the `run` method
* The result of component execution is a python dictionary and is outlined by `OutputType` note in the diagram.

For better understanding of how components work please refer to the [documentation](https://docs.haystack.deepset.ai/docs/components).

#### Data Classes

> In Haystack, there are a handful of core classes that are regularly used in many different places. These are classes that carry data through the system and you are likely to interact with these as either the input or output of your pipeline.

> Haystack 2.0 uses data classes to help components communicate with each other in a simple and modular way. By doing this, data flows seamlessly through the Haystack Pipelines.

Leearn more about data classes in [Haystack docs](https://docs.haystack.deepset.ai/docs/data-classes).

<table width="100%">
<tr>
  <td bgcolor="202424" align="center">
  <img src="https://github.com/prosto/neo4j-haystack-playground/blob/main/images/document-class-diagram.png?raw=true"/>
  </td>
</tr>
</table>

> Document represents a central data abstraction in Haystack, capable of holding text, tables, and binary data.

In our case we are interested in the `Document` a lot as it is going to be the main data structure which will be used to interact with Neo4j. The `Document` is going to be stored in Neo4j as a node, and both `meta` and `embedding` attributes will be used to represent additional data points if needed.

#### Document Store

> [Document Store](https://docs.haystack.deepset.ai/docs/document-store) is an object that stores your Documents. In Haystack, a Document Store is different from a component, as it doesn’t have the `run()` method. You can think of it as an interface to your database – you put the information there, or you can look through it. This means that a Document Store is not a piece of a Pipeline, but rather a tool that the components of a pipeline have access to and can interact with.

> The most common way to use a Document Store in Haystack is to fetch documents using a Retriever. A Document Store will often have a corresponding Retriever to get the most out of specific technologies.

Below you can see methods of the [DocumentStore Protocol](https://docs.haystack.deepset.ai/docs/document-store#documentstore-protocol):

<table width="100%">
<tr>
  <td bgcolor="202424" align="center">
  <img src="https://github.com/prosto/neo4j-haystack-playground/blob/main/images/document-store-class-diagram.png?raw=true"/>
  </td>
</tr>
</table>

DocumentStore Protocol has to be implemented if you are creating a [custom Document Store](https://docs.haystack.deepset.ai/docs/creating-custom-document-stores). The `neo4j-haystack` packge implements all methods from the protocol in the [Neo4jDocumentStore](https://prosto.github.io/neo4j-haystack/reference/neo4j_store/) class.

### Building Pipelines with InMemoryDocumentStore

Oour goal here is to quickly demonstrate how pipelines work and see how Doocument Stores are being used in particular. Later we will replace `InMemoryDocumentStore` with `Neo4jDocumentStore` and pipelines will practically remain same.

#### Prepare Game Of Thrones (Data)

In [None]:
GOT_DOCS_DIR="data/got"
GOT_ZIP_URL="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt6.zip"

In [None]:
# download Game of Thrones (GOT) wiki
!mkdir -p $GOT_DOCS_DIR
!wget $GOT_ZIP_URL -O wiki_gameofthrones_txt6.zip
!unzip -o wiki_gameofthrones_txt6.zip -d $GOT_DOCS_DIR

#### Install Haystack

In [None]:
!pip install haystack-ai

> The [InMemoryDocumentStore](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore) is a very simple document store with no extra services or dependencies.

> It is great for experimenting with Haystack, however we do not recommend using it for production.

In [None]:
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

#### Indexing Pipeline with InMemoryDocumentStore

In [None]:
import os
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter

pipe = Pipeline()
pipe.add_component("text_file_converter", TextFileToDocument())
pipe.add_component("cleaner", DocumentCleaner())
pipe.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=250, split_overlap=30))
pipe.add_component("embedder", SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"))
pipe.add_component("writer", DocumentWriter(document_store=document_store))

pipe.connect("text_file_converter.documents", "cleaner.documents")
pipe.connect("cleaner.documents", "splitter.documents")
pipe.connect("splitter.documents", "embedder.documents")
pipe.connect("embedder.documents", "writer.documents")

# Take the docs data directory as input and run the pipeline
file_paths = [GOT_DOCS_DIR / Path(name) for name in os.listdir(GOT_DOCS_DIR)]
result = pipe.run({"text_file_converter": {"sources": file_paths}})


#### RAG Pipeline with InMemoryDocumentStore

In [None]:
from google.colab import userdata

from haystack import GeneratedAnswer, Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import HuggingFaceTGIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils import Secret

HF_TOKEN = Secret.from_token(userdata.get("HF_TOKEN"))

prompt_template = """
Given these documents, answer the question.\nDocuments:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}

\nQuestion: {{question}}
\nAnswer:
"""
rag_pipeline = Pipeline()
rag_pipeline.add_component(
    "query_embedder",
    SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2", progress_bar=False),
)
rag_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=5))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipeline.add_component(
    "llm",
    HuggingFaceTGIGenerator(model="mistralai/Mistral-7B-v0.1", token=HF_TOKEN),
)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "llm.prompt")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

# Ask a question on the data you just added.
question = "Who created the Dothraki vocabulary?"
result = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "retriever": {"top_k": 3},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)

answer: GeneratedAnswer = result["answer_builder"]["answers"][0]


In [None]:
display(answer)

## Neo4j Document Store

As we already covered main components and concepts of both Neo4j and Haystack we should be ready to discuss the `neo4j-haystack` package and what it offers. In particular we will start with some implementation details and show how `Neo4jDcumentStore` can be used to both create and query documents in Neo4j.

### Big Picture

<table width="100%">
<tr>
  <td bgcolor="202424" align="center">
  <img src="https://github.com/prosto/neo4j-haystack-playground/blob/main/images/big-picture.png?raw=true"/>
  </td>
</tr>
</table>

The conceptual diagram above demonstrats keys areas involved in using `neo4j-haystack` library and its components:

1. You would usually be building Haaystack pipelines with Neo4j related components in it. Please notice the `retriever` component in the pipeline which ships with the package
2. `neo4j-haystack` library comes with a number of components:

  - [Neo4jDocumentStore](https://prosto.github.io/neo4j-haystack/reference/neo4j_store/) -  Document store for Neo4j Database with support for dense retrievals using Vector Search Index, implements the required [Protocol](https://docs.haystack.deepset.ai/v2.0/docs/document-store#documentstore-protocol). Document properties are stored as graph nodes. Embeddings are stored as part of node properties along with the rest of attributes (including meta). `Neo4jDocumentStore` also provides additional methods to query embeddings and manage VectorIndex.
  - [Neo4jEmbeddingRetriever](https://prosto.github.io/neo4j-haystack/reference/neo4j_retriever/#neo4j_haystack.components.neo4j_retriever.Neo4jEmbeddingRetriever) - is a typical [retriever component](https://docs.haystack.deepset.ai/v2.0/docs/retrievers) which can be used to query vector store index and find related Documents. The component uses `Neo4jDocumentStore` to query embeddings.
  - [Neo4jDynamicDocumentRetriever](https://prosto.github.io/neo4j-haystack/reference/neo4j_retriever/#neo4j_haystack.components.neo4j_retriever.Neo4jDynamicDocumentRetriever) is also a retriever component in a sense that it can be used to query Documents in Neo4j. However it is decoupled from `Neo4jDocumentStore` and allows to run arbitrary [Cypher query](https://neo4j.com/docs/cypher-manual/current/queries/) to extract documents. Practically it is possible to query Neo4j same way `Neo4jDocumentStore` does, including vector search.

3. The `neo4j-haystack` library uses [Python Driver](https://neo4j.com/docs/api/python-driver/current/api.html#api-documentation) and
[Cypher Queries](https://neo4j.com/docs/cypher-manual/current/introduction/) to interact with Neo4j database and hide all complexities under the hood. In particular [Neo4jClient](https://prosto.github.io/neo4j-haystack/reference/neo4j_client/) ia acting as a data access layer and is handling all interactions with database by invoking Cypher queries.
4. In Neo4j [Vector search index](https://neo4j.com/docs/cypher-manual/current/indexes-for-vector-search/) is being used for storing document embeddings and dense retrievals.

> As of Neo4j 5.13, the vector search index is no longer a beta feature, consider using a version of the database ">= 5.13". You could explore Known issues and Limitations in the [documentation](https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/).



### Package Structure

The source code of the package is available at https://github.com/prosto/neo4j-haystack. See below the overall structure with main parts highlighted with comments

```
.
├── CHANGELOG.md
├── CONTRIBUTING.md # How to install the project locally and contribute
├── LICENSE.txt
├── README.md
├── mkdocs.yml
├── pyproject.toml
├── data
├── docs # mkdocs pages for the API documentation web site
├── examples
├── scripts # Helper scripts, e.g. load sample data from hf datasets
├── site
├── src
│   └── neo4j_haystack
│       ├── __about__.py
│       ├── client
│       │   └── neo4j_client.py # Neo4j data access with Neo4jClient, Cypher queries are not exposed to the Neo4jDocumentStore
│       ├── components
│       │   └── neo4j_retriever.py # retrievers (e.g. Neo4jEmbeddingRetriever)
│       ├── document_stores
│       │   ├── neo4j_store.py # Implementation of the Neo4jDocumentStore, uses Neo4jClient to interact with Neo4j
│       │   └── utils.py
│       ├── errors.py
│       └── metadata_filter # Utilities to parse metadata filters
│           ├── neo4j_query_converter.py
│           └── parser.py
└── tests # Unit/Integration tests for better confidence when releasing :)
```

### Explore Neo4jDocumentStore

#### Install `neo4-haystack` package

In [None]:
!pip install sentence-transformers # required for producing embeddings in our examples
!pip install neo4j-haystack

#### Create DocumentStore with settings

*We are [cleaning up](https://neo4j.com/docs/aura/auradb/managing-databases/database-actions/#_resetting_an_instance) existing Free instance in AuraDB before proceeding to the next step. There is "Reset" option in the console which wipes out all data. The following warning will be displayed before reset happens:*

> *Resetting into a blank state will erase all data, so please be certain. If you want to keep the current data please take a snapshot and export it.*

In [None]:
from google.colab import userdata
from neo4j_haystack import Neo4jDocumentStore

GRAPH_DB_URI = userdata.get("GRAPH_DB_URI")
GRAPH_DB_NAME = userdata.get("GRAPH_DB_NAME")
GRAPH_DB_SECRET = userdata.get("GRAPH_DB_SECRET")

document_store = Neo4jDocumentStore(
    url=GRAPH_DB_URI,
    username="neo4j",
    password=GRAPH_DB_SECRET,
    database=GRAPH_DB_NAME,
    index="document-embeddings", # The name of the Vector Index in Neo4j
    node_label="Document", # Providing a label to Neo4j nodes which store Documents
    embedding_dim=384, # default is 768
    embedding_field="embedding",
    similarity="cosine", # "cosine" is default value for similarity
    progress_bar=False,
    create_index_if_missing=False,
    recreate_index=False,
    write_batch_size=100,
    verify_connectivity=True # Will try connect to Neo4j instance with given credentials as soon as Neo4jDocumentStore is created
)

* With `verify_connectivity=True` if the code above runs without error it means connection to Neo4j was succesful
* `node_label="Movie"` will ensure we are pointing Document Store to the nodes in the Movie Graph
* `create_index_if_missing=False` will make sure we are not creating index
* `recreate_index` is usefull during locl testing if you would like to recreate index each time if `True`
* `similarity` is `cosine` by default (you could skip the setting of the value). another supported options is `l2` which maps to `euclidean` in Neo4j
* `embedding_dim` value depends on model you are using for embeddings, e.g. for "all-MiniLM-L6-v2" from sentence-transformers it is `384`
* `embedding_field` specified naame of the node property which will be used to store and query embeddings, as well as vector indexing configuration.

> Note: If you wondering why `username` is by default "neo4j" and is not a secret thats becuase "neo4j" user is created by default by AuraDB Free instance.

If you try to count number of documents in the graph you should get `0` at this point

In [None]:
document_store.count_documents()

Lets crete the instance once more, We will omit some of the attributes and leave those with default values, but also instruct our code to create index if it does not exist

In [None]:
# same as above but `create_index_if_missing=True` meaning index will be created automatically
document_store = Neo4jDocumentStore(
    url=GRAPH_DB_URI,
    username="neo4j",
    password=GRAPH_DB_SECRET,
    database=GRAPH_DB_NAME,
    embedding_dim=384,
    create_index_if_missing=True,
)

Its time to see if index has been created in Neo4j, lets verify by running the following Cypher query:

In [None]:
cypher_read_query("SHOW VECTOR INDEXES YIELD name, type, entityType, labelsOrTypes, properties, options")

The result should be a record containing index information matching parameters speicified for the `Neo4jDocumentStore` instance.

Alternatively, Neo4j connection properties could be specified using a dedicated [Neo4jClientConfig](https://prosto.github.io/neo4j-haystack/reference/neo4j_client/#neo4j_haystack.client.neo4j_client.Neo4jClientConfig) data class. This additional data structure was created for code reuse and convenience so you could specify conneciton settings once and then share it betwwen differrent instances of `Neo4jDocumentStore`. Internally `Neo4jDocumentStore` will create `Neo4jClientConfig` to hold credentials even if you directly provide credentials to the DocumentStore constructor as in examples bove.

In [None]:
from neo4j_haystack import Neo4jClientConfig, Neo4jDocumentStore

client_config = Neo4jClientConfig(
    url=GRAPH_DB_URI,
    username="neo4j",
    password=GRAPH_DB_SECRET,
    database=GRAPH_DB_NAME,
)

document_store = Neo4jDocumentStore(client_config=client_config, embedding_dim=384)

#### Write data to Document Store



`Neo4jDocumentStore`, according to Protocol, provides `write_documents` method which can be used to write data to Neo4j and also update embeddings on the documents if provided.

Before we need to prepare some data in the [Document](https://docs.haystack.deepset.ai/docs/data-classes#document) format.

The `movies.json` was prepared to easily map to Haystack Document model, see below an example of a single json movie entry:

```json
{
  "id": "451999",
  "content": "A 1916 film directed by Chester M. Franklin.",
  "meta": {
    "title": "Martha's Vindication",
    "runtime": 50.0,
    "vote_average": 0.0,
    "release_date": "1916-02-20",
    "genres": ["Drama"]
  }
}
```

> **Important** `id` is provided in the json, each Document should haave an id field. If not provided the `Document` class will automatically create/generte it based on its contents.


In [None]:
import json
from typing import List

from haystack import Document
from urllib.request import urlopen

MOVIES_DATA_URL="https://raw.githubusercontent.com/prosto/neo4j-haystack-playground/main/data/movies.json"

def movie_documents() -> List[Document]:
    with urlopen(MOVIES_DATA_URL) as movies_json:
        file_contents = movies_json.read()
        docs_json = json.loads(file_contents)
        return [Document.from_dict(doc_json) for doc_json in docs_json]

documents=movie_documents()

display(documents[0])

Lets write movie Documents to the store:

In [None]:
document_store.write_documents(documents)

If you curious how to obtain data recently written to Neo4j, lets query it with Cypher:
> **Quick Action** Navigate to [cypher_read_query](#cell_cypher_read_query) definition

In [None]:
cypher_read_query("MATCH (doc:Document) RETURN doc LIMIT 5")

In addition to Protocl methods, `Neo4jDocumentStore` provides a number of helper methods, `get_document_by_id` is one of those. See all public methods in the [API documentation](https://prosto.github.io/neo4j-haystack/reference/neo4j_store/)

In [None]:
document_store.get_document_by_id("451999").to_dict()

> **Important** The above `to_dict` invocation creates pythoon `dict` from Document fields. You could call it with additional parameter `flatten=False` and in that case all metdata attributes (see original json from `movies.json`) will be stored under `meta` key (also as dictionary). In current implementation we expect `flatten=True` (default value) as Neo4j stores properties in a flat format (not nested)

We will put Haystack Document representation and Neo4j graph node representation side by side for you to see how data is being stored and mapped:

<table>
<tr><th>Haystack</th><th>Neo4j (Node)</th></tr>
<tr>
<td>
<pre>
{
  "id": "451999",
  "content": "A 1916 film directed by Chester M. Franklin.",
  "dataframe": null,
  "blob": null,
  "score": null,
  "embedding": null,
  "title": "Martha's Vindication",
  "vote_average": 0.0,
  "genres": ["Drama"],
  "release_date": "1916-02-20",
  "runtime": 50.0
}
<pre>
</td>
<td>
<pre>
{
  "identity": 12,
  "labels": ["Document"],
  "properties": {
    "release_date": "1916-02-20",
    "genres": ["Drama"],
    "vote_average": 0.0,
    "runtime": 50.0,
    "id": "451999",
    "title": "Martha's Vindication",
    "content": "A 1916 film directed by Chester M. Franklin."
  },
  "elementId": "4:bd65188c-5b0a-46c5-80ca-2ff365b9899a:12"
}
</pre>
</td>
</tr>
<table>

> **Note:** Neo4j creates additional fields, e.g. `elementId` which are not controlled by DocumentStore. Hystack Document fields are mapped to Node's `properties`

We did not create embeddings (see `"embedding": null` above) before writing documents to Neo4j. We could leverage `sentence_transformers` to create embeddings as before. However there is another way. Haystack allows you to run components directly, not only inside pipeline, and `SentenceTransformersDocumentEmbedder` could help us with that.



In [None]:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

document_embedder.warm_up() # will download the model during first run
documents_with_embeddings = document_embedder.run(documents)

In [None]:
# Write documents with embeddings, in our case only embeddings are updated, rest of properties remain same:
from haystack.document_stores.types import DuplicatePolicy
document_store.write_documents(documents_with_embeddings.get("documents"), DuplicatePolicy.OVERWRITE)

In [None]:
# Retrieve documents with embeddings:
document_store.filter_documents()

`Neo4jDocumentStore` exposes `query_by_embedding` custom method to help you query Vector Index in Neo4j. We know there is a movie wwith the following description:

> A film student robs a bank under the guise of shooting a short film about a bank robbery.

We will embed a query which looks semantically sim ilar and ask document store to find the document for us.

<a name="cell_embed_text"></a>

In [None]:
from haystack.components.embedders import SentenceTransformersTextEmbedder

def embed_text(text):
  text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
  text_embedder.warm_up()

  return text_embedder.run(text).get("embedding")

query_embedding = embed_text("A young fella pretending to be a good citizen but actually planning to commit a crime")

similar_documents = document_store.query_by_embedding(query_embedding, top_k=3)

# expected document should be in the list (not necesarily first)
display(similar_documents)

#### Metadata Filtering

You probably noticed `Neo4jDocumentStore` as part of its Document Store Protocol implementation provides `filter_documents` method. It has the following signature:

```python
filter_documents(self, filters: Optional[FilterType] = None) -> List[Document]
```

You can use it to retrieve documents from Neo4j with [metadata filters](https://docs.haystack.deepset.ai/v2.0/docs/metadata-filtering)

Internally `Neo4jDocumentStore` converts filters into Cypher query, specifically into [WHERE clause](https://neo4j.com/docs/cypher-manual/current/clauses/where/). For that two utility classes [FilterParser](https://prosto.github.io/neo4j-haystack/reference/metadata_filter/parser/) and [Neo4jQueryConverter](https://prosto.github.io/neo4j-haystack/reference/metadata_filter/neo4j_query_converter/) parse and convert filters to Cypher syntaax respectively.

The following example uses both classes to produce a parsed Cypher query:

In [None]:
from neo4j_haystack.metadata_filter import FilterParser, FilterType, Neo4jQueryConverter

parser = FilterParser()
converter = Neo4jQueryConverter(field_name_prefix="doc")

filters = {
    "operator": "OR",
    "conditions": [
        {
            "operator": "AND",
            "conditions": [
                {"field": "type", "operator": "==", "value": "news"},
                {"field": "likes", "operator": "!=", "value": 100},
            ],
        },
        {
            "operator": "AND",
            "conditions": [
                {"field": "type", "operator": "==", "value": "blog"},
                {"field": "likes", "operator": ">=", "value": 500},
            ],
        },
    ],
}

filter_ast = parser.parse(filters)
cypher_query, params = converter.convert(filter_ast)

cypher_query, params

You should get the following query (applicable as `WHERE` clause):

```cypher
((doc.type = $fv_type AND doc.likes < $fv_likes) OR (doc.type = $fv_type_1 AND doc.likes >= $fv_likes_1))
```

with parameters (`params`):

```python
{"fv_type": "news", "fv_likes": 100, "fv_type_1": "blog", "fv_likes_1": 500}
```

> **Note** The reason Cypher query is accompanied with parameters is because we delegate data type conversion of parameter values to Neo4j Python Driver instead of repeating the logic in this class. See the full mapping of core and extended types in the Data Types document.


Lets find some movies, comedy is preferable 😀

In [None]:
# "Comedy" genre
document_store.filter_documents({
    "field": "genres",
    "operator": "in",
    "value": ["Comedy"]
})


In [None]:
# "Comedy" genre with averaging rating > 7
document_store.filter_documents({
    "operator": "AND",
    "conditions": [
        {"field": "genres", "operator": "in", "value": ["Comedy"]},
        {"field": "vote_average", "operator": ">", "value": 7},
    ],
})

`filter_documents` does not allow querying Neo4j graph embedding with metadata filters combined. `query_by_embedding` will give you that:

> **Quick Action** Navigate to [embed_text](#cell_embed_text) definition

In [None]:
query_embedding = embed_text("Never growing up")
filters = {"field": "genres", "operator": "in", "value": ["Comedy"]}

similar_documents = document_store.query_by_embedding(query_embedding, top_k=10, filters=filters)

display(similar_documents)

## Explore RAG Pipelines

In [None]:
!pip install sentence-transformers # required for producing embeddings in our examples
!pip install neo4j-haystack

In [None]:
# setup common utils and constants
from google.colab import userdata

GRAPH_DB_URI = userdata.get("GRAPH_DB_URI")
GRAPH_DB_NAME = userdata.get("GRAPH_DB_NAME")
GRAPH_DB_SECRET = userdata.get("GRAPH_DB_SECRET")

GOT_DOCS_DIR="data/got"
GOT_ZIP_URL="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt6.zip"

MODEL_EMBEDDING_DIM=384
MODEL_EMBEDDING_NAME="sentence-transformers/all-MiniLM-L6-v2"

HF_API_TOKEN=userdata.get("HF_TOKEN")

In [None]:
# download Game of Thrones (GOT) wiki
!mkdir -p $GOT_DOCS_DIR
!wget $GOT_ZIP_URL -O wiki_gameofthrones_txt6.zip
!unzip -o wiki_gameofthrones_txt6.zip -d $GOT_DOCS_DIR

### Indexing Pipeline

Our pipeline will:

* Convert wiki txt files to Document using [TextFileToDocument](https://docs.haystack.deepset.ai/docs/textfiletodocument) component
* Preprocess documents with [DocumentCleaner](https://docs.haystack.deepset.ai/docs/documentcleaner), e.g. remove empty lines
* Split Documents by chunks of length 250 using [DocumentSplitter](https://docs.haystack.deepset.ai/docs/documentsplitter)
* Embed resulting chunked Documents with SentenceTransformersDocumentEmbedder
* Write Document chunks to Neo4j as Graph Nodes using [DocumentWriter](https://docs.haystack.deepset.ai/docs/documentwriter)

> **Note** How `DocumentWriter` is given `Neo4jDocumentStore` as a store to rite documents to. Here comes the benefit of haaving a common Protocol for multiple stores.



In [None]:
import os
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter

from neo4j_haystack import Neo4jDocumentStore

document_store = Neo4jDocumentStore(
    url=GRAPH_DB_URI,
    username="neo4j",
    password=GRAPH_DB_SECRET,
    database=GRAPH_DB_NAME,
    embedding_dim=MODEL_EMBEDDING_DIM,
    create_index_if_missing=True,
)

# Create components and an indexing pipeline that converts txt to documents, cleans and splits them, and
# indexes them for dense retrieval.
pipe = Pipeline()
pipe.add_component("text_file_converter", TextFileToDocument())
pipe.add_component("cleaner", DocumentCleaner())
pipe.add_component("splitter", DocumentSplitter(split_by="word", split_length=400, split_overlap=30))
pipe.add_component("embedder", SentenceTransformersDocumentEmbedder(model=MODEL_EMBEDDING_NAME))
pipe.add_component("writer", DocumentWriter(document_store=document_store))

pipe.connect("text_file_converter.documents", "cleaner.documents")
pipe.connect("cleaner.documents", "splitter.documents")
pipe.connect("splitter.documents", "embedder.documents")
pipe.connect("embedder.documents", "writer.documents")

doc_sources=[GOT_DOCS_DIR / Path(name) for name in os.listdir(GOT_DOCS_DIR)]
result = pipe.run({"text_file_converter": {"sources": doc_sources}})

display(result)

In [None]:
# display pipeliine diagram
pipe.show()

In [None]:
# count number oof documents written
document_store.count_documents()

### Generative Question Answering Pipeline

Our pipeline will:

* Create `Neo4jDocumentStore` with required credentials. It will be used to connect to previously indexed documents
* Embed text query (our question) using [SentenceTransformersTextEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder)
* Pass query embedding to [Neo4jEmbeddingRetriever](https://prosto.github.io/neo4j-haystack/reference/neo4j_retriever/#neo4j_haystack.components.neo4j_retriever.Neo4jEmbeddingRetriever) which will obtain similar documents from `Neo4jDocumentStore`
* Construct simple Q/A prompt using [PromptBuilder](https://docs.haystack.deepset.ai/docs/promptbuilder) by passing over retrieved documents from Neo4j as a prompt context
* Ask "Mistral-7B" model the question with context consisting of previously found doccuments. [HuggingFaceTGIGenerator](https://docs.haystack.deepset.ai/docs/huggingfacetgigenerator) will interact with TGI endpoint and require HF_TOKEN for this.
* Generated answer is parsed/composed by the [AnswerBuilder](https://docs.haystack.deepset.ai/docs/answerbuilder) component as a final execution step of the pipeline

In [None]:
from haystack import GeneratedAnswer, Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import HuggingFaceTGIGenerator
from haystack.utils import Secret

from neo4j_haystack import Neo4jDocumentStore, Neo4jEmbeddingRetriever

HF_TOKEN = Secret.from_token(HF_API_TOKEN)

document_store = Neo4jDocumentStore(
    url=GRAPH_DB_URI,
    username="neo4j",
    password=GRAPH_DB_SECRET,
    database=GRAPH_DB_NAME,
    embedding_dim=MODEL_EMBEDDING_DIM,
    create_index_if_missing=False,
)

# Build a RAG pipeline with a Retriever to get relevant documents to the query and a HuggingFaceTGIGenerator
# interacting with LLMs using a custom prompt.
prompt_template = """
Given these documents, answer the question.\nDocuments:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}

\nQuestion: {{question}}
\nAnswer:
"""
pipe = Pipeline()
pipe.add_component(
    "query_embedder",
    SentenceTransformersTextEmbedder(model=MODEL_EMBEDDING_NAME, progress_bar=False),
)
pipe.add_component("retriever", Neo4jEmbeddingRetriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
pipe.add_component(
    "llm",
    HuggingFaceTGIGenerator(model="mistralai/Mistral-7B-v0.1", token=HF_TOKEN),
)
pipe.add_component("answer_builder", AnswerBuilder())

pipe.connect("query_embedder", "retriever.query_embedding")
pipe.connect("retriever.documents", "prompt_builder.documents")
pipe.connect("prompt_builder.prompt", "llm.prompt")
pipe.connect("llm.replies", "answer_builder.replies")
pipe.connect("llm.meta", "answer_builder.meta")
pipe.connect("retriever", "answer_builder.documents")

# Ask a question on the data you just added.
question = "Who created the Dothraki vocabulary?"
result = pipe.run(
    {
        "query_embedder": {"text": question},
        "retriever": {"top_k": 3},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)

# For details, like which documents were used to generate the answer, look into the GeneratedAnswer object
answer: GeneratedAnswer = result["answer_builder"]["answers"][0]

display(f"""
Query: {answer.query}
Answer: ${answer.data}
== Sources: {'\n'.join([doc.meta['file_path'] for doc in answer.documents])}
""")

### RAG on Existing Graph

*Prerequistes*

*To test the pipeline you need a "Movie Graph" database with data and embeddings. We can [import an existing database](https://neo4j.com/docs/aura/auradb/importing/import-database/) from [github repo](https://github.com/prosto/neo4j-haystack-playground/blob/main/data/movie-graph-with-embeddings-384.dump).*

*`movie-graph-with-embeddings-384.dump` comes with embeddings for the `Movie:tagline` text field and `movie-embeddings` vector index. The `sentence-transformers/all-MiniLM-L6-v2` model was used to generate embeddings with dimension `384`*

*Before importing you might need to [clean up](https://neo4j.com/docs/aura/auradb/managing-databases/database-actions/#_resetting_an_instance) AuraDB Free instance to have a pristine setup for the next exercise.*

In certain scenarios you might have an existing graph in Neo4j database which was created by custom scripts or data ingestion pipelines. The schema of the graph could be complex and not exactly fitting into Haystack Document model. Moreover in many situations you might want to leverage existing graph data to extract more context for grounding LLMs. To make it possible with Haystack `neo4j-haaystack` package provides [Neo4jDynamicDocumentRetriever](https://prosto.github.io/neo4j-haystack/reference/neo4j_retriever/#neo4j_haystack.components.neo4j_retriever.Neo4jDynamicDocumentRetriever) component - a flexible retriever which can run arbitrary Cypher query to obtain documents. This component does not require Document Store to operate.

We will use the "Movie Graph" we created before to find best matching tagline for a movie.

Below is the schema of the Movie Graph to help us understand how to query Neo4j:

<table width="100%"><tr>
<td bgcolor="5b6663" align="center" valign="center">
<img width="23%" src="https://raw.githubusercontent.com/prosto/neo4j-haystack-playground/2e41a0f8b247eed21d69a28e9ebcd1261c854c6d/images/movie-graph-schema.svg"/>
</td>
</tr></table>

To better understand what data is available in each Graph node lets look at the table below:

```
╒═══════════╤══════════╤════════════╤══════════════╤═════════╕
│nodeType   │nodeLabels│propertyName│propertyTypes │mandatory│
╞═══════════╪══════════╪════════════╪══════════════╪═════════╡
│":`Person`"│["Person"]│"born"      │["Long"]      │false    │
├───────────┼──────────┼────────────┼──────────────┼─────────┤
│":`Person`"│["Person"]│"name"      │["String"]    │true     │
├───────────┼──────────┼────────────┼──────────────┼─────────┤
│":`Movie`" │["Movie"] │"tagline"   │["String"]    │false    │
├───────────┼──────────┼────────────┼──────────────┼─────────┤
│":`Movie`" │["Movie"] │"title"     │["String"]    │true     │
├───────────┼──────────┼────────────┼──────────────┼─────────┤
│":`Movie`" │["Movie"] │"released"  │["Long"]      │true     │
├───────────┼──────────┼────────────┼──────────────┼─────────┤
│":`Movie`" │["Movie"] │"embedding" │["FloatArray"]│true     │
└───────────┴──────────┴────────────┴──────────────┴─────────┘
```

> **Note** In our case `embedding` field stores the embedding vector for the `tagline` field

Our pipeline will:

* Create `Neo4jDynamicDocumentRetriever` with required credentials and Cypher query to find matching taglines and respective movies. We will also collect related actors and directors for found movies. We could use all this information if the LLM prompt
* Embed tagline guess using `SentenceTransformersTextEmbedder`. Idea is to search for something that comes to our mind and "resonates" with some movie.
* Pass query embedding to `Neo4jDynamicDocumentRetriever` which will obtain movies with additional data directly from Neo4j (we do not use `Neo4jDocumentStore` here)
* Construct prompt `PromptBuilder` instructing LLM to pick up best matching tagline and compose letter to the movie diretor explaining the tagline.
* Ask "Mixtral-8x7B-Instruct-v0.1" LLM model to generate email letter based on the prompt from the builder.
* Compose answer data with `AnswerBuilder` component and display results with respective movies found in Neo4j


In [None]:
from haystack import GeneratedAnswer, Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import HuggingFaceTGIGenerator
from haystack.utils import Secret

from neo4j_haystack import Neo4jClientConfig, Neo4jDynamicDocumentRetriever

HF_TOKEN = Secret.from_token(HF_API_TOKEN)

client_config = Neo4jClientConfig(
    url=GRAPH_DB_URI,
    username="neo4j",
    password=GRAPH_DB_SECRET,
    database=GRAPH_DB_NAME,
)

rag_cypher_query = """
  CALL db.index.vector.queryNodes($index, $top_k, $query_embedding)
  YIELD node as movie, score
  MATCH (movie)
  WITH movie, score
  MATCH (actor:Person)-[:ACTED_IN]->(movie), (director:Person)-[:DIRECTED]->(movie)
  WITH movie, score, COLLECT(distinct actor.name) AS actors, COLLECT(distinct director.name) AS directors
  RETURN movie{.*, content: movie.tagline, score, actors, directors}
  ORDER BY score DESC LIMIT $top_k
"""

prompt_template = """
Given the list of Movies with it's title, tagline and movie directors pick up
one of the taglines which matches the given guess and write a short email letter to the movie
director of the matched tagline explaining the meaning of the tagline.
The letter should be concise and have no more than 3 setntences.
Sign the letter with the name: "{{letter_from}}".

\nMovies:
{% for doc in documents %}
  - Title: {{ doc.meta['title'] }}, Tagline: {{ doc.meta['tagline'] }}, Directors: {{ doc.meta['directors'] }}
{% endfor %}

\nTagline Guess: {{tagline_guess}}
\nLetter to the director:
"""
pipe = Pipeline()
pipe.add_component(
    "query_embedder",
    SentenceTransformersTextEmbedder(model=MODEL_EMBEDDING_NAME, progress_bar=False),
)
pipe.add_component(
    "retriever",
    Neo4jDynamicDocumentRetriever(
        client_config=client_config,
        runtime_parameters=["query_embedding"],
        doc_node_name="movie",
        verify_connectivity=True,
    ),
)
pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
pipe.add_component(
    "llm",
    HuggingFaceTGIGenerator(model="mistralai/Mixtral-8x7B-Instruct-v0.1",
                            token=HF_TOKEN,
                            generation_kwargs={ "max_new_tokens": 500 }),
)
pipe.add_component("answer_builder", AnswerBuilder())

pipe.connect("query_embedder", "retriever.query_embedding")
pipe.connect("retriever.documents", "prompt_builder.documents")
pipe.connect("prompt_builder.prompt", "llm.prompt")
pipe.connect("llm.replies", "answer_builder.replies")
pipe.connect("llm.meta", "answer_builder.meta")
pipe.connect("retriever", "answer_builder.documents")

# Ask a question on the data you just added.
tagline_guess = "Have we ever met before?"
result = pipe.run(
    {
        "query_embedder": {"text": tagline_guess},
        "retriever": {
            "query": rag_cypher_query,
            "parameters": {"index": "movie-embeddings", "top_k": 5},
        },
        "prompt_builder": {"tagline_guess": tagline_guess, "letter_from": "Sergey"},
        "answer_builder": {"query": tagline_guess},
    }
)

answer: GeneratedAnswer = result["answer_builder"]["answers"][0]

def movie_sources():
  sources = []
  for movie_doc in answer.documents:
    movie_info = (
        f"Score: {movie_doc.score}, "
        f"Movie Title: {movie_doc.meta['title']}, "
        f"Movie Tagline: {movie_doc.meta['tagline']}, "
        f"Directors: {str(movie_doc.meta['directors'])}"
    )
    sources.append(movie_info)
  return sources

print(answer.data)
print("============")
print("\n".join(movie_sources()))

### RAG and Parent Document

*Prerequistes*

*To test the pipeline you will need a "Dune Graph" database with data and embeddings. [Import an existing database](https://neo4j.com/docs/aura/auradb/importing/import-database/) from [github repo](https://github.com/prosto/neo4j-haystack-playground/blob/main/data/dune-parent-child.dump).*

*`dune-parent-child.dump` comes with `Document` and `Chunk` nodes and `chunk-embeddings` vector index. The `sentence-transformers/all-MiniLM-L6-v2` model was used to generate embeddings with dimension `384`*

*Before importing you might need to [clean up](https://neo4j.com/docs/aura/auradb/managing-databases/database-actions/#_resetting_an_instance) AuraDB Free instance to have a pristine setup for the next exercise.*

You have options to enhance your RAG pipeline with data having various schemas, for example by first finding nodes using vector search and then expanding query to search for nearby nodes using appropriate Cypher syntax. It is possible to implement "Parent-Child" chunking strategy with such approach. Before that you have to ingest/index data into Neo4j accordingly by building an indexing pipeline or a custom ingestion script. A simple schema is shown below:

```
┌────────────┐                ┌─────────────┐
│   Chunk    │                │   Document  │
│            │  :HAS_PARENT   │             │
│   content  ├────────────────┤   content   │
│  embedding │                │             │
└────────────┘                └─────────────┘
```

The `dune-parent-child.dump` was created by running a script which chunks the [dune.txt](https://github.com/prosto/neo4j-haystack-playground/blob/main/data/dune.txt) text with following rules:

* The whole text is chunked with: `split_by="word", split_length=512, split_overlap=30` settings. `Document` node is created in Neo4j respectively with `content` property
* Each `Document` is further split with `split_by="word", split_length=100, split_overlap=24` settings resulting in `Chunk` nodes
* `chunk-embeddings` vector index is created based on `content` of `Chunk` nodes using embeddings generated by `sentence-transformers/all-MiniLM-L6-v2` model

The image below depicts simple relationship (`:HAS_PARENT`) we have between `Document` (green nodes in center) and `Chunks`

<table width="100%"><tr>
<td bgcolor="5b6663" align="center" valign="center">
<img width="50%" src="https://raw.githubusercontent.com/prosto/neo4j-haystack-playground/68aeffa211455278b31bec8c5b81601b431c2a16/images/parent-child-graph.svg"/>
</td>
</tr></table>

To get the final picture lets see what node properties are storeed in each node:

```
╒═════════════╤════════════╤════════════╤══════════════╤═════════╕
│nodeType     │nodeLabels  │propertyName│propertyTypes │mandatory│
╞═════════════╪════════════╪════════════╪══════════════╪═════════╡
│":`Chunk`"   │["Chunk"]   │"id"        │["String"]    │true     │
├─────────────┼────────────┼────────────┼──────────────┼─────────┤
│":`Chunk`"   │["Chunk"]   │"file_path" │["String"]    │true     │
├─────────────┼────────────┼────────────┼──────────────┼─────────┤
│":`Chunk`"   │["Chunk"]   │"source_id" │["String"]    │true     │
├─────────────┼────────────┼────────────┼──────────────┼─────────┤
│":`Chunk`"   │["Chunk"]   │"content"   │["String"]    │true     │
├─────────────┼────────────┼────────────┼──────────────┼─────────┤
│":`Chunk`"   │["Chunk"]   │"embedding" │["FloatArray"]│true     │
├─────────────┼────────────┼────────────┼──────────────┼─────────┤
│":`Document`"│["Document"]│"id"        │["String"]    │true     │
├─────────────┼────────────┼────────────┼──────────────┼─────────┤
│":`Document`"│["Document"]│"file_path" │["String"]    │true     │
├─────────────┼────────────┼────────────┼──────────────┼─────────┤
│":`Document`"│["Document"]│"source_id" │["String"]    │true     │
├─────────────┼────────────┼────────────┼──────────────┼─────────┤
│":`Document`"│["Document"]│"content"   │["String"]    │true     │
└─────────────┴────────────┴────────────┴──────────────┴─────────┘
```

> **Note** `Chunk` nodes have a property called `source_id` which is an `id` of its paarent `Document`

Our pipeline will:

* Create `Neo4jDynamicDocumentRetriever` with required credentials and Cypher query to find Chunks and then respective prent Documents. The maximum score is calculated from the list of Chunks which belong to the saame Document
* Embed question using `SentenceTransformersTextEmbedder`. Idea is to search for parent Documents which provide iser context for LLM to nswer the question
* Pass query embedding to `Neo4jDynamicDocumentRetriever` which will obtain parent documents
* Construct prompt `PromptBuilder` instructing LLM to answer the question based on parent Documents
* Ask "Mixtral-8x7B-Instruct-v0.1" LLM model to generate the answer
* Compose answer data with `AnswerBuilder` component and display the answer

> **Important** Please pay attention to the `cypher_query` as it contains the Cypher query responsible for finding parent Documents and thus does all the work. Rest of pipeline is typical RAG

In [None]:
from haystack import GeneratedAnswer, Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import HuggingFaceTGIGenerator
from haystack.utils.auth import Secret

from neo4j_haystack import Neo4jClientConfig, Neo4jDynamicDocumentRetriever

HF_TOKEN = Secret.from_token(HF_API_TOKEN)

client_config = Neo4jClientConfig(
    url=GRAPH_DB_URI,
    username="neo4j",
    password=GRAPH_DB_SECRET,
    database=GRAPH_DB_NAME,
)

cypher_query = """
            // Query Child documents by $query_embedding
            CALL db.index.vector.queryNodes($index, $top_k, $query_embedding)
            YIELD node as child_doc, score

            // Find Parent document for previously retrieved child (e.g. extend RAG context)
            MATCH (child_doc)-[:HAS_PARENT]->(parent:Document)
            WITH parent, max(score) AS score // deduplicate parents
            RETURN parent{.*, score}
        """

prompt_template = """
Given these documents, answer the question.\nDocuments:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}

\nQuestion: {{question}}
\nAnswer:
"""
rag_pipeline = Pipeline()
rag_pipeline.add_component(
    "query_embedder",
    SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2", progress_bar=False),
)
rag_pipeline.add_component(
    "retriever",
    Neo4jDynamicDocumentRetriever(
        client_config=client_config,
        runtime_parameters=["query_embedding"],
        doc_node_name="parent",
        verify_connectivity=True,
    ),
)
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipeline.add_component(
    "llm",
    HuggingFaceTGIGenerator(
        model="mistralai/Mixtral-8x7B-Instruct-v0.1",
        token=HF_TOKEN,
        generation_kwargs={"max_new_tokens": 120, "stop_sequences": ["."]},
    ),
)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "llm.prompt")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

question = "Why did author suppressed technology in the Dune universe?"
result = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "retriever": {
            "query": cypher_query,
            "parameters": {"index": "chunk-embeddings", "top_k": 5},
        },
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)

answer: GeneratedAnswer = result["answer_builder"]["answers"][0]

print("Query: ", answer.query)
print("Answer: ", answer.data)
print("== Sources:")
for doc in answer.documents:
    print("-> ", doc)
