# Preparing Text Data for RAG

For a comprehensive understanding or any upcoming changes in the langchain package, refer to the [Neo4j langchain documentation](https://python.langchain.com/v0.1/docs/integrations/vectorstores/neo4jvector/).

In [1]:
### required if we wanna use vertex AI. We don't use it here so ignore it
# !pip install --upgrade google-cloud-aiplatform

In [2]:
import os
from dotenv import load_dotenv

from langchain_community.graphs import Neo4jGraph

# vector database integration of neo4j
from langchain_community.vectorstores import Neo4jVector
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_experimental.graph_transformers import LLMGraphTransformer

In [3]:
# load from environment
load_dotenv()

NEO4J_URI = os.getenv("NEO4J_URI")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
NEO4J_DATABASE = os.getenv("NEO4J_DATABASE")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

GOOGLE_EMBEDDING_MODEL = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

# LLMGraphTransformer currently only supports OpenAI and mistral model
##### see documentation if this feature is updated in lagchain
# llm_transformer = LLMGraphTransformer(llm=llm)

In [4]:
# connect the knowledge graph instance using LangChain
graph = Neo4jGraph(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database=NEO4J_DATABASE
)

## Create sample database


```python
from langchain_community.graphs import Neo4jGraph

graph = Neo4jGraph()

# Import movie information and actor
# donot have move.tagLine
movies_query = """
LOAD CSV WITH HEADERS FROM 
'https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/movies/movies_small.csv'
AS row
MERGE (m:Movie {id:row.movieId})
SET m.released = date(row.released),
    m.title = row.title,
    m.imdbRating = toFloat(row.imdbRating)
FOREACH (director in split(row.director, '|') | 
    MERGE (p:Person {name:trim(director)})
    MERGE (p)-[:DIRECTED]->(m))
FOREACH (actor in split(row.actors, '|') | 
    MERGE (p:Person {name:trim(actor)})
    MERGE (p)-[:ACTED_IN]->(m))
FOREACH (genre in split(row.genres, '|') | 
    MERGE (g:Genre {name:trim(genre)})
    MERGE (m)-[:IN_GENRE]->(g))
"""

graph.query(movies_query)
```

In [5]:
# open sample database
with open("data/dummy/movies_cypher.txt") as fp:
    content = fp.read()

# create sample database
graph.query(content)

[]

## Create a Vector Index
- Create a vector index on `movie_tagline_embeddings` for each movie based on the `taglineEmbedding` property.
- Configure the vector length of embeddings according to the embedding model being used. In this example, we use the Sentence Transformer, which produces embeddings with a dimension of 768. However, you may use other services like Google or OpenAI for embeddings. OpenAI embeddings have a dimension of 1536. Refer to the documentation of respective service for online API services.
- Use the **cosine similarity** function to retrieve the vector from the query.

In [6]:
graph.query("""
    CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
    FOR (m:Movie) ON (m.taglineEmbedding)
    OPTIONS { indexConfig: {
        `vector.dimensions`: 768, // openai embedding :1536,
        `vector.similarity_function`: 'cosine'
    }}"""
)

graph.query("""
    SHOW VECTOR INDEXES
    """
)

[{'id': 3,
  'name': 'movie_tagline_embeddings',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Movie'],
  'properties': ['taglineEmbedding'],
  'indexProvider': 'vector-2.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 8, 22, 14, 43, 50, 212000000, tzinfo=<UTC>),
  'readCount': 2}]

## Populate the vector index
- Calculate vector representation for each movie tagline using GoogleGenerativeAI
- Add vector to the `Movie` node as `taglineEmbedding` property
- Documentation: [Neo4j](https://neo4j.com/docs/cypher-manual/current/genai-integrations/#ai-providers)
- **Note**: `genai.vector.encode` GenAI is supposed to be missing in community edition, but the error message needs changing to reflect that.: [Github Issue](https://github.com/neo4j/docker-neo4j/issues/489)

```python
from vertexai.language_models import TextEmbeddingModel
projectId = "gen-lang-client-0738820968"

graph.query("""
    MATCH (movie:Movie) WHERE movie.tagline IS NOT NULL
    WITH movie, genai.vector.encode(
        movie.tagline,
        "VertexAI",
        {
            token: $googleApiKey,
            projectId: $project
        }) AS vector
    CALL db.create.setNodeVectorProperty(movie, "taglineEmbedding", vector)
    """,
    params={"googleApiKey": GOOGLE_API_KEY,
           "project": projectId}
)
```

In [7]:
# same as above but with langchain integration
vector_index = Neo4jVector.from_existing_graph(
    GOOGLE_EMBEDDING_MODEL,
    #langchain supports hybrid search(keyword+vector)
    # for keyword search we have to specify the keyword_index_name in param
    search_type="hybrid",  
    node_label="Movie",
    text_node_properties=["tagline"],
    embedding_node_property="taglineEmbedding"
)

In [8]:
result = graph.query("""
    MATCH (m:Movie) 
    WHERE m.tagline IS NOT NULL
    RETURN m.tagline, m.taglineEmbedding
    LIMIT 1
    """
)

result_dict = result[0]
for key, value in result_dict.items():
    print(f"{key}: {type(value)}")

m.tagline: <class 'str'>
m.taglineEmbedding: <class 'list'>


In [9]:
print("Tagline: ", result[0]['m.tagline'])
print("tagLineEmbedding length: ", len(result[0]['m.taglineEmbedding']))
print("First five elements of taglineEmbedding vector: ", result[0]['m.taglineEmbedding'][:10])

Tagline:  Welcome to the Real World
tagLineEmbedding length:  768
First five elements of taglineEmbedding vector:  [0.05952933430671692, -0.01058897189795971, -0.04653780534863472, -0.0076306723058223724, 0.06363282352685928, 0.008514228276908398, 0.006364417262375355, -0.00433186674490571, -0.005778653081506491, 0.02067330852150917]


### Similarity Search
- Calculate embedding for question
- Identifying matching movies based in similarity of question and `taglineEmbedding` vectors
```python
kg.query("""
    WITH genai.vector.encode(
        $question, 
        "OpenAI", 
        {
          token: $openAiApiKey,
          endpoint: $openAiEndpoint
        }) AS question_embedding
    CALL db.index.vector.queryNodes(
        'movie_tagline_embeddings', 
        $top_k, 
        question_embedding
        ) YIELD node AS movie, score
    RETURN movie.title, movie.tagline, score
    """, 
    params={"openAiApiKey":OPENAI_API_KEY,
            "openAiEndpoint": OPENAI_ENDPOINT,
            "question": question,
            "top_k": 5
            })
"""


In [10]:
question = "What moveis are about Love?"

In [11]:
retrived_data = vector_index.similarity_search(question, k=5)
retrived_data[0]

Document(metadata={'released': 1998, 'title': 'When Harry Met Sally'}, page_content='\ntagline: Can two friends sleep together and still love each other in the morning?')

In [12]:
# another question
question = "What moveis are about Adventure?"
retrived_data = vector_index.similarity_search_with_relevance_scores(question, k=5)
retrived_data[0]

(Document(metadata={'released': 2000, 'title': 'Cast Away'}, page_content='\ntagline: At the edge of the world, his journey begins.'),
 1.0)