<a href="https://colab.research.google.com/github/chueneelvin/Databricks/blob/main/PDF_QnA_with_langchain_Neo4j_plus_Hybrid_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Neo4j Vector Index
Neo4j is an open-source graph database with integrated support for vector similarity search

It supports:

- approximate nearest neighbor search.
- Euclidean similarity and cosine similarity.
- Hybrid search combining vector and keyword searches.

This notebook shows how to use the Neo4j vector index (Neo4jVector).

# Intall required libraries

In [2]:
# Pip install necessary package
%pip install --upgrade --quiet  neo4j
%pip install --upgrade --quiet  langchain-openai langchain-community
%pip install --upgrade --quiet  tiktoken
%pip install pypdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.6/294.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m396.4/396.4 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.0/290.0 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Import the required libraries

In [3]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Neo4jVector
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
import os

# Set up the environment variables

In [4]:
# Get the API key from user data
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
os.environ["NEO4J_URI"]      = userdata.get('NEO4J_URI')
os.environ["NEO4J_USERNAME"] = userdata.get('NEO4J_USERNAME')
os.environ["NEO4J_PASSWORD"] = userdata.get('NEO4J_PASSWORD')

# Data ingestion

In [6]:
# Load the data
loader = PyPDFLoader("/content/Farming Potatoes in South Africa_ What You Need to Know.pdf")
documents = loader.load()

#split the data into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

In [7]:
len(docs)

20

# Initialize the embeddings model

In [8]:
embeddings = OpenAIEmbeddings() # using default embeddings model:text-embedding-ada-002
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7caa305d9780>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7ca9fda02f80>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

# Store embeddings in the neo4j vector store

In [9]:
# The Neo4jVector Module will connect to Neo4j and create a vector index if needed.

db = Neo4jVector.from_documents(
    docs, OpenAIEmbeddings(), url=os.environ["NEO4J_URI"], username=os.environ["NEO4J_USERNAME"], password=os.environ["NEO4J_PASSWORD"]
)



In [10]:
db

<langchain_community.vectorstores.neo4j_vector.Neo4jVector at 0x7caa305d9ae0>

In [11]:
query = "What are the trends in potato farming?"
docs_with_score = db.similarity_search_with_score(query, k=2)

In [21]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.846527099609375

Farming P otatoes In South Africa: What
You Need To Know
“My idea of hea ven is a gr eat big bak ed potat o and
someone t o shar e it with. ”
- Opr ah Winfr ey
What's in this guide?
1. Introduction: Farming potatoes in South Africa
2. All about seed potatoes
3. Growing potatoes in South Africa
4. Challenges of potato farming
5. Sustainable potato farming
Chapter 1
Introduction: F arming P otatoes in South Africa
In South Africa, maiz e meal and br ead ar e the most commonly consumed sour ces of
carbohy drates. Howe ver, South Africans ha ve eaten twice as many potat oes o ver the past
decade compar ed to the decade befor e it, buo yed b y a gr owing middle class.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.8433074951171875
Copyright © RegenZ (Pty) Ltd. 

# Working with existing vector store

Above, we created a vectorstore from scratch. However, often times we want to work with an existing vectorstore. In order to do that, we can initialize it directly.

In [None]:
#index_name = "vector"  # default index name
#
#store = Neo4jVector.from_existing_index(
#    OpenAIEmbeddings(),
#    url=url,
#    username=username,
#    password=password,
#    index_name=index_name,
#)


# Working with existing graph store

We can also initialize a vectorstore from existing graph using the from_existing_graph method. This method pulls relevant text information from the database, and calculates and stores the text embeddings back to the database.

In [22]:
# First we create sample data in graph
#store.query(
#    "CREATE (p:Person {name: 'Tomaz', location:'Slovenia', hobby:'Bicycle', age: 33})"
#)

NameError: name 'store' is not defined

In [None]:
# Now we initialize from existing graph
#existing_graph = Neo4jVector.from_existing_graph(
#    embedding=OpenAIEmbeddings(),
#    url=url,
#    username=username,
#    password=password,
#    index_name="person_index",
#    node_label="Person",
#    text_node_properties=["name", "location"],
#    embedding_node_property="embedding",
#)
#result = existing_graph.similarity_search("Slovenia", k=1)

# Hybrid search (vector + keyword)
Neo4j integrates both vector and keyword indexes, which allows you to use a hybrid search approach

In [13]:
# The Neo4jVector Module will connect to Neo4j and create a vector and keyword indices if needed.
hybrid_db = Neo4jVector.from_documents(
    docs,
    OpenAIEmbeddings(),
    url=os.environ["NEO4J_URI"],
    username=os.environ["NEO4J_USERNAME"],
    password=os.environ["NEO4J_PASSWORD"],
    search_type="hybrid",
)



In [14]:
hybrid_db

<langchain_community.vectorstores.neo4j_vector.Neo4jVector at 0x7ca9fdaf9b40>

# Hybrid search from existing indexes
To load the hybrid search from existing indexes, you have to provide both the vector and keyword indices

In [24]:
#index_name = "vector"  # default index name
#keyword_index_name = "keyword"  # default keyword index name
#
#store = Neo4jVector.from_existing_index(
#    OpenAIEmbeddings(),
#    url=url,
#    username=username,
#    password=password,
#    index_name=index_name,
#    keyword_index_name=keyword_index_name,
#    search_type="hybrid",
#)

<langchain_community.vectorstores.neo4j_vector.Neo4jVector at 0x7bfb4cd5bf40>

#Retriever options
This section shows how to use Neo4jVector as a retriever.

In [15]:
retriever = hybrid_db.as_retriever()
retriever.invoke(query)[0]



Document(metadata={'source': '/content/Farming Potatoes in South Africa_ What You Need to Know.pdf', 'page': 15}, page_content='The lapse of the anti-dumping duties t ook place between Januar y and Ma y 2021. During that\ntime, 11.8 million kilograms of fr ozen fries wer e impor ted t o South Africa. This is 64.71% mor e\nthan Ma y 2020, and 199.19% mor e than Ma y 2019.\xa0\nAlready , local farmers and pr oducers ha ve had t o deal with the negativ e \x00nancial impact of the\nCOVID-19 pandemic and other pr evailing socio-economic and mark et conditions. Now , further\nthreats ma y for ce some local gr owers and pr ocessors out of business.\nConsumers ha ve a r ole t o pla y\xa0\nTo combat the eff ects of agricultur al dumping, consumers need t o activ ely suppor t the local\nagricultur al sect or by reading the packaging and choosing South African potat o products.\nTher e are also ecological adv antages t o suppor ting South African spuds, including a smaller\ncarbon footprint and t

#Question Answering with Sources
This section goes over how to do question-answering with sources over an Index. It does this by using the RetrievalQAWithSourcesChain, which does the lookup of the documents from an Index.

In [16]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import ChatOpenAI

In [32]:
chain = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0), chain_type="stuff", retriever=retriever
)

In [44]:
chain.invoke(
    {"question": "what is dumping and how can one avoid agricultural dumping?"},
    return_only_outputs=False,
)



{'question': 'what is dumping and how can one avoid agricultural dumping?',
 'answer': 'Agricultural dumping refers to the practice of exporting commodities at prices well below the cost of production. To avoid agricultural dumping, consumers can actively support the local agricultural sector by choosing South African potato products. Additionally, sustainable farming practices, such as composting, precision farming, and water management, can help farmers achieve profitability and longevity, ultimately reducing the reliance on toxic chemicals and improving environmental outcomes.\n',
 'sources': '/content/Farming Potatoes in South Africa_ What You Need to Know.pdf'}

In [43]:
result = chain.invoke(
    {"question": "what is dumping and how can one avoid agricultural dumping?"},
    return_only_outputs=False,
)
answer = result['answer']
sources = result['sources']

print(f"Answer: {answer}\nSources: {sources}")



Answer: Agricultural dumping refers to the practice of exporting commodities at prices well below the cost of production. To avoid agricultural dumping, consumers can actively support the local agricultural sector by choosing South African potato products. Additionally, sustainable farming practices, such as composting, precision farming, and water management, can help farmers achieve profitability and longevity, ultimately reducing the reliance on toxic chemicals and improving environmental outcomes.

Sources: /content/Farming Potatoes in South Africa_ What You Need to Know.pdf
