<a href="https://colab.research.google.com/github/aurioldegbelo/sis2025/blob/main/2025_SIS_Demo_2_GraphQA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Topics: Data Modelling and Search Models
* Langsmith (for inspection and debugging)
* Semantic model extraction (continued)
* Graph QA using GraphCypherQAChain
* Graph QA using Vector Indices

# Part 1:  Langsmith

[Documentation](https://docs.smith.langchain.com/)

[Website](https://www.langchain.com/langsmith)

In [43]:
!pip install -qU langsmith

In [44]:
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com/"
os.environ["LANGSMITH_API_KEY"] = "yourAPIkey"

In [45]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [47]:
!pip install -qU langchain-openai

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-neo4j 0.5.0 requires langchain-core<0.4.0,>=0.3.39, but you have langchain-core 1.0.0 which is incompatible.
langchain 0.3.27 requires langchain-core<1.0.0,>=0.3.72, but you have langchain-core 1.0.0 which is incompatible.[0m[31m
[0m

In [49]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Please respond to the user's request only based on the given context."),
    ("user", "Question: {question}\nContext: {context}")
])
model = ChatOpenAI(model="gpt-3.5-turbo")
output_parser = StrOutputParser() # https://www.restack.io/docs/langchain-knowledge-langchain-stroutputparser-guide

chain = prompt | model | output_parser

question = "What are the place names and geopolitical entities mentioned in the context?"
context = "Germany is a country in Europe and its capital is Berlin."
chain.invoke({"question": question, "context": context})

'Place names: Germany, Europe, Berlin\nGeopolitical entities: Germany'

# Part 2: Semantic Model Extraction

In [50]:
!pip install -q langchain langchain-neo4j langchain-openai neo4j

In [51]:
from langchain_neo4j import Neo4jGraph

url = "yourUrl"
username = "neo4j"
password = "yourPassword"

graph = Neo4jGraph(
    url=url,
    username=username,
    password=password
)

In [52]:
import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [53]:
# From wikipedia: https://en.wikipedia.org/wiki/M%C3%BCnster
example_text = """
Münster is an independent city (Kreisfreie Stadt)
in North Rhine-Westphalia, Germany. It is in the northern part of the state and is considered to
be the cultural centre of the Westphalia region. It is also a state district capital. Münster was the
location of the Anabaptist rebellion during the Protestant Reformation and the site of the signing of the
Treaty of Westphalia ending the Thirty Years' War in 1648. Today, it is known as the bicycle capital of Germany.
 Münster gained the status of a Großstadt (major city) with more than 100,000 inhabitants in 1915.[4]
 As of 2014, there are 300,000[5] people living in the city, with about 61,500 students,[6]
 only some of whom are recorded in the official population statistics as having their primary residence in Münster.
 Münster is a part of the international Euregio region with more than 1,000,000 inhabitants (Enschede, Hengelo, Gronau, Osnabrück).
 Companies offering jobs in Münster include the Institute for Geoinformatics at the University of Münster,
 the Münster University of Applied Sciences, Reedu GmbH, con terra, the Deutsche Bank, IKEA, LIDL, REWE, ALDI and BASF Coatings.
"""

In [54]:
!pip install -qU langchain-experimental

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/209.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/209.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [55]:
from langchain_experimental.graph_transformers import LLMGraphTransformer

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4o") # https://platform.openai.com/docs/models gpt-4o

llm_transformer = LLMGraphTransformer(llm=llm) # documentation, see https://python.langchain.com/docs/how_to/graph_constructing/

In [58]:
from langchain_core.documents import Document

documents = [Document(page_content=example_text)]
graph_documents = llm_transformer.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents[0].nodes}")
print(f"Relationships:{graph_documents[0].relationships}")

Nodes:[Node(id='Münster', type='Place', properties={}), Node(id='North Rhine-Westphalia', type='Place', properties={}), Node(id='Germany', type='Place', properties={}), Node(id='Westphalia', type='Place', properties={}), Node(id='Anabaptist Rebellion', type='Event', properties={}), Node(id='Protestant Reformation', type='Event', properties={}), Node(id='Treaty Of Westphalia', type='Event', properties={}), Node(id="Thirty Years' War", type='Event', properties={}), Node(id='Euregio', type='Place', properties={}), Node(id='Institute For Geoinformatics', type='Organization', properties={}), Node(id='University Of Münster', type='Organization', properties={}), Node(id='Münster University Of Applied Sciences', type='Organization', properties={}), Node(id='Reedu Gmbh', type='Organization', properties={}), Node(id='Con Terra', type='Organization', properties={}), Node(id='Deutsche Bank', type='Organization', properties={}), Node(id='Ikea', type='Organization', properties={}), Node(id='Lidl', t

In [59]:
graph.add_graph_documents(graph_documents)

# Part 3: Graph QA using GraphCypherQAChain

In [30]:
!pip install -q langchain langchain-neo4j langchain-openai neo4j

In [70]:
from langchain_neo4j import Neo4jGraph

# alternative way of initializing the database
os.environ["NEO4J_URI"] = "yourUrl"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "yourPassword"

graph = Neo4jGraph()

In [61]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [72]:
from langchain_neo4j import GraphCypherQAChain, Neo4jGraph
from langchain_openai import ChatOpenAI
import os

chain = GraphCypherQAChain.from_llm(
    graph=graph,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-4o"), # gpt-4o-mini	gpt-3.5-turbo
    qa_llm=ChatOpenAI(temperature=0, model="gpt-4o"), # gpt-3.5-turbo-16k
    verbose=True,
    allow_dangerous_requests=True
)

In [75]:
question_1 = "What is the population of Hessen?"
question_2 = "What is the geometry of Rheinland-Pfalz?"
question_3 = "What are the areas of Hessen and Niedersachen. Is the area of Hessen bigger than the area of Niedersachsen"
question_4 = "Is Düsseldorf the state capital of Nordrhein-Westfalen"

chain.invoke(question_4)



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (s:State {name: 'Nordrhein-Westfalen', state_capital: 'Düsseldorf'})
RETURN s IS NOT NULL AS isCapital
[0m
Full Context:
[32;1m[1;3m[{'isCapital': True}][0m

[1m> Finished chain.[0m


{'query': 'Is Düsseldorf the state capital of Nordrhein-Westfalen',
 'result': 'Yes, Düsseldorf is the state capital of Nordrhein-Westfalen.'}

# Part 4: GraphQA using Vector Indices

## Indexing

In [7]:
#!pip install langchain openai wikipedia tiktoken neo4j langchain_openai langchain_community --quiet

!pip install -q langchain_openai langchain-neo4j langchain-openai langchain_community

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/449.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/449.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m389.1/449.8 kB[0m [31m5.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m449.8/449.8 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [76]:
# https://neo4j.com/developer-blog/knowledge-graph-rag-application/
# https://github.com/tomasonjo/blogs/blob/master/llm/devops_rag.ipynb
from langchain_neo4j import Neo4jGraph

url = "yourUrl"
username = "neo4j"
password = "yourPassword"

In [77]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [78]:
# create the index

import os
from langchain.vectorstores.neo4j_vector import Neo4jVector
from langchain_openai import OpenAIEmbeddings

vector_index = Neo4jVector.from_existing_graph(
    OpenAIEmbeddings(),
    url=url,
    username=username,
    password=password,
    index_name='index_for_state',
    node_label="State",
    text_node_properties= ['name', 'population', 'state_capital', 'area'], #['name', 'description', 'status'], #['name', 'state_capital', 'url'],
    embedding_node_property='embedding',
)

In [79]:
# see the index just created
vector_index.query(
    """SHOW INDEXES
       YIELD name, type, labelsOrTypes, properties, options
       WHERE type = 'VECTOR'
    """
)

[{'name': 'index_for_state',
  'type': 'VECTOR',
  'labelsOrTypes': ['State'],
  'properties': ['embedding'],
  'options': {'indexProvider': 'vector-3.0',
   'indexConfig': {'vector.hnsw.m': 16,
    'vector.hnsw.ef_construction': 100,
    'vector.dimensions': 1536,
    'vector.similarity_function': 'COSINE',
    'vector.quantization.enabled': True}}}]

## Retrieval

In [80]:
question1 = "How many states in the database?"
question2 = "How many geometries in the the database?"
question3 = "What is the population of Hessen?"
question4 = "What is the area of Niedersachsen?"
question5 = "What is the capital of Rheinland-Pfalz?"
question6 = "What is the geometry of Nordrhein-Westfalen?"
question7 = "What are the geometries of Hessen and Niedersachsen?"
question8 = "What is the url of the geometry of Hessen?"

In [91]:
response = vector_index.similarity_search(question5)
response

[Document(metadata={}, page_content='\nname: Rheinland-Pfalz\npopulation: 4159150\nstate_capital: Mainz\narea: 19854.21'),
 Document(metadata={}, page_content='\nname: Nordrhein-Westfalen\npopulation: 18139116\nstate_capital: Düsseldorf\narea: 34110.26'),
 Document(metadata={}, page_content='\nname: Hessen\npopulation: 6391360\nstate_capital: Wiesbaden\narea: 21114.94'),
 Document(metadata={}, page_content='\nname: Niedersachsen\npopulation: 8140242\nstate_capital: Hannover\narea: 47709.82')]

In [86]:
response_with_score = vector_index.similarity_search_with_score(question1)
response_with_score

[(Document(metadata={}, page_content='\nname: Hessen\npopulation: 6391360\nstate_capital: Wiesbaden\narea: 21114.94'),
  0.8828887939453125),
 (Document(metadata={}, page_content='\nname: Niedersachsen\npopulation: 8140242\nstate_capital: Hannover\narea: 47709.82'),
  0.8789215087890625),
 (Document(metadata={}, page_content='\nname: Nordrhein-Westfalen\npopulation: 18139116\nstate_capital: Düsseldorf\narea: 34110.26'),
  0.8784637451171875),
 (Document(metadata={}, page_content='\nname: Rheinland-Pfalz\npopulation: 4159150\nstate_capital: Mainz\narea: 19854.21'),
  0.875335693359375)]

## Generation: Example 1

In [87]:
# using documents as context
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Answer the question based only on the following context:\n{context}\n\nQuestion: {question}\n'), additional_kwargs={})])

In [88]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [89]:
chain = prompt | llm

In [96]:
docs = response

print(response)
chain.invoke({"context": docs, "question": question3}).content

[Document(metadata={}, page_content='\nname: Rheinland-Pfalz\npopulation: 4159150\nstate_capital: Mainz\narea: 19854.21'), Document(metadata={}, page_content='\nname: Nordrhein-Westfalen\npopulation: 18139116\nstate_capital: Düsseldorf\narea: 34110.26'), Document(metadata={}, page_content='\nname: Hessen\npopulation: 6391360\nstate_capital: Wiesbaden\narea: 21114.94'), Document(metadata={}, page_content='\nname: Niedersachsen\npopulation: 8140242\nstate_capital: Hannover\narea: 47709.82')]


'The population of Hessen is 6,391,360.'

## Generation: Example 2

In [97]:
# Using a retriever as context
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retriever = vector_index.as_retriever() # search_kwargs={"k": 1}

graph_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()

graph_chain.invoke(question6)

'The geometry of Nordrhein-Westfalen is 34110.26 square kilometers.'

## Generation: Example 3

In [98]:
# Using a custom retriever as a context and post-processing of the answer
# https://python.langchain.com/docs/how_to/custom_retriever/
from typing import List
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever

class CustomRetriever(BaseRetriever):
    """ Custom retriever to return the scores of the documents as well.
        Then the scores are passed into an custom ranking function to include the spatial similarity
        between the query and the document.
    """

    vector_index: Neo4jVector

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""

        docs, scores = zip(*self.vector_index.similarity_search_with_score(query))
        for doc, score in zip(docs, scores):
             print("***", doc)
             #new_score = updated_score(score, query, doc)
             doc.page_content = doc.page_content
             doc.metadata["score"] = score
        return docs

def update_scores(docs):

    for doc in docs:
       new_score = doc.metadata["score"] * 10
       doc.page_content = doc.page_content+ "\nScore: " + str(new_score)
       doc.metadata["score"] = new_score
    return docs

In [101]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retriever_r = CustomRetriever(vector_index=vector_index)

graph_chain = {"context": retriever_r | update_scores, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()

print(question5)
graph_chain.invoke(question5)

What is the capital of Rheinland-Pfalz?
*** page_content='
name: Rheinland-Pfalz
population: 4159150
state_capital: Mainz
area: 19854.21'
*** page_content='
name: Nordrhein-Westfalen
population: 18139116
state_capital: Düsseldorf
area: 34110.26'
*** page_content='
name: Hessen
population: 6391360
state_capital: Wiesbaden
area: 21114.94'
*** page_content='
name: Niedersachsen
population: 8140242
state_capital: Hannover
area: 47709.82'


'The capital of Rheinland-Pfalz is Mainz.'

# Cypher Queries

We will not dive deep into the cypher syntax during the course. The following queries should be enough for the interaction with the neo4j database. You can also check the [documentation](https://neo4j.com/docs/cypher-cheat-sheet/5/aura-dbe/auradb-free), if you happen to need more.

In [None]:
# delete every node and edge
MATCH(n)
DETACH DELETE (n)

# create nodes and edges
follow the structure shown at https://github.com/aurioldegbelo/sis2025/blob/main/vector_data/data.cypher

# visualize the model of the graph database
CALL apoc.meta.graph()

# Project work

* Exercice 01: clarify what your search target is

* Exercice 02: elaborate on your data model (what are entities and relationships)

* Exercice 03: create a neo4j account and a database instance

* Exercice 04: create an example of cypher query (CREATE) for your data (just a few instances), upload it to the database to see if it works

* Exercice 05: write a script to generate a CREATE query (it converts from your original format [csv, tsv, json, ...]) to a cypher template
