# Setup

**Connecting to a Neo4j Database and Using OpenAI's API to Generate Cypher Queries**

Here are the key steps:

- **Import Libraries**: Import necessary libraries such as pandas, neo4j, openai, dotenv, and custom utilities.
- **Load Environment Variables**: Use `load_dotenv()` to load environment variables from a `.env` file.
- **Retrieve Credentials**: Get the URI and authentication credentials for Neo4j and OpenAI from environment variables.
- **Initialize Neo4j Driver**: Initialize the Neo4j driver using the retrieved URI and authentication credentials.
- **Connect to OpenAI**: Set up the OpenAI client with the API key from the environment variables.
- **Generate Cypher Query**: Define a function `generate_cypher_query(user_query)` that uses OpenAI's API to generate a Cypher query based on a user query.
- **Example Usage**: Provide an example of how to use the `generate_cypher_query` function to generate a Cypher query from a user input.

These steps outline the process of setting up connections and generating queries using the specified tools and libraries.


In [1]:
import pandas as pd
from neo4j import GraphDatabase
from openai import OpenAI
from dotenv import load_dotenv
import os
from utils import *

# Load environment variables from .env file
load_dotenv()

True

In [2]:
# Get the URI and authentication credentials from environment variables
URI = os.getenv("NEO4J_URI")
AUTH = (os.getenv("NEO4J_USER"), os.getenv("NEO4J_PASSWORD"))

# initalize the driver
driver = GraphDatabase.driver(URI, auth=AUTH)

# connect to OpenAI
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),  # This is the default and can be omitted
)

# Define a Querying Function


In [3]:
def post_process_cypher_query(cypher_query):
    """
    Post-process the Cypher query to ensure it matches the knowledge graph schema.
    """
    # Define valid properties for the RETURN clause
    valid_properties = ["v.name", "v.label", "v.format"]

    # Extract the RETURN clause and validate it
    if "RETURN" in cypher_query:
        return_clause = cypher_query.split("RETURN")[-1].strip()
        cleaned_return = ", ".join(
            [prop.strip() for prop in return_clause.split(",") if prop.strip() in valid_properties]
        )

        # If no valid fields remain, use a default RETURN clause
        if not cleaned_return:
            cleaned_return = "v.name, v.label"

        # Reconstruct the query
        cypher_query = cypher_query.split("RETURN")[0] + " RETURN " + cleaned_return

    return cypher_query

In [4]:
# use OpenAI’s GPT or similar LLMs to generate the Cypher query.
def generate_cypher_query(user_query, client):
    """
    Use an LLM client to generate a Cypher query from the user query.
    """
    # Format the prompt dynamically with the user's query
    prompt = parse_user_query.format(user_query=user_query)

    # Call the chat completion endpoint
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a Cypher query expert."},
            {"role": "user", "content": prompt}
        ],
        model="gpt-4o",
    )

    # Extract the Cypher query from the response
    try:
        # Get the raw content
        raw_content = chat_completion.choices[0].message.content.strip()

        # Extract only the Cypher query block
        if "```" in raw_content:
            query_start = raw_content.find("```") + 3
            query_end = raw_content.rfind("```")
            cypher_query = raw_content[query_start:query_end].strip()
        else:
            cypher_query = raw_content.strip()

        # Post-process the query to ensure it aligns with your schema
        cypher_query = post_process_cypher_query(cypher_query)
        return cypher_query
    except Exception as e:
        print("Error extracting Cypher query:", e)
        print("Raw response:", chat_completion)
        return None

# Execute KG Query

In [5]:
# Step 2: Define a Querying Function
# Translate user queries into Cypher queries to fetch relevant data from Neo4j.

def execute_cypher_query(cypher_query):
    """
    Execute the generated Cypher query in the Neo4j database.
    """
    with driver.session() as session:
        result = session.run(cypher_query)
        return [record.data() for record in result]

In [6]:
def format_results(results, user_input):
    """
    Format Neo4j results into a user-friendly response.
    """
    if not results:
        return f"I understand you want to see what is available related to '{user_input}'.\nUnfortunately, I couldn't find any relevant data in the knowledge graph."

    # Start building the response
    response = f"I understand you want to see what is available related to '{user_input}'.\n"
    response += "Here is what I found:\n\n"

    # Iterate through the results and format them
    for idx, record in enumerate(results, start=1):
        variable = record.get("variable", "Unknown Variable")
        label = record.get("label", "No Description")
        source = record.get("source", "Unknown Source")
        endpoint = record.get("endpoint", "Unknown Endpoint")

        response += f"{idx}. {variable}, {label}, is in {source} under the {endpoint}\n"

    return response

In [11]:
def process_user_query(user_query, client):
    """
    Full workflow to process user query:
    1. Generate Cypher query using LLM.
    2. Execute the Cypher query in Neo4j.
    3. Return a formatted response.
    """
    # Step 1: Generate Cypher query
    cypher_query = generate_cypher_query(user_query, client)
    if not cypher_query:
        return "Sorry, I couldn't generate a query from your input."

    # Step 2: Execute the Cypher query
    results = execute_cypher_query(cypher_query)

    # Step 3: Format and return results
    return format_results(results, user_query)

# TESTs

In [12]:
# check the process_user_query function
user_input = "What variables are available about retention?"
cypher_query = generate_cypher_query(user_input, client)
print("Cleaned Cypher Query:/n", cypher_query)

Cleaned Cypher Query:/n cypher
MATCH (v:Variable)
WHERE v.label CONTAINS "retention"
 RETURN v.name, v.label


In [15]:
user_input = "What variables are available about applicant numbers?"
response = process_user_query(user_input, client)
print(response)

I understand you want to see what is available related to 'What variables are available about applicant numbers?'.
Unfortunately, I couldn't find any relevant data in the knowledge graph.
