# Generating Graph Data from Gemini 

In this notebook we will use Google Gemini Pro and LangChain to:
* Generate topics about Neo4j, Graph Data Science, and Graph Theory  
* Generate questions and answer pairs about these topcis 
* Expand the question and answer pairs to produce additional questions that could have resulted in each answer (response)

This workflow will enable us to generate a large number of questions and answers that can be used to train a model to generate prompts from answers. The data can also be used as an evaluation data set for RAG applications. 


## Load Libraries

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [2]:
from IPython.display import display, Markdown

def print_md(string):
    display(Markdown(string))


In [3]:
# to read .env file 
# !pip install python-dotenv

import os
from dotenv import load_dotenv



In [5]:
# !pip install --upgrade --quiet langchain 
import langchain

# !pip install --upgrade --quiet  langchain-google-genai
import langchain_google_genai
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate

from langchain_core.output_parsers import StrOutputParser

from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List


In [4]:
import warnings
warnings.filterwarnings('ignore')

## Instantiate LangChain for Gemini 
We will use LangChain's [GCP integration](https://python.langchain.com/docs/integrations/chat/google_generative_ai/) so that we can leverage prompt templates and easily swap out models if necessary. 

In [6]:
# read API key from local .env 
load_dotenv()
if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = os.getenv('GEMINI_API_KEY')

In [7]:
# instantiate Gemini via LangChain 
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro-latest", convert_system_message_to_human=True)

# run test query to confirm setup 
prompt = '''What is Jupiter's largest moon? How does its size compare to the size of Earth?'''
result = llm.invoke(prompt)
print_md(result.content)

Jupiter's largest moon is **Ganymede**. It's not only the largest moon of Jupiter but also the largest moon in our entire solar system! 

Here's how it compares to Earth:

*   **Diameter:** Ganymede has a diameter of about 5,268 kilometers (3,273 miles), making it roughly 26% larger than the Earth's Moon and even bigger than the planet Mercury! 
*   **Mass:** Despite its size, Ganymede is less dense than Earth, resulting in it having only about half the mass of our Moon. 


LangChain's [ChatPromptTemplate](https://python.langchain.com/docs/modules/model_io/prompts/quick_start/#chatprompttemplate) allows us to provide general instructions to the LLM which can then be augmented by a user message. 

In [8]:
# instantiate the Chat Template 
from langchain_core.prompts import ChatPromptTemplate

# create the prompt template with Neo4j expertise 
neo4j_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful expert skilled and knowledgeable about Neo4j databases, Graph Data Science, knowledge graphs, and graph theory."), 
    ("user", "{input}")
])

## Generate Question Topics
Next we will ask the LLM to generate topics about Neo4j, Graph Data Science, and Graph Theory. These topics will be the categories we use next to generate question and answer pairs. 

While we could specify these ourselves, the LLM may recommend additional topics that we may not have considered. As with most LLM applicaitons, a combination of human and LLM is often ideal.

In [15]:
# number of topics as a variable 
num_topics = 5

# create template 
topic_prompt_template = PromptTemplate.from_template("I want you to help me generate {num_topics} concise topics about Neo4j, Graph Data Science, knowledge graphs, and graph theory. These topics will represent categories of question and answer pairs that I use to evaluate an AI system. These should represent the most important concepts developers new to Neo4j need to know about. Return only the recommended topics and no other information. Return the topics as concise strings in an un-numbered comma separated list that I can extract into a Python list")

topic_prompt_template.invoke({'num_topics': num_topics} )

StringPromptValue(text='I want you to help me generate 5 concise topics about Neo4j, Graph Data Science, knowledge graphs, and graph theory. These topics will represent categories of question and answer pairs that I use to evaluate an AI system. These should represent the most important concepts developers new to Neo4j need to know about. Return only the recommended topics and no other information. Return the topics as concise strings in an un-numbered comma separated list that I can extract into a Python list')

The LangChain [StrOutputParser](https://api.python.langchain.com/en/latest/output_parsers/langchain_core.output_parsers.string.StrOutputParser.html) provides us with a string that we can easily parse into a Python List.

In [None]:
# parser to convert output to string 
output_parser = StrOutputParser()

We combine the neo4j prompt, llm, and output parser into a single chain using LangChain Expression language and pass it the topic-generation prompt template.

In [16]:


# response = llm.invoke(prompt_template.content)
chain = neo4j_prompt | llm | output_parser

topic_response = chain.invoke({"input": topic_prompt_template})



Once we have the results, we parse them into a Python List to use going forwward. 

In [18]:
# convert response into a python list
neo4j_topic_list = topic_response.split(',')

# trim whitespace 
neo4j_topic_list = sorted([topic.strip() for topic in neo4j_topic_list])
neo4j_topic_list

['Centrality and community detection algorithms',
 'Cypher query language',
 'Graph data modeling',
 'Graph traversal algorithms',
 'Knowledge graph applications']

Overall, these are good topics to start with. However, if we were building a larger testing data set we likely would want 10 or 20 total topics. We could also manually add additional topics in case the LLM misses any we thing are important. Again, the combination of human and LLM is often ideal.

## Generate Questions from Prompts
Once we have categories, we will use them to generate question and answer pairs. However, to do so we will use a different prompt template to generate the question and answer pairs in a JSON output. 

In [20]:
# Define the question and answer data structure
class QuestionAnswer(BaseModel):
    question: str = Field(description="question")
    answer: str = Field(description="answer")


# instantiate the JSON Output Parser which leverages pydantic 
parser = JsonOutputParser(pydantic_object=QuestionAnswer)


In the following process we will just use the first topic from the generated list. In future iterations we will automate the entire process so that we generate Q&A pairs for all topics.

In [21]:
# select  topic from the list 
neo4j_topic = neo4j_topic_list[0]
neo4j_topic

'Centrality and community detection algorithms'

In [22]:
# specify the number of Q&A pairs to generate 
n_qa_pairs = 5

# define the prompt 
qa_query = f"I want you to generate {n_qa_pairs} question and answer pairs about the the following topic as it relates to Neo4j, graph databases, and graph data science. The topic is: {neo4j_topic}. The question should be a single sentence. The answer should be no more than two quesitons. Each question and answer pair must be uniuque. Each quesiton and answer pair should be focused on general concepts about the topic and be appropriate for a developer new to Neo4j."

print(qa_query)

I want you to generate 5 question and answer pairs about the the following topic as it relates to Neo4j, graph databases, and graph data science. The topic is: Centrality and community detection algorithms. The question should be a single sentence. The answer should be no more than two quesitons. Each question and answer pair must be uniuque. Each quesiton and answer pair should be focused on general concepts about the topic and be appropriate for a developer new to Neo4j.


In [23]:
# combine elements of the prompt 
qa_prompt = PromptTemplate(
    template="Answer the query according to the format instructions.\n{qa_query}\n{format_instructions}",
    input_variables=["qa_query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# combine the prompt, llm, and parser into a chain 
qa_gen_chain = qa_prompt | llm | parser


In [24]:
# invoke the chain 
qa_response = qa_gen_chain.invoke({"qa_query": qa_query})

# view the Q&A pairs in the output 
qa_response

[{'question': 'What is a centrality algorithm and how does it work in Neo4j?',
  'answer': 'Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.'},
 {'question': 'How is community detection used in graph analysis with Neo4j?',
  'answer': 'Community detection algorithms identify clusters or groups of nodes that are densely connected within the graph. Neo4j offers algorithms like Louvain Modularity and Label Propagation for discovering communities and understanding their structure.'},
 {'question': 'What are some practical applications of centrality algorithms in Neo4j?',
  'answer': 'Centrality algorithms can be used for various tasks, such as identifying key influencers in a social network, finding critical infrastructure components in a network, or detecting fraudulent activities in financial transactions.'},
 {'question': 'How

## Convert Q&A Pairs to DataFrame 
For ease of reading and manipulation, we will convert the JSON output to a pandas DataFrame. This will also easily enable us to aggregate the Q&A pairs and additional prompts, then output the entire data set to a CSV file.

In [25]:
# convert pairs to a dataframe 
llm_pairs = pd.DataFrame(qa_response)

# add the category
llm_pairs['category'] = neo4j_topic
llm_pairs.shape

(5, 3)

In [26]:
# view the dataframe 
llm_pairs

Unnamed: 0,question,answer,category
0,What is a centrality algorithm and how does it work in Neo4j?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms
1,How is community detection used in graph analysis with Neo4j?,Community detection algorithms identify clusters or groups of nodes that are densely connected within the graph. Neo4j offers algorithms like Louvain Modularity and Label Propagation for discovering communities and understanding their structure.,Centrality and community detection algorithms
2,What are some practical applications of centrality algorithms in Neo4j?,"Centrality algorithms can be used for various tasks, such as identifying key influencers in a social network, finding critical infrastructure components in a network, or detecting fraudulent activities in financial transactions.",Centrality and community detection algorithms
3,How can community detection algorithms benefit recommendation systems built with Neo4j?,"By identifying communities of users with similar interests or preferences, recommendation systems can provide more relevant and personalized suggestions, improving user experience and engagement.",Centrality and community detection algorithms
4,What are the differences between local and global centrality algorithms in Neo4j?,"Local centrality measures the importance of a node within its immediate neighborhood, while global centrality considers the overall network structure. Neo4j offers both types of algorithms, such as Degree Centrality (local) and PageRank (global), allowing you to choose the appropriate measure based on your analysis goals.",Centrality and community detection algorithms


## Generate Additional Prompts from Q&A Pairs
As part of the Kaggle competition, we are asked to generate additional prompts that could have resulted in the answer. Therefore, we will take one of the above responses and ask the LLM to generate five additional prompts that could have produced the response.

In [28]:
# specify the number of Q&A pairs to generate 
n_prompts = 5

# parse out the questions and answers as variables 
llm_response = qa_response[0]['answer']
llm_question = qa_response[0]['question']

In [54]:
print('Question:',llm_question)
print('Answer:',llm_response)

Question: What is a centrality algorithm and how does it work in Neo4j?
Answer: Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.


In [36]:
# generate a new prompt 
question_prompt = f"Your first task is to read the following response about {neo4j_topic}. The response is: '{llm_response}' \n Next, I want you to generate {n_prompts} questions that you have high confidence that when asked could have resulted in given response.  The following is an example of a question that could have generated the response: '{llm_question}' Do not return this question, return only new questions. Each question you return must be unique. Return only the questions and no other information. Return the {n_prompts} quesitons as an un-numbered comma separated list I can extract into a Python list"

question_prompt

"Your first task is to read the following response about Centrality and community detection algorithms. The response is: 'Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.' \n Next, I want you to generate 5 questions that you have high confidence that when asked could have resulted in given response.  The following is an example of a question that could have generated the response: 'What is a centrality algorithm and how does it work in Neo4j?' Do not return this question, return only new questions. Each question you return must be unique. Return only the questions and no other information. Return the 5 quesitons as an un-numbered comma separated list I can extract into a Python list"

In [40]:
# parser to convert output to string 
output_parser = StrOutputParser()

# response = llm.invoke(prompt_template.content)
chain = neo4j_prompt | llm | output_parser

# invoke the chain
question_response = chain.invoke({"input": question_prompt})


In [42]:
# convert response into a python list
additional_prompts = question_response.split(',')

# strip excess whitespace 
additional_prompts = sorted([q.strip() for q in additional_prompts])
additional_prompts

['Can you explain how Neo4j helps in understanding node importance through its algorithms?',
 'How can I identify influential entities within a connected dataset using Neo4j?',
 'How does Neo4j leverage graph structure to assess the influence or centrality of nodes?',
 'What algorithms does Neo4j offer for analyzing the significance of nodes in a graph?',
 'What are some methods for determining important nodes in a graph database like Neo4j?']

In [47]:
# convert new questions to a temporary dataframe 
question_df = pd.DataFrame(additional_prompts, columns=['question'])

# append the answers and categories 
question_df['answer'] = llm_response
question_df['category'] = neo4j_topic
question_df.shape

(5, 3)

In [48]:
question_df

Unnamed: 0,question,answer,category
0,Can you explain how Neo4j helps in understanding node importance through its algorithms?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms
1,How can I identify influential entities within a connected dataset using Neo4j?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms
2,How does Neo4j leverage graph structure to assess the influence or centrality of nodes?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms
3,What algorithms does Neo4j offer for analyzing the significance of nodes in a graph?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms
4,What are some methods for determining important nodes in a graph database like Neo4j?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms


In [49]:
# merge question_df with llm_pairs
llm_pairs = pd.concat([llm_pairs, question_df], axis=0)
llm_pairs.shape

(10, 3)

In [52]:
llm_pairs

Unnamed: 0,question,answer,category
0,What is a centrality algorithm and how does it work in Neo4j?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms
1,How is community detection used in graph analysis with Neo4j?,Community detection algorithms identify clusters or groups of nodes that are densely connected within the graph. Neo4j offers algorithms like Louvain Modularity and Label Propagation for discovering communities and understanding their structure.,Centrality and community detection algorithms
2,What are some practical applications of centrality algorithms in Neo4j?,"Centrality algorithms can be used for various tasks, such as identifying key influencers in a social network, finding critical infrastructure components in a network, or detecting fraudulent activities in financial transactions.",Centrality and community detection algorithms
3,How can community detection algorithms benefit recommendation systems built with Neo4j?,"By identifying communities of users with similar interests or preferences, recommendation systems can provide more relevant and personalized suggestions, improving user experience and engagement.",Centrality and community detection algorithms
4,What are the differences between local and global centrality algorithms in Neo4j?,"Local centrality measures the importance of a node within its immediate neighborhood, while global centrality considers the overall network structure. Neo4j offers both types of algorithms, such as Degree Centrality (local) and PageRank (global), allowing you to choose the appropriate measure based on your analysis goals.",Centrality and community detection algorithms
0,Can you explain how Neo4j helps in understanding node importance through its algorithms?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms
1,How can I identify influential entities within a connected dataset using Neo4j?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms
2,How does Neo4j leverage graph structure to assess the influence or centrality of nodes?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms
3,What algorithms does Neo4j offer for analyzing the significance of nodes in a graph?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms
4,What are some methods for determining important nodes in a graph database like Neo4j?,Centrality algorithms identify the most important or influential nodes in a graph. Neo4j provides algorithms like PageRank and Degree Centrality to measure node importance based on connections and relationships.,Centrality and community detection algorithms


## Conclusion 
In this notebook we demonstrated how to use Google Gemini Pro and Langchain to:
* Generate topics about Neo4j, Graph Data Science, and Graph Theory
* Generate questions and answer pairs about these topcis
* Expand the question and answer pairs to produce additional questions that could have resulted in each answer (response)

This workflow will enable us to generate a large number of questions and answers that can be used to train a model to generate prompts from answers. The data can also be used as an evaluation data set for RAG applications.

There is more work to do, especially in automating the process to generate Q&A pairs for all topics. However, this notebook provides a good starting point for generating a large data set of questions and answers.