# Fine-Tuning Dataset Builder - Parametric Queries

**NOTES:**

In order to run this notebook you will need a non-empty graph database in Neo4j together with its credentials.

**In this notebook:**

- extract schema information from the graph,
- extract sample data from the graph (nodes and relationships instances),
- process the schema and the extracted data from the graph,
- generate sample queries using predefined Python functionalities,
- save the generated data to files.


## Workspace Setup

In [88]:

# Python functionalities to collect and save data samples
from utils.utilities import *
# Neo4j graph connector
from utils.neo4j_conn import *
# Functionalities to extract schema and data from the graph
from utils.neo4j_schema import *
# Functionalities to parse extracted graph data
from utils.graph_utils import *

In [89]:
# Initialize the Neo4j connector module
graph = Neo4jGraph(url='neo4j+s://46a80122.databases.neo4j.io:7687', username='neo4j', password='abcde777', database='neo4j')

# Module to extract data from the graph
gutils = Neo4jSchema(url='neo4j+s://46a80122.databases.neo4j.io:7687', username='neo4j', password='abcde777', database='neo4j')

In [90]:
# How to query the graph using the 'graph' module
graph.query("MATCH (n) RETURN COUNT(n) AS TotalNodes")

[{'TotalNodes': 90}]

## Path Variables

In [91]:
# Create a path variable for the data folder
data_path = 'datas/'

# Graph schema file
schema_file = 'schema_file.json'

# Node and relationships instances
node_instances_file = 'node_instances_file.json'
rels_instances_file = 'rels_instances_file.json'

# Fine-tuning datasets
trainer_with_repeats_file = 'parametric_trainer_with_repeats.json'
trainer_without_repeats_file = 'parametric_trainer_without_repeats.json'

## Options For Building the Supervised Fine Tuning Dataset

**NOTES:**

- The node and relationships instances extracted from the graph illustrate realistic property values. The number of instances extracted depends on the graph's size and the complexity of the data.
- `ALLOW_REPEATS` denotes the method by which SFT messages are constructed from the extracted instances. An example of repeated queries (where only the value of the node's property varies) is shown below:
```
MATCH (n:Article {author: 'John Smith'}) RETURN n
MATCH (n:Article {author: 'Jane Doe'}) RETURN n
```
- In larger graphs, an individual sampler (which may include variations of the same query type, with or without repeats) can generate tens or hundred of thousands variations. The `M` parameter limits these values to ensure a more balanced dataset. If desired, each upper limit can be individually adjusted within the functions below.


In [92]:
# Choose how many instances of each node label to extract
node_instances_size = 12

# Choose how many instances of each relationship type to extract
rels_instances_size = 12

# Choose if to include repeats in the data builder or not
ALLOW_REPEATS = True

# Select the maximum size of each individual sampler with len(sampler) elements
M = 500

## Data Collection for Sample Building

**NOTES:**

This section necessitates access to a Neo4j database populated with nodes and relationships. Data relevant to the task is retrieved from the knowledge graph, processed, and then stored in files.



In [93]:
# Build the schema as a dictionary
jschema = gutils.get_structured_schema
# Check the output format
jschema.keys()

dict_keys(['node_props', 'rel_props', 'relationships'])

In [94]:
# Extract the list of nodes
nodes = get_nodes_list(jschema)
# Extract the list of relationships
relationships = jschema['relationships']

In [95]:
# Extract the node instances from the graph
node_instances = gutils.extract_node_instances(nodes, node_instances_size)

# Extract the relationship instances from the graph
rels_instances = gutils.extract_multiple_relationships_instances(relationships, rels_instances_size)

# Serialize extracted neo4j.time data - for saving to json files
nodes_instances_serialized = serialize_nodes_data(node_instances)
rels_instances_serialized = serialize_relationships_data(rels_instances)

In [96]:
# Save data to json files
write_json(jschema, data_path+schema_file)
write_json(nodes_instances_serialized, data_path+node_instances_file)
write_json(rels_instances_serialized, data_path+rels_instances_file)

## Data Preparation for Sample Building



**NOTES:**

From this point forward, connection to the Neo4j knowledge graph is not required, as the necessary data has been saved to JSON files.

Some queries constructed below require the use of specific data type formats. For instance, the query for "Find all the titles that contain 'approximation'!" is best suitted for string data type properties. To select samples that meet specific conditions, we build several dictionaries that organize the data according to the property's data type.

In [97]:
# Read the data from files if previously saved
jschema = read_json(data_path+schema_file)
node_instances = read_json(data_path+node_instances_file) # these are serialized, see above
rels_instances = read_json(data_path+rels_instances_file) # these are serialized, see above

In [98]:
# List of node labels
nodes = get_nodes_list(jschema)

# Read the nodes with their properties and datatypes
node_props_types = jschema['node_props']

# Read the relationship properties with their datatypes
rel_props_types = jschema['rel_props']

# Read the relationships as a list of triples
relationships = jschema['relationships']

In [99]:
# List of datatypes available as node properties in the graph
node_dtypes = retrieve_datatypes(jschema, "node")
print(f"The datatypes for node properties: {node_dtypes}")

The datatypes for node properties: ['INTEGER', 'STRING']


In [100]:
# List of datatypes available as relationship properties in the graph
rel_dtypes = retrieve_datatypes(jschema, "rel")
print(f"The datatypes for relationships properties: {rel_dtypes}")

The datatypes for relationships properties: []


**NOTES:**

- In the next cell, a dictionary is created with keys named `datatype_parsed`, where `datatype` corresponds to one of the potential data types of node properties as outlined in `node_dtypes` and exemplified in the extracted node instances.

- Additionally, a consolidated list of all data types is included under the key `dtypes_parsed`.

- The dictionary values are lists of triples in the format `[label, property, value]`.

In [101]:
# Extract and parse n instances of specified datatype, return a flatten list
dparsed = {f"{datatype.lower()}_parsed": \
                        parse_node_instances_datatype(jschema,
                                                      node_instances,
                                                      nodes,
                                                      datatype,
                                                      True) for datatype in node_dtypes
                              }
# Add all the combined records
dparsed['dtypes_parsed'] = sum(dparsed.values(), [])

# Display available lists of instances
print(f"A dictionary is created, the keys are: {dparsed.keys()}.")

# Display a sample entry - instance of node and property with datatype STRING
print(f"Sample entry: {dparsed['string_parsed'][11]}.")


A dictionary is created, the keys are: dict_keys(['integer_parsed', 'string_parsed', 'dtypes_parsed']).
Sample entry: ['Article', 'metadata_url', 'http://193.175.161.176:5000/publication/3'].


**NOTES:**

Building on the concepts from the previous step, we select relationships where the start and end nodes have properties of a specified data type. This results in the creation of a dictionary with keys formatted as `datatypeStart_datatypeEnd_rels`. The `all_rels` values contain all the possible combinations of data types.

In [102]:
# Generate all possible pairs of node properties datatypes
dtypes_pairs = list(product(node_dtypes, repeat=2))

# Use dictionary comprehension with formatted keys for pairs
drels = {
    f"{dt1.lower()}_{dt2.lower()}_rels": \
    filter_relationships_instances(jschema, rels_instances, dt1, dt2)
    for dt1, dt2 in dtypes_pairs
}

# Add 'all_rels' key with concatenated lists from the other values
drels['all_rels'] = sum(drels.values(), [])

# Retain all those combinations that have nonempty entries
drels = {key: value for key, value in drels.items() if value}

# Display the list of node properties datatypes combinations for the relationships in the graph
print(f"The possible end node properties datatypes pairs for relationships are\n {drels.keys()}.\n")

# Sample entry
print("A sample entry:")
drels['string_string_rels'][1]

The possible end node properties datatypes pairs for relationships are
 dict_keys(['integer_integer_rels', 'integer_string_rels', 'string_integer_rels', 'string_string_rels', 'all_rels']).

A sample entry:


['Article',
 {'name': 'Automatic recognition of movement patterns in the vojta-therapy using RGB-D data',
  'url': 'https://ieeexplore.ieee.org/abstract/document/7532555',
  'creativeWorkType': 'article',
  'metadata_url': 'http://193.175.161.176:5000/publication/2'},
 'uses',
 'Dataset',
 {'name': 'PII | External Dataset',
  'url': 'https://www.kaggle.com/datasets/alejopaullier/pii-external-dataset',
  'creativeWorkType': 'dataset',
  'metadata_url': 'http://193.175.161.176:5000/dataset/32'}]

**NOTES:**

- We construct a dictionary that considers the data types of properties associated with relationships.
- The keys are formatted as: `datatypeStart_datatypeRelationship_datatypeEnd_rels`. The `all_rels` values contain all the possible combinations of data types.

## Samples Builder

**NOTES:**

- Each cell below features a function that constructs a message with four components: a system prompt, a question, subschema (relevant information about the graph), and a parametric Cypher query.

- Notation details:
    - Node labels: `label_i`
    - Node properties: `prop_i`
    - Node properties values: `val_i`
    - Relationship types: `rtype_i`
    - Relationship properties: `rprop_i`
    - Relationship properties values: `rval_i`

- The subschema information is derived using:
```
graph_utils.build_minimal_subschema(jschema,
    nodes_info, relationships_info,
    include_node_props, include_rel_props, include_types)
```
- `nodes_info` is formatted as `[[label_1, prop_1], ...]`
- `relationship_info` follows the format `[[rtype_1, rprop_1],...]`.
- The last three parameters are boolean values that dictate the extent of the information included in subschema.

- To exclude any of the generated groups of samples, in the corresponding cell, comment the line:  
```trainer += collect_samples(sampler, M)```

In [103]:
# Create a system message
system_message =  "Convert the following question into a Cypher query using the provided graph schema!"

In [104]:
# List to collect the samples
trainer=[]

**NOTES:**

- With the exception of the first couple of examples, samplers that pertain to a single node label are generated using ```utilities.build_node_sampler(nlist, prompter, allow_repeats)```. Here `nlist` takes the structure `dparsed["datatype_parsed"]` or `dparsed["dtypes_parsed"]`, specifying the data type(s) attributed to the selected properties.

- Most of the samples that involve two node labels are constructed via `utilities.build_nodes_property_pairs_sampler(nlist_1, nlist_2, prompter, same_node, allow_repeats)`. Here `nlist_1` and `nlist_2` are as in the previous case and allow for independent choices of data types for the nodes properties. The `same_node` is a boolean argument, that controls how the two nodes are selected.

- The samples that involve named relationships are constructed with `utilities.build_relationship_samples(rel_list, prompter, allow_repeats)`. Here `rel_list` is extracted as `drels["key"]` where key indicates the data types of the start node and end node, or it is `all_rel` which takes into account all possible data types available in the graph.

- The samples that involve relationship properties are built using  `utilities.build_relationships_props_samples(rel_list, prompter, allow_repeats)` which is similar to the general relationships samples builder from above.

### One Node Label

In [105]:
def count_nodes_of_given_label():
    """ Determine how many nodes of specified label are in the graph."""

    def prompter(*params, **kwargs):

        label_1 = params[0]

        subschema =  build_minimal_subschema(jschema,[[label_1, ]],[], False, False, False)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the total number of {label_1} in the graph!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) RETURN count(n)"
                   }
        return message

    sampler = []
    for label in nodes:
        temp_dict = prompter(label)
        sampler.append(temp_dict)

    return sampler

# Build the set
sampler = count_nodes_of_given_label()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 4 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the total number of Article in the graph!',
 'Schema': 'Graph schema: Relevant node labels and their properties  are:\nArticle',
 'Cypher': 'MATCH (n:Article) RETURN count(n)'}

In [106]:
def paths_with_node_endpoint():
    """Find paths with specified endpoints."""

    def prompter(*params, **kwargs):

        label_1 = params[0]

        subschema = build_minimal_subschema(jschema,[[label_1, ]],[], False, False, False)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Identify three paths where {label_1} is a start or end node!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH p=(b:{label_1})-[r*]->(n) RETURN p UNION MATCH p=(n)-[r*]->(b:{label_1}) RETURN p LIMIT 3"
                   }
        return message

    sampler = []

    for label in nodes:
        temp_dict = prompter(label)
        sampler.append(temp_dict)

    return sampler

# Build the set
sampler = paths_with_node_endpoint()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 4 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Identify three paths where Article is a start or end node!',
 'Schema': 'Graph schema: Relevant node labels and their properties  are:\nArticle',
 'Cypher': ' MATCH p=(b:Article)-[r*]->(n) RETURN p UNION MATCH p=(n)-[r*]->(b:Article) RETURN p LIMIT 3'}

### One Node Label, One Property

#### Any Data Type Input

In [107]:
def match_one_node_one_prop():
    """Return a given node label and a specified property."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]

        # Extract subschema for the variables of interest
        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment

        message = {"Prompt": f"{system_message}",
                   "Question": f"""Fetch the {label_1} nodes and extract their {prop_1} property!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) RETURN n.{prop_1}"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = match_one_node_one_prop()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Fetch the Article nodes and extract their id property!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) RETURN n.id'}

In [108]:
def where_one_node_one_prop_notnull_numeral():
    """Return n (use figures, e.g. 8) nodes where a property is not null."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find 10 {label_1} that have the {prop_1} recorded and return these values!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} IS NOT NULL RETURN n.{prop_1} LIMIT 10"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                                 prompter,
                                 allow_repeats = ALLOW_REPEATS)

# Build the set
sampler = where_one_node_one_prop_notnull_numeral()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find 10 Article that have the id recorded and return these values!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) WHERE n.id IS NOT NULL RETURN n.id LIMIT 10'}

In [109]:
def where_one_node_one_prop_notnull_literal():
    """Return n (use words, e.g. eight) nodes where a property is not null."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find ten {label_1} that have {prop_1} and return their records!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} IS NOT NULL RETURN n.{prop_1} LIMIT 10"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                                 prompter,
                                 allow_repeats = ALLOW_REPEATS)

# Build the set
sampler = where_one_node_one_prop_notnull_literal()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find ten Article that have id and return their records!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) WHERE n.id IS NOT NULL RETURN n.id LIMIT 10'}

In [110]:
def where_one_node_one_prop_null_numeral():
    """Return n (use figures, e.g. 8) nodes where a property is null."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find 8 {label_1} that are missing the {prop_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} IS NULL RETURN n LIMIT 8"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                                 prompter,
                                 allow_repeats= ALLOW_REPEATS)


# Build the set
sampler = where_one_node_one_prop_null_numeral()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find 8 Article that are missing the id!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) WHERE n.id IS NULL RETURN n LIMIT 8'}

In [111]:
def find_node_notproperty_count():
    """Find how many nodes of given label are missing a specified property."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the total number of {label_1} for which the {prop_1} is missing!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} IS NULL RETURN count(n)"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                                 prompter,
                                 allow_repeats= ALLOW_REPEATS)


# Build the set
sampler = find_node_notproperty_count()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the total number of Article for which the id is missing!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) WHERE n.id IS NULL RETURN count(n)'}

In [112]:
def find_node_property_count():
    """Count nodes of given label which have a certain property."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the total number of {label_1} that have the {prop_1} recorded!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} IS NOT NULL RETURN count(n)"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                                 prompter,
                                 allow_repeats= ALLOW_REPEATS)


# Build the set
sampler = find_node_property_count()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the total number of Article that have the id recorded!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) WHERE n.id IS NOT NULL RETURN count(n)'}

In [113]:
def find_node_by_property():
    """Find instances of given node label that has a property with specified value."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        # Extract subschema for the variables of interest
        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the {label_1} for which {prop_1} is {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1} {{{prop_1}:'{val_1}'}}) RETURN n"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler =  find_node_by_property()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the Article for which id is 1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': "MATCH (n:Article {id:'1'}) RETURN n"}

In [114]:
def match_skip_limit_return_property():
    """Return a list of values of a property, using skip and limit."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]

        nrecs = kwargs.get('nrecs', 2)

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Return the {prop_1} of the {label_1}, skip the first {nrecs} records and return {nrecs} records!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) RETURN n.{prop_1}  SKIP {nrecs} LIMIT {nrecs}"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)


# Build the set
sampler = match_skip_limit_return_property()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Return the id of the Article, skip the first 2 records and return 2 records!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) RETURN n.id  SKIP 2 LIMIT 2'}

#### String Data Type

In [115]:
def match_where_skip_limit_return_property():
    """Fetch a list of nodes with certain properties, use skip and limit."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        nrecs = kwargs.get('nrecs', 2)

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the {label_1} for which {prop_1} starts with {val_1[0]}, skip the first {nrecs} records and return the next {nrecs} records of {prop_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} STARTS WITH '{val_1[0]}' WITH n.{prop_1} AS {prop_1} SKIP {nrecs} LIMIT {nrecs} RETURN {prop_1}"
                   }
        return message

    return build_node_sampler(dparsed["string_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = match_where_skip_limit_return_property()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 192 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the Article for which name starts with M, skip the first 2 records and return the next 2 records of name!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}',
 'Cypher': "MATCH (n:Article) WHERE n.name STARTS WITH 'M' WITH n.name AS name SKIP 2 LIMIT 2 RETURN name"}

In [116]:
def where_one_node_one_prop_one_val():
    """Retrieve nodes of given label where a string property has a given value."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the {label_1} where {prop_1} is {val_1.strip()}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} = '{val_1}' RETURN n"
                   }
        return message

    return build_node_sampler(dparsed["string_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = where_one_node_one_prop_one_val()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 192 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the Article where name is MexPub: Deep Transfer Learning for Metadata Extraction from German Publications!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}',
 'Cypher': "MATCH (n:Article) WHERE n.name = 'MexPub: Deep Transfer Learning for Metadata Extraction from German Publications' RETURN n"}

In [117]:
def where_one_node_one_string_contains():
    """Retrieve nodes of specified label where a string property contains a given substring."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the {label_1} where {prop_1} contains {val_1[:5]}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} CONTAINS '{val_1[:5]}' RETURN n"
                   }
        return message

    return build_node_sampler(dparsed["string_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = where_one_node_one_string_contains()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 192 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the Article where name contains MexPu!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}',
 'Cypher': "MATCH (n:Article) WHERE n.name CONTAINS 'MexPu' RETURN n"}

In [118]:
def find_node_by_start_substring():
    """Find instances of given node label that has a property that starts with a specified substring."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the {label_1} for which {prop_1} starts with {val_1[:3]}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} STARTS WITH '{val_1[:3]}' RETURN n"
                   }
        return message

    return build_node_sampler(dparsed["string_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = find_node_by_start_substring()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 192 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the Article for which name starts with Mex!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}',
 'Cypher': "MATCH (n:Article) WHERE n.name STARTS WITH 'Mex' RETURN n"}

In [119]:
def where_one_node_string_re():
    """Retrieve nodes of given label with a string property satisfies a condition given by a regular expression."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Fetch the {label_1} where {prop_1} ends with {val_1[:2]}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} =~'{val_1[:2]}.*' RETURN n"
                   }
        return message

    return build_node_sampler(dparsed["string_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = where_one_node_string_re()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 192 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Fetch the Article where name ends with Me!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}',
 'Cypher': "MATCH (n:Article) WHERE n.name =~'Me.*' RETURN n"}

#### Temporal Data Types

#### Paths and Neighbors - Any Data Type

In [121]:
def find_unique_rels():
    """Fetch unique relationships that have a given node instance."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""How many unique relationships originate from {label_1} where {prop_1} is {val_1}?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[r]->() RETURN COUNT(DISTINCT TYPE(r)) AS rels, TYPE(r)"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = find_unique_rels()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'How many unique relationships originate from Article where id is 1?',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH (a:Article{id:'1'})-[r]->() RETURN COUNT(DISTINCT TYPE(r)) AS rels, TYPE(r)"}

In [122]:
def connection_thru_two_rels():
    """How many nodes are connected to a given node instance via two relationships."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""How many nodes are connected to {label_1} for which {prop_1} is {val_1}, by exactly two different types of relationships?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[r]->(n) WITH n, COLLECT(DISTINCT TYPE(r)) AS Types WHERE SIZE(Types) = 2 RETURN COUNT(n)"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = connection_thru_two_rels()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'How many nodes are connected to Article for which id is 1, by exactly two different types of relationships?',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH (a:Article{id:'1'})-[r]->(n) WITH n, COLLECT(DISTINCT TYPE(r)) AS Types WHERE SIZE(Types) = 2 RETURN COUNT(n)"}

In [124]:
def rels_and_counts_and_nodes():
    """Get information on nodes connected to a certain node instance."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""List the nodes that are connected to {label_1} for which {prop_1} is {val_1}, with their relationship types and count these types!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[r]->(n) RETURN n, TYPE(r) AS Relations, COUNT(r) AS Counts"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler =  rels_and_counts_and_nodes()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'List the nodes that are connected to Article for which id is 1, with their relationship types and count these types!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH (a:Article{id:'1'})-[r]->(n) RETURN n, TYPE(r) AS Relations, COUNT(r) AS Counts"}

In [125]:
def rels_and_counts():
    """Find relationships and their counts that are connected to a specified node instance."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""List the types of relationships and their counts connected to {label_1} for which {prop_1} is {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[r]->() RETURN TYPE(r) AS Relations, COUNT(r) AS Counts"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler =  rels_and_counts()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'List the types of relationships and their counts connected to Article for which id is 1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH (a:Article{id:'1'})-[r]->() RETURN TYPE(r) AS Relations, COUNT(r) AS Counts"}

In [126]:
def find_node_neighbours():
    """Find all neighbors of a given node instance."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find all nodes directly connected to the {label_1} that has {prop_1} {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH path=(:{label_1} {{{prop_1}:'{val_1}'}})-->() RETURN path"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = find_node_neighbours()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find all nodes directly connected to the Article that has id 1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': "MATCH path=(:Article {id:'1'})-->() RETURN path"}

In [127]:
def find_neighbors_properties():
    """Find the neighbors of a given node (specified intrinsically) and list their properties."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the nodes connected to {label_1} where {prop_1} is {val_1} and list their properties!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[r]->(n) RETURN properties(n), r"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = find_neighbors_properties()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the nodes connected to Article where id is 1 and list their properties!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH (a:Article{id:'1'})-[r]->(n) RETURN properties(n), r"}

In [128]:
def find_node_neighbors_properties():
    """Find the neighbors of a given node (specified extrinsically) and list their properties."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Identify nodes that are connected to {label_1} where {prop_1} is {val_1} and list their properties, including those of {label_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH (b:{label_1})-[r]->(n) WHERE b.{prop_1} = '{val_1}' RETURN properties(b) AS {label_1}_props, properties(n) AS props"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)


# Build the set
sampler = find_node_neighbors_properties()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Identify nodes that are connected to Article where id is 1 and list their properties, including those of Article!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH (b:Article)-[r]->(n) WHERE b.id = '1' RETURN properties(b) AS Article_props, properties(n) AS props"}

In [129]:
def find_properties_neighbors_relationship():
    """Find properties of specified neighbors of a given node instance."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""What are the properties of nodes connected to {label_1} for which {prop_1} is {val_1}, and what are their relationships to {label_1}?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (c:{label_1})<-[r]-(n) WHERE c.{prop_1} = '{val_1}' RETURN properties(n) AS props, r"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = find_properties_neighbors_relationship()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'What are the properties of nodes connected to Article for which id is 1, and what are their relationships to Article?',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': "MATCH (c:Article)<-[r]-(n) WHERE c.id = '1' RETURN properties(n) AS props, r"}

In [130]:
def nodes_connected_to_two_nodes():
    """Find common neighbors of two nodes, only one specified."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Which nodes are connected to {label_1} where {prop_1} is {val_1}, and also to another node?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[r]->(n), (n)-[s]->(m) RETURN labels(n) AS Interim, labels(m) AS Target"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = nodes_connected_to_two_nodes()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Which nodes are connected to Article where id is 1, and also to another node?',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': "MATCH (a:Article{id:'1'})-[r]->(n), (n)-[s]->(m) RETURN labels(n) AS Interim, labels(m) AS Target"}

In [131]:
def longest_path_from_node():
    """Find the longest path originating from a given node, basic approach."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Identify the longest path originating from {label_1} for which {prop_1} is {val_1}, and list the properties of the nodes on the path!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH p=(a:{label_1}{{{prop_1}:'{val_1}'}})-[*]->(n) RETURN p, nodes(p) ORDER BY LENGTH(p) DESC LIMIT 1"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = longest_path_from_node()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Identify the longest path originating from Article for which id is 1, and list the properties of the nodes on the path!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH p=(a:Article{id:'1'})-[*]->(n) RETURN p, nodes(p) ORDER BY LENGTH(p) DESC LIMIT 1"}

In [132]:
def node_properties_for_two_relationships():
    """Fetch node properties for a given path."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""What are the properties of nodes connected to {label_1} where {prop_1} is {val_1}, by two different types of relationships?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (e:{label_1}{{{prop_1}:'{val_1}'}})-[r1]->(n)-[r2]->(m) WHERE TYPE(r1) <> TYPE(r2) RETURN properties(n) AS props1, properties(m) AS props2"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = node_properties_for_two_relationships()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'What are the properties of nodes connected to Article where id is 1, by two different types of relationships?',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': "MATCH (e:Article{id:'1'})-[r1]->(n)-[r2]->(m) WHERE TYPE(r1) <> TYPE(r2) RETURN properties(n) AS props1, properties(m) AS props2"}

In [133]:
def average_props():
    """Find the average count of properties of nodes along a path."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""What is the average number of properties per node connected to {label_1} for which {prop_1} is {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[r]->(n) RETURN AVG(SIZE(keys(n))) AS AvgProps"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = average_props()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]


There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'What is the average number of properties per node connected to Article for which id is 1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH (a:Article{id:'1'})-[r]->(n) RETURN AVG(SIZE(keys(n))) AS AvgProps"}

In [134]:
def first_and_far_neighbors():
    """Proprieties of nodes for which there is a path to a specified node."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Enumerate the properties of nodes that are either directly or indirectly connected to {label_1} for which {prop_1} is {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[*]->(n) RETURN DISTINCT properties(n) AS Properties"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = first_and_far_neighbors()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Enumerate the properties of nodes that are either directly or indirectly connected to Article for which id is 1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH (a:Article{id:'1'})-[*]->(n) RETURN DISTINCT properties(n) AS Properties"}

In [135]:
def nodes_connected_to_node():
    """Find the neighbors of a node (extrinsincally specified property)."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f""" List all nodes that are connected to {label_1} where {prop_1} contains {val_1}, along with the type of their relationship with {label_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"""MATCH (d:{label_1})-[r]->(n) WHERE d.{prop_1} CONTAINS '{val_1}' RETURN n, TYPE(r)"""
                           }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = nodes_connected_to_node()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': ' List all nodes that are connected to Article where id contains 1, along with the type of their relationship with Article!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': "MATCH (d:Article)-[r]->(n) WHERE d.id CONTAINS '1' RETURN n, TYPE(r)"}

In [136]:
def find_far_unique_rels():
    """Find the distinct properties of nodes that are nhops away from a given node."""

    def prompter(*params, **kwargs):

        label_1= params[0]
        prop_1 = params[1]
        val_1 = params[2]
        nhops = kwargs.get('nhops', 2)


        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""List the distinct properties of nodes that are {nhops} hops away from {label_1} with {prop_1} equal to {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[*{nhops}]->(n) RETURN DISTINCT properties(n) AS props"
                   }
        return message

    sampler = []

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = find_far_unique_rels()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'List the distinct properties of nodes that are 2 hops away from Article with id equal to 1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH (a:Article{id:'1'})-[*2]->(n) RETURN DISTINCT properties(n) AS props"}

In [137]:
def find_far_neighbors_properties():
    """Find the properties of nodes that are 3 hops away from a given node instance."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        nhops = kwargs.get('nhops', 3)

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""List the properties of nodes that are {nhops} hops away from {label_1} with {prop_1} equal to {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f" MATCH (a:{label_1})-[*{nhops}]->(n) WHERE a.{prop_1} = '{val_1}' RETURN properties(n) AS props"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)


# Build the set
sampler = find_far_neighbors_properties()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'List the properties of nodes that are 3 hops away from Article with id equal to 1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': " MATCH (a:Article)-[*3]->(n) WHERE a.id = '1' RETURN properties(n) AS props"}

In [138]:
def find_far_neighbors():
    """Retrieve the node labels of the nodes that are nhops away from a given node instance."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        nhops = kwargs.get('nhops', 3)

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""List nodes that are {nhops} hops away from {label_1} for which {prop_1}={val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[*{nhops}]->(n) RETURN labels(n) AS FarNodes"
                   }
        return message

    return build_node_sampler(dparsed["dtypes_parsed"],
                              prompter,
                              allow_repeats= ALLOW_REPEATS)

# Build the set
sampler = find_far_neighbors()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 240 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'List nodes that are 3 hops away from Article for which id=1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}',
 'Cypher': "MATCH (a:Article{id:'1'})-[*3]->(n) RETURN labels(n) AS FarNodes"}

### One Node Label, Two Properties

#### String Data Type

In [139]:
def match_with_where_not_value():
    """Retrieve a node property when another property does not take a certain value."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        prop_2 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_1, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Retrieve distinct values of the {prop_2} from {label_1} where {prop_1} is not {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} <> '{val_1}' RETURN DISTINCT n.{prop_2} AS {prop_2}"
                   }
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["dtypes_parsed"],
                                              prompter,
                                              same_node=True,
                                              allow_repeats=ALLOW_REPEATS)


# Build the set
sampler = match_with_where_not_value()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 14400 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Retrieve distinct values of the id from Article where id is not 1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': "MATCH (n:Article) WHERE n.id <> '1' RETURN DISTINCT n.id AS id"}

In [141]:
def match_with_where_contains_substring():
    """Retrieve two properties of a node if one of the properties does contain a given substring."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        prop_2 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_1, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the {prop_1} and the {prop_2} for those {label_1} where {prop_1} contains the substring {val_1[:2]}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} CONTAINS '{val_1[2:]}' RETURN n.{prop_1} AS {prop_1}, n.{prop_2} AS {prop_2}"
        }
        return message

    return build_nodes_property_pairs_sampler(dparsed["string_parsed"],
                                              dparsed["dtypes_parsed"],
                                              prompter,
                                              same_node=True,
                                              allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = match_with_where_contains_substring()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 11520 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the name and the id for those Article where name contains the substring Me!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}\nArticle {id: INTEGER}',
 'Cypher': "MATCH (n:Article) WHERE n.name CONTAINS 'xPub: Deep Transfer Learning for Metadata Extraction from German Publications' RETURN n.name AS name, n.id AS id"}

In [142]:
def match_with_where_starts_with_substring():
    """Retrieve two properties of a node if one of the properties starts with a given substring."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        prop_2 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_1, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the {prop_1} and the {prop_2} for those {label_1} where {prop_1} starts with {val_1[0]}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} STARTS WITH '{val_1[0]}' RETURN n.{prop_1} AS {prop_1}, n.{prop_2} AS {prop_2}"
                   }
        return message

    return build_nodes_property_pairs_sampler(dparsed["string_parsed"],
                                              dparsed["dtypes_parsed"],
                                              prompter,
                                              same_node=True,
                                              allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = match_with_where_starts_with_substring()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 11520 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the name and the id for those Article where name starts with M!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}\nArticle {id: INTEGER}',
 'Cypher': "MATCH (n:Article) WHERE n.name STARTS WITH 'M' RETURN n.name AS name, n.id AS id"}

In [143]:
def match_with_where_not_is_value():
    """Return two properties of a node if one of the properties does not start with a specified string."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        prop_2 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_1, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Fetch unique values of {prop_1} and {prop_2} from {label_1} where {prop_1} does not start with {val_1[0]}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE NOT n.{prop_1} STARTS WITH '{val_1[0]}' RETURN DISTINCT n.{prop_1} AS {prop_1}, n.{prop_2} AS {prop_2}"
                   }
        return message

    return build_nodes_property_pairs_sampler(dparsed["string_parsed"],
                                              dparsed["dtypes_parsed"],
                                              prompter,
                                              same_node=True,
                                              allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = match_with_where_not_is_value()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 11520 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Fetch unique values of name and id from Article where name does not start with M!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}\nArticle {id: INTEGER}',
 'Cypher': "MATCH (n:Article) WHERE NOT n.name STARTS WITH 'M' RETURN DISTINCT n.name AS name, n.id AS id"}

In [144]:
def match_properties_with_union():
    """Find node instances if one of two properties contains a certain substring."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        prop_2 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_1, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Retrieve the {label_1} where {prop_1} or {prop_2} contains {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} CONTAINS '{val_1}' RETURN n AS node UNION ALL MATCH (m:{label_1}) WHERE m.{prop_2} CONTAINS '{val_1}' RETURN m AS node"
                   }
        return message

    return build_nodes_property_pairs_sampler(dparsed["string_parsed"],
                                              dparsed["string_parsed"],
                                       prompter,
                                       same_node = True,
                                       allow_repeats = ALLOW_REPEATS)

# Build the set
sampler = match_properties_with_union()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 9216 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Retrieve the Article where name or name contains MexPub: Deep Transfer Learning for Metadata Extraction from German Publications!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}\nArticle {name: STRING}',
 'Cypher': "MATCH (n:Article) WHERE n.name CONTAINS 'MexPub: Deep Transfer Learning for Metadata Extraction from German Publications' RETURN n AS node UNION ALL MATCH (m:Article) WHERE m.name CONTAINS 'MexPub: Deep Transfer Learning for Metadata Extraction from German Publications' RETURN m AS node"}

In [145]:
def where_one_node_two_props_notnull_or():
    """Find a specified property of a given label if another property fulfills a given condition or the specified property is not null."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        prop_2 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_1, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Fetch the distinct values of the {prop_2} from {label_1} where either {prop_1} is {val_1} or {prop_2} is not null!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} = '{val_1}' OR n.{prop_2} IS NOT NULL RETURN DISTINCT n.{prop_2} AS {prop_2}"
                   }
        return message

    return build_nodes_property_pairs_sampler(dparsed["string_parsed"],
                                              dparsed["string_parsed"],
                                              prompter,
                                              same_node=True,
                                              allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = where_one_node_two_props_notnull_or()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 9216 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Fetch the distinct values of the name from Article where either name is MexPub: Deep Transfer Learning for Metadata Extraction from German Publications or name is not null!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}\nArticle {name: STRING}',
 'Cypher': "MATCH (n:Article) WHERE n.name = 'MexPub: Deep Transfer Learning for Metadata Extraction from German Publications' OR n.name IS NOT NULL RETURN DISTINCT n.name AS name"}

#### Temporal Data Types

#### Numerical Data Types

In [147]:
# Node count by property and relation
def aggregate_integers_by_string():
    """Find statistics of a numerical property for those nodes that satisfy a condition on a second property."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        prop_2 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_1, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""For each nonull {prop_1} of the {label_1}, how many times does it appear, and what are the minimum, maximum and average values of {prop_2} associated to it?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} IS NOT NULL WITH DISTINCT n WITH n.{prop_1} as {prop_1}, COUNT(n) AS count, min(n.{prop_2}) AS min, max(n.{prop_2}) AS max, avg(n.{prop_2}) AS avg RETURN {prop_1}, count, min, max, avg"
        }
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["integer_parsed"], # dparsed["integer_parsed"]+dparsed["float_parsed"] when available
                                              prompter,
                                              same_node=True,
                                              allow_repeats=ALLOW_REPEATS)


# Build the set
sampler = aggregate_integers_by_string()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 2880 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'For each nonull id of the Article, how many times does it appear, and what are the minimum, maximum and average values of id associated to it?',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) WHERE n.id IS NOT NULL WITH DISTINCT n WITH n.id as id, COUNT(n) AS count, min(n.id) AS min, max(n.id) AS max, avg(n.id) AS avg RETURN id, count, min, max, avg'}

In [148]:
def match_with_where_not_null():
    """Return nodes where a property is not null, a second property takes specified values, order by the second property."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        prop_2 = params[3]
        val_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_1, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Search for {prop_1} and {prop_2} from {label_1} where {prop_1} is not null and {prop_2} exceeds {val_2} and sort the results by {prop_2}, beginning with the largest!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1}  IS NOT NULL AND n.{prop_2} > {val_2} RETURN n.{prop_1} AS {prop_1}, n.{prop_2} AS {prop_2} ORDER BY {prop_2} DESC"
                   }
        return message

    return build_nodes_property_pairs_sampler(dparsed["string_parsed"],
                                              dparsed["integer_parsed"], # dparsed["integer_parsed"]+dparsed["float_parsed"] when available
                                              prompter,
                                              same_node=True,
                                              allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = match_with_where_not_null()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 2304 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Search for name and id from Article where name is not null and id exceeds 1 and sort the results by id, beginning with the largest!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) WHERE n.name  IS NOT NULL AND n.id > 1 RETURN n.name AS name, n.id AS id ORDER BY id DESC'}

In [149]:
def aggregate_numerical_by_integer():
    """Count the nodes where two properties satisfy two numerical conditions."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        prop_2 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_1, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the {label_1} counts where {prop_1} is smaller than ten, and return the maximum, minimum and average values of the {prop_2}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} > 100 WITH DISTINCT n WITH n.{prop_1} as {prop_1}, COUNT(n) AS count, min(n.{prop_2}) AS min_{prop_2}, max(n.{prop_2}) AS max_{prop_2}, avg(n.{prop_2}) AS avg_{prop_2} RETURN {prop_1}, count, min_{prop_2}, max_{prop_2}, avg_{prop_2}"
        }
        return message

    return build_nodes_property_pairs_sampler(dparsed["integer_parsed"], # dparsed["integer_parsed"]+dparsed["float_parsed"] when available
                                              dparsed["integer_parsed"], # dparsed["integer_parsed"]+dparsed["float_parsed"] when available
                                              prompter,
                                              same_node=True,
                                              allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = aggregate_numerical_by_integer()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 576 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the Article counts where id is smaller than ten, and return the maximum, minimum and average values of the id!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) WHERE n.id > 100 WITH DISTINCT n WITH n.id as id, COUNT(n) AS count, min(n.id) AS min_id, max(n.id) AS max_id, avg(n.id) AS avg_id RETURN id, count, min_id, max_id, avg_id'}

In [150]:
def match_with_where_or_numerical_literal():
    """Find at most n nodes of specified label where a numerical property is greater or another is less than specified values."""
    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        prop_2 = params[3]
        val_2 = params[4]


        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_1, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find eight instances of {label_1} where either {prop_1} exceeds {val_1} or {prop_2} is less than {val_2}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE n.{prop_1} > {val_1} OR n.{prop_2} < {val_2} RETURN n LIMIT 8"
                   }
        return message

    return build_nodes_property_pairs_sampler(dparsed["integer_parsed"], # dparsed["integer_parsed"]+dparsed["float_parsed"] when available
                                              dparsed["integer_parsed"], # dparsed["integer_parsed"]+dparsed["float_parsed"] when available
                                              prompter,
                                              same_node=True,
                                              allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = match_with_where_or_numerical_literal()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 576 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find eight instances of Article where either id exceeds 1 or id is less than 1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) WHERE n.id > 1 OR n.id < 1 RETURN n LIMIT 8'}

### Two Node Labels, Properties

---



#### Relationships to Nodes

In [151]:
def find_nodes_connected_to_two_nodes():
    """Find the nodes connected to two given nodes."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        label_2 = params[1]

        subschema = build_minimal_subschema(jschema, [[label_1, ], [label_2, ]], [], False, False, False)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find nodes that share a relationship with both {label_1} and {label_2}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"""MATCH (c:{label_1})<-[r1]-(n)-[r2]->(d:{label_2}) RETURN labels(n)"""
                           }
        return message

    return build_nodes_pairs(nodes,
                             prompter,
                             allow_repeats = ALLOW_REPEATS
                             )

# Build the set
sampler= find_nodes_connected_to_two_nodes()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 16 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find nodes that share a relationship with both Article and Article!',
 'Schema': 'Graph schema: Relevant node labels and their properties  are:\nArticle\nArticle',
 'Cypher': 'MATCH (c:Article)<-[r1]-(n)-[r2]->(d:Article) RETURN labels(n)'}

In [152]:
def nodes_connected_to_two_nodes_both():
    """Find nodes on paths between two given nodes."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        label_2 = params[1]

        subschema = build_minimal_subschema(jschema, [[label_1, ], [label_2, ]], [], False, False, False)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Identify nodes that are connected to both {label_1} and {label_2}, directly or indirectly!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (a:{label_1})-[*]-(n)-[*]-(b:{label_2}) RETURN labels(n)"
        }
        return message

    return build_nodes_pairs(nodes,
                             prompter,
                             allow_repeats = ALLOW_REPEATS
                             )

# Build the set
sampler = nodes_connected_to_two_nodes_both()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 16 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Identify nodes that are connected to both Article and Article, directly or indirectly!',
 'Schema': 'Graph schema: Relevant node labels and their properties  are:\nArticle\nArticle',
 'Cypher': 'MATCH (a:Article)-[*]-(n)-[*]-(b:Article) RETURN labels(n)'}

In [153]:
def find_common_rels():
    """Find nodes that share common relationships with two given nodes."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        label_2 = params[1]

        subschema = build_minimal_subschema(jschema, [[label_1, ], [label_2, ]], [], False, False, False)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Are there any nodes that share a common relationship type with both {label_1} and {label_2}?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (a:{label_1})-[r]->(n), (d:{label_2})-[s]->(m) WHERE TYPE(r) = TYPE(s) RETURN labels(n), labels(m)"
                   }
        return message

    return build_nodes_pairs(nodes,
                             prompter,
                             allow_repeats = ALLOW_REPEATS
                             )

# Build the set
sampler = find_common_rels()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 16 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Are there any nodes that share a common relationship type with both Article and Article?',
 'Schema': 'Graph schema: Relevant node labels and their properties  are:\nArticle\nArticle',
 'Cypher': 'MATCH (a:Article)-[r]->(n), (d:Article)-[s]->(m) WHERE TYPE(r) = TYPE(s) RETURN labels(n), labels(m)'}

In [154]:
def rel_and_common_prop():
    """Identify nodes with common properties."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        label_2 = params[3]
        prop_2 = params[4]
        val_2 = params[5]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Are there any nodes that are connected with {label_1} where {prop_1} is {val_1} and share a common property with {label_2}, for which {prop_2} equals {val_2}?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"""MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[r]->(n), (d:{label_2}{{{prop_2}:'{val_2}'}}) WHERE ANY(key in keys(n) WHERE n[key] = d[key]) RETURN n"""}
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["dtypes_parsed"],
                                       prompter,
                                       same_node = False,
                                       allow_repeats = ALLOW_REPEATS)


# Build the set
sampler = rel_and_common_prop()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 57600 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Are there any nodes that are connected with Article where id is 1 and share a common property with Article, for which id equals 1?',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': "MATCH (a:Article{id:'1'})-[r]->(n), (d:Article{id:'1'}) WHERE ANY(key in keys(n) WHERE n[key] = d[key]) RETURN n"}

#### Unions of Sets

In [155]:
def match_nodes_with_union_all():
    """Build a union of two sets (without filtering duplicates) extracted from two distinct node labels and their properties."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        label_2 = params[3]
        prop_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Return the {prop_1} for {label_1} combined with the {prop_2} for {label_2}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) RETURN n.{prop_1} AS Records UNION ALL MATCH (m:{label_2}) RETURN m.{prop_2} AS Records"
                   }
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["dtypes_parsed"],
                                       prompter,
                                       same_node = False,
                                       allow_repeats = ALLOW_REPEATS)

# Build the set
sampler = match_nodes_with_union_all()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 57600 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Return the id for Article combined with the id for Article!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) RETURN n.id AS Records UNION ALL MATCH (m:Article) RETURN m.id AS Records'}

In [156]:
def match_nodes_with_union():
    """Build a union of two sets (with filtering duplicates) extracted from two distinct node labels and their properties."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        label_2 = params[3]
        prop_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Return the {prop_1} for {label_1} combined with the {prop_2} for {label_2}, filter the duplicates if any!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) RETURN n.{prop_1} AS Records UNION MATCH (m:{label_2}) RETURN m.{prop_2} AS Records"
                   }
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["dtypes_parsed"],
                                       prompter,
                                       same_node = False,
                                       allow_repeats = ALLOW_REPEATS)

# Build the set
sampler = match_nodes_with_union()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 57600 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Return the id for Article combined with the id for Article, filter the duplicates if any!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) RETURN n.id AS Records UNION MATCH (m:Article) RETURN m.id AS Records'}

#### Retrieve Properties

In [157]:
def match_two_nodes_two_props():
    """Retrieve several samples of properties values that correspond to two node labels (same or distinct)."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        label_2 = params[3]
        prop_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Fetch eight samples of the {prop_1} of the {label_1} and the {prop_2} for {label_2}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) MATCH (m:{label_2}) RETURN n.{prop_1}, m.{prop_2} LIMIT 8"
                   }
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["dtypes_parsed"],
                                       prompter,
                                       same_node = False,
                                       allow_repeats = ALLOW_REPEATS)

sampler = match_two_nodes_two_props()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 57600 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Fetch eight samples of the id of the Article and the id for Article!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': 'MATCH (n:Article) MATCH (m:Article) RETURN n.id, m.id LIMIT 8'}

In [158]:
def where_not_simple_path_and_property():
    """Retrieve one property that is not in relationship to another node with a given property."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        label_2 = params[3]
        prop_2 = params[4]
        val_2 = params[5]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Look for the {prop_1} of the {label_1} that is not related  to the {label_2} with the  {prop_2}  {val_2}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}), (:{label_2} {{{prop_2}: '{val_2}'}}) WHERE NOT (n) --> (:{label_2}) RETURN n.{prop_1}"
        }
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["dtypes_parsed"],
                                       prompter,
                                       same_node = False,
                                       allow_repeats = ALLOW_REPEATS)

# Build the set
sampler = where_not_simple_path_and_property()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 57600 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Look for the id of the Article that is not related  to the Article with the  id  1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': "MATCH (n:Article), (:Article {id: '1'}) WHERE NOT (n) --> (:Article) RETURN n.id"}

#### Paths

In [159]:
def path_existence():
    """Determine if there is a path connected two given nodes."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        label_2 = params[3]
        prop_2 = params[4]
        val_2 = params[5]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Is there a path connecting {label_1} where {prop_1} is {val_1} and {label_2}, for which {prop_2} is {val_2}?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"""MATCH (a:{label_1}{{{prop_1}:'{val_1}'}}), (b:{label_2}{{{prop_2}:'{val_2}'}}) RETURN EXISTS((a)-[*]-(b)) AS pathExists"""}
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["dtypes_parsed"],
                                       prompter,
                                       same_node = False,
                                       allow_repeats = ALLOW_REPEATS)

# Build the set
sampler = path_existence()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]


There are 57600 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Is there a path connecting Article where id is 1 and Article, for which id is 1?',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': "MATCH (a:Article{id:'1'}), (b:Article{id:'1'}) RETURN EXISTS((a)-[*]-(b)) AS pathExists"}

In [160]:
def number_of_paths():
    """Find the number of paths with given end nodes."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        label_2 = params[3]
        prop_2 = params[4]
        val_2 = params[5]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""How many paths are there between {label_1} where {prop_1} is {val_1} and {label_2}, for which {prop_2} equals {val_2}?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"""MATCH p=(a:{label_1}{{{prop_1}:'{val_1}'}})-[*]->(d:{label_2}{{{prop_2}:'{val_2}'}}) RETURN count(p)"""}
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["dtypes_parsed"],
                                       prompter,
                                       same_node = False,
                                       allow_repeats = ALLOW_REPEATS)

# Build the set
sampler = number_of_paths()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 57600 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'How many paths are there between Article where id is 1 and Article, for which id equals 1?',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': "MATCH p=(a:Article{id:'1'})-[*]->(d:Article{id:'1'}) RETURN count(p)"}

In [161]:
def end_of_the_path():
    """Find the end node of a given path."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        label_2 = params[3]
        prop_2 = params[4]
        val_2 = params[5]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find nodes that are at the end of a path starting at {label_1} where {prop_1} is {val_1} and traversing through {label_2} with {prop_2} {val_2}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"""MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[*]->(d:{label_2}{{{prop_2}:'{val_2}'}})-[*]->(n) RETURN n
                    """ }
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["dtypes_parsed"],
                                       prompter,
                                       same_node = False,
                                       allow_repeats = ALLOW_REPEATS)

# Build the set
sampler = end_of_the_path()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]


There are 57600 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find nodes that are at the end of a path starting at Article where id is 1 and traversing through Article with id 1!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': "MATCH (a:Article{id:'1'})-[*]->(d:Article{id:'1'})-[*]->(n) RETURN n\n                    "}

In [162]:
def shortest_path_between_two_nodes():
    """Find the shortest path between two nodes."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        val_1 = params[2]
        label_2 = params[3]
        prop_2 = params[4]
        val_2 = params[5]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2]], [], True, False, True)[:-29] # remove relationship comment
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the shortest path between {label_1} where {prop_1} is {val_1} and {label_2}, with {prop_2} equal {val_2}, including the nodes on the path!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"""MATCH p=shortestPath((a:{label_1}{{{prop_1}:'{val_1}'}})-[*]-(e:{label_2}{{{prop_2}:'{val_2}'}})) RETURN nodes(p)
                    """
                           }
        return message

    return build_nodes_property_pairs_sampler(dparsed["dtypes_parsed"],
                                              dparsed["dtypes_parsed"],
                                       prompter,
                                       same_node = False,
                                       allow_repeats = ALLOW_REPEATS)

# Build the set
sampler =  shortest_path_between_two_nodes()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 57600 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the shortest path between Article where id is 1 and Article, with id equal 1, including the nodes on the path!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nArticle {id: INTEGER}',
 'Cypher': "MATCH p=shortestPath((a:Article{id:'1'})-[*]-(e:Article{id:'1'})) RETURN nodes(p)\n                    "}

### Relationships

Use only one set of instances for no repeats.
build_minimal_subschema(jschema, [[label_1, ], [label_2, ]], [], False, False, False)[:-29] # remove relationship comment

#### Nodes and Relationships

In [163]:
def find_not_connected_nodes():
    """Identify nodes that do not have certain relationships."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        rel_1 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, ]], [[rel_1, ]], False, False, False)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Fetch five {label_1} that are not linked through {rel_1} relationships!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (p:{label_1}) WHERE NOT EXISTS ((p)-[:{rel_1}]->()) RETURN p LIMIT 5"
        }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_not_connected_nodes()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Fetch five Article that are not linked through uses relationships!',
 'Schema': "Graph schema: Relevant node labels and their properties  are:\nArticle\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (p:Article) WHERE NOT EXISTS ((p)-[:uses]->()) RETURN p LIMIT 5'}

In [164]:
def find_connected_nodes():
    """Find nodes that are connected via certain relationships."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        rel_1 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, ]], [[rel_1, ]], False, False, False)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find four {label_1} that have {rel_1} links!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (p:{label_1}) WHERE EXISTS ((p)-[:{rel_1}]->()) RETURN p LIMIT 4"
        }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_connected_nodes()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find four Article that have uses links!',
 'Schema': "Graph schema: Relevant node labels and their properties  are:\nArticle\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (p:Article) WHERE EXISTS ((p)-[:uses]->()) RETURN p LIMIT 4'}

In [165]:
def find_node_relation_count():
    """Count the number of specified relationships a node has."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        rel_1 = params[3]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1]], [[rel_1, ]], True, False, False)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Fetch ten {label_1} and return the {prop_1} and the number of nodes connected to them via {rel_1} given in descending order of the node counts.""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WITH n.{prop_1} AS {prop_1}, size([(n)-[:{rel_1}]->() | 1]) AS count ORDER BY count DESC LIMIT 10 RETURN article_id, count"
                   }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_node_relation_count()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Fetch ten Article and return the id and the number of nodes connected to them via uses given in descending order of the node counts.',
 'Schema': "Graph schema: Relevant node labels and their properties  are:\nArticle {id}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (n:Article) WITH n.id AS id, size([(n)-[:uses]->() | 1]) AS count ORDER BY count DESC LIMIT 10 RETURN article_id, count'}

#### Two Labels, One Property

In [166]:
def nodes_connected_to_first_node_and_not_connected_to_second_node():
    """Determine which nodes are connected to node A but not connected to node B via a given relationship."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        rel_1 = params[3]
        label_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1,], [label_2, ]], [[rel_1, ]], False, False, False)
        message = {"Prompt": f"{system_message}",
                   "Question": f""" Which nodes are connected to {label_1}, but not to {label_2} via {rel_1}?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"""MATCH (c:{label_1})-[r]-(n) WHERE NOT (n)-[:{rel_1}]-(:{label_2}) RETURN labels(n)"""
                           }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler =  nodes_connected_to_first_node_and_not_connected_to_second_node()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': ' Which nodes are connected to Article, but not to Dataset via uses?',
 'Schema': "Graph schema: Relevant node labels and their properties  are:\nArticle\nDataset\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (c:Article)-[r]-(n) WHERE NOT (n)-[:uses]-(:Dataset) RETURN labels(n)'}

In [167]:
def find_node_property_with_count_limit():
    """Retrieve property values for several nodes A and the number of relationship counts to nodes B."""
    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        rel_1 = params[3]
        label_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Search for the {prop_1} values from 20 {label_1} that are linked to {label_2} via {rel_1} and return {prop_1} along with the respective {label_2} counts!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) -[:{rel_1}]->(m:{label_2}) WITH DISTINCT n, m RETURN n.{prop_1} AS {prop_1}, count(m) AS count LIMIT 20"
        }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_node_property_with_count_limit()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Search for the id values from 20 Article that are linked to Dataset via uses and return id along with the respective Dataset counts!',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (n:Article) -[:uses]->(m:Dataset) WITH DISTINCT n, m RETURN n.id AS id, count(m) AS count LIMIT 20'}

In [168]:
def find_node_property_by_condition_on_node():
    """Retrieve property values for nodes A that have more than five relationships to nodes B."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        rel_1 = params[3]
        label_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find the {prop_1} of {label_1} that each have more than five {rel_1} relationships with {label_2}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) -[r:{rel_1}]->(m:{label_2}) WITH DISTINCT n, m, r WITH n.{prop_1} AS {prop_1}, count(r) AS count WHERE count > 5 RETURN {prop_1}"
        }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_node_property_by_condition_on_node()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find the id of Article that each have more than five uses relationships with Dataset!',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (n:Article) -[r:uses]->(m:Dataset) WITH DISTINCT n, m, r WITH n.id AS id, count(r) AS count WHERE count > 5 RETURN id'}

In [169]:
def where_and_exists_simple_path():
    """Fetch a property of nodes connected to a given node via a specified relationship."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        rel_1 = params[3]
        label_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Fetch {prop_1} of the {label_1} that are connected to {label_2} via {rel_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) WHERE EXISTS {{ MATCH (n)-[:{rel_1}]->(:{label_2}) }} RETURN n.{prop_1} AS {prop_1}"}
        return message

    return build_relationships_samples(drels["string_string_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = where_and_exists_simple_path()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 768 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Fetch name of the Article that are connected to Dataset via uses!',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}\nDataset {}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (n:Article) WHERE EXISTS { MATCH (n)-[:uses]->(:Dataset) } RETURN n.name AS name'}

In [170]:
def find_node_relation_ordered_count_desc():
    """Retrieve, in descending order, the count of nodes linked to a given node."""
    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        rel_1 = params[3]
        label_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""For each {label_1} find its {prop_1} and the count of {label_2} linked via {rel_1}, and retrieve seven results in desc order of the counts!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) -[:{rel_1}]->(m:{label_2}) WITH DISTINCT n, m RETURN n.{prop_1} AS {prop_1}, count(m) AS count ORDER BY count DESC LIMIT 7"
        }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_node_relation_ordered_count_desc()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'For each Article find its id and the count of Dataset linked via uses, and retrieve seven results in desc order of the counts!',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (n:Article) -[:uses]->(m:Dataset) WITH DISTINCT n, m RETURN n.id AS id, count(m) AS count ORDER BY count DESC LIMIT 7'}

In [171]:
def find_node_relation_ordered_count():
    """Retrieve, in ascending order, the counts of nodes linked to a given node."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        rel_1 = params[3]
        label_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""For each {label_1}, find the number of {label_2} linked via {rel_1} and retrieve the {prop_1} of the {label_1} and the {label_2} counts in ascending order!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) -[:{rel_1}]->(m:{label_2}) WITH DISTINCT n, m RETURN n.{prop_1} AS {prop_1}, count(m) AS {label_2.lower()}_count ORDER BY {label_2.lower()}_count"
        }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_node_relation_ordered_count()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'For each Article, find the number of Dataset linked via uses and retrieve the id of the Article and the Dataset counts in ascending order!',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (n:Article) -[:uses]->(m:Dataset) WITH DISTINCT n, m RETURN n.id AS id, count(m) AS dataset_count ORDER BY dataset_count'}

In [172]:
def find_node_relation_ordered_count_filter():
    """Retrieve the counts, larger than a given value, of nodes linked to a given node."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        rel_1 = params[3]
        label_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""For each {label_1} and its {prop_1}, count the {label_2} connected through {rel_1} and fetch the {prop_1} and the counts that are greater than 5, starting with the largest {prop_1} and count!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) -[:{rel_1}]->(m:{label_2}) WITH DISTINCT n, m WITH n.{prop_1} AS {prop_1}, count(m) AS count WHERE count > 4 RETURN {prop_1}, count ORDER BY {prop_1} DESC, count DESC"
        }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_node_relation_ordered_count_filter()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'For each Article and its id, count the Dataset connected through uses and fetch the id and the counts that are greater than 5, starting with the largest id and count!',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (n:Article) -[:uses]->(m:Dataset) WITH DISTINCT n, m WITH n.id AS id, count(m) AS count WHERE count > 4 RETURN id, count ORDER BY id DESC, count DESC'}

In [173]:
def find_common_prop():
    """Find related nodes with common properties."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        val_1 = params[2]
        rel_1 = params[3]
        label_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Which nodes have a common property with {label_1} where {prop_1} is {val_1} and are {rel_1} linked to a {label_2}?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (a:{label_1} {{{prop_1}:'{val_1}'}})-[r:{rel_1}]->(b:{label_2}) WHERE ANY(key IN keys(a) WHERE a[key] = b[key]) RETURN b"
                   }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_common_prop()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Which nodes have a common property with Article where id is 1 and are uses linked to a Dataset?',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': "MATCH (a:Article {id:'1'})-[r:uses]->(b:Dataset) WHERE ANY(key IN keys(a) WHERE a[key] = b[key]) RETURN b"}

In [174]:
def find_end_nodes_path():
    """Find nodes that are at the end of a path with specified starting node."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        val_1 = params[2]
        rel_1 = params[3]
        label_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2,]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Which nodes are at the end of a path starting from {label_1}, with {prop_1} equal to  {val_1}, passing through {label_2} via {rel_1}?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"""MATCH (a:{label_1}{{{prop_1}:'{val_1}'}})-[:{rel_1}]->(c:{label_2})-[r]->(n) RETURN n"""
                           }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_end_nodes_path()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Which nodes are at the end of a path starting from Article, with id equal to  1, passing through Dataset via uses?',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': "MATCH (a:Article{id:'1'})-[:uses]->(c:Dataset)-[r]->(n) RETURN n"}

In [175]:
# Extract properties of end nodes of a path
def find_end_node_properties():
    """Find properties of nodes connected to specified nodes."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        val_1 = params[2]
        rel_1 = params[3]
        label_2 = params[4]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2,]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""What are the properties of {label_2} that is {rel_1} connected to {label_1} that has {prop_1} equal to {val_1}?""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) -[:{rel_1}]->(m:{label_2}) WHERE n.{prop_1} = {val_1} RETURN properties(m) AS props"
                   }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_end_node_properties()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'What are the properties of Dataset that is uses connected to Article that has id equal to 1?',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (n:Article) -[:uses]->(m:Dataset) WHERE n.id = 1 RETURN properties(m) AS props'}

#### Two Labels, Two Properties

In [176]:
def find_node_relation_ordered_count_collect():
    """Find properties of nodes that are related under given conditions."""

    def prompter(*params, **kwargs):

        label_1 = params[0]
        prop_1 = params[1]
        rel_1 = params[3]
        label_2 = params[4]
        prop_2 = params[5]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2 ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Fetch the {prop_1} of the {label_1} that are linked via {rel_1} to more than three {label_2}, and list {label_2} {prop_2} and {label_2} counts, ordering by {label_2} count and limiting to the top six results!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) -[:{rel_1}]->(m:{label_2}) WITH DISTINCT n, m WITH n.{prop_1} AS {prop_1}, count(m) AS count, COLLECT(m.{prop_2}) as {prop_2} WHERE count > 3 RETURN {prop_1}, count, {prop_2} ORDER BY count LIMIT 6"
        }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)
# Build the set
sampler = find_node_relation_ordered_count_collect()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Fetch the id of the Article that are linked via uses to more than three Dataset, and list Dataset id and Dataset counts, ordering by Dataset count and limiting to the top six results!',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {id: INTEGER}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': 'MATCH (n:Article) -[:uses]->(m:Dataset) WITH DISTINCT n, m WITH n.id AS id, count(m) AS count, COLLECT(m.id) as id WHERE count > 3 RETURN id, count, id ORDER BY count LIMIT 6'}

In [177]:
# Node count by property and relation
def find_node_aggregation_date_rels():
    """Evaluate the average values of a property for all nodes of the same label that are connected to a specified node."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        rel_1 = params[3]
        label_2 = params[4]
        prop_2 = params[5]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2 ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Calculate the average {prop_2} for {label_2} that are linked to {label_1} via {rel_1} and have {prop_1} date before December 31, 2020!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) -[:{rel_1}]->(m:{label_2}) WHERE m.{prop_1} < date('2020-12-31') RETURN avg(m.{prop_2}) AS avg_{prop_2}"
        }
        return message

    return build_relationships_samples(drels["all_rels"],  # best with drels["date_integer_rels"] or drels["date_float_rels"] if available
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = find_node_aggregation_date_rels()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Calculate the average id for Dataset that are linked to Article via uses and have id date before December 31, 2020!',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {id: INTEGER}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': "MATCH (n:Article) -[:uses]->(m:Dataset) WHERE m.id < date('2020-12-31') RETURN avg(m.id) AS avg_id"}

In [178]:
def where_and_simple_path():
    """Find a property of a node connected via a given relationship to a node for which a certain property takes a specified value."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        val_1 = params[2]
        rel_1 = params[3]
        label_2 = params[4]
        prop_2 = params[5]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2 ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": "Convert the following question into a Cypher query using the provided graph schema!",
                   "Question": f"""Retrieve the {prop_2} for {label_2} that is linked through a {rel_1} relationship with the {label_1} where {prop_1} is {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1}) -[{rel_1[:2].lower()}:{rel_1}]->(m) WHERE n.{prop_1}='{val_1}' RETURN m.{prop_2}"
                   }
        return message

    return build_relationships_samples(drels["all_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = where_and_simple_path()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 1200 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Retrieve the id for Dataset that is linked through a uses relationship with the Article where id is 1!',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {id: INTEGER}\nDataset {id: INTEGER}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': "MATCH (n:Article) -[us:uses]->(m) WHERE n.id='1' RETURN m.id"}

In [179]:
def relation_with_and_where():
    """Retrieve related node properties that satisfy given conditions."""

    def prompter(*params, **kwardgs):

        label_1= params[0]
        prop_1 = params[1]
        val_1 = params[2]
        rel_1 = params[3]
        label_2 = params[4]
        prop_2 = params[5]
        val_2 = params[6]

        subschema = build_minimal_subschema(jschema, [[label_1, prop_1], [label_2, prop_2 ]], [[rel_1, ]], True, False, True)
        message = {"Prompt": f"{system_message}",
                   "Question": f"""Find {label_2} that has a {prop_2} which begins with {label_2[0].lower()}, and is linked to {label_1} via {rel_1} relationship, where {label_1} has {prop_1} {val_1}!""",
                   "Schema": f"Graph schema: {subschema}",
                   "Cypher": f"MATCH (n:{label_1} {{{prop_1}: '{val_1}'}}) -[:{rel_1}]- (m:{label_2}) WHERE m.{prop_2} STARTS WITH '{label_2[0].lower()}' RETURN m"
        }
        return message

    return build_relationships_samples(drels["string_string_rels"],
                            prompter,
                            allow_repeats=ALLOW_REPEATS)

# Build the set
sampler = relation_with_and_where()
# Print information about the sampler set
print(f"There are {len(sampler)} queries in this subset.")
# Add to trainer dataset
trainer += collect_samples(sampler, M)
# Display an example for inspection
sampler[0]

There are 768 queries in this subset.


{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Find Dataset that has a name which begins with d, and is linked to Article via uses relationship, where Article has name MexPub: Deep Transfer Learning for Metadata Extraction from German Publications!',
 'Schema': "Graph schema: Relevant node labels and their properties (with datatypes) are:\nArticle {name: STRING}\nDataset {name: STRING}\n\nRelevant relationships are:\n{'start': Article, 'type': uses, 'end': Dataset }",
 'Cypher': "MATCH (n:Article {name: 'MexPub: Deep Transfer Learning for Metadata Extraction from German Publications'}) -[:uses]- (m:Dataset) WHERE m.name STARTS WITH 'd' RETURN m"}

### Relationships with Properties

#### Nodes and Relationships (with properties)

#### Two Labels, One Property, Relationship (with property)

#### Two Labels, Two Properties, Relationship (with property)

## Data Saving

In [183]:
import json

# Display the number of samples created and save the data to a file
print(f"There are {len(trainer)} samples in the fine-tuning dataset.")
write_json(trainer, data_path+trainer_with_repeats_file)

There are 25256 samples in the fine-tuning dataset.
