Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: Apache-2.0

# Learning Gremlin - Loops and Repeat Queries

This notebook is the second in a series of notebooks that walk through how to write queries using Gremlin.  In this notebook, we will examine the basics of how to perform looping and repeating queries in Gremlin.  


This notebook assumes that you have already completed the previous notebook "01-Basic-Read-Queries" so we will continue our lessons from the end of the previous notebook and assume that the data has been loaded into the cluster. 

### Setting up the visualizations

Run the next two cells to configure various display options for our notebook, which we will use later on to display our results in a pleasing visual way.  

In [None]:
%%graph_notebook_vis_options
{
  "groups": {    
    "person": {
      "color": "#9ac7bf"
    },
    "review": {
      "color": "#f8cecc"
    },
    "city": {
      "color": "#d5e8d4"
    },
    "state": {
      "color": "#dae8fc"
    },
    "review_rating": {
      "color": "#e1d5e7"
    },
    "restaurant": {
      "color": "#ffe6cc"
    },
    "cusine": {
      "color": "#fff2cc"
    }
  }
}

In [None]:
node_labels = '{"person":"first_name","city":"name","state":"name","restaurant":"name","cuisine":"name"}'

We'll be using the `node_labels` variable to provide a nicer visualisation when running the queries in this notebook. To use it, we need to pass it along with the query itself, as follows:

`%%gremlin -d node_labels`

The `-d` instructs the notebook as to which properties should be displayed for each specified node label.

### Looking at our graph data

As we examined the data model in the previous notebook, we are not going to examine it, however we will leave the data schema for reference.

![dining-by-friends.png](attachment:dining-by-friends.png)


## Looping

When working with any property graph, some of the most powerful queries you can write are ones where the number of connections between a source and a target entity is not known.  These types of queries are so common that property graph query languages, such as Gremlin, have first class support as a key piece of the query language.  In Gremlin, these queries are written using a mechanism known as Looping and Repeating. Loops allow us to specify a sequence of nodes and relationships, whilst Repeats allow us to specify the number of times to repeat the relationship in the pattern matching syntax, or until an additional pattern has been matched.

In Gremlin, a basic loop query to find all nodes within 1 to 3 hops looks like:

```
   g.V().repeat( out() ).times(3)
```

Examining this query we see that there are two defined parts to a loop in Gremlin. The first is the `repeat()` step, which acts as a wrapper to the traversal pattern that we'd like to use. The second part defines the *limit* to be applied to the repeat (we don't want to keep traversing indefinitely!) The *limit* portion can be applied using three different mechanisms, as shown in the list below:

* `times()` - used to specify the exact number of times a `repeat()` pattern is to be executed
* `until()` - used to specify a traversal pattern that, once satisfied, will stop the `repeat()` for a traversal
* `loops()` - used to extract the number of times a traversal has gone through the current loop

### Diving deeper into `repeat()` ###

The `repeat()` step also supports two 'modulators'; `until()` and `emit()`, which can be both used before or after the `repeat()` step. Using the `until()` step before the `repeat()` is similar to the common [`while...do`](https://www.w3schools.com/java/java_while_loop.asp) programming paradigm, whereas using the `until()` _after_ the `repeat()` is similar to the [`do...while`](https://www.w3schools.com/cpp/cpp_do_while_loop.asp) concept.

The `emit()` modulator works by returning the results of a traversal as it is executed, and can be useful when used in conjunction with other looping-limiting steps such as `times()`. An example of this is the query below where we want to limit the `repeat()` to two hops, however we also want to return paths which include only one hops.

In [None]:
%%gremlin
g.V()
.repeat(
    out()
)
.emit()
.times(2)
.path()

We can also place the `emit()` modulator _before_ the `repeat()` step. This will cause the result in the previous step in the query to be emitted before the results that follow.

Run the following example, and notice that `Dave` is returned ahead of the results from the `repeat()`.

In [None]:
%%gremlin
g.V()
    .hasLabel("person")
    .has("first_name","Dave")
    .emit()
    .repeat(
        out().hasLabel("person")
    )
    .times(3)
    .limit(10)
    .path()

Compare this with the output of the following query that doesn't use `emit()` prior to the `repeat()`. You'll notice that `Dave` is no longer included as a path by themselves.

In [None]:
%%gremlin
g.V()
    .hasLabel("person")
    .has("first_name","Dave")
    .repeat(
        out().hasLabel("person")
    )
    .times(3)
    .limit(10)
    .path()

Now that we have a basic understanding of Gremlin's loop syntax, let's look at how this is applied to answer some common graph query patterns.

### Static Number of Hops

The simplest looping pattern you can do in Gremlin is to specify a fixed number of hops for your pattern.  This is accomplished using the `times()` step.  Let's execute the query below to traverse outwards by 2 hops, and return the path.

In [None]:
%%gremlin
g.V()
.repeat(
    out()
)
.times(2)
.path()
.by(elementMap())
.limit(10)

### Explaining the previous Gremlin query

Using the `repeat()` step, we told Gremlin to traverse all **outgoing** edges **2 times**. The graphic below demonstrates how Gremlin creates additional traverses when there are multiple outgoing edges to follow.

![looping-example-v2.gif](attachment:looping-example-v2.gif)

### Variable Number of Hops

While the example above works on a static number of hops, sometimes we do not know the number of connections we need to traverse to answer a question. In this case, we can use the `until()` step to specify an additional pattern that will stop a traverser once the condition is met.

**Note**. The performance of graph queries depend on how much of the graph needs to be traversed. It's important that you have an optimal graph data model to ensure fan-out is kept to a minimum, or large portions of your graph aren't traversed when they don't need to be.

Execute the query below to see how many paths are connected via any number of `friends` edges. 

In [None]:
%%gremlin -d $node_labels
g.V()
.hasLabel("person")
.repeat(
    bothE()
        .hasLabel("friends")
    .otherV()
        .hasLabel("person")
)
.until(
    not(out().hasLabel("person"))
)
.path()
.by(elementMap())
.limit(10)

We can also do the same using `outE()` and `inV()` steps, ensuring we're only traversing in one direction.

In [None]:
%%gremlin -d $node_labels
g.V()
.hasLabel("person")
.repeat(
    outE()
        .hasLabel("friends")
    .inV()
        .hasLabel("person")
)
.until(
    not(out().hasLabel("person"))
)
.path()
.by(elementMap())
.limit(10)

Now execute the following query to limit the number of times we repeat our loop along the `friends` edges:

In [None]:
%%gremlin -d $node_labels
g.V()
.hasLabel("person")
.repeat(
    bothE()
        .hasLabel("friends")
    .otherV()
        .hasLabel("person")
)
.until(
    loops().is(2)
)
.path()
.by(elementMap())
.limit(10)

There is a 'gotcha' when combining filtering using `has()` and looping using `until()` or `loops()`. 

As we saw in the `01-Basic-Read-Queries` notebook, `has()` provides the functionality to filter on the existence of a specific property, or match based on a property value. We can use this when looping through our graph to stop when we match the specified criteria. For example;

In [None]:
%%gremlin -d $node_labels
g.V()
.hasLabel("person")
.repeat(
    bothE()
        .hasLabel("friends")
    .otherV()
        .hasLabel("person")
)
.until(
    has("first_name","Dave")
    .or(loops().is(2))
)
.path()
.by(elementMap())
.limit(10)

What we're asking in the above query is:

*"traverse from every person node across the friends edge to another person node, and loop until the first_name property matches "Dave" or we've repeated 3 iterations"*

However, it doesn't quite work in the way that we expect it to. This is common misconception. Whilst we're filtering using `has()` in the `until()` step, no additional filtering is performed on the `or(loops().is(3))` step resulting in additional objects we're not interested in. To mitigate this, we need to apply the same `has()` filtering to the output of the `until()` as follows:

In [None]:
%%gremlin -d $node_labels
g.V()
.hasLabel("person")
.repeat(
    bothE()
        .hasLabel("friends")
    .otherV()
        .hasLabel("person")
)
.until(
    has("first_name","Dave")
    .or(loops().is(2))
)
.has("first_name","Dave")
.path()
.by(elementMap())
.limit(9)

### Cyclic Paths ###

When repeating a traversal in Gremlin using the `repeat()` it's common to come across a pattern whereby the path loops back on itself. This is called a cyclic path, and can lead to your Gremlin queries looping forever.

To stop this from occurring, it's good practise to include the `simplePath()` step. This removes paths with repeated objects, thus ensuring cyclic paths are not traversed.

**Important**. The `simplePath()` filters for repeated object based on the previous step, such as `in()` or `out()`.

The following query provides an example of combining `simplePath()` with the `out()` step to filter on all connected `person` vertices.


In [None]:
%%gremlin -d $node_labels
g.V()
.hasLabel("person")
.repeat(
    out()
        .hasLabel("person")
    .simplePath()
)
.until(
    not(out().hasLabel("person"))
)
.path()
.limit(10)


### Visualising Results in a Neptune Notebook

A key part of using any graph database is being able to visualise the way the objects stored within it are connected to each other. We've already shown how to do this in previous examples, however it's important to understand which of the Gremlin steps support this type of functionality.

* `path()` - used to provide access to all nodes and edges within each unique path traversed
* `simplePath()` - used to ensure we don't repeat a traversal across an object we've already covered (this can lead to infinite looping if the model supports circular references)

If you're running this in a Neptune Notebook, we can use the `path()` step we tell the notebook to automatically present a visualisation of the output of a query. The following query returns 10 paths visualising the connections between `person`, `city`, `restaurant` and `cuisine`. Run the following query, and a graphical visualisation will automatically appear.

In [None]:
%%gremlin -d $node_labels
g.V()
    .hasLabel("person")     // start with all person nodes
    .out("lives")           // traverse the outbound "lives" edge to city
    .in("within")           // traverse the inbound edge from city to restaurant
    .where(__.inE("about")) // filter on restaurants where at least one review exists
    .out("serves")          // traverse the outbound edge from restaurant to cuisine
    .path()                 // return the path
    .limit(10)              // only return 10 results

Additionally, you can combine `path()` with the `by()` modulator along with the `valueMap()` or `values()` steps to return some or all of the non-internal property values stored against the objects within a path. The following query builds upon what we've already run, by returning all non-internal values as a map.

In [None]:
%%gremlin -d $node_labels
g.V()
    .hasLabel("person")     // start with all person nodes
    .out("lives")           // traverse the outbound "lives" edge to city
    .in("within")           // traverse the inbound edge from city to restaurant
    .where(__.inE("about")) // filter on restaurants where at least one review exists
    .out("serves")          // traverse the outbound edge from restaurant to cuisine
    .path()                 // return the path
    .by(
        valueMap()          // return all the non-internal properties of all vertices within the path
    )
    .limit(10)              // only return 10 results

The following query returns only a single property value by using the `values()` step instead of `valueMap()`.

In [None]:
%%gremlin -d $node_labels
g.V()
    .hasLabel("person")     // start with all person nodes
    .out("lives")           // traverse the outbound "lives" edge to city
    .in("within")           // traverse the inbound edge from city to restaurant
    .where(__.inE("about")) // filter on restaurants where at least one review exists
    .out("serves")          // traverse the outbound edge from restaurant to cuisine
    .path()                 // return the path
    .by(
        values('first_name','name')          // return only the first_name or name property value (whichever is applicable)
    )
    .limit(10)              // only return 10 results

**Important**. It's worth noting that in the above query, just specifying `first_name` or `name` will result in no records being returned. This is because neither are properties across **all** vertices in our data model. For example, the `Person` vertex uses `first_name`, and all other vertices use `name` to store the name of the object. In this case, we can list the different properties and Gremlin will associate whichever property is applicable to whichever vertex.

We will dive more into using `valueMap()` and `values()` in the next section.

## Exercises

Now that we have gone through the basics of looping and repeating queries in Gremlin, it's time to put it into practice. Below are several exercises you can complete to verify your understanding of the material covered in this notebook.  As practice for what you have learned, please write the Gremlin queries specified below.

### Exercise 1: Find the friends of Dave's Friends using a loop

Using the data model above, write a query that will:

* Find a `person` node(s) with a `first_name` of "Dave"
* Find the friends of Dave (i.e. traverse the `friends` edge)
* Find the friends of that person (i.e. traverse the `friends` edge)
* Return the friends `first_name`

The correct answer is a three results: "Hank", "Denise", "Paras"

In [None]:
%%gremlin


### Exercise 2: Find all `person` nodes connected to Dave

Starting at a single node and trying to find all connected children (a.k.a. root to leaf) or trying to find the parent of any child node (a.k.a leaf to root) are two very common hierarchical graph query patterns.  Commonly, these queries supported bill of materials, information organization, or compliance use cases.

In this exercise, we will be applying that same query pattern to find the hierarchy of people within our social network.  We'll accomplish this by writing a "root to leaf" type query where the root node is our `Dave` node in the social network.

Using the data model above, write a query that will:

* Find a `person` node(s) with a `first_name` of "Dave"
* Keep traversing the outgoing `friends` edge until there are no more outgoing `friends` edges
* Return all the paths

The correct answer has 5 results

In [None]:
%%gremlin


### Exercise 3: Find all the ways Dave and Denise are connected

A common extension to the path traversal query we wrote in Loop-3 is to return not just "if" someone is connected but "how" they are connected.

In this exercise, we will be making a slight modification to the previous query to return "how" Dave and Denise are connected, not just that they are.

Using the data model above, write a query that will:

* Find a `person` node(s) with a `first_name` of "Dave"
* Find the friends of Dave (i.e. traverse the `friends` edge)
* Keep traversing the `friends` edge until you find `Denise`
* Return the path

The correct answer has 3 results

In [None]:
%%gremlin


## Conclusion

In this notebook, we explored writing looping and repeat queries in Gremlin. These queries are a powerful and common way to explore connected data to answer questions, especially those where the exact number of connection is unknown.  

In the next notebook we will take what we have learned in this notebook and extend it to demonstrate how to order, group, and aggregate values in queries.