# Gremlin 200 - Aerospike Graph Best Practices

## Use primary keys when possible.

Let's look at our first example, we inserted the following data:
```
g.addV("Person").property("name", "Lyndon").iterate()
g.addV("Person").property("name", "Grant").iterate()
g.addV("Person").property("name", "Simon").iterate()
```
Now let's image our graph is extremely large, it has every person in the world. We want to find "Lyndon". Well we have to now do a `g.V().has("name", "Lyndon")` which will scan every vertex in the graph. This is very slow and not scalable.

Instead we can insert like this:
```
g.addV("Person").property(T.id, "Lyndon").iterate()
g.addV("Person").property(T.id, "Grant").iterate()
g.addV("Person").property(T.id, "Simon").iterate()
```

By doing this we can now retrieve "Lyndon" as `g.V("Lyndon").next() and it will be extremely fast, just using a direct primary index lookup.

The takeaway here is when modelling customer data, try to figure out what is unique in a given vertex type, and use that as a `T.id`, where the customer can then lookup with that.

## Use Secondary indexes when necessary

If a there is a use case where you may want to lookup a vertex by a property, a secondary index can be used here.

For example

```
g.addV("Person").property(T.id, "Lyndon").property("age", 30).iterate()
g.addV("Person").property(T.id, "Grant").property("age", 40).iterate()
g.addV("Person").property(T.id, "Simon").property("age", 14).iterate()
```
If we have a secondary index on `age` we can do things like the below very quickly:
```
g.V().has("age", 30).next() // All people who are 30
g.V().has("age", P.gt(30)).next() // All people who are older than 30
```
There are between semantics and other things as well which you can look up on your own.

To create a secondary index on a vertex property, the follow needs to be added to either the properties file or environment variables of the system.
`aerospike.graph.index.vertex.properties=property_key1,property_key2`

This will create a secondary index on property_key1 and property_key2.

## Avoid using edge properties

Internally we have strategies that leverage our 'edge cache' that is materialized on the vertex records. This allows up to move from vertex->vertex without reading the edge in between.

If you have a query that is like `g.V("Lyndon").outE("knows").has("since", "1998").inV()` it will be significantly slower than `g.V("Lyndon").out("knows")` because we have to read the edge in between, when we could otherwise skip it.

When possible, instead of using an edge property, use a vertex property. This is not always possible but when it is, it is considerably faster.

## Using sampling when traversing across supernodes

Customers will have supernodes, imagine twitter, and we want to find the people that follow the people that we follow.

Now if we follow Justin Bieber for example, this will explode into a ton of people.

We have a custom strategy that allows you to sample some data from this very efficiently.

An example of this is below:
```
g.V("Lyndon").out("follows").out("follows").sample(100)
```
This will return 100 random people that are followed by the people that Lyndon follows.

This strategy just pulls the ids via a sindex (assuming there is a supernode) and then does a random sampling of those ids, which is much quicker than materializing all of the data.

## Notes about combining filters

Our filtering is semi intelligent, but not all knowing. Let's look at this query:
`g.V().hasLabel("Person").has("name", "Lyndon").has("age", 30).has("location", "Port Alberni")`

If you have a secondary index on `name`, `age`, `location`, and the label, under the hood Aerospike Graph will run a single secondary index query on the secondary index that has the highest cardinality. Note this does not always yield the best result. Let's say `name` was the highest cardinality, but instead of `Lyndon` we had the name `Mohammad`, google estimates there are 150 million people names `Mohammad`, meanwhile there are only about 10-20 thousand people that live in `Port Alberni`.

Another important note, we always assume property filters are more selective than label filters, so a secondary index created on labels will only be used if there are no secondary indexes on the properties used in the query where the label filter is applied.

Quick note:
`g.V().hasLabel("Person").has("name", "Lyndon").has("age", 30).has("location", "Port Alberni").out().has("name", "Grant")`

The has("name", "Grant") will not (under any circumstance) use a sindex. Sindexes can ONLY be used at the start of a traversal.

`g.V().hasLabel("Person").has("name", "Lyndon").has("age", 30).has("location", "Port Alberni").out().where(eq(__.V().has("name", "Grant").in()))`

Something like the above will run a sindex on the anonymous traversal that is injected (this query is just an example, and may not work).

Refer to https://aerospike.com/docs/graph/indexing for more information.

## Prefer vertex properties over additional edges

A common thing people like to do when working on a graph is take something that may be a common property to many vertices and make it an edge.

So for example, perhaps an application has the input data of an address, and they want to try to figure out what people might be at that address to serve an advertisement.

Since addresses are generally not unique, and may attach to many vertices, a common way of dealing with this is to attach an address vertex and attach it to each person that has that address.

The problem with this is as your graph grows, you will likely have address supernodes (perhaps a large apartment, or gym, etc). This will cause traversing off these commonly visited places to explode at runtime, causing things to be very slow.

Instead, if you have a property on the vertex, you can use a secondary index to find all vertices that have that property, and then traverse off of that, so instead of storing the edge to address, we can store the address as a property on the vertex, and then use a secondary index to find all vertices that have that property, and then traverse off of that.

The query difference would be from:
`g.V("123 Fake Street").has().in()`
to:
`g.V().has("address", "123 Fake Street")`

Now this is not always possible. In some datasets address may be unique, in others there may be many different addresses associated with the user. In the case that there is only 1 unique address, then this is possible and is a much better way of doing things. In the case that there are many addresses, then this is not possible.

## Summary alternative to Count

Counting vertices was mentioned as a common mistake people make. This is because it enters OLAP territory. We do however have a summary API alternative to this.

Running `g.call("summary")` will return the metadata counters of the graph very quickly. If you want data outside what this returns, you are out of luck unfortunately.