unique properties / indexing #45

ehsanonline · 2021-03-17T01:17:30Z

Hello, it would be great if we could use unique keys in nodes and edges. in Neo4j it's called Node Key and it can be used like these:

CREATE CONSTRAINT constraint_name ON (n:Person) ASSERT (n.firstname) IS NODE KEY
CREATE CONSTRAINT constraint_name ON (n:Person) ASSERT (n.firstname, n.surname) IS NODE KEY
CREATE CONSTRAINT constraint_name IF NOT EXISTS ON (n:Person) ASSERT (n.firstname, n.surname) IS NODE KEY
CREATE CONSTRAINT constraint_with_provider ON (n:Label) ASSERT (n.prop1) IS NODE KEY OPTIONS {indexProvider: 'native-btree-1.0'}
DROP CONSTRAINT constraint_name
DROP CONSTRAINT missing_constraint_name IF EXISTS

I've tried to use Postgres queries but since the properties column type is not json/jsonb I got error:

postgres=# CREATE UNIQUE INDEX person_name_idx ON mygraph.person( (properties->>'name') ) ;
ERROR:  operator does not exist: ag_catalog.agtype ->> unknown
LINE 1: ...INDEX person_name_idx ON mygraph.person( (properties->>'name')...
                                                               ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

The text was updated successfully, but these errors were encountered:

pdpotter · 2021-04-02T13:16:42Z

It would indeed be very useful to enable the creation of constraints / indexes. This could help to accelerate the creation of edges between nodes that were added earlier, which slows down a lot when there are a lot of nodes with the same label:

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2})
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype)

The CREATE UNIQUE, CREATE CONSTRAINT, and CREATE INDEX clauses / commands are part of cypher, but not of openCypher, so I don't know what the best approach would be here.

@ehsanonline

See comment below for an update. Click here to view the original comment content.

You can create a unique index, which will prevent the creation of duplicate values, but will not speed up matching, by creating an immutable function that casts to json over text

CREATE OR REPLACE FUNCTION get_name(properties agtype) 
  RETURNS text
AS
$BODY$
    select $1::text::json->>'name';
$BODY$
LANGUAGE sql
IMMUTABLE;

and using that immutable function in the create index command

CREATE UNIQUE INDEX person_name_idx ON mygraph.person(get_name(properties)) ;

Link to the mentioned comment below

audiBookning · 2021-04-02T16:31:31Z

Here some links just for reference.

List of Cypher Improvement Proposals (CIP)

As for indexes and constraints in Opencypher, there are already some PR for their addition, but i think that they are "blocked" by another CIP

CIP:

PR:

pdpotter · 2021-05-20T14:04:20Z

I tried using id() instead of using properties for matching, but this was even slower (I noticed using an additional WHERE clause instead of defining properties in the MATCH clause is slower in another use case as well).

For this, I rewrote

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2})
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype);

to

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA), (r:LabelB)
WHERE id(d) = 11111 and id(r) = 11112
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype);

where I obtained the ids (the ones above are made up) using

SELECT * FROM cyper('graph_name',$$
MATCH (n:LabelA)
return id(n), n.id
$$) as (id agtype, prop agtype);

pdpotter · 2021-08-20T13:58:00Z

There is a (1000 times faster) workaround for the creation of edges between vertices that were added earlier. The workaround consists of inserting the data directly in the underlying tables that are used by Apache AGE.

See comment below for an update. Click here to view the original comment content.

First, the underlying ids of the vertices are retrieved:

SELECT * FROM cyper('graph_name',$$
MATCH (n:LabelA)
return id(n), n.id
$$) as (id agtype, prop agtype);

A single edge is then created using the cypher function to make sure the underlying tables are created correctly (the id values are made up)

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA), (r:LabelB)
WHERE id(d) = 11111 and id(r) = 11112
CREATE (d)-[:RelationA {prop: 'value'}]->(r)
$$) as (a agtype);

Indices on start_id and end_id are then created in the underlying table for the edges of a certain type, to speed up the inserts in the _ag_label_edge table later on:

CREATE INDEX RelationA__start_id ON graph_name.RelationA(start_id);
CREATE INDEX RelationA__end_id ON graph_name.RelationA(end_id);

All other edges are created by direct insertion, first in the specific edge table (using executemany):

INSERT INTO graph_name.RelationA (start_id, end_id, properties)
VALUES ($1, $2, $3)

Where $1 are the domain_ids, $2 are the range_ids and $3 are the properties (as json dump).

Secondly, the edges are also directly inserted into the _ag_label_edge table (also using executemany). In my use case, each edge has an id property that can be used to select the correct edge when there are multiple edges between two vertices:

INSERT INTO graph_name._ag_label_edge (id, start_id, end_id, properties)
VALUES (
    (
        SELECT id from graph_name.RelationA
        WHERE start_id = $1
        AND end_id = $2
        AND properties::text::json->>'id' = $3
    ),
    $1,
    $2,
    $4
)

Where $1 are the domain_ids, $2 are the range_ids, $3 are the relation ids and $4 are the properties (as json dump).

Any thoughts on this workaround?
Are there any plans to add property indexes to Apache AGE later on?

pdpotter · 2021-08-30T10:16:07Z

When the edges are inserted in the specific edge table, they are automatically added to the _ag_label_edge table, reducing the workaround to the following and making the relation creation even faster:

First, the underlying ids of the vertices are retrieved:

SELECT * FROM cyper('graph_name',$$
MATCH (n:LabelA)
return id(n), n.id
$$) as (id agtype, prop agtype);

A single edge is then created using the cypher function to make sure the underlying tables are created correctly (the id values are made up)

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA), (r:LabelB)
WHERE id(d) = 11111 and id(r) = 11112
CREATE (d)-[:RelationA {prop: 'value'}]->(r)
$$) as (a agtype);

All other edges are created by direct insertion in the specific edge table (using executemany):

INSERT INTO graph_name.RelationA (start_id, end_id, properties)
VALUES ($1, $2, $3)

Where $1 are the domain_ids, $2 are the range_ids and $3 are the properties (as json dump).

JoshInnis · 2021-11-30T04:57:08Z

Hello,

A graph name is a schema and a label name is a table. Id and properties are columns in vertex table. Id, start_id, end_id, and properties are columns in the edge tables. Use the agtype_access_operator(properties, key) to get to get a property value.

Knowing all that you can use Postges' standard DDL language to implement constraints, indices and unique values.

ALTER TABLE graph_name.label_name 
ADD CONSTRAINT constraint_name
CHECK(agtype_access_operator(properties, "name_of_property") != '"Check against here"'::agtype);

pdpotter · 2021-11-30T09:19:13Z

This is great, thank you!

Since indices require an immutable function, an additional function will still need to be created for them. When I create a get_id function with

CREATE OR REPLACE FUNCTION get_id(properties agtype) 
  RETURNS agtype
AS
$BODY$
    select agtype_access_operator($1, '"id"');
$BODY$
LANGUAGE sql
IMMUTABLE;

and use it in an index with

CREATE UNIQUE INDEX person_id_idx ON mygraph.person(get_id(properties)) ;

the creation of vertices with the same id will be prevented

ERROR:  duplicate key value violates unique constraint "person_id_idx"
DETAIL:  Key (get_id(properties))=(2250) already exists.

but the index will still not be used when trying to match vertices with a specific id:

SELECT * FROM ag_catalog.cypher('mygraph', $$EXPLAIN ANALYZE MATCH (a:person {id:2250}) return a$$) as (a agtype);

                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on person a  (cost=0.00..2775.32 rows=16906 width=32) (actual time=12.688..85.521 rows=1 loops=1)
   Filter: _property_constraint_check(properties, agtype_build_map('id'::text, '2250'::agtype))
   Rows Removed by Filter: 50718
 Planning Time: 0.122 ms
 Execution Time: 85.550 ms
(5 rows)

Is there a way to use indices when matching?

JoshInnis · 2021-11-30T22:47:38Z

It might be safe to change agtype_access_operator to immutable.

Per Postgres' Documentation:

IMMUTABLE indicates that the function cannot modify the database and always returns the same result when given the same argument values; that is, it does not do database lookups or otherwise use information not directly present in its argument list. If this option is given, any call of the function with all-constant arguments can be immediately replaced with the function value.

STABLE indicates that the function cannot modify the database, and that within a single table scan it will consistently return the same result for the same argument values, but that its result could change across SQL statements. This is the appropriate selection for functions whose results depend on database lookups, parameter variables (such as the current time zone), etc. (It is inappropriate for AFTER triggers that wish to query rows modified by the current command.) Also note that the current_timestamp family of functions qualify as stable, since their values do not change within a transaction.

The access operator works for Agtype Lists and Maps, it does not perform any database lookup, it just extracts a value from the first passed in parameter.

JoshInnis · 2021-11-30T23:03:51Z

Indices cannot currently be used while matching. There will need to be some re factoring done to allow the planner to realize opportunities where the indices can be used.

JoshInnis · 2022-04-24T21:15:47Z

Hello, the commit 57e11a3 in master should have resolved this issue and will be available in the next release.

pdpotter · 2022-04-25T10:53:19Z

Thank you for the update! Is it possible to explain in more detail what issue was resolved?

I tried creating a unique index as described in #45 (comment), but a Sequential scan is still executed when I use a MATCH query to find a vertex based on the property on which a unique index has been created.

I pulled the latest changes, did a make and sudo make install, DROP EXTENSTION age CASCADE; and CREATE EXTENTION age;. Is this sufficient to use the changes from 57e11a3, or do I have to drop and recreate the database I'm working on?

JoshInnis · 2022-04-25T19:53:17Z

Hi @pdpotter, sorry for the confusion, AGE now supports constraints, the MATCH clause does not yet support using index scans. Constraints now work and the updating (SET, REMOVE, etc) clauses work with constraints and they no longer break indices. The patch 6279c10 supports GIN indices. So if you create an index on a label's properties and place the quals in the {} in the MATCH clause. Such as MATCH (:label {prop_name: "filter value"}) the gin index will be used. Index scans using filters in the WHERE clause in nearing completion and will be in the review process shortly.

pdpotter · 2022-04-26T08:48:07Z

See comment below for an update. Click here to view the original comment content.

Wow, this is fantastic. It is now possible to create relations quickly (~10 000/s on my local VM) using simple queries (with executemany) in the form of ``` SELECT * FROM cypher('graph_name', $$ MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2}) CREATE (d)-[:Relation {prop: 'value'}]->(r) $$) as (a agtype) ```

After simply adding GIN indexes

CREATE INDEX d:LabelA__properties
ON graph_name.LabelA USING GIN (properties)

CREATE INDEX d:LabelB__properties
ON graph_name.LabelB USING GIN (properties)

For this specific use case, it would of course be more disk space efficient to only index the id property and not all of them, but having MATCH working so much faster is a huge leap forward. Great work!

pdpotter · 2022-05-09T09:39:41Z

Commit 379983b includes some improvements which are relevant here:

agtype_access_operator is now immutable. This means it can be used directly to create unique indices. It is no longer necessary to create a separate immutable function. E.g.,

CREATE UNIQUE INDEX person_id_idx ON mygraph.person(ag_catalog.agtype_access_operator(properties, '"id"'::agtype));

~~GIN indices are now used when filtering using WHERE clauses.~~
edit: The unique index created above is used when filtering using WHERE clauses.

Very nice work, thank you @JoshInnis!

pdpotter · 2022-05-09T19:23:01Z

It looks like I have been a bit too enthusiastic in my previous comments.

Property constraints

When using GIN indices on properties and creating relations using these indices, I didn't check if the relations were actually added. Unfortunately, they were not.

After adding a GIN index and adding enough vertices so the index is used, a match query doesn't return any results. E.g.,

SELECT * FROM cypher('test_graph', $$CREATE (p:person {id: 1}) return p$$) as (p agtype);
SELECT * FROM cypher('test_graph', $$CREATE (p:person {id: 2}) return p$$) as (p agtype);
...
SELECT * FROM cypher('test_graph', $$CREATE (p:person {id: 500}) return p$$) as (p agtype);
CREATE INDEX person__properties ON test_graph.person USING GIN (properties);

A match query using property constraints returns 0 results.

SELECT * FROM cypher('test_graph', $$MATCH (p:person {id: 1}) return p$$) as (p agtype);
 p 
---
(0 rows)

Query plan:

                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on person p  (cost=12.01..16.03 rows=1 width=32) (actual time=0.026..0.027 rows=0 loops=1)
   Recheck Cond: (properties @> agtype_build_map('id'::text, '1'::agtype))
   ->  Bitmap Index Scan on person__properties  (cost=0.00..12.01 rows=1 width=0) (actual time=0.024..0.024 rows=0 loops=1)
         Index Cond: (properties @> agtype_build_map('id'::text, '1'::agtype))
 Planning Time: 0.158 ms
 Execution Time: 0.075 ms
(6 rows)

Where clause

When using the where clause, it is not the GIN index that is being used, but the unique index that was added to prevent duplicate entries by executing

CREATE UNIQUE INDEX person__id ON test_graph.person (properties);

The WHERE clause does give a correct result:

SELECT * FROM cypher('test_graph', $$MATCH (p:person) WHERE p.id = 1) return p$$) as (p agtype);
                                      p                                      
-----------------------------------------------------------------------------
 {"id": 844424930131969, "label": "person", "properties": {"id": 1}}::vertex
(1 row)

Query plan:

      QUERY PLAN                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using person__id on person p  (cost=0.27..8.30 rows=1 width=32) (actual time=0.063..0.065 rows=1 loops=1)
   Index Cond: (agtype_access_operator(VARIADIC ARRAY[properties, '"id"'::agtype]) = '1'::agtype)
 Planning Time: 0.190 ms
 Execution Time: 0.103 ms
(4 rows)

When using the WHERE clause for creating relations, the performance decreases when adding a lot of relations (when adding relations in batches of 5000, the first batch achieves ~5000 it/s, while the ninth batch achieves ~500 it/s) using something similar to

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA ), (r:LabelB)
WHERE d.id = 1 AND r.id = 2
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype)

Test suite

When having a new look at the tests (https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L299-L336) to check if I was making mistakes, I found out some of the index related tests might have some issues:

Are the indices being used with a limited amount of vertices?
The query at https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L304 returns 0 results, which is expected, but it might be useful to check if results are returned for other queries (e.g., MATCH (c:City {country_code: "US"}). Queries https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L318 and https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L330 are similar cases.
There are errors on lines https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L314 and https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L335

Thank you so much for making Apache AGE better and better with each commit. I'm sorry if I caused any confusion with my previous comments.

JoshInnis · 2022-05-09T19:31:00Z

Hi @pdpotter, GIN Indices are for a subset of JsonB operators https://www.postgresql.org/docs/11/gin-builtin-opclasses.html These operators are not usabale in the cypher Where clause. Currently the only place they can be used is in the property constraint field in the match clause. The where clause is now compatible with the comparison operators that the cypher command currently has. These new operators need to be added to the cypher command's where clause

JoshInnis · 2022-05-09T19:44:57Z

For an article about this you can checkout this https://bitnine.net/blog-postgresql/postgresql-internals-jsonb-type-and-its-indexes/ Its a bit dated and about Agensgraph and not AGE, but if you replace insances of the '->>' with the '.' operator the information is still accurate

pdpotter · 2022-08-10T11:26:36Z

As #228 has been fixed, it is now possible to use GIN indices for adding relations.
After adding GIN indices by executing

CREATE INDEX d:LabelA__properties
ON graph_name.LabelA USING GIN (properties);
CREATE INDEX d:LabelB__properties
ON graph_name.LabelB USING GIN (properties);

Adding relations using a query like

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2})
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype);

is now over 30 times faster (~300/s on my local VM) than without the GIN indices.

I tried improving the performance by creating indices using

CREATE INDEX d:LabelA__properties
ON graph_name.LabelA USING GIN (properties jsonb_path_ops);

but jsonb_path_ops is not supported for agtype.

Another attempt using a specialized index created by

CREATE INDEX d:LabelA__properties_id
ON graph_name.LabelA USING GIN ((properties->'id'));

didn't seem to work because the index wasn't used when executing MATCH queries (this was already mentioned in #212 (comment)).

As using GIN indixes is ~30 times slower than the method described in #45 (comment), I think I'm going to keep using this other method for initial data import.

Dzordzu · 2022-10-06T17:02:01Z

For onlookers: within Rust driver I've implemented methods that provide unique indexes, and properties contraints. As of now it can be treated as a summary of this disscussion

jrgemignani · 2022-10-17T17:49:38Z

Just wondering if this issue is resolved? Or, is there more that needs to be done that this issue needs to stay open? It is a bit difficult to tell from the correspondence.

Dzordzu · 2022-10-18T05:05:28Z

Just wondering if this issue is resolved? Or, is there more that needs to be done that this issue needs to stay open? It is a bit difficult to tell from the correspondence.

I wouldn't say it's resolved. Fact - currently there are mechanisms that allow client to create indexes / constraints. On the other hand there is neither good documentation on this, nor agreement on the topic if constraints should also be included within cypher-like methods.

jrgemignani · 2022-10-18T16:19:19Z

I will try to see if I can get others engaged that can help.

ddrao · 2023-06-24T19:30:26Z

Any update on indexing side of things? Any pointers on enabling the cypher queries to use specific property indexes? Other than the GIN index, nothing seems to be working.

pdpotter · 2023-06-24T19:42:22Z

~~In #954 (comment), a way to create specific property indexes has been mentioned. I haven't been able to test it out, but the issue has been closed, so I'm hopeful it will work.~~ Edit: it doesn't seem to work, see #1009

In #1000, a patch that is still being worked on is mentioned that would allow GIN indices to be used in WHERE clauses as well.

vladiksun · 2023-06-26T08:13:42Z

@pdpotter I suppose the indexes mentioned in the #1000 to be used from the WHERE clauses should be of BTREE type instead of GIN because GIN index is good for pattern matches like in the MATCH clauses.

In the update_entity_tuple() function, when we call table_tuple_update() and assign the returned value to the result variable, the buffer variable receives the value of 0. Made a workaround so that the original value isn't lost.

JoshInnis added the enhancement New request label Jun 16, 2021

This was referenced Mar 10, 2022

Variable Length Edges feature does not scale well with the length of the path #195

Closed

How to create a unique constraint on a property, like in AgensGraph? #200

Closed

This was referenced Jun 7, 2022

Fix and extend index tests #227

Closed

Match queries with integer properties constraints return 0 results after adding gin index #228

Closed

pdpotter mentioned this issue Nov 21, 2022

[Bitnine Tech] - It is not applied property index according to position of condition #359

Open

humzakt mentioned this issue May 14, 2023

Creating Index on nodes property #920

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique properties / indexing #45

unique properties / indexing #45

ehsanonline commented Mar 17, 2021

pdpotter commented Apr 2, 2021 •

edited

Loading

audiBookning commented Apr 2, 2021

pdpotter commented May 20, 2021

pdpotter commented Aug 20, 2021 •

edited

Loading

pdpotter commented Aug 30, 2021

JoshInnis commented Nov 30, 2021 •

edited

Loading

pdpotter commented Nov 30, 2021

JoshInnis commented Nov 30, 2021

JoshInnis commented Nov 30, 2021

JoshInnis commented Apr 24, 2022

pdpotter commented Apr 25, 2022

JoshInnis commented Apr 25, 2022

pdpotter commented Apr 26, 2022 •

edited

Loading

pdpotter commented May 9, 2022 •

edited

Loading

pdpotter commented May 9, 2022 •

edited

Loading

JoshInnis commented May 9, 2022

JoshInnis commented May 9, 2022

pdpotter commented Aug 10, 2022 •

edited

Loading

Dzordzu commented Oct 6, 2022 •

edited

Loading

jrgemignani commented Oct 17, 2022

Dzordzu commented Oct 18, 2022 •

edited

Loading

jrgemignani commented Oct 18, 2022

ddrao commented Jun 24, 2023 •

edited

Loading

pdpotter commented Jun 24, 2023 •

edited

Loading

vladiksun commented Jun 26, 2023

unique properties / indexing #45

unique properties / indexing #45

Comments

ehsanonline commented Mar 17, 2021

pdpotter commented Apr 2, 2021 • edited Loading

audiBookning commented Apr 2, 2021

pdpotter commented May 20, 2021

pdpotter commented Aug 20, 2021 • edited Loading

pdpotter commented Aug 30, 2021

JoshInnis commented Nov 30, 2021 • edited Loading

pdpotter commented Nov 30, 2021

JoshInnis commented Nov 30, 2021

JoshInnis commented Nov 30, 2021

JoshInnis commented Apr 24, 2022

pdpotter commented Apr 25, 2022

JoshInnis commented Apr 25, 2022

pdpotter commented Apr 26, 2022 • edited Loading

pdpotter commented May 9, 2022 • edited Loading

pdpotter commented May 9, 2022 • edited Loading

Property constraints

Where clause

Test suite

JoshInnis commented May 9, 2022

JoshInnis commented May 9, 2022

pdpotter commented Aug 10, 2022 • edited Loading

Dzordzu commented Oct 6, 2022 • edited Loading

jrgemignani commented Oct 17, 2022

Dzordzu commented Oct 18, 2022 • edited Loading

jrgemignani commented Oct 18, 2022

ddrao commented Jun 24, 2023 • edited Loading

pdpotter commented Jun 24, 2023 • edited Loading

vladiksun commented Jun 26, 2023

pdpotter commented Apr 2, 2021 •

edited

Loading

pdpotter commented Aug 20, 2021 •

edited

Loading

JoshInnis commented Nov 30, 2021 •

edited

Loading

pdpotter commented Apr 26, 2022 •

edited

Loading

pdpotter commented May 9, 2022 •

edited

Loading

pdpotter commented May 9, 2022 •

edited

Loading

pdpotter commented Aug 10, 2022 •

edited

Loading

Dzordzu commented Oct 6, 2022 •

edited

Loading

Dzordzu commented Oct 18, 2022 •

edited

Loading

ddrao commented Jun 24, 2023 •

edited

Loading

pdpotter commented Jun 24, 2023 •

edited

Loading