Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unique properties / indexing #45

Open
ehsanonline opened this issue Mar 17, 2021 · 25 comments
Open

unique properties / indexing #45

ehsanonline opened this issue Mar 17, 2021 · 25 comments
Labels
enhancement New request

Comments

@ehsanonline
Copy link

Hello, it would be great if we could use unique keys in nodes and edges. in Neo4j it's called Node Key and it can be used like these:

  1. CREATE CONSTRAINT constraint_name ON (n:Person) ASSERT (n.firstname) IS NODE KEY
  2. CREATE CONSTRAINT constraint_name ON (n:Person) ASSERT (n.firstname, n.surname) IS NODE KEY
  3. CREATE CONSTRAINT constraint_name IF NOT EXISTS ON (n:Person) ASSERT (n.firstname, n.surname) IS NODE KEY
  4. CREATE CONSTRAINT constraint_with_provider ON (n:Label) ASSERT (n.prop1) IS NODE KEY OPTIONS {indexProvider: 'native-btree-1.0'}
  5. DROP CONSTRAINT constraint_name
  6. DROP CONSTRAINT missing_constraint_name IF EXISTS

I've tried to use Postgres queries but since the properties column type is not json/jsonb I got error:

postgres=# CREATE UNIQUE INDEX person_name_idx ON mygraph.person( (properties->>'name') ) ;
ERROR:  operator does not exist: ag_catalog.agtype ->> unknown
LINE 1: ...INDEX person_name_idx ON mygraph.person( (properties->>'name')...
                                                               ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
@pdpotter
Copy link
Contributor

pdpotter commented Apr 2, 2021

It would indeed be very useful to enable the creation of constraints / indexes. This could help to accelerate the creation of edges between nodes that were added earlier, which slows down a lot when there are a lot of nodes with the same label:

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2})
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype)

The CREATE UNIQUE, CREATE CONSTRAINT, and CREATE INDEX clauses / commands are part of cypher, but not of openCypher, so I don't know what the best approach would be here.

@ehsanonline

See comment below for an update. Click here to view the original comment content. You can create a unique index, which will prevent the creation of duplicate values, but will not speed up matching, by creating an immutable function that casts to json over text
CREATE OR REPLACE FUNCTION get_name(properties agtype) 
  RETURNS text
AS
$BODY$
    select $1::text::json->>'name';
$BODY$
LANGUAGE sql
IMMUTABLE;

and using that immutable function in the create index command

CREATE UNIQUE INDEX person_name_idx ON mygraph.person(get_name(properties)) ;

Link to the mentioned comment below

@audiBookning
Copy link

Here some links just for reference.

List of Cypher Improvement Proposals (CIP)

As for indexes and constraints in Opencypher, there are already some PR for their addition, but i think that they are "blocked" by another CIP

CIP:

PR:

@pdpotter
Copy link
Contributor

I tried using id() instead of using properties for matching, but this was even slower (I noticed using an additional WHERE clause instead of defining properties in the MATCH clause is slower in another use case as well).

For this, I rewrote

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2})
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype);

to

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA), (r:LabelB)
WHERE id(d) = 11111 and id(r) = 11112
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype);

where I obtained the ids (the ones above are made up) using

SELECT * FROM cyper('graph_name',$$
MATCH (n:LabelA)
return id(n), n.id
$$) as (id agtype, prop agtype);

@JoshInnis JoshInnis added the enhancement New request label Jun 16, 2021
@pdpotter
Copy link
Contributor

pdpotter commented Aug 20, 2021

There is a (1000 times faster) workaround for the creation of edges between vertices that were added earlier. The workaround consists of inserting the data directly in the underlying tables that are used by Apache AGE.

See comment below for an update. Click here to view the original comment content. First, the underlying ids of the vertices are retrieved:
SELECT * FROM cyper('graph_name',$$
MATCH (n:LabelA)
return id(n), n.id
$$) as (id agtype, prop agtype);

A single edge is then created using the cypher function to make sure the underlying tables are created correctly (the id values are made up)

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA), (r:LabelB)
WHERE id(d) = 11111 and id(r) = 11112
CREATE (d)-[:RelationA {prop: 'value'}]->(r)
$$) as (a agtype);

Indices on start_id and end_id are then created in the underlying table for the edges of a certain type, to speed up the inserts in the _ag_label_edge table later on:

CREATE INDEX RelationA__start_id ON graph_name.RelationA(start_id);
CREATE INDEX RelationA__end_id ON graph_name.RelationA(end_id);

All other edges are created by direct insertion, first in the specific edge table (using executemany):

INSERT INTO graph_name.RelationA (start_id, end_id, properties)
VALUES ($1, $2, $3)

Where $1 are the domain_ids, $2 are the range_ids and $3 are the properties (as json dump).

Secondly, the edges are also directly inserted into the _ag_label_edge table (also using executemany). In my use case, each edge has an id property that can be used to select the correct edge when there are multiple edges between two vertices:

INSERT INTO graph_name._ag_label_edge (id, start_id, end_id, properties)
VALUES (
    (
        SELECT id from graph_name.RelationA
        WHERE start_id = $1
        AND end_id = $2
        AND properties::text::json->>'id' = $3
    ),
    $1,
    $2,
    $4
)

Where $1 are the domain_ids, $2 are the range_ids, $3 are the relation ids and $4 are the properties (as json dump).

Any thoughts on this workaround?
Are there any plans to add property indexes to Apache AGE later on?

@pdpotter
Copy link
Contributor

When the edges are inserted in the specific edge table, they are automatically added to the _ag_label_edge table, reducing the workaround to the following and making the relation creation even faster:

First, the underlying ids of the vertices are retrieved:

SELECT * FROM cyper('graph_name',$$
MATCH (n:LabelA)
return id(n), n.id
$$) as (id agtype, prop agtype);

A single edge is then created using the cypher function to make sure the underlying tables are created correctly (the id values are made up)

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA), (r:LabelB)
WHERE id(d) = 11111 and id(r) = 11112
CREATE (d)-[:RelationA {prop: 'value'}]->(r)
$$) as (a agtype);

All other edges are created by direct insertion in the specific edge table (using executemany):

INSERT INTO graph_name.RelationA (start_id, end_id, properties)
VALUES ($1, $2, $3)

Where $1 are the domain_ids, $2 are the range_ids and $3 are the properties (as json dump).

@JoshInnis
Copy link
Contributor

JoshInnis commented Nov 30, 2021

Hello,

A graph name is a schema and a label name is a table. Id and properties are columns in vertex table. Id, start_id, end_id, and properties are columns in the edge tables. Use the agtype_access_operator(properties, key) to get to get a property value.

Knowing all that you can use Postges' standard DDL language to implement constraints, indices and unique values.

ALTER TABLE graph_name.label_name 
ADD CONSTRAINT constraint_name
CHECK(agtype_access_operator(properties, "name_of_property") != '"Check against here"'::agtype);

@pdpotter
Copy link
Contributor

This is great, thank you!

Since indices require an immutable function, an additional function will still need to be created for them. When I create a get_id function with

CREATE OR REPLACE FUNCTION get_id(properties agtype) 
  RETURNS agtype
AS
$BODY$
    select agtype_access_operator($1, '"id"');
$BODY$
LANGUAGE sql
IMMUTABLE;

and use it in an index with

CREATE UNIQUE INDEX person_id_idx ON mygraph.person(get_id(properties)) ;

the creation of vertices with the same id will be prevented

ERROR:  duplicate key value violates unique constraint "person_id_idx"
DETAIL:  Key (get_id(properties))=(2250) already exists.

but the index will still not be used when trying to match vertices with a specific id:

SELECT * FROM ag_catalog.cypher('mygraph', $$EXPLAIN ANALYZE MATCH (a:person {id:2250}) return a$$) as (a agtype);
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on person a  (cost=0.00..2775.32 rows=16906 width=32) (actual time=12.688..85.521 rows=1 loops=1)
   Filter: _property_constraint_check(properties, agtype_build_map('id'::text, '2250'::agtype))
   Rows Removed by Filter: 50718
 Planning Time: 0.122 ms
 Execution Time: 85.550 ms
(5 rows)

Is there a way to use indices when matching?

@JoshInnis
Copy link
Contributor

It might be safe to change agtype_access_operator to immutable.

Per Postgres' Documentation:

IMMUTABLE indicates that the function cannot modify the database and always returns the same result when given the same argument values; that is, it does not do database lookups or otherwise use information not directly present in its argument list. If this option is given, any call of the function with all-constant arguments can be immediately replaced with the function value.

STABLE indicates that the function cannot modify the database, and that within a single table scan it will consistently return the same result for the same argument values, but that its result could change across SQL statements. This is the appropriate selection for functions whose results depend on database lookups, parameter variables (such as the current time zone), etc. (It is inappropriate for AFTER triggers that wish to query rows modified by the current command.) Also note that the current_timestamp family of functions qualify as stable, since their values do not change within a transaction.

The access operator works for Agtype Lists and Maps, it does not perform any database lookup, it just extracts a value from the first passed in parameter.

@JoshInnis
Copy link
Contributor

Indices cannot currently be used while matching. There will need to be some re factoring done to allow the planner to realize opportunities where the indices can be used.

@JoshInnis
Copy link
Contributor

Hello, the commit 57e11a3 in master should have resolved this issue and will be available in the next release.

@pdpotter
Copy link
Contributor

Thank you for the update! Is it possible to explain in more detail what issue was resolved?

I tried creating a unique index as described in #45 (comment), but a Sequential scan is still executed when I use a MATCH query to find a vertex based on the property on which a unique index has been created.

I pulled the latest changes, did a make and sudo make install, DROP EXTENSTION age CASCADE; and CREATE EXTENTION age;. Is this sufficient to use the changes from 57e11a3, or do I have to drop and recreate the database I'm working on?

@JoshInnis
Copy link
Contributor

Hi @pdpotter, sorry for the confusion, AGE now supports constraints, the MATCH clause does not yet support using index scans. Constraints now work and the updating (SET, REMOVE, etc) clauses work with constraints and they no longer break indices. The patch 6279c10 supports GIN indices. So if you create an index on a label's properties and place the quals in the {} in the MATCH clause. Such as MATCH (:label {prop_name: "filter value"}) the gin index will be used. Index scans using filters in the WHERE clause in nearing completion and will be in the review process shortly.

@pdpotter
Copy link
Contributor

pdpotter commented Apr 26, 2022

See comment below for an update. Click here to view the original comment content. Wow, this is fantastic. It is now possible to create relations quickly (~10 000/s on my local VM) using simple queries (with executemany) in the form of ``` SELECT * FROM cypher('graph_name', $$ MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2}) CREATE (d)-[:Relation {prop: 'value'}]->(r) $$) as (a agtype) ```

After simply adding GIN indexes

CREATE INDEX d:LabelA__properties
ON graph_name.LabelA USING GIN (properties)
CREATE INDEX d:LabelB__properties
ON graph_name.LabelB USING GIN (properties)

For this specific use case, it would of course be more disk space efficient to only index the id property and not all of them, but having MATCH working so much faster is a huge leap forward. Great work!

@pdpotter
Copy link
Contributor

pdpotter commented May 9, 2022

Commit 379983b includes some improvements which are relevant here:

  • agtype_access_operator is now immutable. This means it can be used directly to create unique indices. It is no longer necessary to create a separate immutable function. E.g.,
CREATE UNIQUE INDEX person_id_idx ON mygraph.person(ag_catalog.agtype_access_operator(properties, '"id"'::agtype));
  • GIN indices are now used when filtering using WHERE clauses.
    edit: The unique index created above is used when filtering using WHERE clauses.

Very nice work, thank you @JoshInnis!

@pdpotter
Copy link
Contributor

pdpotter commented May 9, 2022

It looks like I have been a bit too enthusiastic in my previous comments.

Property constraints

When using GIN indices on properties and creating relations using these indices, I didn't check if the relations were actually added. Unfortunately, they were not.

After adding a GIN index and adding enough vertices so the index is used, a match query doesn't return any results. E.g.,

SELECT * FROM cypher('test_graph', $$CREATE (p:person {id: 1}) return p$$) as (p agtype);
SELECT * FROM cypher('test_graph', $$CREATE (p:person {id: 2}) return p$$) as (p agtype);
...
SELECT * FROM cypher('test_graph', $$CREATE (p:person {id: 500}) return p$$) as (p agtype);
CREATE INDEX person__properties ON test_graph.person USING GIN (properties);

A match query using property constraints returns 0 results.

SELECT * FROM cypher('test_graph', $$MATCH (p:person {id: 1}) return p$$) as (p agtype);
 p 
---
(0 rows)

Query plan:

                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on person p  (cost=12.01..16.03 rows=1 width=32) (actual time=0.026..0.027 rows=0 loops=1)
   Recheck Cond: (properties @> agtype_build_map('id'::text, '1'::agtype))
   ->  Bitmap Index Scan on person__properties  (cost=0.00..12.01 rows=1 width=0) (actual time=0.024..0.024 rows=0 loops=1)
         Index Cond: (properties @> agtype_build_map('id'::text, '1'::agtype))
 Planning Time: 0.158 ms
 Execution Time: 0.075 ms
(6 rows)

Where clause

When using the where clause, it is not the GIN index that is being used, but the unique index that was added to prevent duplicate entries by executing

CREATE UNIQUE INDEX person__id ON test_graph.person (properties);

The WHERE clause does give a correct result:

SELECT * FROM cypher('test_graph', $$MATCH (p:person) WHERE p.id = 1) return p$$) as (p agtype);
                                      p                                      
-----------------------------------------------------------------------------
 {"id": 844424930131969, "label": "person", "properties": {"id": 1}}::vertex
(1 row)

Query plan:

      QUERY PLAN                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using person__id on person p  (cost=0.27..8.30 rows=1 width=32) (actual time=0.063..0.065 rows=1 loops=1)
   Index Cond: (agtype_access_operator(VARIADIC ARRAY[properties, '"id"'::agtype]) = '1'::agtype)
 Planning Time: 0.190 ms
 Execution Time: 0.103 ms
(4 rows)

When using the WHERE clause for creating relations, the performance decreases when adding a lot of relations (when adding relations in batches of 5000, the first batch achieves ~5000 it/s, while the ninth batch achieves ~500 it/s) using something similar to

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA ), (r:LabelB)
WHERE d.id = 1 AND r.id = 2
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype)

Test suite

When having a new look at the tests (https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L299-L336) to check if I was making mistakes, I found out some of the index related tests might have some issues:

Thank you so much for making Apache AGE better and better with each commit. I'm sorry if I caused any confusion with my previous comments.

@JoshInnis
Copy link
Contributor

Hi @pdpotter, GIN Indices are for a subset of JsonB operators https://www.postgresql.org/docs/11/gin-builtin-opclasses.html These operators are not usabale in the cypher Where clause. Currently the only place they can be used is in the property constraint field in the match clause. The where clause is now compatible with the comparison operators that the cypher command currently has. These new operators need to be added to the cypher command's where clause

@JoshInnis
Copy link
Contributor

For an article about this you can checkout this https://bitnine.net/blog-postgresql/postgresql-internals-jsonb-type-and-its-indexes/ Its a bit dated and about Agensgraph and not AGE, but if you replace insances of the '->>' with the '.' operator the information is still accurate

@pdpotter
Copy link
Contributor

pdpotter commented Aug 10, 2022

As #228 has been fixed, it is now possible to use GIN indices for adding relations.
After adding GIN indices by executing

CREATE INDEX d:LabelA__properties
ON graph_name.LabelA USING GIN (properties);
CREATE INDEX d:LabelB__properties
ON graph_name.LabelB USING GIN (properties);

Adding relations using a query like

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2})
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype);

is now over 30 times faster (~300/s on my local VM) than without the GIN indices.

I tried improving the performance by creating indices using

CREATE INDEX d:LabelA__properties
ON graph_name.LabelA USING GIN (properties jsonb_path_ops);

but jsonb_path_ops is not supported for agtype.

Another attempt using a specialized index created by

CREATE INDEX d:LabelA__properties_id
ON graph_name.LabelA USING GIN ((properties->'id'));

didn't seem to work because the index wasn't used when executing MATCH queries (this was already mentioned in #212 (comment)).

As using GIN indixes is ~30 times slower than the method described in #45 (comment), I think I'm going to keep using this other method for initial data import.

@Dzordzu
Copy link
Contributor

Dzordzu commented Oct 6, 2022

For onlookers: within Rust driver I've implemented methods that provide unique indexes, and properties contraints. As of now it can be treated as a summary of this disscussion

@jrgemignani
Copy link
Contributor

Just wondering if this issue is resolved? Or, is there more that needs to be done that this issue needs to stay open? It is a bit difficult to tell from the correspondence.

@Dzordzu
Copy link
Contributor

Dzordzu commented Oct 18, 2022

Just wondering if this issue is resolved? Or, is there more that needs to be done that this issue needs to stay open? It is a bit difficult to tell from the correspondence.

I wouldn't say it's resolved. Fact - currently there are mechanisms that allow client to create indexes / constraints. On the other hand there is neither good documentation on this, nor agreement on the topic if constraints should also be included within cypher-like methods.

@jrgemignani
Copy link
Contributor

I will try to see if I can get others engaged that can help.

@ddrao
Copy link

ddrao commented Jun 24, 2023

Any update on indexing side of things? Any pointers on enabling the cypher queries to use specific property indexes? Other than the GIN index, nothing seems to be working.

@pdpotter
Copy link
Contributor

pdpotter commented Jun 24, 2023

In #954 (comment), a way to create specific property indexes has been mentioned. I haven't been able to test it out, but the issue has been closed, so I'm hopeful it will work. Edit: it doesn't seem to work, see #1009

In #1000, a patch that is still being worked on is mentioned that would allow GIN indices to be used in WHERE clauses as well.

@vladiksun
Copy link

@pdpotter I suppose the indexes mentioned in the #1000 to be used from the WHERE clauses should be of BTREE type instead of GIN because GIN index is good for pattern matches like in the MATCH clauses.

panosfol pushed a commit to panosfol/age that referenced this issue Aug 2, 2023
In the update_entity_tuple() function, when we call table_tuple_update() and assign the returned value to the result variable, the buffer variable receives the value of 0.
Made a workaround so that the original value isn't lost.
rafsun42 pushed a commit to rafsun42/age that referenced this issue Aug 31, 2023
In the update_entity_tuple() function, when we call table_tuple_update() and assign the returned value to the result variable, the buffer variable receives the value of 0.
Made a workaround so that the original value isn't lost.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New request
Projects
Development

No branches or pull requests

8 participants