added some more doc

cytosm · Dec 15, 2016 · 390ef69 · 390ef69
1 parent b547e36
commit 390ef69
Show file tree

Hide file tree

Showing 4 changed files with 235 additions and 1 deletion.
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -10,4 +10,4 @@ Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
-limitations under the License.
+limitations under the License.README.md
diff --git a/README.md b/README.md
@@ -27,3 +27,46 @@ The following would be nice to have:
   are not supported
 - Improve `CypherConverter` and `pathfinder` to generate AST nodes instead of using intermediary string representations.
 - Improve the `pathfinder` related code to use the full information available about the variable and their type.
+
+
+## Overview
+
+A Cypher string goes into several transformations in it journey through Cytsom:
+
+* Parsing (auto generated ANTLR parsed based on OpenCypher EBNF grammar). It creates an AST to be used later on.
+* PathFinder navigates the AST, given a graph topology file, in order to make Cypher queries more concrete.
+* Cypher2SQL. The module where all the magic happens. 
+
+
+### Path Finder
+
+A set of simple optimisations that try to make Cypher queries more concrete (avoiding the mapper from exploring patterns in the Cypher queries that 
+are logically correct, but impossible in the light of the database Tables that exist in the database).
+
+This way, the mapping process is simpler and we make SQL queries more efficient.
+
+See [PathFinder](pathfinder/README.md)
+
+### Cypher2SQL
+
+This module takes the concreted Cypher queries that the PathFinder module spits out and
+
+ * analises dependencies between Cypher variables and tracks their scope.
+ * creates an intermediate language representation of the query (something closer to SQL, but quite there yet). This is a hierarchical representation. 
+ * from the hierarchy created in the previous stage, it builds a sequence of nested joins and unions in SQL to represent the graph patterns indicated in Cypher.
+
+
+### Common
+
+#### Graph Topology Files (gTop)
+
+A description of the graph hiding in your relational database. It also includes how to mapp from abstract node/edges in the graph into specific databse tables/columns. 
+
+Find more details about [gTop](common/README.md)
+
+A gTop file can be automatically discovered by the "Graph Extraction" module (to be opensourced soon).
+
+## Benchmarks
+
+Cytosm queries have been run on a variety of backends, obtaining quite suprising results. Please find more details in 
+the sibling repo for [Cytosm benchmarking](https://github.com/Alnaimi-/database-benchmark). 
diff --git a/common/README.md b/common/README.md
@@ -0,0 +1,33 @@
+#Cytosm: gTop
+
+##Overview:
+
+In Grapher, property graphs follows a graph topology file (gTop) - it is a set of rules where every edge or vertex from the graph can be assigned at least one edge or vertex type. Vertices or edges of the same type share the same property keys, having possibly different values for those keys. The file format supports mixed graphs (that are graphs that can contain both undirected and directed edges) and parallel edges. Through the use of the gTop, it is possible to map a GQL into the relational domain.
+
+##Example:
+
+An example of gTop file is presented in figure below.
+
+<p align="center">
+  <img src="../docs/static_files/gtop.png?raw=true" alt="gtop"/>
+</p>
+
+The following graph could be modeled by the gtop above.
+
+<p align="center">
+  <img src="../docs/static_files/gtopgraph.png?raw=true" alt="gtopgraph"/>
+</p>
+
+## Gtop Features
+
+GTop enables:
+
+* <b>Flexibility:</b> GQLs can run on RDBMS without implementation system details. A translation step having as input a gTop file associated with the GQL should be able to construct valid SQL for the RDBMS.
+* <b>Multiple Models:</b> it describes how data stored in the RDBMS would be visualised as a property graph, classifying information as node types and edge types as long as it fits certain interpretation rules. In other words, it maps relational tuple sets into nodes and edge types.
+
+The separation of gTop files in two layers enables flexibility in mapping. The abstraction layer describes the property graph model (as a serialized description of the first figure) while the implementation one defines the mapping mechanisms between domains. We will describe this in detail.
+
+This layer describes how the information described in the abstraction level gTop is stored in the underlying relational system. Nodes of the same type can be found on multiple tables and edges can actually represent multiple table joins between the source and destination nodes.
+
+Also nodes and edges of the same type can coexist in multiple tables: nodes of type Person could have been stored inside the tables <i><b>proletariat</b></i> and <i><b>bourgeoisie</b></i> in the relational system. As long as both data tables were assigned the type Person on the implementation level gTop, the GQL will be agnostic of this structure and refer to the data of both of them by the type Person. This feature can be extended to resemble a type hierarchy from object-oriented languages: nodes of type Electricity Suppliers can also be of type Company. Similarly, restrictions can be applied based on table data rules. It provides support to split a single table into several different types of nodes/edges. It would make sense for a relational system to have the information of Companies and Electricity Suppliers in the same table. An extra aspect that implementation level sums is the possibility of representing an edge as a multiple sequential join of tables.
+
diff --git a/pathfinder/README.md b/pathfinder/README.md
@@ -0,0 +1,158 @@
+#Cytosm: Path Finder
+
+## Overview:
+
+"The" Path Finder is a parallelizable Regular Path Query solver, taking into consideration the graph topology in the gTop file. The Regular Path Query is the pattern/relationship-chain section on Cypher Language.
+
+##Context:
+Without a well defined topological mapping between the relational and the property graph models, certain queries can cause a combinatorial explosion of arrangements due to lack of understanding on the graph model layout underneath it. Many solutions would actually be impossible if not analysed in the light of the graph topology model. When every node and edge in the graph follows properly defined rules, the query planner can prune several impossible graph traversals before start executing the query - in a similar fashion to what the common relational systems planner does (or should do) with dead code.
+
+## Black Box Example
+
+The Path Finder module acts as a pre-planner in order to provide any relational system query planner with graph topology information. It trims out any unnecessary join that would either return an impossible graph traversal - by impossible graph traversal one can understand a sequence of nodes and edges that would not follow the model described in the graph topology file. The Path Finder algorithm is designed to be independently parallelizable in two stages. One of the benefits of having this pre-computation layer is that badly formatted or model invalid queries can be invalidated before even being sent to the RDBMS, reducing drastically the response time.
+
+<p align="center">
+  <img src="../docs/static_files/gtop.png?raw=true" alt="gtop"/>
+</p>
+
+
+The model described in the figure above above contains six different types of nodes and the same number of unique types of directed edges. A sample of a cypher query that could be done on top of this system is shown below. This query would return all the people that work in some company.
+
+```
+Match (a)-[:works_at]->(b)
+Return a;
+```
+
+The input of the Path Finder is the Regular Path Query. In the Cypher example, it is the relationship-chain defined as:
+
+```
+(a)-[:works_at]->(b)
+```
+
+The Path Finder module is, however, not limited to a single graph query language or representation of a Regular Path Query. Node <i>a</i> and <i>b</i>, represented by <i>(a)</i> and <i>(b)</i> are what is called anonymous node. These are nodes whose type is not explicitelly defined in the query. The letters <i>a</i> and <i>b</i> are called variables and used to refer to that given node in other moments of the query.
+
+A human reader, knowing the information on the gTop, would have correctly infered that <i>(a)</i> can only be of type Person and <i>(b)</i> of type Company since of the edge between them is labeled works at.
+
+## Detailed Example
+
+A more complex route path description would take to the human reader considerably more time to visualise and enumerate all the possible outcomes. Assume the following path description: 
+
+```
+( )<--(m: {"passport_no"': "FD8X723"})-[*1..2 ]->(n)<--(c)
+```
+
+There are four anonymous nodes in the original query, three of them referenced with the variables <i>m</i>, <i>n</i> and <i>c</i>. The directed edge between <i>m</i> and <i>n</i> has the content <i>*1..2</i>. It is what we called of edge expansion wildcard. This means that <i>n</i> could be one or two hops away from <i>m</i>. The correct set of paths, matching the gtop description and the described query, are:
+
+The correct solution found by Path Finder solutions are:
+
+```
+(:Pet)<-[:owns]- (:Person) -[:works_at]-> (:Company) <-[:supplies]- (:Electric_Supplier)
+
+(:Company)<-[works_at]- (:Person) -[:works_at]-> (:Company) <-[:supplies]- (:Electric_Supplier)
+
+(:Pet)<-[:owns]-(:Person)-[:works_at]->(:Company) -[:based_in]-> (:City)<-[:available_in]- (:Electric_Supplier)
+
+(:Company)<-[works_at]- (:Person) -[works_at]-> (:Company) -[:based_in]-> (:City) <-[:available_in]- (:Electric_Supplier)
+```
+
+### Top Level Algorithm
+
+The process that allows the Path Finder to enumerate all the possible routes that fit that pattern is described in the algorithm below:
+
+<b>Input:</b> Graph path that may contain regular edge and node expressions; a gTop file<br>
+<b>Output:</b> Set of graph paths without regular edge or node expressions, in accordance to gTop rules.
+
+```
+read graph path description;
+read gTop file;
+solves any possible node and edge hints;
+replace multi-hop regular edge expressions with equivalent edge/node pairs;
+
+While( exists non-solved graph path) {
+	contextualise graph path using the gtop;
+    }
+}
+```
+
+The algorithm receives a gTop file and a graph route with edge wildcard expansions or anonymous nodes. Some queries may contain characteristics that will greatly reduce the possible matches from an anonymous node/edge. In this query, the anonymous node <i>(m)</i> has an attribute called "passport_no". Based on the gTop model, one can easily infer that this node can only be of type Person. Sometimes it is an attribute that is not exclusive to a single node or edge type, but still reduces the number of possible matches for that node early in the search.
+In the sequence it identifies any edge wildcard expansion in the query and performs the route dilatation. Thus, the route original route is equivalent to:
+
+```
+[1]		( )<--(m: Person {"passport_no": "FD8X723"})-->(n)<--(c)
+
+[2]		( )<--(m: Person {"passport_no": "FD8X723"})-->()-->(n)<--(c)
+```
+
+In other words, the input route is the union of the routes [1] and [2]. Due to this property, routes [1] and [2] can be contextualized in independently in parallel. In order to continue the demonstration, we are going to assume the current route being analyzed is [2].
+
+A graph search algorithm (as Depth-First Search) associated with a path tracking structure can be used in order to solve every single graph path according to the graph topology file. This process is called contextualization, since it contextualizes a route with non-defined nodes and edges to a set of well defined node and edge sequence that follow the graph topology model.
+
+At the end of this process, the set of solutions described above are found.
+
+## Performance comparison:
+
+The following results compare the planning of 4 hops away-queries with and without the Path Finder. They are based in LDBC query number 6.
+
+###1. Four hops with anonymous node:
+
+```
+Profile MATCH (person:Person {id:2199023259437})-[:KNOWS]->()-[:KNOWS]->()-[:KNOWS]->()-[:KNOWS]->(friend:Person),
+(friend:Person)<-[:HAS_CREATOR]-(friendPost:Post)-[:HAS_TAG]->(knownTag:Tag {name:"A_Woman_and_a_Man"})
+WHERE not(person=friend)
+MATCH (friendPost:Post)-[:HAS_TAG]->(commonTag:Tag)
+WHERE not(commonTag=knownTag)
+WITH DISTINCT commonTag, knownTag, friend
+```
+
+Cypher version: CYPHER 2.3, planner: COST. <b>1678093696 total db hits</b> in <b>1769262 ms</b>.
+
+The equivalent plan is:
+
+<p align="center">
+  <img src="docs/pathFinderPlanning/neo4j2_3_4/anonNodesPlan/plan.png?raw=true" alt="Anonymous Nodes Plan"/>
+</p>
+
+###2. Edge Expansion:
+
+```
+Profile MATCH (person:Person {id:2199023259437})-[:KNOWS*4]->(friend:Person),
+(friend:Person)<-[:HAS_CREATOR]-(friendPost:Post)-[:HAS_TAG]->(knownTag:Tag {name:"A_Woman_and_a_Man"})
+WHERE not(person=friend)
+MATCH (friendPost:Post)-[:HAS_TAG]->(commonTag:Tag)
+WHERE not(commonTag=knownTag)
+WITH DISTINCT commonTag, knownTag, friend
+```
+
+Cypher version: CYPHER 2.3, planner: COST. <b>1678093696 total db hits</b> in <b>1669792 ms</b>.
+
+The equivalent plan is:
+
+<p align="center">
+  <img src="docs/pathFinderPlanning/neo4j2_3_4/regEdgePlan/plan.png?raw=true" alt="Regular Edge Expression Plan"/>
+</p>
+
+###3. Path Finder:
+
+```
+Profile MATCH (person:Person {id:2199023259437})-[:KNOWS]->(:Person)-[:KNOWS]->(:Person)-[:KNOWS]->(:Person)-[:KNOWS]->(friend:Person),
+(friend:Person)<-[:HAS_CREATOR]-(friendPost:Post)-[:HAS_TAG]->(knownTag:Tag {name:"A_Woman_and_a_Man"})
+WHERE not(person=friend)
+MATCH (friendPost:Post)-[:HAS_TAG]->(commonTag:Tag)
+WHERE not(commonTag=knownTag)
+WITH DISTINCT commonTag, knownTag, friend
+MATCH (commonTag:Tag)<-[:HAS_TAG]-(commonPost:Post)-[:HAS_TAG]->(knownTag:Tag)
+WHERE (commonPost:Post)-[:HAS_CREATOR]->(friend:Person)
+RETURN
+commonTag.name AS tagName,
+count(commonPost) AS postCount
+ORDER BY postCount DESC, tagName ASC
+LIMIT 20;
+```
+
+Cypher version: CYPHER 2.3, planner: COST. <b>829079 total db hits</b> in <b>3256 ms</b>.
+
+The equivalent plan is:
+
+<p align="center">
+  <img src="docs/pathFinderPlanning/neo4j2_3_4/pathFinder/plan.png?raw=true" alt="Regular Edge Expression Plan"/>
+</p>