Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27299][GRAPH][WIP] Spark Graph API design proposal #24297

Open
wants to merge 82 commits into
base: master
from

Conversation

Projects
None yet
8 participants
@s1ck
Copy link
Contributor

commented Apr 4, 2019

What changes were proposed in this pull request?

This PR demonstrates a prototypical implementation of the new Spark Graph API. The PR should mainly be used to discuss the API proposed in this GoogleDoc. This PR is not intended to be merged.

The PR introduces two modules:

  • spark-graph-api (containing the API to be discussed)
  • spark-cypher (a prototypical implementation of spark-graph-api)

Please use the PR and/or the GoogleDoc to comment the content of spark-graph-api. There will be follow-up PRs for spark-cypher.

How was this patch tested?

spark-cypher has been tested using the openCypher Technology Compatibility Kit

Contributors

Design, documentation and implementation have been a collaborative effort:

Co-Authored-By: Xiangrui Meng meng@databricks.com
Co-Authored-By: Max Kießling max.kiessling@neotechnology.com
Co-Authored-By: Mats Rydberg mats@neotechnology.com
Co-Authored-By: Philip Stutz philip.stutz@gmail.com
Co-Authored-By: Sören Reichardt soren.reichardt@neotechnology.com
Co-Authored-By: Jonatan Jäderberg jonatan.jaderberg@gmail.com
Co-Authored-By: Tobias Johansson tobias.johansson@neotechnology.com
Co-Authored-By: Alastair Green alastair.green@neo4j.com

s1ck and others added some commits Feb 19, 2019

Add initial set of classes to get Cypher engine running
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Add initial test infrastructure
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Setup TCK testing infrastructure
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Co-authored-by: Mats Rydberg <mats@neotechnology.com>
Get all TCK scenarios to pass
* except two temporal tests that failed because of loss of precision

Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Co-authored-by: Mats Rydberg <mats@neotechnology.com>
Remove unnecessary Neo4j specific dependencies
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Exclude Antlr from spark-sql deps
* workaround to get TCK to work proper

Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Add spark-graph-api
Removed cherry-pick conflict in pom.xml

Co-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com>
Add adapters for spark-graph-cypher and example
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Use named arguments in example for better showcasing
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Adapt to latest OKAPI changes
Co-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com>
Support for turning Cypher results into nodes/rels for property graph
Co-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com>
Fix round trip example by replacing dots when importing
Co-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com>
Adapt to changes in okapi-relational
Co-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com>
Clearer semantics for column access via header
It is only legal to convert an expression when the header contains it.
If the header contains an expression, but the physical table does not
contain the corresponding column, then the expression is converted to a
null literal.

Co-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com>
Adapt to entity mapping changes in okapi
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Enable round trip example
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Rename Node/RelationshipDataFrame to Node/RelationshipFrame
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Adapt CypherResult to GraphElementFrame name changes
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Add constructor for GraphElementFrames that infers property mappings
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
Adapt blacklist due to changes in Spark 2.4
Same changes as in CAPS.

Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
Add documentation for spark-graph-api
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Add default parameter for relationships in `createGraph`
Co-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com>
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
Use Spark 2.4 features where possible and remove/inline helpers
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
Co-authored-by: Philip Stutz <philip.stutz@gmail.com>
Use CypherResult to create new Property Graph
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
Adapt versions to migrate graph work to Spark 3.0 branch
Co-authored-by: Sören Reichardt <soren.reichardt@neotechnology.com>
Fix to get Spark master branch to compile
Co-authored-by: Sören Reichardt <soren.reichardt@neotechnology.com>
@SparkQA

This comment has been minimized.

Copy link

commented Apr 15, 2019

Test build #104592 has finished for PR 24297 at commit b6f26aa.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • implicit class RichExpression(expr: Expr)
  • implicit class TemporalExpression(val expr: Expr) extends AnyVal
@dongjoon-hyun

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

Hi, @mengxr . Could you help this PR to pass the Jenkins, please?

@mengxr

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2019

@dongjoon-hyun This PR is a prototype for API and design discussions. We should break it down into smaller ones after we reach an agreement on the API and design. I don't think we can merge this one directly.

@dongjoon-hyun

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

Thank you, @mengxr . I see.

@SparkQA

This comment has been minimized.

Copy link

commented Apr 18, 2019

Test build #104719 has finished for PR 24297 at commit bfe66f8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented May 21, 2019

Test build #105643 has finished for PR 24297 at commit 9dde5f9.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • case class SchemaAdapter(schema: PropertyGraphSchema) extends PropertyGraphType

dongjoon-hyun added a commit to dongjoon-hyun/spark that referenced this pull request Jun 9, 2019

[SPARK-27300][GRAPH] Add Spark Graph modules and dependencies
## What changes were proposed in this pull request?

This PR introduces the necessary Maven modules for the new [Spark Graph](https://issues.apache.org/jira/browse/SPARK-25994) feature for Spark 3.0.

* `spark-graph` is a parent module that users depend on to get all graph functionalities (Cypher and Graph Algorithms)
* `spark-graph-api` defines the [Property Graph API](https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI) that is being shared between Cypher and Algorithms
* `spark-cypher` contains a Cypher query engine implementation

Both, `spark-graph-api` and `spark-cypher` depend on Spark SQL.

Note, that the Maven module for Graph Algorithms is not part of this PR and will be introduced in https://issues.apache.org/jira/browse/SPARK-27302

A PoC for a running Cypher implementation can be found in this WIP PR apache#24297

## How was this patch tested?

Pass the Jenkins with all profiles and manually build and check the followings.
```
$ ls assembly/target/scala-2.12/jars/spark-cypher*
assembly/target/scala-2.12/jars/spark-cypher_2.12-3.0.0-SNAPSHOT.jar

$ ls assembly/target/scala-2.12/jars/spark-graph* | grep -v graphx
assembly/target/scala-2.12/jars/spark-graph-api_2.12-3.0.0-SNAPSHOT.jar
assembly/target/scala-2.12/jars/spark-graph_2.12-3.0.0-SNAPSHOT.jar
```

Closes apache#24490 from s1ck/SPARK-27300.

Lead-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com>
Co-authored-by: Max Kießling <max@kopfueber.org>
Co-authored-by: Martin Junghanns <martin.junghanns@neo4j.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Add PropertyGraphTest
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
@SparkQA

This comment has been minimized.

Copy link

commented Jun 12, 2019

Test build #106419 has finished for PR 24297 at commit 94b01fd.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.
Bump okapi-shade 0.4.2
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
@SparkQA

This comment has been minimized.

Copy link

commented Jun 12, 2019

Test build #106420 has finished for PR 24297 at commit d74df52.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.
Adapt GraphElementFrame API to latest API changes
Co-authored-by: Max Kießling <max.kiessling@neotechnology.com>
@SparkQA

This comment has been minimized.

Copy link

commented Jun 12, 2019

Test build #106421 has finished for PR 24297 at commit 4b6935b.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class GraphElementFrame

@dongjoon-hyun dongjoon-hyun added the GRAPH label Jun 14, 2019

emanuelebardelli added a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019

[SPARK-27300][GRAPH] Add Spark Graph modules and dependencies
## What changes were proposed in this pull request?

This PR introduces the necessary Maven modules for the new [Spark Graph](https://issues.apache.org/jira/browse/SPARK-25994) feature for Spark 3.0.

* `spark-graph` is a parent module that users depend on to get all graph functionalities (Cypher and Graph Algorithms)
* `spark-graph-api` defines the [Property Graph API](https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI) that is being shared between Cypher and Algorithms
* `spark-cypher` contains a Cypher query engine implementation

Both, `spark-graph-api` and `spark-cypher` depend on Spark SQL.

Note, that the Maven module for Graph Algorithms is not part of this PR and will be introduced in https://issues.apache.org/jira/browse/SPARK-27302

A PoC for a running Cypher implementation can be found in this WIP PR apache#24297

## How was this patch tested?

Pass the Jenkins with all profiles and manually build and check the followings.
```
$ ls assembly/target/scala-2.12/jars/spark-cypher*
assembly/target/scala-2.12/jars/spark-cypher_2.12-3.0.0-SNAPSHOT.jar

$ ls assembly/target/scala-2.12/jars/spark-graph* | grep -v graphx
assembly/target/scala-2.12/jars/spark-graph-api_2.12-3.0.0-SNAPSHOT.jar
assembly/target/scala-2.12/jars/spark-graph_2.12-3.0.0-SNAPSHOT.jar
```

Closes apache#24490 from s1ck/SPARK-27300.

Lead-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com>
Co-authored-by: Max Kießling <max@kopfueber.org>
Co-authored-by: Martin Junghanns <martin.junghanns@neo4j.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

override def toString: String = {
if (header.isEmpty) {
s"CAPSRecords.empty"

This comment has been minimized.

Copy link
@kiszk

kiszk Jun 17, 2019

Contributor

nit: s is not necessary

noLabelNodeDirectoryName
} else {
// TODO: Find more elegant solution for encoding underline characters
seq.map(_.replace("_", "--UNDERLINE--")).mkString("_").encodeSpecialCharacters

This comment has been minimized.

Copy link
@kiszk

kiszk Jun 17, 2019

Contributor

nit: Is it better to define a variable to hold "--UNDERLINE-- and use it at all of the references for consistency?


implicit class RichRelationshipDataFrame(val relDf: RelationshipFrame) extends AnyVal {
def toRelationshipMapping: ElementMapping = RelationshipMappingBuilder
.on(relDf.idColumn)

This comment has been minimized.

Copy link
@kiszk

kiszk Jun 17, 2019

Contributor

2-indent?

* values from the evaluated children.
*/
def nullSafeConversion(expr: Expr)(withConvertedChildren: Seq[Column] => Column)
(implicit header: RecordHeader, df: DataFrame, parameters: CypherMap): Column = {

This comment has been minimized.

Copy link
@kiszk

kiszk Jun 17, 2019

Contributor

4-indent?

@s1ck

This comment has been minimized.

Copy link
Contributor Author

commented Jun 17, 2019

Hey Kazuaki. Very nice of you to find the time to look over the PR. However, this PR is not intended to be merged. It's main purpose is to have a running PoC while we implement the graph features in smaller PRs. The latest one is introducing the API for property graph construction (#24851). Please feel free to have a look. I'll address your comments when we start working on SPARK-27309. @kiszk

@SparkQA

This comment has been minimized.

Copy link

commented Jun 24, 2019

Test build #106831 has finished for PR 24297 at commit 432389c.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • class PropertyGraphReadWrite extends QueryTest with SharedCypherContext with BeforeAndAfterEach
@SparkQA

This comment has been minimized.

Copy link

commented Jun 28, 2019

Test build #106990 has finished for PR 24297 at commit 038a8ce.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.