[Spark] Refine the `GraphWriter` to automatically generate graph info and improve the Neo4j case #196

acezen · 2023-07-07T08:32:32Z

… schema of dataset

Proposed changes

This change focus on Spark's graph level API's refine and Neo4j case's refactor:

graph level API refine
- Make GraphWriter implement as class and add PutVertexData and PutEdgeData method to put vertex/edge data frame to writer. User no need to construct Mapping for data any more.
- Support generate graph info with DataFrame schema and add dump method to GraphInfo, VertexInfo and EdgeInfo to support dumps the info to JSON string.
Neo4j case
- Refactor the case with graph level API and cover the whole movie graph, not just Person Produced Movie.
- Add some scripts to help user to run the case easy.( For get started)
Some bugfix and document update too.
Use markdown format in cpp/spark's README since they're not include in the website.

There still something we can optimize:

use .option(xxxx) style to configure parameter of write method like vertex_chunk_size, edge_chunk_size etc.
Or user provide a configuration file to describe the parameters.

Types of changes

What types of changes does your code introduce to GraphAr?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING doc
I have signed the CLA
Lint and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

Fixes #200
Fixes #201

acezen · 2023-07-10T06:10:50Z

hi, @lixueclaire , please help me review the change and give your opinion.

lixueclaire · 2023-07-10T06:44:10Z

hi, @lixueclaire , please help me review the change and give your opinion.

Overall, it looks good for me. Have you ever tested the modified examples and checked the results? If you decide to change the example of Neo4j2GraphAr, there may be some extra things that should be updated:

the data generated by this example in gar-test
the related documentation

Besides, since the data generated from reading Neo4j changed, the example of GraphAr2Neo4j may not work now.

spark/src/main/scala/com/alibaba/graphar/graph/GraphWriter.scala

spark/src/main/scala/com/alibaba/graphar/utils/Utils.scala

acezen · 2023-07-10T08:05:51Z

hi, @lixueclaire , please help me review the change and give your opinion.

Overall, it looks good for me. Have you ever tested the modified examples and checked the results? If you decide to change the example of Neo4j2GraphAr, there may be some extra things that should be updated:

the data generated by this example in gar-test

the related documentation

Besides, since the data generated from reading Neo4j changed, the example of GraphAr2Neo4j may not work now.

Yes, I will update the document and Neo4j case if we all good about the API change.

acezen · 2023-07-14T09:30:39Z

spark/src/main/scala/com/alibaba/graphar/reader/VertexReader.scala

@@ -94,7 +94,11 @@ class VertexReader(prefix: String, vertexInfo: VertexInfo, spark: SparkSession)
    val pg0: PropertyGroup = propertyGroups.get(0)
    val df0 = readVertexPropertyGroup(pg0, false)
    if (len == 1) {
-      return df0
+      if (addIndex) {
+        return IndexGenerator.generateVertexIndexColumn(df0)


NB: this is fixing the bug that no index column generated in the result when property group only contain one property.

acezen · 2023-07-14T09:32:34Z

spark/src/main/scala/com/alibaba/graphar/utils/IndexGenerator.scala

@@ -86,6 +86,19 @@ object IndexGenerator {
    spark.createDataFrame(rdd_with_index, schema_with_index)
  }

+  def generateVertexIndexColumnAndIndexMapping(vertexDf: DataFrame, primaryKey: String = ""): (DataFrame, DataFrame) = {


NB: The method is added for generating index column and index mapping in one-shot, avoid to generate index mapping again when process edges.

acezen · 2023-07-14T09:37:04Z

spark/src/main/scala/com/alibaba/graphar/reader/EdgeReader.scala

-    val df = DataFrameConcat.concat(adjList_df, properties_df)
+    val property_groups = edgeInfo.getPropertyGroups(adjListType)
+    val df = if (property_groups.size == 0) {
+      adjList_df


NB: this is fixing the bug that when there are only adj list and no property group, the concatenated DataFrame in line 339 partition num is not same between adjList_df and properties_df(the properties_df is a empty DataFrame and its partition num is 0).

… schema of dataset

lixueclaire · 2023-07-17T07:01:08Z

docs/applications/spark.rst

-      .save()
-  })
+
+def main(args: Array[String]): Unit = {


lixueclaire · 2023-07-17T07:01:20Z

docs/applications/spark.rst

-
-  - <id> the internal Neo4j ID
-  - <labels> a list of labels for that node
+def main(args: Array[String]): Unit = {


lixueclaire · 2023-07-17T07:04:00Z

docs/applications/spark.rst

+
+    putVertexDataIntoNeo4j(graphInfo, vertexData, spark)
+    putEdgeDataIntoNeo4j(graphInfo, vertexData, edgeData, spark)
+}

 See `GraphAr2Neo4j.scala`_ for the complete example.


Can you add some descriptions about how to write in different modes (i.e., how to modify the example).

lixueclaire · 2023-07-17T07:06:21Z

spark/src/main/scala/com/alibaba/graphar/EdgeInfo.scala

@@ -480,6 +480,29 @@ class EdgeInfo() {
  def getConcatKey(): String = {
    return getSrc_label + GeneralParams.regularSeperator + getEdge_label + GeneralParams.regularSeperator + getDst_label
  }
+
+  /** Dump to Json string. */


lixueclaire · 2023-07-17T07:09:02Z

spark/src/main/scala/com/alibaba/graphar/VertexInfo.scala

@@ -223,6 +223,25 @@ class VertexInfo() {
    }
    return prefix + str
  }
+
+  /** Dump to Json string. */


lixueclaire · 2023-07-17T07:41:37Z

spark/src/main/scala/com/alibaba/graphar/graph/GraphReader.scala

@@ -34,20 +35,20 @@ object GraphReader {
  private def readAllVertices(prefix: String, vertexInfos: Map[String, VertexInfo], spark: SparkSession): Map[String, DataFrame] = {
    val vertex_dataframes: Map[String, DataFrame] = vertexInfos.map { case (label, vertexInfo) => {
      val reader = new VertexReader(prefix, vertexInfo, spark)
-      (label, reader.readAllVertexPropertyGroups(false))
+      (label, reader.readAllVertexPropertyGroups(true))


May be adding a parameter for adding index is required for this function.

lixueclaire · 2023-07-17T07:44:05Z

spark/src/main/scala/com/alibaba/graphar/graph/GraphWriter.scala

+  def write(path: String,
+            spark: SparkSession,
+            name: String = "graph",
+            vertex_chunk_size: Long = 262144,  // 2^18


Add the default value into general parameters?

lixueclaire · 2023-07-17T07:52:42Z

spark/src/main/scala/com/alibaba/graphar/example/Neo4j2GraphAr.scala

+
+object Neo4j2GraphAr {
+
+  def main(args: Array[String]): Unit = {


It would be better to add some comments for args.

lixueclaire · 2023-07-17T07:54:53Z

spark/src/main/scala/com/alibaba/graphar/example/Neo4j2GraphAr.scala

+    val edgeChunkSize: Long = args(2).toLong
+    val fileType: String = args(3)
+
+    writer.write(outputPath, spark, "MovieGraph", vertexChunkSize, edgeChunkSize, fileType)


Can you help to update the test data?

The CI and case is no need to read from gar-test repo anymore. But if you think we still need to add a copy to test data, I will update it.

The CI and case is no need to read from gar-test repo anymore.

We did not need read from gar-test previously, either. The test data in gar-test is used to give a showcase for the data generated by you example. Currently, the data is inconsistent with the new examples, thus it would be a little confusing.

lixueclaire · 2023-07-17T07:55:48Z

spark/src/main/scala/com/alibaba/graphar/example/GraphAr2Neo4j.scala

+
+object GraphAr2Neo4j {
+
+  def main(args: Array[String]): Unit = {


Add some comments for args.

acezen · 2023-07-17T11:49:15Z

Thanks a lot for the review @lixueclaire . I agree with all your comments and made the necessary changes.

lixueclaire

LGTM~ I believe it's a prime example of GraphAr's capabilities. Thank you for making this change!

acezen requested a review from lixueclaire July 10, 2023 06:09

lixueclaire reviewed Jul 10, 2023

View reviewed changes

acezen changed the title ~~[WIP][Spark] Refine the GraphWriter to automatically generate graph info base the…~~ [Spark] Refine the GraphWriter to automatically generate graph info base the… Jul 11, 2023

acezen commented Jul 14, 2023

View reviewed changes

acezen added 21 commits July 14, 2023 17:43

Refine the GraphWriter to automaticaly generate graph info base the…

b9ea96b

… schema of dataset

Update Neo4j2GraphAr case with graph writer

f6e4350

Revise the GraphReader

1c6d536

Fix

1bc3209

Fix

58e2505

Update

d09cd07

Make Neo4j2GraphAr as example

ac1fbe8

Revise

103cf14

Add some scripts for spark

167ed95

Update the document of spark

6038355

Fix broken test

58ce8fc

Update

967e081

Add example to ci

0f9dc5c

Update

7bcb322

Fix

d297134

update

64194ee

name

40fb9c6

Fix

63917d1

Add graphar2neo4j

9bb6181

Update

40e090f

Update

59440d4

acezen added 2 commits July 14, 2023 17:43

Fix

6c34a87

Add some comment

06a8bc7

acezen force-pushed the spark-graph-writer-refine branch from 7f70ab2 to 06a8bc7 Compare July 14, 2023 09:43

Fix the links

84a2144

lixueclaire reviewed Jul 17, 2023

View reviewed changes

Revise

33d375d

acezen force-pushed the spark-graph-writer-refine branch from 697a657 to 33d375d Compare July 17, 2023 11:48

acezen and others added 2 commits July 17, 2023 20:11

Fix

41a8618

Merge branch 'main' into spark-graph-writer-refine

1ead885

lixueclaire approved these changes Jul 18, 2023

View reviewed changes

acezen changed the title ~~[Spark] Refine the GraphWriter to automatically generate graph info base the…~~ [Spark] Refine the GraphWriter to automatically generate graph info and improve the Neo4j case Jul 18, 2023

acezen merged commit 11ebf37 into apache:main Jul 18, 2023
5 checks passed

acezen deleted the spark-graph-writer-refine branch July 18, 2023 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Refine the `GraphWriter` to automatically generate graph info and improve the Neo4j case #196

[Spark] Refine the `GraphWriter` to automatically generate graph info and improve the Neo4j case #196

acezen commented Jul 7, 2023 •

edited

Loading

acezen commented Jul 10, 2023

lixueclaire commented Jul 10, 2023

acezen commented Jul 10, 2023

acezen Jul 14, 2023 •

edited

Loading

acezen Jul 14, 2023 •

edited

Loading

acezen Jul 14, 2023 •

edited

Loading

lixueclaire Jul 17, 2023

lixueclaire Jul 17, 2023

lixueclaire Jul 17, 2023

lixueclaire Jul 17, 2023

lixueclaire Jul 17, 2023

lixueclaire Jul 17, 2023

lixueclaire Jul 17, 2023

lixueclaire Jul 17, 2023

lixueclaire Jul 17, 2023

acezen Jul 17, 2023 •

edited

Loading

lixueclaire Jul 17, 2023 •

edited

Loading

lixueclaire Jul 17, 2023

acezen commented Jul 17, 2023

lixueclaire left a comment


		object Neo4j2GraphAr {

		def main(args: Array[String]): Unit = {


		object GraphAr2Neo4j {

		def main(args: Array[String]): Unit = {

[Spark] Refine the GraphWriter to automatically generate graph info and improve the Neo4j case #196

[Spark] Refine the GraphWriter to automatically generate graph info and improve the Neo4j case #196

Conversation

acezen commented Jul 7, 2023 • edited Loading

Proposed changes

Types of changes

Checklist

Further comments

acezen commented Jul 10, 2023

lixueclaire commented Jul 10, 2023

acezen commented Jul 10, 2023

acezen Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

acezen Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

acezen Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acezen Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

lixueclaire Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acezen commented Jul 17, 2023

lixueclaire left a comment

Choose a reason for hiding this comment

[Spark] Refine the `GraphWriter` to automatically generate graph info and improve the Neo4j case #196

[Spark] Refine the `GraphWriter` to automatically generate graph info and improve the Neo4j case #196

acezen commented Jul 7, 2023 •

edited

Loading

acezen Jul 14, 2023 •

edited

Loading

acezen Jul 14, 2023 •

edited

Loading

acezen Jul 14, 2023 •

edited

Loading

acezen Jul 17, 2023 •

edited

Loading

lixueclaire Jul 17, 2023 •

edited

Loading