Skip to content

Error writing to ArangoGraph (cloud) #52

@aldoorozco

Description

@aldoorozco

Setup:

  • ArangoGraph (oneshard model, 3 x 4GB)
  • Google Dataproc version 2.1 (Spark version 3.3.2, Scala 2.12)
  • Pyspark (Python 3.10)
  • ArangoDB Spark Connector version 1.6.0

Description:

I'm reading data from Google BigQuery and want to load it into ArangoDB using Apache Spark. I confirmed that the data can be read properly from BigQuery. I followed the instructions in this document. I'm using but I'm getting the following error while writing:

Traceback (most recent call last):
  File "/home/test.py", line 60, in <module>
    .save()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 966, in save
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o95.save.
: java.lang.NoSuchMethodError: 'scala.collection.immutable.ArraySeq scala.runtime.ScalaRunTime$.wrapRefArray(java.lang.Object[])'
        at org.apache.spark.sql.arangodb.datasource.ArangoTable.capabilities(ArangoTable.scala:27)
        at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits$TableHelper.supports(DataSourceV2Implicits.scala:95)
        at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:297)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:829)

The dataframe has the following schema and is 61K rows:

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

The sample code I'm using:

df = spark.read.format("bigquery")...

logging.info(f"Table <table> contains {df.count():,} rows")

(
    df.write.mode("overwrite")
    .format("com.arangodb.spark")
    .option("table", "<table_name>")
    .option("endpoints", "<endpoint>")
    .option("user", "root")
    .option("password", "<password>")
    .save() # fails here
)

And the way that I'm submitting my job is as follows:

spark-submit --master yarn --packages com.arangodb:arangodb-spark-datasource-3.3_2.12:1.6.0 /home/arangodb_test.py

Wondering if anyone has seen this error. I've tried using previous versions, but to no avail.

Happy to share more details if needed

Thanks,
Aldo

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions