[SYSTEMML-2523] Update SystemML to Support Spark 2.3.0 by niketanpansare · Pull Request #857 · apache/systemds

niketanpansare · 2019-03-19T20:42:08Z

Spark 2.3 (released on February 28, 2018) has updated the Antlr version from 4.3 to 4.7, which throws a warning every time we invoke SystemML.

@bertholdreinwald @romeokienzler @mboehm7 @prithvirajsen @nakul02 @j143 Let's use this PR for discussion and raising potential concerns.

Spark 2.3 (released on February 28, 2018) has updated the Antlr version from 4.3 to 4.7, which throws a warning every time we invoke SystemML.

romeokienzler · 2019-03-19T21:26:44Z

@niketanpansare just trying out using niketanpansare:update_spark23

romeokienzler · 2019-03-19T21:32:27Z

@niketanpansare getting the same error on niketanpansare:update_spark23

Py4JJavaError Traceback (most recent call last)
in ()
----> 1 df.createOrReplaceTempView("df")

/opt/ibm/spark/python/pyspark/sql/dataframe.py in createOrReplaceTempView(self, name)
174
175 """
--> 176 self._jdf.createOrReplaceTempView(name)
177
178 @SInCE(2.1)

/opt/ibm/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py in call(self, *args)
1158 answer = self.gateway_client.send_command(command)
1159 return_value = get_return_value(
-> 1160 answer, self.gateway_client, self.target_id, self.name)
1161
1162 for temp_arg in temp_args:

/opt/ibm/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

/opt/ibm/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 raise Py4JJavaError(
319 "An error occurred while calling {0}{1}{2}.\n".
--> 320 format(target_id, ".", name), value)
321 else:
322 raise Py4JError(

Py4JJavaError: An error occurred while calling o121.createOrReplaceTempView.
: java.lang.ExceptionInInitializerError
at java.lang.J9VMInternals.ensureError(J9VMInternals.java:146)
at java.lang.J9VMInternals.recordInitializationFailure(J9VMInternals.java:135)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:84)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableIdentifier(ParseDriver.scala:49)
at org.apache.spark.sql.Dataset.createTempViewCommand(Dataset.scala:3079)
at org.apache.spark.sql.Dataset.createOrReplaceTempView(Dataset.scala:3034)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:90)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:508)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:811)
Caused by: java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).
at org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:153)
at org.apache.spark.sql.catalyst.parser.SqlBaseLexer.(SqlBaseLexer.java:1153)
... 16 more
Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).
... 18 more

mboehm7 · 2019-03-19T22:12:40Z

here is the original discussion from March 2018:
https://www.mail-archive.com/dev@systemml.apache.org/msg00571.html
https://www.mail-archive.com/dev@systemml.apache.org/msg00579.html

Back then, we tried updating ANTLR but had to revert it because it would not run on Spark 2.2 and before. So we decided to rather accept the warning and run on all versions.

niketanpansare · 2019-03-19T22:58:26Z

Thanks @mboehm7 ... I forgot about the discussion. Since Spark 2.4 is already released and Spark 2.3 is almost a year old, we can consider option 1: we directly release for Spark 2.3 and drop 2.2 and 2.1.

niketanpansare · 2019-03-19T23:33:03Z

@romeokienzler Can you please double-check the jar? It is working for me. Sometimes, the jar compiled using eclipse that uses a different version of Antlr can also cause these issues. Please try either of the following commands again:

Use the jar I compiled:

pip install https://github.com/niketanpansare/future_of_data/blob/master/systemml-1.3.0-SNAPSHOT-python.tar.gz?raw=true

OR compile from command-line:

git clone https://github.com/niketanpansare/systemml.git
cd systemml
git checkout update_spark23
mvn package -P distribution

romeokienzler · 2019-03-20T15:26:47Z

@niketanpansare I love you!!! You've made my day bro! It's working fine.

I've missed the
git checkout update_spark23
command, stupid me :)

So I confirm the JAR is working fine for me and you've saved me and the coursera community by making SystemML running on Spark 2.3, thanks a lot!

IMHO you can merge this PR but, of course, I have nothing to say here.

niketanpansare · 2019-03-20T17:33:42Z

The test failed on travis with The job exceeded the maximum log length, and has been terminated. but succeeds locally. This is because maven prints the progress status:

Downloading from central: http://repo.maven.apache.org/maven2/org/apache/spark/spark-parent_2.11/2.3.0/spark-parent_2.11-2.3.0.pom
Progress (1): 4.1/102 kB
Progress (1): 7.7/102 kB
Progress (1): 12/102 kB 
Progress (1): 16/102 kB
Progress (1): 20/102 kB

ghost · 2019-03-20T19:04:31Z

My two cents:
When releasing SystemML-1.3.0, kindly distribute one more jar with Spark-2.2 support, since there is good support for 2.2, and many companies are not that faster when it comes for an upgrade.

snippet source: spark downloads page

Also, I am testing this change and will let you know soon. :)

niketanpansare · 2019-03-20T20:20:45Z

@j143-bot The main problem with that is maintainability. We either (1) have a separate tests for Spark 2.1 and latest version or (2) re-run the tests before release (and push all the fixes in a separate branch), or (3) distribute spark 2.1 jar but don't test it (really bad option).

If we don't reach consensus, I am okay with keeping this PR open for people facing issues with SystemML + Spark 2.3.

ghost · 2019-03-20T21:20:18Z

+1
SystemML +Spark 2.3 build is smooth.

niketanpansare · 2019-03-21T20:43:37Z

@romeokienzler You are getting the error because the setup contains two SystemML (possibly conflicting dependencies) jars. There are two possible solutions to your problem:

Recommended: Remove the older incubating jar and do not include the corresponding 1.2.0 or 1.3.0-snapshot jars (i.e. no need for ln -s trick).
Use the python package compiled by this PR.

Since there is something weird happening here, I am including the logs. I apologize it advance for the long trace. Please ignore the below logs if you agree to the above statements.

Setup 1. With only incubating jar (FAILS !!)

$ ~/spark-2.3.0-bin-hadoop2.7/bin/pyspark --driver-memory 20g --master local[*] --driver-class-path systemml-0.14.0-incubating.jar
Python 3.6.3 (default, Mar 20 2018, 13:50:41) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
2019-03-21 13:07:11 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Python version 3.6.3 (default, Mar 20 2018 13:50:41)
SparkSession available as 'spark'.
>>> from systemml import MLContext
>>> ml = MLContext(spark)
2019-03-21 13:07:20 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException

Welcome to Apache SystemML!

>>> ml.version()
'0.14.0-incubating'
>>> df=spark.read.parquet('shake.parquet')
>>> df.show()
+-----+---------+-----+-----+-----+
|CLASS| SENSORID|    X|    Y|    Z|
+-----+---------+-----+-----+-----+
|    2| qqqqqqqq| 0.12| 0.12| 0.12|
|    2|aUniqueID| 0.03| 0.03| 0.03|
|    2| qqqqqqqq|-3.84|-3.84|-3.84|
|    2| 12345678| -0.1| -0.1| -0.1|
|    2| 12345678|-0.15|-0.15|-0.15|
|    2| 12345678| 0.47| 0.47| 0.47|
|    2| 12345678|-0.06|-0.06|-0.06|
|    2| 12345678|-0.09|-0.09|-0.09|
|    2| 12345678| 0.21| 0.21| 0.21|
|    2| 12345678|-0.08|-0.08|-0.08|
|    2| 12345678| 0.44| 0.44| 0.44|
|    2|    gholi| 0.76| 0.76| 0.76|
|    2|    gholi| 1.62| 1.62| 1.62|
|    2|    gholi| 5.81| 5.81| 5.81|
|    2| bcbcbcbc| 0.58| 0.58| 0.58|
|    2| bcbcbcbc|-8.24|-8.24|-8.24|
|    2| bcbcbcbc|-0.45|-0.45|-0.45|
|    2| bcbcbcbc| 1.03| 1.03| 1.03|
|    2|aUniqueID|-0.05|-0.05|-0.05|
|    2| qqqqqqqq|-0.44|-0.44|-0.44|
+-----+---------+-----+-----+-----+
only showing top 20 rows

>>> df.createOrReplaceTempView("df")
ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.5.3ANTLR Runtime version 4.7 used for parser compilation does not match the current runtime version 4.5.3Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 176, in createOrReplaceTempView
    self._jdf.createOrReplaceTempView(name)
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o52.createOrReplaceTempView.
: java.lang.ExceptionInInitializerError
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:84)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableIdentifier(ParseDriver.scala:49)
	at org.apache.spark.sql.Dataset.createTempViewCommand(Dataset.scala:3079)
	at org.apache.spark.sql.Dataset.createOrReplaceTempView(Dataset.scala:3034)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).
	at org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:153)
	at org.apache.spark.sql.catalyst.parser.SqlBaseLexer.<clinit>(SqlBaseLexer.java:1153)
	... 16 more
Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).
	... 18 more

>>>

Setup 2: Put the older incubating jar before the current SystemML 1.2.0 jars (FAILS !!)

$ ~/spark-2.3.0-bin-hadoop2.7/bin/pyspark --driver-memory 20g --master local[*] --driver-class-path systemml-0.14.0-incubating.jar:systemml-1.2.0-extra.jar:systemml-1.2.0.jar
Python 3.6.3 (default, Mar 20 2018, 13:50:41) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
2019-03-21 13:12:11 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Python version 3.6.3 (default, Mar 20 2018 13:50:41)
SparkSession available as 'spark'.
>>> from systemml import MLContext
>>> ml = MLContext(spark)
2019-03-21 13:12:21 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException

Welcome to Apache SystemML!

>>> ml.version()
'0.14.0-incubating'
>>> df=spark.read.parquet('shake.parquet')
>>> df.show()
+-----+---------+-----+-----+-----+
|CLASS| SENSORID|    X|    Y|    Z|
+-----+---------+-----+-----+-----+
|    2| qqqqqqqq| 0.12| 0.12| 0.12|
|    2|aUniqueID| 0.03| 0.03| 0.03|
|    2| qqqqqqqq|-3.84|-3.84|-3.84|
|    2| 12345678| -0.1| -0.1| -0.1|
|    2| 12345678|-0.15|-0.15|-0.15|
|    2| 12345678| 0.47| 0.47| 0.47|
|    2| 12345678|-0.06|-0.06|-0.06|
|    2| 12345678|-0.09|-0.09|-0.09|
|    2| 12345678| 0.21| 0.21| 0.21|
|    2| 12345678|-0.08|-0.08|-0.08|
|    2| 12345678| 0.44| 0.44| 0.44|
|    2|    gholi| 0.76| 0.76| 0.76|
|    2|    gholi| 1.62| 1.62| 1.62|
|    2|    gholi| 5.81| 5.81| 5.81|
|    2| bcbcbcbc| 0.58| 0.58| 0.58|
|    2| bcbcbcbc|-8.24|-8.24|-8.24|
|    2| bcbcbcbc|-0.45|-0.45|-0.45|
|    2| bcbcbcbc| 1.03| 1.03| 1.03|
|    2|aUniqueID|-0.05|-0.05|-0.05|
|    2| qqqqqqqq|-0.44|-0.44|-0.44|
+-----+---------+-----+-----+-----+
only showing top 20 rows

>>> df.createOrReplaceTempView("df")
ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.5.3ANTLR Runtime version 4.7 used for parser compilation does not match the current runtime version 4.5.3Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 176, in createOrReplaceTempView
    self._jdf.createOrReplaceTempView(name)
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o52.createOrReplaceTempView.
: java.lang.ExceptionInInitializerError
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:84)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableIdentifier(ParseDriver.scala:49)
	at org.apache.spark.sql.Dataset.createTempViewCommand(Dataset.scala:3079)
	at org.apache.spark.sql.Dataset.createOrReplaceTempView(Dataset.scala:3034)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).
	at org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:153)
	at org.apache.spark.sql.catalyst.parser.SqlBaseLexer.<clinit>(SqlBaseLexer.java:1153)
	... 16 more
Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).
	... 18 more

>>>

Setup 3: Put the the current SystemML 1.2.0 jars before the older incubating jar (FAILS !!)

$ ~/spark-2.3.0-bin-hadoop2.7/bin/pyspark --driver-memory 20g --master local[*] --driver-class-path systemml-1.2.0-extra.jar:systemml-1.2.0.jar:systemml-0.14.0-incubating.jar
Python 3.6.3 (default, Mar 20 2018, 13:50:41) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
2019-03-21 13:14:49 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Python version 3.6.3 (default, Mar 20 2018 13:50:41)
SparkSession available as 'spark'.
>>> from systemml import MLContext
>>> ml = MLContext(spark)
2019-03-21 13:15:11 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException

Welcome to Apache SystemML!
Version 1.2.0
>>> ml.version()
'1.2.0'
>>> df=spark.read.parquet('shake.parquet')
>>> df.show()
+-----+---------+-----+-----+-----+
|CLASS| SENSORID|    X|    Y|    Z|
+-----+---------+-----+-----+-----+
|    2| qqqqqqqq| 0.12| 0.12| 0.12|
|    2|aUniqueID| 0.03| 0.03| 0.03|
|    2| qqqqqqqq|-3.84|-3.84|-3.84|
|    2| 12345678| -0.1| -0.1| -0.1|
|    2| 12345678|-0.15|-0.15|-0.15|
|    2| 12345678| 0.47| 0.47| 0.47|
|    2| 12345678|-0.06|-0.06|-0.06|
|    2| 12345678|-0.09|-0.09|-0.09|
|    2| 12345678| 0.21| 0.21| 0.21|
|    2| 12345678|-0.08|-0.08|-0.08|
|    2| 12345678| 0.44| 0.44| 0.44|
|    2|    gholi| 0.76| 0.76| 0.76|
|    2|    gholi| 1.62| 1.62| 1.62|
|    2|    gholi| 5.81| 5.81| 5.81|
|    2| bcbcbcbc| 0.58| 0.58| 0.58|
|    2| bcbcbcbc|-8.24|-8.24|-8.24|
|    2| bcbcbcbc|-0.45|-0.45|-0.45|
|    2| bcbcbcbc| 1.03| 1.03| 1.03|
|    2|aUniqueID|-0.05|-0.05|-0.05|
|    2| qqqqqqqq|-0.44|-0.44|-0.44|
+-----+---------+-----+-----+-----+
only showing top 20 rows

>>> df.createOrReplaceTempView("df")
ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.5.3ANTLR Runtime version 4.7 used for parser compilation does not match the current runtime version 4.5.3Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 176, in createOrReplaceTempView
    self._jdf.createOrReplaceTempView(name)
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o52.createOrReplaceTempView.
: java.lang.ExceptionInInitializerError
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:84)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableIdentifier(ParseDriver.scala:49)
	at org.apache.spark.sql.Dataset.createTempViewCommand(Dataset.scala:3079)
	at org.apache.spark.sql.Dataset.createOrReplaceTempView(Dataset.scala:3034)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).
	at org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:153)
	at org.apache.spark.sql.catalyst.parser.SqlBaseLexer.<clinit>(SqlBaseLexer.java:1153)
	... 16 more
Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).
	... 18 more

>>>

Setup 4: Put the jar from the PR before the older incubating jar (SUCCEEDS !!)

$ ~/spark-2.3.0-bin-hadoop2.7/bin/pyspark --driver-memory 20g --master local[*] --driver-class-path systemml-1.3.0-SNAPSHOT-extra-pr.jar:systemml-1.3.0-SNAPSHOT-pr.jar:systemml-0.14.0-incubating.jar
Python 3.6.3 (default, Mar 20 2018, 13:50:41) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
2019-03-21 13:19:59 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Python version 3.6.3 (default, Mar 20 2018 13:50:41)
SparkSession available as 'spark'.
>>> from systemml import MLContext
>>> ml = MLContext(spark)
2019-03-21 13:20:22 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException

Welcome to Apache SystemML!
Version 1.3.0-SNAPSHOT
>>> ml.version()
'1.3.0-SNAPSHOT'
>>> df=spark.read.parquet('shake.parquet')
>>> df.show()
+-----+---------+-----+-----+-----+
|CLASS| SENSORID|    X|    Y|    Z|
+-----+---------+-----+-----+-----+
|    2| qqqqqqqq| 0.12| 0.12| 0.12|
|    2|aUniqueID| 0.03| 0.03| 0.03|
|    2| qqqqqqqq|-3.84|-3.84|-3.84|
|    2| 12345678| -0.1| -0.1| -0.1|
|    2| 12345678|-0.15|-0.15|-0.15|
|    2| 12345678| 0.47| 0.47| 0.47|
|    2| 12345678|-0.06|-0.06|-0.06|
|    2| 12345678|-0.09|-0.09|-0.09|
|    2| 12345678| 0.21| 0.21| 0.21|
|    2| 12345678|-0.08|-0.08|-0.08|
|    2| 12345678| 0.44| 0.44| 0.44|
|    2|    gholi| 0.76| 0.76| 0.76|
|    2|    gholi| 1.62| 1.62| 1.62|
|    2|    gholi| 5.81| 5.81| 5.81|
|    2| bcbcbcbc| 0.58| 0.58| 0.58|
|    2| bcbcbcbc|-8.24|-8.24|-8.24|
|    2| bcbcbcbc|-0.45|-0.45|-0.45|
|    2| bcbcbcbc| 1.03| 1.03| 1.03|
|    2|aUniqueID|-0.05|-0.05|-0.05|
|    2| qqqqqqqq|-0.44|-0.44|-0.44|
+-----+---------+-----+-----+-----+
only showing top 20 rows

>>> df.createOrReplaceTempView("df")
>>>

Setup 5: No jar provided (SUCCEEDS !!)

$ ~/spark-2.3.0-bin-hadoop2.7/bin/pyspark --driver-memory 20g --master local[*]
Python 3.6.3 (default, Mar 20 2018, 13:50:41) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
2019-03-21 13:23:26 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Python version 3.6.3 (default, Mar 20 2018 13:50:41)
SparkSession available as 'spark'.
>>> from systemml import MLContext
>>> ml = MLContext(spark)
2019-03-21 13:23:46 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException

Welcome to Apache SystemML!
Version 1.2.0
>>> ml.version()
'1.2.0'
>>> df=spark.read.parquet('shake.parquet')
>>> df.show()
+-----+---------+-----+-----+-----+
|CLASS| SENSORID|    X|    Y|    Z|
+-----+---------+-----+-----+-----+
|    2| qqqqqqqq| 0.12| 0.12| 0.12|
|    2|aUniqueID| 0.03| 0.03| 0.03|
|    2| qqqqqqqq|-3.84|-3.84|-3.84|
|    2| 12345678| -0.1| -0.1| -0.1|
|    2| 12345678|-0.15|-0.15|-0.15|
|    2| 12345678| 0.47| 0.47| 0.47|
|    2| 12345678|-0.06|-0.06|-0.06|
|    2| 12345678|-0.09|-0.09|-0.09|
|    2| 12345678| 0.21| 0.21| 0.21|
|    2| 12345678|-0.08|-0.08|-0.08|
|    2| 12345678| 0.44| 0.44| 0.44|
|    2|    gholi| 0.76| 0.76| 0.76|
|    2|    gholi| 1.62| 1.62| 1.62|
|    2|    gholi| 5.81| 5.81| 5.81|
|    2| bcbcbcbc| 0.58| 0.58| 0.58|
|    2| bcbcbcbc|-8.24|-8.24|-8.24|
|    2| bcbcbcbc|-0.45|-0.45|-0.45|
|    2| bcbcbcbc| 1.03| 1.03| 1.03|
|    2|aUniqueID|-0.05|-0.05|-0.05|
|    2| qqqqqqqq|-0.44|-0.44|-0.44|
+-----+---------+-----+-----+-----+
only showing top 20 rows

>>> df.createOrReplaceTempView("df")
>>>

Setup 6: Provide just 1.2.0 jars (FAILS !!)

$ ~/spark-2.3.0-bin-hadoop2.7/bin/pyspark --driver-memory 20g --master local[*] --driver-class-path systemml-1.2.0
systemml-1.2.0-extra.jar  systemml-1.2.0.jar        
[npansar@dml3 debug_classpath]$ ~/spark-2.3.0-bin-hadoop2.7/bin/pyspark --driver-memory 20g --master local[*] --driver-class-path systemml-1.2.0.jar:systemml-1.2.0-extra.jar
Python 3.6.3 (default, Mar 20 2018, 13:50:41) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
2019-03-21 13:32:09 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Python version 3.6.3 (default, Mar 20 2018 13:50:41)
SparkSession available as 'spark'.
>>> from systemml import MLContext
>>> ml = MLContext(spark)
2019-03-21 13:32:25 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException

Welcome to Apache SystemML!
Version 1.2.0
>>> ml.version()
'1.2.0'
>>> df=spark.read.parquet('shake.parquet')
>>> df.show()
+-----+---------+-----+-----+-----+
|CLASS| SENSORID|    X|    Y|    Z|
+-----+---------+-----+-----+-----+
|    2| qqqqqqqq| 0.12| 0.12| 0.12|
|    2|aUniqueID| 0.03| 0.03| 0.03|
|    2| qqqqqqqq|-3.84|-3.84|-3.84|
|    2| 12345678| -0.1| -0.1| -0.1|
|    2| 12345678|-0.15|-0.15|-0.15|
|    2| 12345678| 0.47| 0.47| 0.47|
|    2| 12345678|-0.06|-0.06|-0.06|
|    2| 12345678|-0.09|-0.09|-0.09|
|    2| 12345678| 0.21| 0.21| 0.21|
|    2| 12345678|-0.08|-0.08|-0.08|
|    2| 12345678| 0.44| 0.44| 0.44|
|    2|    gholi| 0.76| 0.76| 0.76|
|    2|    gholi| 1.62| 1.62| 1.62|
|    2|    gholi| 5.81| 5.81| 5.81|
|    2| bcbcbcbc| 0.58| 0.58| 0.58|
|    2| bcbcbcbc|-8.24|-8.24|-8.24|
|    2| bcbcbcbc|-0.45|-0.45|-0.45|
|    2| bcbcbcbc| 1.03| 1.03| 1.03|
|    2|aUniqueID|-0.05|-0.05|-0.05|
|    2| qqqqqqqq|-0.44|-0.44|-0.44|
+-----+---------+-----+-----+-----+
only showing top 20 rows

>>> df.createOrReplaceTempView("df")
ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.5.3ANTLR Runtime version 4.7 used for parser compilation does not match the current runtime version 4.5.3Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 176, in createOrReplaceTempView
    self._jdf.createOrReplaceTempView(name)
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/npansar/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o52.createOrReplaceTempView.
: java.lang.ExceptionInInitializerError
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:84)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableIdentifier(ParseDriver.scala:49)
	at org.apache.spark.sql.Dataset.createTempViewCommand(Dataset.scala:3079)
	at org.apache.spark.sql.Dataset.createOrReplaceTempView(Dataset.scala:3034)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).
	at org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:153)
	at org.apache.spark.sql.catalyst.parser.SqlBaseLexer.<clinit>(SqlBaseLexer.java:1153)
	... 16 more
Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).
	... 18 more

>>>

niketanpansare · 2019-03-21T21:40:25Z

Interestingly, running a similar code with 1.2.0 jars in spark-2.3.0../spark-shell succeeds, i.e. behaves similar to setup 5 rather than setup 6. Here is the Scala code used for testing:

val ml = new org.apache.sysml.api.mlcontext.MLContext(spark)
System.out.println(ml.version())
val df = spark.read.parquet("shake.parquet")
df.show()
df.createOrReplaceTempView("df")

Based on the above experiments, here are my thoughts:

We can continue to support older Spark 2.1 version and can get away with warning on Spark 2.3 in the following setups:

Invoked without any Spark SQL code
Part of Scala/Java pipeline (for example: if invoked via spark-shell)
With PySpark if and only if we recommend our users to not provide any jars in the driver-class-path or jars (see setup 5 and 6)

If we are uncomfortable with the above restriction, we should consider merging this PR.

Though I have validated that above Python code works with Spark 2.2.3 with a warning, I did not run exhaustive testing to guarantee backward compatibility support for older Spark 2.1 and 2.2 (with the exception of warning).

geektcp

LGTM

j143 · 2020-06-04T06:54:34Z

SystemML is working with latest spark version in 2.x series. Shall we close this?

cc @mboehm7

romeokienzler · 2020-06-04T07:52:23Z

I can confirm that it works on Spark 2.3 and Spark 2.4

Baunsgaard · 2020-07-14T19:33:26Z

Since we intend to release SystemDS 2.0 late August, it makes sense to update our spark versions as stated in [1].
The version to update to could be 2.3,1. but maybe we should aim for the latest one currently .2.4.6 [2].
Together with this update the Hadoop version could be bumped to 2.10.0 [3].

Do you, @niketanpansare , want to make this change in this PR, or should we open a new Task[4]/PR for this? it might be easier to start over since there have been multiple changes since the commits in this pr.

[1] https://mail-archives.apache.org/mod_mbox/systemds-dev/202007.mbox/%3Cf34cbccf-4c5a-479c-71ba-cff8810436d1%40gmail.com%3E
[2] https://spark.apache.org/downloads.html
[3] https://hadoop.apache.org/releases.html
[4] https://issues.apache.org/jira/browse/SYSTEMDS-2523

Bonus:

this would address open CVEs:

https://nvd.nist.gov/vuln/detail/CVE-2018-8024
https://nvd.nist.gov/vuln/detail/CVE-2018-1334

j143 · 2020-07-15T16:47:15Z

Niketan may not respond right now! Hi Sebastian, you are welcome to open a new PR referencing this PR, to update spark version. 🙂 Thank you, Janardhan

…

On Wed, 15 Jul, 2020, 01:03 Sebastian Baunsgaard, ***@***.***> wrote: Since we intend to release SystemDS 2.0 late August, it makes sense to update our spark versions as stated in [1]. The version to update to could be 2.3,1. but maybe we should aim for the latest one currently .2.4.6 [2]. Together with this update the Hadoop version could be bumped to 2.10.0 [3]. Do you, @niketanpansare <https://github.com/niketanpansare> , want to make this change in this PR, or should we open a new Task[4]/PR for this? it might be easier to start over since there have been multiple changes since the commits in this pr. [1] https://mail-archives.apache.org/mod_mbox/systemds-dev/202007.mbox/%3Cf34cbccf-4c5a-479c-71ba-cff8810436d1%40gmail.com%3E [2] https://spark.apache.org/downloads.html [3] https://hadoop.apache.org/releases.html [4] https://issues.apache.org/jira/browse/SYSTEMDS-2523 Bonus: this would address open CVEs: https://nvd.nist.gov/vuln/detail/CVE-2018-8024 https://nvd.nist.gov/vuln/detail/CVE-2018-1334 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#857 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMU4H42X3NPIQ4ILDJ6Y5YLR3SXJNANCNFSM4G7UJHJA> .

Baunsgaard · 2020-07-17T08:05:54Z

See #992 for continuation, closing this PR.

[SYSTEMML-2523] Update SystemML to Support Spark 2.3.0

0e4318b

Spark 2.3 (released on February 28, 2018) has updated the Antlr version from 4.3 to 4.7, which throws a warning every time we invoke SystemML.

Fixed compilation issues

1dc8982

update the documentation

d4d800e

geektcp reviewed Apr 27, 2020

View reviewed changes

Baunsgaard mentioned this pull request Jul 17, 2020

[SYSTEMDS-2523] Update to Spark 2.4.6 #992

Closed

Baunsgaard closed this Jul 17, 2020

Conversation

niketanpansare commented Mar 19, 2019

Uh oh!

romeokienzler commented Mar 19, 2019

Uh oh!

romeokienzler commented Mar 19, 2019

Uh oh!

mboehm7 commented Mar 19, 2019

Uh oh!

niketanpansare commented Mar 19, 2019

Uh oh!

niketanpansare commented Mar 19, 2019

Uh oh!

romeokienzler commented Mar 20, 2019

Uh oh!

niketanpansare commented Mar 20, 2019

Uh oh!

ghost commented Mar 20, 2019

Uh oh!

niketanpansare commented Mar 20, 2019

Uh oh!

ghost commented Mar 20, 2019

Uh oh!

niketanpansare commented Mar 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niketanpansare commented Mar 21, 2019

Uh oh!

geektcp left a comment

Choose a reason for hiding this comment

Uh oh!

j143 commented Jun 4, 2020

Uh oh!

romeokienzler commented Jun 4, 2020

Uh oh!

Baunsgaard commented Jul 14, 2020

Uh oh!

j143 commented Jul 15, 2020 via email

Uh oh!

Baunsgaard commented Jul 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

niketanpansare commented Mar 21, 2019 •

edited

Loading