[SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working #16515

HyukjinKwon · 2017-01-09T10:39:37Z

What changes were proposed in this pull request?

binary_classification_metrics_example.py

LibSVM datasource loads ml.linalg.SparseVector whereas the example requires it to be mllib.linalg.SparseVector. For the equivalent Scala exmaple, BinaryClassificationMetricsExample.scala seems fine.

./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py

  File ".../spark/examples/src/main/python/mllib/binary_classification_metrics_example.py", line 39, in <lambda>
    .rdd.map(lambda row: LabeledPoint(row[0], row[1]))
  File ".../spark/python/pyspark/mllib/regression.py", line 54, in __init__
    self.features = _convert_to_vector(features)
  File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 80, in _convert_to_vector
    raise TypeError("Cannot convert type %s into Vector" % type(l))
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

status_api_demo.py (this one does not work on Python 3.4.6)

It's queue in Python 3+.

PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py

Traceback (most recent call last):
  File ".../spark/examples/src/main/python/status_api_demo.py", line 22, in <module>
    import Queue
ImportError: No module named 'Queue'

bisecting_k_means_example.py

BisectingKMeansModel does not implement save and load in Python.

./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py

Traceback (most recent call last):
  File ".../spark/examples/src/main/python/mllib/bisecting_k_means_example.py", line 46, in <module>
    model.save(sc, path)
AttributeError: 'BisectingKMeansModel' object has no attribute 'save'

elementwise_product_example.py

It calls collect from the vector.

./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py

Traceback (most recent call last):
  File ".../spark/examples/src/main/python/mllib/elementwise_product_example.py", line 48, in <module>
    for each in transformedData2.collect():
  File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 478, in __getattr__
    return getattr(self.array, item)
AttributeError: 'numpy.ndarray' object has no attribute 'collect'

These three tests look throwing an exception for a relative path set in spark.sql.warehouse.dir.

hive.py

./bin/spark-submit examples/src/main/python/sql/hive.py

Traceback (most recent call last):
  File ".../spark/examples/src/main/python/sql/hive.py", line 47, in <module>
    spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
  File ".../spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 541, in sql
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File ".../spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse);'

SparkHiveExample.scala

./bin/run-example sql.hive.SparkHiveExample

Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse
	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498)
	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)

JavaSparkHiveExample.java

./bin/run-example sql.hive.JavaSparkHiveExample

Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse
	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498)
	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)

How was this patch tested?

Manually via

./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py

PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py

./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py

./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py

./bin/spark-submit examples/src/main/python/sql/hive.py

./bin/run-example sql.hive.JavaSparkHiveExample

./bin/run-example sql.hive.SparkHiveExample

These were found via

find ./examples/src/main/python -name "*.py" -exec spark-submit {} \;

HyukjinKwon · 2017-01-09T10:39:55Z

examples/src/main/python/mllib/binary_classification_metrics_example.py

-        .builder\
-        .appName("BinaryClassificationMetricsExample")\
-        .getOrCreate()
+    sc = SparkContext(appName="BinaryClassificationMetricsExample")


I just used SparkContext to be consistent with other examples.

Is the point that this is an .mllib example rather than .ml so should use the older API?

Yes, it is up to my understanding.

HyukjinKwon · 2017-01-09T10:40:07Z

@yanboliang Could I please ask to take a look please?

HyukjinKwon · 2017-01-09T11:19:45Z

Hm.. actually, it seems there are more. Let me open a JIRA and sweep it.

SparkQA · 2017-01-09T12:02:09Z

Test build #71077 has finished for PR 16515 at commit 9a4fd40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-09T12:05:34Z

Test build #71078 has finished for PR 16515 at commit 1f8c11e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-09T13:06:05Z

Test build #71080 has finished for PR 16515 at commit 1ce29fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-01-10T10:19:38Z

LGTM. Thanks for catching this.
My network has encountered some problems, will try to merge later. Or @srowen can help to merge this. Thanks.

HyukjinKwon · 2017-01-10T12:00:14Z

Thank you all!

… not working ## What changes were proposed in this pull request? **binary_classification_metrics_example.py** LibSVM datasource loads `ml.linalg.SparseVector` whereas the example requires it to be `mllib.linalg.SparseVector`. For the equivalent Scala exmaple, `BinaryClassificationMetricsExample.scala` seems fine. ``` ./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py ``` ``` File ".../spark/examples/src/main/python/mllib/binary_classification_metrics_example.py", line 39, in <lambda> .rdd.map(lambda row: LabeledPoint(row[0], row[1])) File ".../spark/python/pyspark/mllib/regression.py", line 54, in __init__ self.features = _convert_to_vector(features) File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 80, in _convert_to_vector raise TypeError("Cannot convert type %s into Vector" % type(l)) TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector ``` **status_api_demo.py** (this one does not work on Python 3.4.6) It's `queue` in Python 3+. ``` PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py ``` ``` Traceback (most recent call last): File ".../spark/examples/src/main/python/status_api_demo.py", line 22, in <module> import Queue ImportError: No module named 'Queue' ``` **bisecting_k_means_example.py** `BisectingKMeansModel` does not implement `save` and `load` in Python. ```bash ./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py ``` ``` Traceback (most recent call last): File ".../spark/examples/src/main/python/mllib/bisecting_k_means_example.py", line 46, in <module> model.save(sc, path) AttributeError: 'BisectingKMeansModel' object has no attribute 'save' ``` **elementwise_product_example.py** It calls `collect` from the vector. ```bash ./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py ``` ``` Traceback (most recent call last): File ".../spark/examples/src/main/python/mllib/elementwise_product_example.py", line 48, in <module> for each in transformedData2.collect(): File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 478, in __getattr__ return getattr(self.array, item) AttributeError: 'numpy.ndarray' object has no attribute 'collect' ``` **These three tests look throwing an exception for a relative path set in `spark.sql.warehouse.dir`.** **hive.py** ``` ./bin/spark-submit examples/src/main/python/sql/hive.py ``` ``` Traceback (most recent call last): File ".../spark/examples/src/main/python/sql/hive.py", line 47, in <module> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive") File ".../spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 541, in sql File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File ".../spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco pyspark.sql.utils.AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse);' ``` **SparkHiveExample.scala** ``` ./bin/run-example sql.hive.SparkHiveExample ``` ``` Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498) at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484) at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668) ``` **JavaSparkHiveExample.java** ``` ./bin/run-example sql.hive.JavaSparkHiveExample ``` ``` Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498) at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484) at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668) ``` ## How was this patch tested? Manually via ``` ./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py ``` ``` PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py ``` ``` ./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py ``` ``` ./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py ``` ``` ./bin/spark-submit examples/src/main/python/sql/hive.py ``` ``` ./bin/run-example sql.hive.JavaSparkHiveExample ``` ``` ./bin/run-example sql.hive.SparkHiveExample ``` These were found via ```bash find ./examples/src/main/python -name "*.py" -exec spark-submit {} \; ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#16515 from HyukjinKwon/minor-example-fix.

Fix binary classification metrics example to work

9a4fd40

HyukjinKwon commented Jan 9, 2017

View reviewed changes

HyukjinKwon changed the title ~~[MINOR][PYTHON][EXAMPLE] Fix binary classification metrics example to work~~ [SPARK-19134][PYTHON][EXAMPLE] Fix several Python mllib and status api examples not working Jan 9, 2017

Add more examples

1f8c11e

HyukjinKwon changed the title ~~[SPARK-19134][PYTHON][EXAMPLE] Fix several Python mllib and status api examples not working~~ [WIP][SPARK-19134][PYTHON][SQL][EXAMPLE] Fix several Python mllib and status api examples not working Jan 9, 2017

Fix more

1ce29fb

HyukjinKwon changed the title ~~[WIP][SPARK-19134][PYTHON][SQL][EXAMPLE] Fix several Python mllib and status api examples not working~~ [SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working Jan 9, 2017

srowen approved these changes Jan 9, 2017

View reviewed changes

asfgit closed this in b0e5840 Jan 10, 2017

zhengruifeng mentioned this pull request Mar 31, 2017

[SPARK-13672] [ML] Add python examples of BisectingKMeans in ML and MLLIB #11515

Closed

HyukjinKwon deleted the minor-example-fix branch January 2, 2018 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working #16515

[SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working #16515

HyukjinKwon commented Jan 9, 2017 •

edited

Loading

HyukjinKwon Jan 9, 2017

srowen Jan 9, 2017

HyukjinKwon Jan 9, 2017

dongjoon-hyun Jan 9, 2017

HyukjinKwon commented Jan 9, 2017

HyukjinKwon commented Jan 9, 2017

SparkQA commented Jan 9, 2017

SparkQA commented Jan 9, 2017

SparkQA commented Jan 9, 2017

yanboliang commented Jan 10, 2017 •

edited

Loading

HyukjinKwon commented Jan 10, 2017

[SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working #16515

[SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working #16515

Conversation

HyukjinKwon commented Jan 9, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon Jan 9, 2017

Choose a reason for hiding this comment

srowen Jan 9, 2017

Choose a reason for hiding this comment

HyukjinKwon Jan 9, 2017

Choose a reason for hiding this comment

dongjoon-hyun Jan 9, 2017

Choose a reason for hiding this comment

HyukjinKwon commented Jan 9, 2017

HyukjinKwon commented Jan 9, 2017

SparkQA commented Jan 9, 2017

SparkQA commented Jan 9, 2017

SparkQA commented Jan 9, 2017

yanboliang commented Jan 10, 2017 • edited Loading

HyukjinKwon commented Jan 10, 2017

HyukjinKwon commented Jan 9, 2017 •

edited

Loading

yanboliang commented Jan 10, 2017 •

edited

Loading