[MAHOUT-1894] Add Support for Spark 2.x #271

rawkintrevo · 2017-02-02T06:33:25Z

As long as we're sticking to Scala 2.10, running mahout on spark 2.x is simply a matter of

mvn clean package -Dspark.version=2.0.2
or
mvn clean package -Dspark.version=2.1.0

The trouble comes with the shell...

I checked Apache Zeppelin to see how they handle multiple spark/scala versions...
a brief preview of the descent into hell that is having a shell that handles multiple spark/scala versions

So I took an alternate root. I dropped the Mahout shell all together, changed the mahout bin file to load the spark shell directly, and pass a scala script that takes care of our imports.

When building there is a single deprecation warning regarding the sqlContext and how it is created in the spark-bindings.

I think we should add binaries for Spark 2.0 and Spark 2.1 as a matter of convenience and the Zeppelin integration.

MAHOUT-1894 Add support for spark 2.x

pferrel · 2017-02-02T18:02:32Z

I'm soooo into dropping a special Mahout shell, do your comments mean we just run Mahout classes in the Spark shell for Spark 2.x? Does this work with and without (@andrewpalumbo 's case) Zeppelin?

IF we can compile Mahout with Scala 2.11 fairly easily (excluding the shell) and IF we can run Mahout with some helper scripts in the Spark Shell, we can drop the Mahou Shell code and get all the advantages of using the plain Spark Shell with our extensions. Can/should this be done?

I realize I've asked these before but this seems the best forum.

rawkintrevo · 2017-02-04T06:48:34Z

@pferrel In short yes. The idea here is we entirely drop the Mahout Shell. It was also the blocker for upgrading to Spark 2.x.

The Zeppelin integration, for all intents and purposes is a spark shell + some imports and setting up the distributed context.

So that is what we're doing here.

Hopefully removing the shell will also clear the way for the Scala 2.11 upgrade / profile.

andrewpalumbo · 2017-02-05T00:15:28Z

hmm.. just tried to launch into local[4] and blew it up:

AP-RE-X16743C45L:mahout apalumbo$ MASTER=local[4] mahout spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
Loading /Users/apalumbo/sandbox/mahout/bin/load-shell.scala...
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@73e0c775

                _                 _
_ __ ___   __ _| |__   ___  _   _| |_
 '_ ` _ \ / _` | '_ \ / _ \| | | | __|
 | | | | (_| | | | | (_) | |_| | |_
_| |_| |_|\__,_|_| |_|\___/ \__,_|\__|  version 0.13.0



Exception in thread "main" java.io.FileNotFoundException: spark-shell (Is a directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at scala.reflect.io.File.inputStream(File.scala:97)
	at scala.reflect.io.File.inputStream(File.scala:82)
	at scala.reflect.io.Streamable$Chars$class.reader(Streamable.scala:93)
	at scala.reflect.io.File.reader(File.scala:82)
	at scala.reflect.io.Streamable$Chars$class.bufferedReader(Streamable.scala:98)
	at scala.reflect.io.File.bufferedReader(File.scala:82)
	at scala.reflect.io.Streamable$Chars$class.bufferedReader(Streamable.scala:97)
	at scala.reflect.io.File.bufferedReader(File.scala:82)
	at scala.reflect.io.Streamable$Chars$class.applyReader(Streamable.scala:103)
	at scala.reflect.io.File.applyReader(File.scala:82)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SparkILoop.scala:677)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1$$anonfun$apply$mcV$sp$1.apply(SparkILoop.scala:677)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1$$anonfun$apply$mcV$sp$1.apply(SparkILoop.scala:677)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$savingReplayStack(SparkILoop.scala:162)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1.apply$mcV$sp(SparkILoop.scala:676)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1.apply(SparkILoop.scala:676)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1.apply(SparkILoop.scala:676)
	at org.apache.spark.repl.SparkILoop.savingReader(SparkILoop.scala:167)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$interpretAllFrom(SparkILoop.scala:675)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$loadCommand$1.apply(SparkILoop.scala:740)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$loadCommand$1.apply(SparkILoop.scala:739)
	at org.apache.spark.repl.SparkILoop.withFile(SparkILoop.scala:733)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loadCommand(SparkILoop.scala:739)

{...}

Something in the script? Spark seems to think that spark-submit is a directory ...

rawkintrevo · 2017-02-05T00:37:46Z

Possibly a regression last night when I moved the location/ changed name of load.scala -> bin/load-shell.scala

rawkintrevo · 2017-02-05T00:50:07Z

Confirmed shell explosion- fixed by deleting $MAHOUT_HOME/bin/metastore_db

My shell explosion was a slightly different flavor though. Can you try the above?

andrewpalumbo · 2017-02-05T05:48:17Z

distribution/pom.xml

@@ -211,10 +207,6 @@
    </dependency>


there is an other shell dependency @ line 158 that needs to come out.

skanjila · 2017-02-06T18:48:12Z

@rawkintrevo I was wondering how we will evolve the shell if a new spark version comes out, also I am wondering what the use cases are for mahout-shell , seems like most people use mahout as an embedded application or a library, is the shell just to test out a few things? I would be all for removing the shell altogether actually. Less code to maintain in the long run, let me know if I am missing something here.

rawkintrevo · 2017-02-07T00:10:50Z

@skanjila the shell is useful enough that I'd like to keep it around if possible. Some reasons off the cuff:

Good for 'demo-ing' mahout. Fire up the shell do some simple stuff.
Good for sanity checking bugs in Zeppelin (something very close to my heart)
As we move to add algorithms, I envision Mahout being used for more interactive data science. In that world there is a lot of iterative "try this, see what happens, try that" kind of approach. I do most of that in Zeppelin, but some people may use JetBrains/ other IDEs and the shell is useful in these cases.

To your point- yes, 86ing the entire shell module certainly poses some very attractive advantages. What we're seeing in this PR is an opportunity to get best of both worlds (no code, but still have a shell). Just need to work out some kinks on getting it working with spark-shell correctly.

skanjila · 2017-02-07T00:14:46Z

@rawkintrevo the notion of interactive data science is very interesting to me as thats what I do at work, however what is the advantage of using mahout for that versus doing it directly in spark shell using spark sql or the ml algorithms in spark, is that where Samsara comes in, just trying to understand the tradeoffs between the spark and the mahout worlds

rawkintrevo · 2017-02-07T00:20:33Z

@skanjila yes. In short- SparkML has a few non-extensible algorithms with limited functionality. Mahout lets your write your own algorithms, but at the moment there are some amazing tools to help you do that (distributed svd) but not many pre baked algorithms (you still need to go back to sparkML for Random Forrests, etc).

With the new algorithms framework, I hope to see Mahout catch up and exceed SparkML's pre-canned algorithm collection in short order, driven primarily by community involvement.

rawkintrevo · 2017-02-07T02:37:22Z

@andrewpalumbo checked against spark 1.6.1/2.0.2/2.1.0 on another box- no issues.

Can someone else help test this?

andrewpalumbo · 2017-02-07T02:46:48Z

Great! Is there anything left to do here? Sent from my Verizon Wireless 4G LTE smartphone

…

-------- Original message -------- From: Trevor Grant <notifications@github.com> Date: 02/06/2017 6:37 PM (GMT-08:00) To: apache/mahout <mahout@noreply.github.com> Cc: Andrew Palumbo <ap.dev@outlook.com>, Mention <mention@noreply.github.com> Subject: Re: [apache/mahout] [MAHOUT-1894] Add Support for Spark 2.x (#271) @andrewpalumbo<https://github.com/andrewpalumbo> checked against spark 1.6.1/2.0.2/2.1.0 on another box- no issues. Can someone else help test this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#271 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHU2HeRKlxw62_ne3qi5KbvpvWGPYmzbks5rZ9jkgaJpZM4L0xAh>.

andrewpalumbo · 2017-02-07T02:57:56Z

+1 Sent from my Verizon Wireless 4G LTE smartphone

…

-------- Original message -------- From: Trevor Grant <notifications@github.com> Date: 02/06/2017 4:20 PM (GMT-08:00) To: apache/mahout <mahout@noreply.github.com> Cc: Andrew Palumbo <ap.dev@outlook.com>, Mention <mention@noreply.github.com> Subject: Re: [apache/mahout] [MAHOUT-1894] Add Support for Spark 2.x (#271) @skanjila<https://github.com/skanjila> yes. In short- SparkML has a few non-extensible algorithms with limited functionality. Mahout lets your write your own algorithms, but at the moment there are some amazing tools to help you do that (distributed svd) but not many pre baked algorithms (you still need to go back to sparkML for Random Forrests, etc). With the new algorithms framework, I hope to see Mahout catch up and exceed SparkML's pre-canned algorithm collection in short order, driven primarily by community involvement. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#271 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHU2Ha2jHySN6ab9zvmFuBSeuRJDaTNPks5rZ7jUgaJpZM4L0xAh>.

rawkintrevo · 2017-02-07T04:27:37Z

I'd like someone else to test this.

Also curious if this solves MAHOUT-1897

This DOES NOT solve MAHOUT-1892 serialization when doing a map block in the shell

skanjila · 2017-02-07T17:57:02Z

@rawkintrevo I was going to test this on an azure vm, do you guys still need help testing?

skanjila · 2017-02-07T18:25:43Z

@rawkintrevo here's what I see when testing on an azure vm:
saikan@dsexperiments:~/code/mahout$ MASTER=local[4] ./bin/mahout spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
// .__/_,// //_\ version 1.6.2
//

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
17/02/07 18:24:40 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/02/07 18:24:40 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
17/02/07 18:24:46 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/02/07 18:24:46 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.
Loading /home/saikan/code/mahout/bin/load-shell.scala...
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@7804474a

            _                 _

_ __ ___ __ | |_ ___ _ | |
'_ _ \ / _ | '_ \ / _ | | | | |
| | | | (| | | | | () | || | |
| || ||_,|| |_|_/ _,|__| version 0.13.0

That file does not exist

scala>

Looks good to me,+1 non binding of course, perhaps we should try some heavy matrix ops on it for further testing

rawkintrevo · 2017-02-07T19:08:23Z

@skanjila Thanks for testing! I usually run the ols example- though another type of test is probably advisable to truly detect bugs. Could you also confirm that it works in the following ways:
Build mahout with mvn clean package -Dspark.version=2.0.2 and then set export SPARK_HOME=/path/to/spark-2.0.2-bin and then again for spark 2.1.0?

Thanks again!

skanjila · 2017-02-07T19:22:48Z

Here's the results with Spark 2.0.2
saikan@dsexperiments:~/code/mahout$ MASTER=local[4] ./bin/mahout spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/02/07 19:22:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/07 19:22:21 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://10.4.9.4:4040
Spark context available as 'sc' (master = local[4], app id = local-1486495341246).
Spark session available as 'spark'.
Loading /home/saikan/code/mahout/bin/load-shell.scala...
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@43f44d37

            _                 _

_ __ ___ __ | |_ ___ _ | |
'_ _ \ / _ | '_ \ / _ | | | | |
| | | | (| | | | | () | || | |
| || ||_,|| |_|_/ _,|__| version 0.13.0

That file does not exist

Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
// .__/_,// //_\ version 2.0.2
//

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

skanjila · 2017-02-07T19:29:08Z

And here's the results for spark 2.1.0
saikan@dsexperiments:~/code/mahout$ MASTER=local[4] ./bin/mahout spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/07 19:27:59 WARN SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
17/02/07 19:27:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/07 19:28:05 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.4.9.4:4040
Spark context available as 'sc' (master = local[4], app id = local-1486495680739).
Spark session available as 'spark'.
Loading /home/saikan/code/mahout/bin/load-shell.scala...
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@3ea6753e

            _                 _

_ __ ___ __ | |_ ___ _ | |
'_ _ \ / _ | '_ \ / _ | | | | |
| | | | (| | | | | () | || | |
| || ||_,|| |_|_/ _,|__| version 0.13.0

That file does not exist

Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
// .__/_,// //_\ version 2.1.0
//

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
I would highly recommend we come up with a beefy set of tests to validate the shell further, thoughts on which set of operations to consider other than OLS?

skanjila · 2017-02-08T15:33:51Z

@rawkintrevo is there any other help I can provide on this, maybe run through some example mscala scripts, let me know

andrewmusselman · 2017-02-24T01:16:56Z

Yeah, I'm getting a result for all three versions of spark, but the welcome banner situation could use some work; I'd like to remove the "This file does not exist" message, and with 1.6.3 the spark banner shows up before the mahout banner, while with 2.x the mahout banner shows up first. Perhaps suppressing the mahout banner makes sense.

MAHOUT-1894 Add support for spark 2.x

867cdd0

MAHOUT-1894 Add support for spark 2.x

rawkintrevo mentioned this pull request Feb 2, 2017

MAHOUT-1795: Build math & spark bindings under scala 2.11 #179

Closed

MAHOUT-1894 Add support for spark 2.x

eb29a0d

andrewpalumbo reviewed Feb 5, 2017

View reviewed changes

distribution/pom.xml

@@ -211,10 +207,6 @@

</dependency>

Copy link

Member

andrewpalumbo Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an other shell dependency @ line 158 that needs to come out.

MAHOUT-1894 Add support for Spark 2.x

89a6a3d

asfgit closed this in 5afdc68 Feb 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MAHOUT-1894] Add Support for Spark 2.x #271

[MAHOUT-1894] Add Support for Spark 2.x #271

rawkintrevo commented Feb 2, 2017

pferrel commented Feb 2, 2017

rawkintrevo commented Feb 4, 2017

andrewpalumbo commented Feb 5, 2017 •

edited

rawkintrevo commented Feb 5, 2017

rawkintrevo commented Feb 5, 2017

andrewpalumbo Feb 5, 2017

skanjila commented Feb 6, 2017

rawkintrevo commented Feb 7, 2017

skanjila commented Feb 7, 2017

rawkintrevo commented Feb 7, 2017

rawkintrevo commented Feb 7, 2017

andrewpalumbo commented Feb 7, 2017 via email

andrewpalumbo commented Feb 7, 2017 via email

rawkintrevo commented Feb 7, 2017

skanjila commented Feb 7, 2017

skanjila commented Feb 7, 2017 •

edited

rawkintrevo commented Feb 7, 2017

skanjila commented Feb 7, 2017

skanjila commented Feb 7, 2017

skanjila commented Feb 8, 2017

andrewmusselman commented Feb 24, 2017

[MAHOUT-1894] Add Support for Spark 2.x #271

[MAHOUT-1894] Add Support for Spark 2.x #271

Conversation

rawkintrevo commented Feb 2, 2017

pferrel commented Feb 2, 2017

rawkintrevo commented Feb 4, 2017

andrewpalumbo commented Feb 5, 2017 • edited

rawkintrevo commented Feb 5, 2017

rawkintrevo commented Feb 5, 2017

andrewpalumbo Feb 5, 2017

Choose a reason for hiding this comment

skanjila commented Feb 6, 2017

rawkintrevo commented Feb 7, 2017

skanjila commented Feb 7, 2017

rawkintrevo commented Feb 7, 2017

rawkintrevo commented Feb 7, 2017

andrewpalumbo commented Feb 7, 2017 via email

andrewpalumbo commented Feb 7, 2017 via email

rawkintrevo commented Feb 7, 2017

skanjila commented Feb 7, 2017

skanjila commented Feb 7, 2017 • edited

rawkintrevo commented Feb 7, 2017

skanjila commented Feb 7, 2017

skanjila commented Feb 7, 2017

skanjila commented Feb 8, 2017

andrewmusselman commented Feb 24, 2017

andrewpalumbo commented Feb 5, 2017 •

edited

skanjila commented Feb 7, 2017 •

edited