Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAHOUT-1894] Add Support for Spark 2.x #271

Closed
wants to merge 3 commits into from

Conversation

rawkintrevo
Copy link
Contributor

As long as we're sticking to Scala 2.10, running mahout on spark 2.x is simply a matter of

mvn clean package -Dspark.version=2.0.2
or
mvn clean package -Dspark.version=2.1.0

The trouble comes with the shell...

I checked Apache Zeppelin to see how they handle multiple spark/scala versions...
a brief preview of the descent into hell that is having a shell that handles multiple spark/scala versions

So I took an alternate root. I dropped the Mahout shell all together, changed the mahout bin file to load the spark shell directly, and pass a scala script that takes care of our imports.

When building there is a single deprecation warning regarding the sqlContext and how it is created in the spark-bindings.

I think we should add binaries for Spark 2.0 and Spark 2.1 as a matter of convenience and the Zeppelin integration.

MAHOUT-1894 Add support for spark 2.x
@pferrel
Copy link
Contributor

pferrel commented Feb 2, 2017

I'm soooo into dropping a special Mahout shell, do your comments mean we just run Mahout classes in the Spark shell for Spark 2.x? Does this work with and without (@andrewpalumbo 's case) Zeppelin?

IF we can compile Mahout with Scala 2.11 fairly easily (excluding the shell) and IF we can run Mahout with some helper scripts in the Spark Shell, we can drop the Mahou Shell code and get all the advantages of using the plain Spark Shell with our extensions. Can/should this be done?

I realize I've asked these before but this seems the best forum.

@rawkintrevo
Copy link
Contributor Author

@pferrel In short yes. The idea here is we entirely drop the Mahout Shell. It was also the blocker for upgrading to Spark 2.x.

The Zeppelin integration, for all intents and purposes is a spark shell + some imports and setting up the distributed context.

So that is what we're doing here.

Hopefully removing the shell will also clear the way for the Scala 2.11 upgrade / profile.

@andrewpalumbo
Copy link
Member

andrewpalumbo commented Feb 5, 2017

hmm.. just tried to launch into local[4] and blew it up:

AP-RE-X16743C45L:mahout apalumbo$ MASTER=local[4] mahout spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
Loading /Users/apalumbo/sandbox/mahout/bin/load-shell.scala...
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@73e0c775

                _                 _
_ __ ___   __ _| |__   ___  _   _| |_
 '_ ` _ \ / _` | '_ \ / _ \| | | | __|
 | | | | (_| | | | | (_) | |_| | |_
_| |_| |_|\__,_|_| |_|\___/ \__,_|\__|  version 0.13.0



Exception in thread "main" java.io.FileNotFoundException: spark-shell (Is a directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at scala.reflect.io.File.inputStream(File.scala:97)
	at scala.reflect.io.File.inputStream(File.scala:82)
	at scala.reflect.io.Streamable$Chars$class.reader(Streamable.scala:93)
	at scala.reflect.io.File.reader(File.scala:82)
	at scala.reflect.io.Streamable$Chars$class.bufferedReader(Streamable.scala:98)
	at scala.reflect.io.File.bufferedReader(File.scala:82)
	at scala.reflect.io.Streamable$Chars$class.bufferedReader(Streamable.scala:97)
	at scala.reflect.io.File.bufferedReader(File.scala:82)
	at scala.reflect.io.Streamable$Chars$class.applyReader(Streamable.scala:103)
	at scala.reflect.io.File.applyReader(File.scala:82)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SparkILoop.scala:677)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1$$anonfun$apply$mcV$sp$1.apply(SparkILoop.scala:677)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1$$anonfun$apply$mcV$sp$1.apply(SparkILoop.scala:677)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$savingReplayStack(SparkILoop.scala:162)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1.apply$mcV$sp(SparkILoop.scala:676)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1.apply(SparkILoop.scala:676)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1.apply(SparkILoop.scala:676)
	at org.apache.spark.repl.SparkILoop.savingReader(SparkILoop.scala:167)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$interpretAllFrom(SparkILoop.scala:675)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$loadCommand$1.apply(SparkILoop.scala:740)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$loadCommand$1.apply(SparkILoop.scala:739)
	at org.apache.spark.repl.SparkILoop.withFile(SparkILoop.scala:733)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loadCommand(SparkILoop.scala:739)

{...}

Something in the script? Spark seems to think that spark-submit is a directory ...

@rawkintrevo
Copy link
Contributor Author

Possibly a regression last night when I moved the location/ changed name of load.scala -> bin/load-shell.scala

@rawkintrevo
Copy link
Contributor Author

Confirmed shell explosion- fixed by deleting $MAHOUT_HOME/bin/metastore_db

My shell explosion was a slightly different flavor though. Can you try the above?

@@ -211,10 +207,6 @@
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an other shell dependency @ line 158 that needs to come out.

@skanjila
Copy link

skanjila commented Feb 6, 2017

@rawkintrevo I was wondering how we will evolve the shell if a new spark version comes out, also I am wondering what the use cases are for mahout-shell , seems like most people use mahout as an embedded application or a library, is the shell just to test out a few things? I would be all for removing the shell altogether actually. Less code to maintain in the long run, let me know if I am missing something here.

@rawkintrevo
Copy link
Contributor Author

@skanjila the shell is useful enough that I'd like to keep it around if possible. Some reasons off the cuff:

  • Good for 'demo-ing' mahout. Fire up the shell do some simple stuff.
  • Good for sanity checking bugs in Zeppelin (something very close to my heart)
  • As we move to add algorithms, I envision Mahout being used for more interactive data science. In that world there is a lot of iterative "try this, see what happens, try that" kind of approach. I do most of that in Zeppelin, but some people may use JetBrains/ other IDEs and the shell is useful in these cases.

To your point- yes, 86ing the entire shell module certainly poses some very attractive advantages. What we're seeing in this PR is an opportunity to get best of both worlds (no code, but still have a shell). Just need to work out some kinks on getting it working with spark-shell correctly.

@skanjila
Copy link

skanjila commented Feb 7, 2017

@rawkintrevo the notion of interactive data science is very interesting to me as thats what I do at work, however what is the advantage of using mahout for that versus doing it directly in spark shell using spark sql or the ml algorithms in spark, is that where Samsara comes in, just trying to understand the tradeoffs between the spark and the mahout worlds

@rawkintrevo
Copy link
Contributor Author

@skanjila yes. In short- SparkML has a few non-extensible algorithms with limited functionality. Mahout lets your write your own algorithms, but at the moment there are some amazing tools to help you do that (distributed svd) but not many pre baked algorithms (you still need to go back to sparkML for Random Forrests, etc).

With the new algorithms framework, I hope to see Mahout catch up and exceed SparkML's pre-canned algorithm collection in short order, driven primarily by community involvement.

@rawkintrevo
Copy link
Contributor Author

@andrewpalumbo checked against spark 1.6.1/2.0.2/2.1.0 on another box- no issues.

Can someone else help test this?

@andrewpalumbo
Copy link
Member

andrewpalumbo commented Feb 7, 2017 via email

@andrewpalumbo
Copy link
Member

andrewpalumbo commented Feb 7, 2017 via email

@rawkintrevo
Copy link
Contributor Author

I'd like someone else to test this.

Also curious if this solves MAHOUT-1897

This DOES NOT solve MAHOUT-1892 serialization when doing a map block in the shell

@skanjila
Copy link

skanjila commented Feb 7, 2017

@rawkintrevo I was going to test this on an azure vm, do you guys still need help testing?

@skanjila
Copy link

skanjila commented Feb 7, 2017

@rawkintrevo here's what I see when testing on an azure vm:
saikan@dsexperiments:~/code/mahout$ MASTER=local[4] ./bin/mahout spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/
/ .__/_,// //_\ version 1.6.2
/
/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
17/02/07 18:24:40 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/02/07 18:24:40 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
17/02/07 18:24:46 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/02/07 18:24:46 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.
Loading /home/saikan/code/mahout/bin/load-shell.scala...
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@7804474a

            _                 _

_ __ ___ __ | |_ ___ _ | |
'_ _ \ / _ | '_ \ / _ | | | | |
| | | | (| | | | | () | || | |
| || ||_,|| |_|_
/ _,|__| version 0.13.0

That file does not exist

scala>

Looks good to me,+1 non binding of course, perhaps we should try some heavy matrix ops on it for further testing

@rawkintrevo
Copy link
Contributor Author

@skanjila Thanks for testing! I usually run the ols example- though another type of test is probably advisable to truly detect bugs. Could you also confirm that it works in the following ways:
Build mahout with mvn clean package -Dspark.version=2.0.2 and then set export SPARK_HOME=/path/to/spark-2.0.2-bin and then again for spark 2.1.0?

Thanks again!

@skanjila
Copy link

skanjila commented Feb 7, 2017

Here's the results with Spark 2.0.2
saikan@dsexperiments:~/code/mahout$ MASTER=local[4] ./bin/mahout spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/02/07 19:22:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/07 19:22:21 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://10.4.9.4:4040
Spark context available as 'sc' (master = local[4], app id = local-1486495341246).
Spark session available as 'spark'.
Loading /home/saikan/code/mahout/bin/load-shell.scala...
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@43f44d37

            _                 _

_ __ ___ __ | |_ ___ _ | |
'_ _ \ / _ | '_ \ / _ | | | | |
| | | | (| | | | | () | || | |
| || ||_,|| |_|_
/ _,|__| version 0.13.0

That file does not exist

Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/
/ .__/_,// //_\ version 2.0.2
/
/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

@skanjila
Copy link

skanjila commented Feb 7, 2017

And here's the results for spark 2.1.0
saikan@dsexperiments:~/code/mahout$ MASTER=local[4] ./bin/mahout spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/07 19:27:59 WARN SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
17/02/07 19:27:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/07 19:28:05 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.4.9.4:4040
Spark context available as 'sc' (master = local[4], app id = local-1486495680739).
Spark session available as 'spark'.
Loading /home/saikan/code/mahout/bin/load-shell.scala...
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@3ea6753e

            _                 _

_ __ ___ __ | |_ ___ _ | |
'_ _ \ / _ | '_ \ / _ | | | | |
| | | | (| | | | | () | || | |
| || ||_,|| |_|_
/ _,|__| version 0.13.0

That file does not exist

Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/
/ .__/_,// //_\ version 2.1.0
/
/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
I would highly recommend we come up with a beefy set of tests to validate the shell further, thoughts on which set of operations to consider other than OLS?

@skanjila
Copy link

skanjila commented Feb 8, 2017

@rawkintrevo is there any other help I can provide on this, maybe run through some example mscala scripts, let me know

@andrewmusselman
Copy link
Contributor

Yeah, I'm getting a result for all three versions of spark, but the welcome banner situation could use some work; I'd like to remove the "This file does not exist" message, and with 1.6.3 the spark banner shows up before the mahout banner, while with 2.x the mahout banner shows up first. Perhaps suppressing the mahout banner makes sense.

@asfgit asfgit closed this in 5afdc68 Feb 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants