SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats #455

MLnick · 2014-04-19T19:37:52Z

So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it.

This adds initial support for reading Hadoop SequenceFiles, as well as arbitrary Hadoop InputFormats, in PySpark.

Overview

The basics are as follows:

PythonRDD object contains the relevant methods, that are in turn invoked by SparkContext in PySpark
The SequenceFile or InputFormat is read on the Scala side and converted from Writable instances to the relevant Scala classes (in the case of primitives)
Pyrolite is used to serialize Java objects. If this fails, the fallback is toString
PickleSerializer on the Python side deserializes.

This works "out the box" for simple Writables:

Text
IntWritable, DoubleWritable, FloatWritable
NullWritable
BooleanWritable
BytesWritable
MapWritable

It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added).

I've tested it out with ESInputFormat as an example and it works very nicely:

conf = {"es.resource" : "index/type" }
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()

I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box.

Some things still outstanding:

~~Requires msgpack-python and will fail without it. As originally discussed with Josh, add a as_strings argument that defaults to False, that can be used if msgpack-python is not available~~
I see from SPARK-1374: PySpark API for SparkSQL #363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the msgpack-based SerDe here to use Pyrolite wouldn't be too hard
Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or java.util.Map that can be easily serialized)
Support saveAsSequenceFile and saveAsHadoopFile etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR

…InputFormat

Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala project/SparkBuild.scala

…gument names

…ents

Conflicts: project/SparkBuild.scala python/pyspark/context.py

Conflicts: project/SparkBuild.scala

…mes. Use SPARK_HOME in path for writing test sequencefile data.

Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala python/pyspark/context.py

Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

Conflicts: project/SparkBuild.scala

Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala project/SparkBuild.scala python/pyspark/context.py python/pyspark/serializers.py

Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala project/SparkBuild.scala

rjurney · 2014-07-29T00:50:46Z

@JoshRosen Actually, I just did: sbt/sbt assembly publish-local

Trying again with: SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.X.X sbt/sbt assembly publish-local

rjurney · 2014-07-29T00:52:23Z

@JoshRosen

Huh, I'm going by http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html

and I get:

sbt.ResolveException: unresolved dependency: org.apache.hadoop#hadoop-client;2.0.0-mr1-cdh4.X.X: not found

srowen · 2014-07-29T00:54:51Z

You're not literally writing 4.X.X in the version are you?

rjurney · 2014-07-29T00:56:27Z

@srowen Actually yes, I'm that stupid :) Figured it out on me own though, have it building across the cluster now.

rjurney · 2014-07-29T01:34:54Z

Ok, so now I rebuilt with my specific CDH version, and I get this when I run ./sbin/start-master.sh:

Spark Command: /usr/java/jdk1.8.0//bin/java -cp ::/home/hivedata/spark/conf:/home/hivedata/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.4.0.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip hivecluster2 --port 7077 --webui-port 8080

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
14/07/28 18:33:37 INFO Master: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/07/28 18:33:37 INFO Master: Registered signal handlers for [TERM, HUP, INT]
14/07/28 18:33:37 INFO SecurityManager: Changing view acls to: hivedata
14/07/28 18:33:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hivedata)
14/07/28 18:33:38 INFO Slf4jLogger: Slf4jLogger started
14/07/28 18:33:38 INFO Remoting: Starting remoting
14/07/28 18:33:38 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-2] shutting down ActorSystem [sparkMaster]
java.lang.VerifyError: (class: org/jboss/netty/channel/socket/nio/NioWorkerPool, method: createWorker signature: (Ljava/util/concurrent/Executor;)Lorg/jboss/netty/channel/socket/nio/AbstractNioWorker;) Wrong return type in function
at akka.remote.transport.netty.NettyTransport.(NettyTransport.scala:282)
at akka.remote.transport.netty.NettyTransport.(NettyTransport.scala:239)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
at scala.util.Try$.apply(Try.scala:161)
at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at scala.util.Success.flatMap(Try.scala:200)
at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:618)
at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:610)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:610)
at akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:450)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/07/28 18:33:38 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
14/07/28 18:33:38 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
14/07/28 18:33:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
Exception in thread "main" java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at akka.remote.Remoting.start(Remoting.scala:173)
at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:104)
at org.apache.spark.deploy.master.Master$.startSystemAndActor(Master.scala:808)
at org.apache.spark.deploy.master.Master$.main(Master.scala:788)
at org.apache.spark.deploy.master.Master.main(Master.scala)

rjurney · 2014-07-29T01:42:28Z

Hmmm, I didn't clean before rebuilding with CDH 4.4. Trying that now.

rjurney · 2014-07-29T01:54:00Z

@srowen @JoshRosen Cleaned, made no difference. See https://issues.apache.org/jira/browse/SPARK-1138 where others had this issue.

srowen · 2014-07-29T09:01:08Z

@rjurney This means you have two conflicting versions of Netty in the build. It may or may not be to do with the JIRA you cite, just because there are lots of Netties floating around and lots of ways they can collide. Maybe you can restate what version you are buildling, how you are building at this point and any modifications to the build? mvn dependency:tree will reveal where Netties are coming from.

ericgarcia · 2014-07-29T16:30:08Z

@rjurney I had the same problem with the master branch so I patched the relevant changes from this pull request into the 1.0.0 release, tossed the AvroGenericConverter class into core/src/main/scala/org/apache/spark/api/python, built it and it worked fine.

rjurney · 2014-07-29T18:05:01Z

@ericgarcia @srowen Sorry, but again I can't make things go. I try to pull that request to branch-1.0.0 via: 'git fetch origin pull/455/head:master' but I get this:

From github.com:apache/spark
! [rejected] refs/pull/455/head -> master (non-fast-forward)

[new tag] v0.9.2 -> v0.9.2
[new tag] v1.0.2-rc1 -> v1.0.2-rc1

Which looks like it isn't applying?

srowen · 2014-07-29T18:09:17Z

@rjurney Yes I doubt in general that patches necessarily magically apply cleanly across versions. You may have to massage it as with any merge conflict.

rjurney · 2014-07-29T18:12:41Z

@ericgarcia Could you please create a public branch with this code in a working state and push it to your clone of spark so I can use that? I'm bad at merging conflicts.

ericgarcia · 2014-07-29T21:25:31Z

@rjurney For some reason I can't push my changes to a tagged release in github. Here's the packaged distribution that I built: https://dl.dropboxusercontent.com/u/145006/spark-1.0.0-bin-1.0.4.tgz it's built for hadoop 1.0

If you want to create your own modified 1.0.0 release, what i did was: check out the tagged 1.0.0 release https://github.com/apache/spark/tree/v1.0.0 and also check out this pull request and then copy the files from this pull request into the corresponding directories in the 1.0.0 release. Then create org.apache.spark.api.python.AvroGenericConverter in the same directory as the other org.apache.spark.api.python namespace files, change the namespace on AvroGenericConverter in the code accordingly, and build the project with ./make_distribution.sh

The modified code for loading the avro files in pyspark is

avroRdd = sc.newAPIHadoopFile(filename, 
    "org.apache.avro.mapreduce.AvroKeyInputFormat", 
    "org.apache.avro.mapred.AvroKey", 
    "org.apache.hadoop.io.NullWritable",
    keyConverter="org.apache.spark.api.python.AvroGenericConverter")

JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024 This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats. * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs. * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types. * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples. * Added HBase and Cassandra output examples to show how custom output formats and converters can be used. cc MLnick mateiz ahirreddy pwendell Author: Kan Zhang <kzhang@apache.org> Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits: c01e3ef [Kan Zhang] [SPARK-2024] code formatting 6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10 57a7a5e [Kan Zhang] [SPARK-2024] correcting typo 75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD 0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests 9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests 0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases 7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark

rjurney · 2014-08-05T14:15:09Z

@ericgarcia @srowen @MLnick

Unfortunately when I follow those directions, I still get errors. It looks like I'll have to wait to get this functionality until its merged. Anyone have any idea when that will be? Is there a release in mind that this would be a part of?

MLnick · 2014-08-05T16:16:15Z

It will be in release 1.1.

You should be able to check out branch-1.1 and build from source and it should work ok.

Otherwise 1.1 should be released within a couple weeks
—
Sent from Mailbox

On Tue, Aug 5, 2014 at 4:15 PM, Russell Jurney notifications@github.com
wrote:

@ericgarcia @srowen @MLnick

Unfortunately when I follow those directions, I still get errors. It looks like I'll have to wait to get this functionality until its merged. Anyone have any idea when that will be? Is there a release in mind that this would be a part of?

Reply to this email directly or view it on GitHub:
#455 (comment)

@ahirreddy

So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it. This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark. # Overview The basics are as follows: 1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark 2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives) 3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString``` 4. ```PickleSerializer``` on the Python side deserializes. This works "out the box" for simple ```Writable```s: * ```Text``` * ```IntWritable```, ```DoubleWritable```, ```FloatWritable``` * ```NullWritable``` * ```BooleanWritable``` * ```BytesWritable``` * ```MapWritable``` It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added). I've tested it out with ```ESInputFormat``` as an example and it works very nicely: ```python conf = {"es.resource" : "index/type" } rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf) rdd.first() ``` I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box. # Some things still outstanding: 1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~ 2. ~~I see from apache#363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~ 3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~ 4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR Author: Nick Pentreath <nick.pentreath@gmail.com> Closes apache#455 from MLnick/pyspark-inputformats and squashes the following commits: 268df7e [Nick Pentreath] Documentation changes mer @pwendell comments 761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry. 4c972d8 [Nick Pentreath] Add license headers d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats cde6af9 [Nick Pentreath] Parameterize converter trait 5ebacfa [Nick Pentreath] Update docs for PySpark input formats a985492 [Nick Pentreath] Move Converter examples to own package 365d0be [Nick Pentreath] Make classes private[python]. Add docs and @experimental annotation to Converter interface. eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests 1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight 3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python b65606f [Nick Pentreath] Add converter interface 5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None 085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs 43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide 94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods 1a4a1d6 [Nick Pentreath] Address @mateiz style comments 01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase 9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 84fe8e3 [Nick Pentreath] Python programming guide space formatting d0f52b6 [Nick Pentreath] Python programming guide 7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 93ef995 [Nick Pentreath] Add back context.py changes 9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py 077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py 5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 35b8e3a [Nick Pentreath] Another fix for test ordering bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats e001b94 [Nick Pentreath] Fix test failures due to ordering 78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide 64eb051 [Nick Pentreath] Scalastyle fix e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests 1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir 17a656b [Nick Pentreath] remove binary sequencefile for tests f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark 450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 31a2fff [Nick Pentreath] Scalastyle fixes fc5099e [Nick Pentreath] Add Apache license headers 4e08983 [Nick Pentreath] Clean up docs for PySpark context methods b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies 951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats f6aac55 [Nick Pentreath] Bring back msgpack 9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering 7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging 25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps 65360d5 [Nick Pentreath] Adding test SequenceFiles 0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats d72bf18 [Nick Pentreath] msgpack dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats e67212a [Nick Pentreath] Add back msgpack dependency f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 97ef708 [Nick Pentreath] Remove old writeToStream 2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data. 174f520 [Nick Pentreath] Add back graphx settings 703ee65 [Nick Pentreath] Add back msgpack 619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats eb40036 [Nick Pentreath] Remove unused comment lines 4d7ef2e [Nick Pentreath] Fix indentation f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments 0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer 4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names 818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD 4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up 4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat

JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024 This PR is a followup to apache#455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats. * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs. * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types. * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples. * Added HBase and Cassandra output examples to show how custom output formats and converters can be used. cc MLnick mateiz ahirreddy pwendell Author: Kan Zhang <kzhang@apache.org> Closes apache#1338 from kanzhang/SPARK-2024 and squashes the following commits: c01e3ef [Kan Zhang] [SPARK-2024] code formatting 6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10 57a7a5e [Kan Zhang] [SPARK-2024] correcting typo 75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD 0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests 9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests 0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases 7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark

freedafeng · 2014-09-05T21:52:04Z

Anyone still working on this patch? Pyspark + Hbase is the key to our data science application. I really hope it can work in the very near future.

JoshRosen · 2014-09-05T21:53:51Z

@freedafeng This PR was actually merged and will be available in Spark 1.1 (which should be released very soon). f971d6c

Add support for collecting logs of Spark+K8s jobs

…466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection" (apache#455) This reverts commit c5583fd.

…cript K8S-1077 (apache#598) * K8S-1077 - use single k8s secret with user info MapR [SPARK-651] Replacing joda-time-*.jar with joda-time-2.10.3.jar. MapR [SPARK-638] Wrong permissions when creating files under directory with GID bit set. MapR [SPARK-627] SparkHistoryServer-2.4 is getting 403 Unauthorized home page for users(spark.ui.view.acls) via spark-submit MapR [SPARK-639] Default headers are adding two times MapR [SPARK-629] Spark UI for job lose CSS styles MapR [MS-925] After upgrade to MEP 6.2 (Spark 2.4.0) can no longer consume Kafka / MapR Streams. MapR [SPARK-626] Update kafka dependencies for Spark 2.4.4.0 in release MEP-6.3.0 MapR [SPARK-340] Jetty web server version at Spark should be updated tp v9.4.X MapR [SPARK-617] an't use ssl via spark beeline MapR [SPARK-617] Can't use ssl via spark beeline MapR [SPARK-620] Replace core dependency in Spark-2.4.4 MapR [SPARK-621] Fix multiple XML configuration initialization for (apache#575) custom headers. Use X-XSS-Protection, X-Content-Type-Options Content-Security-Policy and Strict-Transport-Security configuration only in case: cluster security is enabled OR spark.ui.security.headers.enabled set to true. MapR [SPARK-595] Spark cannot access hs2 through zookeeper Revert "MapR [SPARK-595] Spark cannot access hs2 through zookeeper (apache#577)" MapR [SPARK-595] Spark cannot access hs2 through zookeeper MapR [SPARK-620] Replace core dependency in Spark-2.4. MapR [SPARK-619] Move absent commits from 2.4.3 branch to 2.4.4 (apache#574) * Adding SQL API to write to kafka from Spark (apache#567) * Branch 2.4.3 extended kafka and examples (apache#569) * The v2 API is in its own package - the v2 api is in a different package - the old functionality is available in a separated package * v2 API examples - All the examples are using the newest API. - I have removed the old examples since they are not relevant any more and the same functionality is shown in the new examples usin the new API. * MapR [SPARK-619] Move absent commits from 2.4.3 branch to 2.4.4 CORE-321. Add custom http header support for jetty. MapR [SPARK-609] Port Apache Spark-2.4.4 changes to the MapR Spark-2.4.4 branch Adding multi table loader (apache#560) * Adding multi table loader - This allows us to load multiple matching tables into one Union DataFrame. If we have the fallowing MFS structure: ``` /clients/client_1/data.table /clients/client_2/data.table ``` we can load a union dataframe by doing `loadFromMapRDB("/clients/*/*.table")` * Fixing the path to the reader MapR [SPARK-588] Spark thriftserver fails when work with hive-maprdb json table MapR [SPARK-598] Spark can't add needed properties to hive-site.xml MAPR-SPARK-596: Change HBase compatible version for Spark 2.4.3 MapR [SPARK-592] Add possibility to use start-thriftserver.sh script with 2304 port MapR [SPARK-584] MaprDB connector's setHintUsingIndex method doesn't work as expected MapR [SPARK-583] MaprDB connector's loadFromMaprDB function for Java API doesn't work as expected SPARK-579 info about ssl_trustore is added for metrics MapR [SPARK-552] Failed to get broadcast_11_piece0 of broadcast_11 SPARK-569 Generation of SSL ceritificates for spark UI MapR [SPARK-575] Warning messages in spark workspace after the second attempt to login to job's UI Update zookeeper version Adding `joinWithMapRDBTable` function (apache#529) The related documentation of this function is here https://github.com/anicolaspp/MapRDBConnector#joinwithmaprdbtable. The main idea is that having a dataframe (no matter how was it constructed) we can join it with a MapR-DB table. This functions looks at the join query and load only those records from MapR-DB that will join instead of loading the full table and then join in memory. In other words, we only load what we know will be joint. Adding DataSource Reader Support (apache#525) * Adding DataSource Reader Support * Update SparkSessionExt.scala * creating a package object * Update MapRDBSpark.scala * fully path to avoid name collition * refactorings MapR [SPARK-451] Spark hadoop/core dependency updates MapR [SPARK-566] Move absent commits from 2.4.0 branch MapR [SPARK-561] Spark 2.4.3 porting to MapR MapR [SPARK-561] Spark 2.4.3 porting to MapR MapR [SPARK-558] Render application UI init page if driver is not up MapR [SPARK-541] Avoid duplication of the first unexpired record MapR [COLD-150][K8S] Fix metrics copy MapR [K8S-893] Hide plain text password from logs MapR [SPARK-540] Include 'avro' artifacts MapR [SPARK-536] PySpark streaming package for kafka-0-10 added K8S-853: Enable spark metrics for external tenant MapR [SPARK-531] Remove duplicating entries from classpath in ClasspathFilter MapR [SPARK-516] Spark jobs failure using yarn mode on kerberos fixed MapR [SPARK-462] Spark and SparkHistoryServer allow week ciphers, which can allow man in the middle attack [SPARK-508] MapR-DB OJAI Connector for Spark isNull condition returns incorrect result MapR [SPARK-510] nonmapr "admin" users not able to view other user logs in SHS SPARK-460: Spark Metrics for CollectD Configuration for collecting Spark metrics SPARK-463 MAPR_MAVEN_REPO variable for specifying mapR repository MapR [SPARK-492] Spark 2.4.0.0 configure.sh has error messages MapR [SPARK-515][K8S] Remove configure.sh call for k8s MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg MapR [SPARK-514] Recovery from checkpoint is broken MapR [SPARK-445] Messages loss fixed by reverting [MAPR-32290] changes from kafka09 package (apache#460) * MapR [SPARK-445] Revert "[MAPR-32290] Spark processing offsets when messages are already TTL in the first batch (apache#376)" This reverts commit e8d59b9. * MapR [SPARK-445] Revert "[MAPR-32290] Spark processing offsets when messages are already ttl in first batch (apache#368)" This reverts commit b282a8b. MapR [SPARK-445] Messages loss fixed by reverting [MAPR-32290] changes from kafka10 package MapR [SPARK-469] Fix NPE in generated classes by reverting "[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection" (apache#455) This reverts commit c5583fd. MapR [SPARK-482] Spark streaming app fails to start by UnknownTopicOrPartitionException with checkpoint MapR [SPARK-496] Spark HS UI doesn't work MapR [SPARK-416] CVE-2018-1320 vulnerability in Apache Thrift MapR [SPARK-486][K8S] Fix sasl encryption error on Kubernetes MapR [SPARK-481] Cannot run spark configure.sh on Client node MapR [K8S-637][K8S] Add configure.sh configuration in spark-defaults.conf for job runtime MapR [SPARK-465] Error messages after update of spark 2.4 MapR [SPARK-465] Error messages after update of spark 2.4 MapR [SPARK-464] Can't submit spark 2.4 jobs from mapr-client [SPARK-466] SparkR errors fixed MapR [SPARK-456] Spark shell can't be started SPARK-417 impersonation fixes for spark executor. Impersonation is mo… (apache#433) * SPARK-417 impersonation fixes for spark executor. Impersonation is moved from HadoopRDD.compute() method to org.apache.spark.executor.Executor.run() method * SPARK-363 Hive version changed to '1.2.0-mapr-spark-MEP-6.0.0' [SPARK-449] Kafka offset commit issue fixed MapR [SPARK-287] Move logic of creating /apps/spark folder from installer's scripts to the configure.sh MapR [SPARK-221] Investigate possibility to move creating of the spark-env.sh from private-pkg to configure.sh MapR [SPARK-430] PID files should be under /opt/mapr/pid MapR [SPARK-446] Spark configure.sh doesn't start/stop Spark services MapR [SPARK-434] Move absent commits from 2.3.2 branch (apache#425) * MapR [SPARK-352] Spark shell fails with "NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream" if java is not available in PATH * MapR [SPARK-350] Deprecate Spark Kafka-09 package * MapR [SPARK-326] Investigate possibility of writing Java example for the MapRDB OJAI connector * [SPARK-356] Merge mapr changes from kafka-09 package into the kafka-10 * SPARK-319 Fix for sparkR version check * MapR [SPARK-349] Update OJAI client to v3 for Spark MapR-DB JSON connector * MapR [SPARK-367] Move absent commits from 2.3.1 branch * MapR [SPARK-137] Analyze the warning during compilation of OJAI connector * MapR [SPARK-369] Spark 2.3.2 fails with error related to zookeeper * [MAPR-26258] hbasecontext.HBaseDistributedScanExample fails * [SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests * MapR [SPARK-374] Spark Hive example fails when we submit job from another(simple) cluster user * MapR [SPARK-434] Move absent commits from 2.3.2 branch * MapR [SPARK-434] Move absent commits from 2.3.2 branch * MapR [SPARK-373] Unexpected behavior during job running in standalone cluster mode * MapR [SPARK-419] Update hive-maprdb-json-handler jar for spark 2.3.2.0 and spark 2.2.1 * MapR [SPARK-396] Interface change of sendToKafka * MapR [SPARK-357] consumer groups are prepeneded with a "service_" prefix * MapR [SPARK-429] Changes in maprdb connector are the cause of broken backward compatibility * MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr * MapR [SPARK-434] Move absent commits from 2.3.2 branch * Move absent commits from 2.3.2 branch * MapR [SPARK-434] Move absent commits from 2.3.2 branch * Move absent commits from 2.3.2 branch * Move absent commits from 2.3.2 branch MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr MapR [SPARK-379] Spark 2.4 4-gidit version MapR [PIC-48][K8S] Port k8s changes to 2.4.0 [PIC-48] Create user for k8s driver and executor if required [PIC-48] Create user for k8s driver and executor if required Revert "Remove spark.ui.filters property" This reverts commit d8941ba36c3451cdce15d18d6c1a52991de3b971. [SPARK-351] Copy kubernetes start scripts anyway PIC-34: Rename default configmap name to be consistent with mapr-kubernetes [SPARK-23668][K8S] Add config option for passing through k8s Pod.spec.imagePullSecrets (apache#355) Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries. See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/ Unit tests + manual testing. Manual testing procedure: 1. Have private image registry. 2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message: ``` Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n \\\"errors\\\" : [ {\\n \\\"status\\\" : 400,\\n \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n } ]\\n}\"" ``` 3. Create secret `kubectl create secret docker-registry ...` 4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful. Author: Andrew Korzhuev <andrew.korzhuev@klarna.com> Author: Andrew Korzhuev <korzhuev@andrusha.me> Closes apache#20811 from andrusha/spark-23668-image-pull-secrets. [SPARK-321] Change default value of spark.mapr.ssl.secret.prefix property [PIC-32] Spark on k8s with MapR secure cluster Update entrypoint.sh with correct spark version (apache#340) This PR has minor fix to correct the spark version string [SPARK-274] Create home directory for user who submitted job [MAPR-SPARK-230] Implement security for Spark on Kubernetes Run Spark job with specify the username for driver and executor Read cluster configs from configMap Run configure.sh script form entrypoint.sh Remove spark.kubernetes.driver.pod.commands property Add Spark properties for executor and driver environment variable MapR [SPARK-296] Structured Streaming memory leak Revert "[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.tem…" (apache#252) * Revert "[MAPR-SPARK-176] Fix Spark Project Catalyst unit tests (apache#251)" This reverts commit 5de05075cd14abf8ac65046a57a5d76617818fbe. * Revert "[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.template (apache#249)" This reverts commit 1baa677d727e89db7c605ffbae9a9eba00337ad0. [MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.template MapR [SPARK-379] Port Spark to 2.4.0 MapR [SPARK-341] Spark 2.3.2 porting [MAPR-32290] Spark processing offsets when messages are already TTL in the first batch * Bug 32263 - Seek called on unsubscribed partitions [MSPARK-331] Remove snapshot versions of mapr dependencies from Spark-2.3.1 [MAPR-32290] Spark processing offsets when messages are already ttl in first batch MapR [SPARK-325] Add examples for work with the MapRDB JSON connector into the Spark project [ATS-449] Unit test for EBF 32013 created. MAPR-SPARK-311: Spark beeline uses default ssl truststore instead of mapr ssl truststore Bug 32355 - Executor tab empty on Spark UI [SPARK-318] Submitting Spark jobs from Oozie fails due to ClassNotFoundException Bug 32014 - Spark Consumer fails with java.lang.AssertionError Revert "[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1" (apache#341) * Revert "[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1 (apache#335)" This reverts commit 832411e. Bug 32014 - Spark Consumer fails with java.lang.AssertionError (apache#326) (apache#336) * MapR [32014] Spark Consumer fails with java.lang.AssertionError [SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1 DEVOPS-2768 temporarily removed curl for file downloading [SPARK-302] Local privilege escalation MapR [SPARK-297] Added unit test for empty value conversion MapR [SPARK-297] Empty values are loaded as non-null MapR [SPARK-296] Structured Streaming memory leak 2.3.1 spark 289 (apache#318) * MapR [SPARK-289] Fix unit test for Spark-2.3.1 [SPARK-130] MapRDB connector - NPE while saving Pair RDD with 'null' values MapR [SPARK-283] Unit tests fail during initialization SSL properties. [SPARK-212] SparkHiveExample fails when we run it twice MapR [SPARK-282] Remove maprfs and hadoop jars from mapr spark package MapR [SPARK-278] Spark submit fails for jobs with python MapR [SPARK-279] Can't connect to spark thrift server with new spark and hive packages MapR [SPARK-276] Update zookeeper dependency to v.3.4.11 for spark 2.3.1 MapR [SPARK-272] Use only client passwords from ssl-client.xml MapR [SPARK-266] Spark jobs can't finish correctly, when there is an error during job running MapR [SPARK-263] Add possibility to use keyPassword which is different from keyStorePassword [MSPARK-31632] RM UI showing broken page for Spark jobs MapR [SPARK-261] Use mapr-security-web for getting passwords. MapR [SPARK-259] Spark application doesn't finish correctly MapR [SPARK-268] Update Spark version for Warden change project version to 2.3.1-mapr-SNAPSHOT MapR [SPARK-256] Spark doesn't work on yarn mode MapR [SPARK-255] Installer fresh install 610/600 secure fails to start "mapr-spark-thriftserver", "mapr-spark-historyserver" Mapr [SPARK-248] MapRDBTableScanRDD fails to convert to Scala Dataframe when using where clause MapR [SPARK-225] Hadoop credentials provider usage for hiding passwords at spark-defaults MapR [SPARK-214] Hive-2.1 poperties can't be read from a hive-site.xml as Spark uses Hive-1.2 MapR [SPARK-216] Spark thriftserver fails when work with hive-maprdb json table SPARK-244 (apache#278) Provide ability to use MapR-Negotiation authentication for Spark HistoryServer MapR [SPARK-226] Spark - pySpark Security Vulnerability MapR [SPARK-220] SparkR fails with UDF functions bug fixed MapR [SPARK-227] KafkaUtils.createDirectStream fails with kafka-09 MapR [SPARK-183] Spark Integration for Kafka 0.10 unit tests disabled MapR [SPARK-182] Spark Project External Kafka Producer v09 unit tests fixed MapR [SPARK-179] Spark Integration for Kafka 0.9 unit tests fixed MapR [SPARK-181] Kafka 0.10 Structured Streaming unit tests fixed [MSPARK-31305] Spark History server NOT loading applications submitted by users other than 'mapr' MapR [SPARK-175] Fix Spark Project Streaming unit tests [MAPR-SPARK-176] Fix Spark Project Catalyst unit tests [MAPR-SPARK-178] Fix Spark Project Hive unit tests MapR [SPARK-174] Spark Core unit tests fixed Changed version for spark-kafka connector. MapR [SPARK-202] Update MapR Spark to 2.3.0 Fixed compile time errors in tests Change project version [SPARK-198] Update hadoop dependency version to 2.7.0-mapr-1803 for Spark 2.2.1 MapR [SPARK-188] Couldn't connect to thrift server via spark beeline on kerberos cluster MapR [SPARK-143] Spark History Server does not require login for secured-by-default clusters MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector MapR [SPARK-191] Incorrect work of MapR-DB Sink 'complete' and 'update' modes fixed MapR [SPARK-170] StackOverflowException in equals method in DBMapValue 2.2.1 build fixed (apache#231) * MapR [SPARK-164] Update Kafka version to 1.0.1-mapr in Spark Kafka Producer module MapR [SPARK-161] Include Kafka Structured streaming jar to Spark package. MapR [SPARK-155] Change Spark Master port from 8080 MapR [SPARK-153] Exception in spark job with configured labels on yarn-client mode MapR [SPARK-152] Incorrect date string parsing fixed MapR [SPARK-21] Structured Streaming MapR-DB Sink created MapR [SPARK-135] Spark 2.2 with MapR Streams ( Kafka 1.0) (apache#218) * MapR [SPARK-135] Spark 2.2 with MapR Streams (Kafka 1.0) Added functionality of MapR-Streams specific EOF handling. MapR [SPARK-143] Spark History Server does not require login for secured-by-default clusters Disable build failing if scalastyle checking is fall. MapR [SPARK-16] Change Spark version in Warden files and configure.sh MapR [SPARK-144] Add insertToMapRDB method for rdd for Java API [MAPR-30536] Spark SQL queries on Map column fails after upgrade MapR [SPARK-139] Remove "update" related APIs from connector MapR [SPARK-140] Change the option name "tableName" to "tablePath" in the Spark/MapR-DB connectors. MapR [SPARK-121] Spark OJAI JAVA: update functionality removed MapR [SPARK-118] Spark OJAI Python: missed DataFrame import while moving imports in order to fix MapR [ZEP-101] interpreter issue MapR [SPARK-118] Spark OJAI Python: move MapR DB Connector class importing in order to fix MapR [ZEP-101] interpreter issue MapR [SPARK-117] Spark OJAI Python: Save functionality implementation MapR [SPARK-131] Exception when try to save JSON table with Binary _id field Spark OJAI JAVA: load to RDD, save from RDD implementation (apache#195) * MapR [SPARK-124] Loading to JavaRDD implemented * MapR [SPARK-124] MapRDBJavaSparkContext constructor changed * MapR [SPARK-124] implemented RDD[Row] saving MapR [SPARK-118] Spark OJAI Python: Read implementation MapR [SPARK-128] MapRDB connector - wrong handle of null fields when nullable is false * MapR [SPARK-121] Spark OJAI JAVA: Read to Dataset functionality implementation * Minor refactoring MapR [SPARK-125] Default value of idFieldPath parameter is not handle MapR [SPARK-113] Hit java.lang.UnsupportedOperationException: empty.reduceLeft during loadFromMapRDB Spark Mapr-DB connector was refactored according to Scala style Removed code duplication [MSPARK-107]idField information is lost in MapRDBDataFrameWriterFunctions.saveToMapRDB configure.sh takes options to change ports Kafka client excluded from package because correct version is located in "mapr classpath" Changed Kafka version in Kafka producer module. Branch spark 69 (apache#170) * Fixing the wrong type casting of TimeStamp to OTimeStamp when read from spark dataFrame. * SPARK-69: Problem with license when we try to read from json and write to maprdb remove creatin /usr/local/spark link from configure.sh. This link will be creates by private-pkg remove include-maprdb from default profiles added profiles in maprdb pom file instead of two pom files Fixed maprdb connector dependencies. Fixing the wrong type casting of TimeStamp to OTimeStamp when read from spark dataFrame. changed port for spark-thriftserver as it conflicts with hive server changed port for spark-thriftserver as it conflicts with hive server remove .not_configured_yet file after success Ojai connector fixed required java version [MSPARK-45] Move Spark-OJAI connector code to Spark github repo (apache#132) * SPARK-45 Move Spark-OJAI connector code to Spark github repo * Fixing pom versions for maprdb spark connector. * Changes made to the connector code to be compatible with 5.2.* and 6.0 clients. Spark 2.1.0 mapr 29106 (apache#150) * [SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher. Blindly deserializing classes using Java serialization opens the code up to issues in other libraries, since just deserializing data from a stream may end up execution code (think readObject()). Since the launcher protocol is pretty self-contained, there's just a handful of classes it legitimately needs to deserialize, and they're in just two packages, so add a filter that throws errors if classes from any other package show up in the stream. This also maintains backwards compatibility (the updated launcher code can still communicate with the backend code in older Spark releases). Tested with new and existing unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18166 from vanzin/SPARK-20922. (cherry picked from commit 8efc6e9) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 772a9b9) * [SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18178 from vanzin/SPARK-20922-hotfix. Added security by default for historyserver use waitForConsumerAssignment() instead of consumer.poll(0) for spark-29052 change MAPR_HADOOP_CLASSPATH in configure.sh for creating it by mapr-classpath.sh change MAPR_HADOOP_CLASSPATH in configure.sh for creating it by mapr-classpath.sh changes for mapr-classpath.sh changes for mapr-classpath.sh configure.sh changes [SPARK-39] Classpath filter was added Fixed impersonation when data read from MapR-DB via Spark-Hive. added configure.sh and warden.spark-thriftserver.conf hive-hbase-handler added to Spark jars Fixed "Single message comes late" 28339 bug fixed Spark streaming skipped message with zero offset from Kafka 0.9 [MSPARK-9] Initial fix for Spark unit tests Bump dependencies after ECO-1703 release [SPARK-33] Streaming example fixed [MAPR-26060] Fixed case when mapr-streams make gaps in offsets ported features from kafka 10 to kafka 9 [MAPR-26289][SPARK-2.1] Streaming general improvements (apache#93) * Added include-kafka-09 profile to Assembly * Set default poll timeout to 120s Set default HBase verison to 1.1.8 Changes from Kafka10 package were ported to Kafka09 package. [MAPR-26053] Include MapR Classes to the default value of spark.sql.hive.metastore.sharedPrefixes [MAPR-25807] Spark-Warehouse path computes incorrectly Add MapR-SASL support for Thrift Server Adding scala library. [MAPR-25713] Spark might try to load MapR Class Loader multiple times and fail [MAPR-25311] Bump Spark dependencies after ECO-1611 release [MINOR] Fix spark-jars.sh script [MAPR-24603] Could not launch beeline shell after starting spark thrift server fixed syntax error in V09DirectKafkaWordCount example Spark 2.0.1 MAPR-streams Python API [MAPR-24415] SPARK_JAVA_OPTS is deprecated Kafka streaming producer added. Minor fix for previous commit Added script for MAPR-24374 Some minor changes to spark-defaults.conf Changed default HBase version to 1.1.1 in compatibility.version Streaming example was refactored [MAPR-24470] HiveFromSpark test fails in yarn-cluster mode Added MapR Repo [MAPR-22940] Failed to connect spark beeline (after spark thrift server is started) on Kerberos cluster [MAPR-18865] Unable to submit spark apps from Windows client Skip maven clean task on the parent module New: Issue with running Hive commands in Spark This is fixed in SPARK-7819 Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error Spark warden.services.conf should have dependency on cldb Remove DFS shuffle settings. These settings are not used right now. Copy every file in the conf directory into the distribution package. Create spark-defaults.conf for MapR Settings to enable DFS shuffle on MapR. Support hbase classpath computation in util script. Adding external conf and scripts. Enable SPARK_HIVE mode while building. This is needed to bundle datanucleus jars needed for hive table creation. Build Spark on MapR. - make-distribution.sh takes an environment variable to enable profiles - MVN_PROFILE_ARG - Added warden conf files under ext-conf. - Updated pom.xml to use right set of jars and version. Spark Master failed to start in HA mode Updated Apache Curator version Added spark streaming integration with kafka 0.9 and mapr-streams Added MapR Repo

MLnick added 30 commits December 9, 2013 08:53

Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop …

d86325f

…InputFormat

Refactoring utils into own objects. Cleaning up old commented-out code

4b0a43f

Adding supporting sequncefiles for tests. Cleaning up

c304cc8

Merge remote-tracking branch 'upstream/master' into pyspark-inputformats

4e7c9e3

Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala project/SparkBuild.scala

Add seqencefile and Hadoop InputFormat support to PythonRDD

818a1e6

Add old Hadoop api methods. Clean up and expand comments. Clean up ar…

4294cbb

…gument names

Remove unused pair UTF8 class. Add comments to msgpack deserializer

0f5cd84

mergeConfs returns a copy rather than mutating one of the input argum…

f1d73e3

…ents

Fix indentation

4d7ef2e

Remove unused comment lines

eb40036

Merge remote-tracking branch 'upstream/master' into pyspark-inputformats

1c8efbc

Conflicts: project/SparkBuild.scala python/pyspark/context.py

Merge remote-tracking branch 'upstream/master' into pyspark-inputformats

619c0fa

Conflicts: project/SparkBuild.scala

Add back msgpack

703ee65

Add back graphx settings

174f520

Change name to WriteInputFormatTestDataGenerator. Cleanup some var na…

795a763

…mes. Use SPARK_HOME in path for writing test sequencefile data.

Merge remote-tracking branch 'upstream/master' into pyspark-inputformats

2beeedb

Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala python/pyspark/context.py

Remove old writeToStream

97ef708

Merge branch 'master' into pyspark-inputformats

41856a5

Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

Merge branch 'master' into pyspark-inputformats

f2d76a0

Conflicts: project/SparkBuild.scala

Add back msgpack dependency

e67212a

Merge remote-tracking branch 'upstream/master' into pyspark-inputformats

dd57922

Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala project/SparkBuild.scala python/pyspark/context.py python/pyspark/serializers.py

msgpack

d72bf18

Merge branch 'master' into pyspark-inputformats

0c612e5

Adding test SequenceFiles

65360d5

Add generator for nulls, bools, bytes and maps

25da1ca

Add back msgpack serializer and hadoop file code lost during merging

7237263

Clean up Msgpack serialization and registering

a67dfad

Clean up SparkBuild from merge

1bbbfb0

Merge branch 'master' into pyspark-inputformats

9d2256e

Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala project/SparkBuild.scala

Bring back msgpack

f6aac55

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#455 from liu-sheng/collect-log-spark-k8s

66e7c43

Add support for collecting logs of Spark+K8s jobs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats #455

SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats #455

MLnick commented Apr 19, 2014

rjurney commented Jul 29, 2014

rjurney commented Jul 29, 2014

srowen commented Jul 29, 2014

rjurney commented Jul 29, 2014

rjurney commented Jul 29, 2014

rjurney commented Jul 29, 2014

rjurney commented Jul 29, 2014

srowen commented Jul 29, 2014

ericgarcia commented Jul 29, 2014

rjurney commented Jul 29, 2014

srowen commented Jul 29, 2014

rjurney commented Jul 29, 2014

ericgarcia commented Jul 29, 2014

rjurney commented Aug 5, 2014

MLnick commented Aug 5, 2014

Unfortunately when I follow those directions, I still get errors. It looks like I'll have to wait to get this functionality until its merged. Anyone have any idea when that will be? Is there a release in mind that this would be a part of?

freedafeng commented Sep 5, 2014

JoshRosen commented Sep 5, 2014

SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats #455

SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats #455

Conversation

MLnick commented Apr 19, 2014

Overview

Some things still outstanding:

rjurney commented Jul 29, 2014

rjurney commented Jul 29, 2014

srowen commented Jul 29, 2014

rjurney commented Jul 29, 2014

rjurney commented Jul 29, 2014

rjurney commented Jul 29, 2014

rjurney commented Jul 29, 2014

srowen commented Jul 29, 2014

ericgarcia commented Jul 29, 2014

rjurney commented Jul 29, 2014

srowen commented Jul 29, 2014

rjurney commented Jul 29, 2014

ericgarcia commented Jul 29, 2014

rjurney commented Aug 5, 2014

MLnick commented Aug 5, 2014

Unfortunately when I follow those directions, I still get errors. It looks like I'll have to wait to get this functionality until its merged. Anyone have any idea when that will be? Is there a release in mind that this would be a part of?

freedafeng commented Sep 5, 2014

JoshRosen commented Sep 5, 2014