[SPARK-1308] [PySpark] Add partitions() method to PySpark RDDs #218

nchammas · 2014-03-24T23:46:20Z

I've added the partitions() method per the discussion here.

Looking at the instructions here, it is not clear how to add tests for this kind of modification to PySpark.

sbt/sbt test has 1 test fail -- org.apache.spark.sql.hive.execution.HiveCompatibilitySuite -- but this appears to be unrelated to my change. Reverting my changes and re-reunning assembly and test yield the same result.

First-time committer. Forgive me if I’ve messed anything up.

Per the discussion here: http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions- is-my-RDD-split-into-td3072.html Looking at the instructions here (https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), it is not clear how to add tests for this kind of modification to PySpark. First-time committer. Forgive me if I’ve messed anything up.

AmplabJenkins · 2014-03-25T00:12:46Z

Can one of the admins verify this patch?

nchammas · 2014-03-25T00:40:46Z

python/pyspark/rdd.py

+        """
+        Get the array of partitions of this RDD, taking into account whether the RDD is checkpointed or not.
+        """
+        return self._jrdd.splits()


Actually, I'm not sure that having a Python method return a Java object is right thing to do. Or is it?

len(rdd.partitions()) works as you would expect, and so does rdd.partitions().size(), but I'm not sure how else this method might be used.

pwendell · 2014-03-25T00:57:19Z

It might be nicer if this just had a method that returned the number of partitions. The partitions are java objects which can't be returned safely to python.

nchammas · 2014-03-25T03:07:22Z

Makes sense. Would you rather have the method be called numPartitions() or getNumPartitions()?

The former lines up with the numPartitions input parameter to many methods. The latter lines up with the 2 other get...() methods.

pwendell · 2014-03-25T05:44:15Z

Hm - the naming for getters seems a little inconstent in the API. /cc @JoshRosen any feelings about getNumPartitions vs numPartitions?. Also it would be good to add a doctest:

>>> sc.parallelize([1, 2, 3, 4, 5, 6], 4).getNumPartitions()
4

AmplabJenkins · 2014-03-28T01:54:39Z

Can one of the admins verify this patch?

Rebasing fork from source.

Change the definition of this method per Patrick’s comments [here](#218).

nchammas · 2014-04-22T16:18:43Z

Dunno if I've done this correctly, but I've rebased my fork and changed this method per our discussion here. The change list seems to contain everything I merged in, in addition to the one method I'm trying to add to PySpark RDD. Is that correct?

Anyway, feel free to close this PR if it should just be revisited at a later time.

nchammas · 2014-04-22T16:30:40Z

Ah, looks like I have indeed messed up this PR. I will revisit this issue later.

…de-error Fix UnicodeEncodeError in PySpark saveAsTextFile() (SPARK-970) This fixes [SPARK-970](https://spark-project.atlassian.net/browse/SPARK-970), an issue where PySpark's saveAsTextFile() could throw UnicodeEncodeError when called on an RDD of Unicode strings. Please merge this into master and branch-0.8. (cherry picked from commit 8a3475a) Signed-off-by: Reynold Xin <rxin@apache.org>

[SPARKR-225] Merge master into sparkr-sql branch

This pull requests integrates SparkR, an R frontend for Spark. The SparkR package contains both RDD and DataFrame APIs in R and is integrated with Spark's submission scripts to work on different cluster managers. Some integration points that would be great to get feedback on: 1. Build procedure: SparkR requires R to be installed on the machine to be built. Right now we have a new Maven profile `-PsparkR` that can be used to enable SparkR builds 2. YARN cluster mode: The R package that is built needs to be present on the driver and all the worker nodes during execution. The R package location is currently set using SPARK_HOME, but this might not work on YARN cluster mode. The SparkR package represents the work of many contributors and attached below is a list of people along with areas they worked on edwardt (edwart) - Documentation improvements Felix Cheung (felixcheung) - Documentation improvements Hossein Falaki (falaki) - Documentation improvements Chris Freeman (cafreeman) - DataFrame API, Programming Guide Todd Gao (7c00) - R worker Internals Ryan Hafen (hafen) - SparkR Internals Qian Huang (hqzizania) - RDD API Hao Lin (hlin09) - RDD API, Closure cleaner Evert Lammerts (evertlammerts) - DataFrame API Davies Liu (davies) - DataFrame API, R worker internals, Merging with Spark Yi Lu (lythesia) - RDD API, Worker internals Matt Massie (massie) - Jenkins build Harihar Nahak (hnahak87) - SparkR examples Oscar Olmedo (oscaroboto) - Spark configuration Antonio Piccolboni (piccolbo) - SparkR examples, Namespace bug fixes Dan Putler (dputler) - Dataframe API, SparkR Install Guide Ashutosh Raina (ashutoshraina) - Build improvements Josh Rosen (joshrosen) - Travis CI build Sun Rui (sun-rui)- RDD API, JVM Backend, Shuffle improvements Shivaram Venkataraman (shivaram) - RDD API, JVM Backend, Worker Internals Zongheng Yang (concretevitamin) - RDD API, Pipelined RDDs, Examples and EC2 guide Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com> Author: Zongheng Yang <zongheng.y@gmail.com> Author: cafreeman <cfreeman@alteryx.com> Author: Shivaram Venkataraman <shivaram@eecs.berkeley.edu> Author: Davies Liu <davies@databricks.com> Author: Davies Liu <davies.liu@gmail.com> Author: hlin09 <hlin09pu@gmail.com> Author: Sun Rui <rui.sun@intel.com> Author: lythesia <iranaikimi@gmail.com> Author: oscaroboto <oscarjr@gmail.com> Author: Antonio Piccolboni <antonio@piccolboni.info> Author: root <edward> Author: edwardt <edwardt.tril@gmail.com> Author: hqzizania <qian.huang@intel.com> Author: dputler <dan.putler@gmail.com> Author: Todd Gao <todd.gao.2013@gmail.com> Author: Chris Freeman <cfreeman@alteryx.com> Author: Felix Cheung <fcheung@AVVOMAC-119.local> Author: Hossein <hossein@databricks.com> Author: Evert Lammerts <evert@apache.org> Author: Felix Cheung <fcheung@avvomac-119.t-mobile.com> Author: felixcheung <felixcheung_m@hotmail.com> Author: Ryan Hafen <rhafen@gmail.com> Author: Ashutosh Raina <ashutoshraina@users.noreply.github.com> Author: Oscar Olmedo <oscarjr@gmail.com> Author: Josh Rosen <rosenville@gmail.com> Author: Yi Lu <iranaikimi@gmail.com> Author: Harihar Nahak <hnahak87@users.noreply.github.com> Closes #5096 from shivaram/R and squashes the following commits: da64742 [Davies Liu] fix Date serialization 59266d1 [Davies Liu] check exclusive of primary-py-file and primary-r-file 55808e4 [Davies Liu] fix tests 5581c75 [Davies Liu] update author of SparkR f731b48 [Shivaram Venkataraman] Only run SparkR tests if R is installed 64eda24 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R d7c3f22 [Shivaram Venkataraman] Address code review comments Changes include 1. Adding SparkR docs to API docs generated 2. Style fixes in SparkR scala files 3. Clean up of shell scripts and explanation of install-dev.sh 377151f [Shivaram Venkataraman] Merge remote-tracking branch 'apache/master' into R eb5da53 [Shivaram Venkataraman] Merge pull request #3 from davies/R2 a18ff5c [Davies Liu] Update sparkR.R 5133f3a [Shivaram Venkataraman] Merge pull request #7 from hqzizania/R3 940b631 [hqzizania] [SPARKR-92] Phase 2: implement sum(rdd) 0e788c0 [Shivaram Venkataraman] Merge pull request #5 from hlin09/doc-fix 3487461 [hlin09] Add tests log in .gitignore. 1d1802e [Shivaram Venkataraman] Merge pull request #4 from felixcheung/r-require 11981b7 [felixcheung] Update R to fail early if SparkR package is missing c300e08 [Davies Liu] remove duplicated file b045701 [Davies Liu] Merge branch 'remote_r' into R 19c9368 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into remote_r f8fa8af [Davies Liu] mute logging when start/stop context e7104b6 [Davies Liu] remove ::: in SparkR a1777eb [Davies Liu] move rules into R/.gitignore e88b649 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R 6e20e71 [Davies Liu] address comments b433817 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R a1cedad [Shivaram Venkataraman] Merge pull request #228 from felixcheung/doc e089151 [Davies Liu] Merge pull request #225 from sun-rui/SPARKR-154_2 463e28c [Davies Liu] Merge pull request #2 from shivaram/doc-fixes bc2d6d8 [Shivaram Venkataraman] Remove arg from sparkR.stop and update docs d425363 [Shivaram Venkataraman] Some doc fixes for column, generics, group 1f1a7e0 [Shivaram Venkataraman] Some fixes to DataFrame, RDD, SQLContext docs 104ad4e [Shivaram Venkataraman] Check the right env in exists cf5cd99 [Shivaram Venkataraman] Remove unused numCols argument 85a50ec [Shivaram Venkataraman] Merge pull request #226 from RevolutionAnalytics/master 3eacfc0 [Davies Liu] fix flaky test 733380d [Davies Liu] update R examples (remove master from args) b21a0da [Davies Liu] Merge pull request #1 from shivaram/log4j-tests a1493d7 [Shivaram Venkataraman] Address comments e1f83ab [Shivaram Venkataraman] Send Spark INFO logs to a file in SparkR tests 58276f5 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R 52cc92d [Shivaram Venkataraman] Add license to create-docs.sh 6ff5ea2 [Shivaram Venkataraman] Add instructions to generate docs 1f478c5 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R 02b4833 [Shivaram Venkataraman] Add a script to generate R docs (Rd, html) Also fix some issues with our documentation d6d3729 [Davies Liu] enable spark and pyspark tests 0e5a83f [Davies Liu] fix code style afd8a77 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R d87a181 [Davies Liu] fix flaky tests 7100fb9 [Shivaram Venkataraman] Fix libPaths in README bdf3a14 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R 05e7375 [Davies Liu] sort generics b44e371 [Shivaram Venkataraman] Include RStudio instructions in README 855537f [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R 9fb6af3 [Davies Liu] mark R classes/objects are private 423ea3c [Shivaram Venkataraman] Ignore unknown jobj in cleanup 974e4ea [Davies Liu] fix flaky test 410ec18 [Davies Liu] fix zipRDD() tests d8b24fc [Davies Liu] disable spark and python tests temporary ce3ca62 [Davies Liu] fix license check 7da0049 [Davies Liu] fix build 2892e29 [Davies Liu] support R in YARN cluster ebd4d07 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R 38cbf59 [Davies Liu] fix test of zipRDD() 756ece0 [Shivaram Venkataraman] Update README remove outdated TODO d436f26 [Davies Liu] add missing files 40d193a [Shivaram Venkataraman] Merge pull request #224 from sun-rui/SPARKR-224-new 1a16cd6 [Davies Liu] rm PROJECT_HOME 56670ef [Davies Liu] rm man page ba4b80b [Davies Liu] Merge branch 'remote_r' into R f04080c [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into remote_r 028cbfb [Davies Liu] fix exit code of sparkr unit test 42d8b4c [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R ef26015 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R a1870e8 [Shivaram Venkataraman] Merge pull request #214 from sun-rui/SPARKR-156_3 cb6e5e3 [Shivaram Venkataraman] Add scripts to start SparkR on windows 8030847 [Shivaram Venkataraman] Set windows file separators, install dirs 05afef0 [Shivaram Venkataraman] Only stop backend JVM if R launched it 95d2de3 [Davies Liu] fix spark-submit with R scripot baefd9e [Shivaram Venkataraman] Make bin/sparkR use spark-submit As a part of this move the R initialization functions into first.R and first-submit.R d6f2bdd [Shivaram Venkataraman] Fix run-tests path ea90fab [Davies Liu] fix spark-submit with R path and sparkR -h 0e2412c [Davies Liu] fix bin/sparkR 9f6aa1f [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R 479e3fe [Davies Liu] change println() to logging 52ca6e5 [Shivaram Venkataraman] Add missing comma 716b16f [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R 2d235d4 [Shivaram Venkataraman] Build SparkR with Maven profile aae881b [Davies Liu] fix rat ff776aa [Shivaram Venkataraman] Fix style e4f1937 [Shivaram Venkataraman] Remove DFC example f7b6936 [Davies Liu] remove Spark prefix for class 043959e [Davies Liu] cleanup ba53b09 [Davies Liu] support R in spark-submit f403b4a [Davies Liu] rm .travis.yml c4a5bdf [Davies Liu] run sparkr tests in Spark e8fc7ca [Davies Liu] fix .gitignore 35e5755 [Davies Liu] reduce size of example data 50bff63 [Davies Liu] add LICENSE header for R sources facb6e0 [Davies Liu] add .gitignore for .o, .so, .Rd 18e5eed [Davies Liu] update docs 0a0e632 [Davies Liu] move sparkR into bin/ a76472f [Davies Liu] fix path of assembly jar df3eeea [Davies Liu] move R/examples into examples/src/main/r 3415cc7 [Davies Liu] move Scala source into core/ and sql/ 180fc9c [Davies Liu] move scala 014d253 [Davies Liu] delete man pages 49a8133 [Davies Liu] Merge branch 'remote_r' into R 44994c2 [Davies Liu] Moved files to R/ 2fc553f [Shivaram Venkataraman] Merge pull request #222 from davies/column2 b043876 [Davies Liu] fix test 5e610cb [Davies Liu] add more API for Column 6f95d49 [Shivaram Venkataraman] Merge pull request #221 from shivaram/sparkr-stop-start 3214c6d [Shivaram Venkataraman] Merge pull request #217 from hlin09/cleanClosureFix f5d3355 [Shivaram Venkataraman] Merge pull request #218 from davies/merge 70f620c [Davies Liu] address comments 4b1628d [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into merge 3139325 [Shivaram Venkataraman] Merge pull request #212 from davies/toDF 6122e0e [Davies Liu] handle NULL bc2ff38 [Davies Liu] handle NULL 7f5e70c [Davies Liu] Update SerDe.scala 46454e4 [Davies Liu] address comments dd52cbc [Shivaram Venkataraman] Merge pull request #220 from shivaram/sparkr-utils-include 662938a [Shivaram Venkataraman] Include utils before SparkR for `head` to work Before this change calling `head` on a DataFrame would not work from the sparkR script as utils would be loaded after SparkR and placed ahead in the search list. This change requires utils to be loaded before SparkR 1bc2998 [Shivaram Venkataraman] Merge pull request #179 from evertlammerts/sparkr-sql 7695d36 [Evert Lammerts] added tests 8190127 [Evert Lammerts] fixed parquetFile signature d8c8fcc [Shivaram Venkataraman] Merge pull request #219 from shivaram/sparkr-build-final 963c7ee [Davies Liu] Merge branch 'master' into merge 8bff523 [Shivaram Venkataraman] Remove staging repo now that 1.3 is released e52258f [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into toDF 05b9126 [Shivaram Venkataraman] Merge pull request #215 from davies/agg 8e1497d [Davies Liu] Update DataFrame.R 72adb14 [Davies Liu] Update SQLContext.R 66cc92a [Davies Liu] address commets 55c38bc [Shivaram Venkataraman] Merge pull request #216 from davies/select2 3e0555d [Shivaram Venkataraman] Merge pull request #193 from davies/daemon 0467474 [Davies Liu] add more selecter for DataFrame 9a6be74 [Davies Liu] include grouping columns in agg() e87bb98 [Davies Liu] improve comment and logging a6dc435 [Davies Liu] remove dependency of jsonlite 26a3621 [Davies Liu] support date.frame and Date/Time 4e4908a [Davies Liu] createDataFrame from rdd 5757b95 [Shivaram Venkataraman] Merge pull request #196 from davies/die 90f2692 [Shivaram Venkataraman] Merge pull request #211 from hlin09/generics 8583968 [Davies Liu] readFully() 46cea3d [Davies Liu] retry 01aa5ee [Davies Liu] add config for using daemon, refactor ff948db [hlin09] Remove missingOrInteger. ecdfda1 [hlin09] Remove duplication. 411b751 [Davies Liu] make RStudio happy 8f8813f [Davies Liu] switch back to use parallel 6bccbbf [hlin09] Move roxygen doc back to implementation. ffd6e8e [Shivaram Venkataraman] Merge pull request #210 from hlin09/hlin09 471c794 [hlin09] Move getJRDD and broadcast's value to 00-generic.R. 89b886d [hlin09] Move setGeneric() to 00-generics.R. 97dde1a [hlin09] Add a test for access operators. 09ff163 [Shivaram Venkataraman] Merge pull request #204 from cafreeman/sparkr-sql 15a713f [cafreeman] Fix example for `dropTempTable` dc1291b [hlin09] Add checks for namespace access operators in cleanClosure. b4c0b2e [Davies Liu] use fork package 3db5649 [cafreeman] Merge branch 'sparkr-sql' of https://github.com/amplab-extras/SparkR-pkg into sparkr-sql 789be97 [Shivaram Venkataraman] Merge pull request #207 from shivaram/err-remove e60578a [cafreeman] update tests to guarantee row order 5eec6fc [Shivaram Venkataraman] Merge pull request #206 from sun-rui/SPARKR-156_2 3f7aed6 [Sun Rui] Fix minor typos in the function description. a8cebf0 [Shivaram Venkataraman] Remove print statement in SparkRBackendHandler This print statement is noisy for SQL methods which have multiple APIs (like loadDF). We already have a better error message when no valid methods are found 5e3a576 [Sun Rui] Fix indentation. f3d99a6 [Sun Rui] [SPARKR-156] phase 2: implement zipWithIndex() of the RDD class. a582810 [cafreeman] Merge branch 'dfMethods' into sparkr-sql 7a5d6fd [cafreeman] `withColumn` and `withColumnRenamed` c5fa3b9 [cafreeman] New `select` method bcb0bf5 [Shivaram Venkataraman] Merge pull request #180 from davies/group 9dd6a5a [Davies Liu] Update SparkRBackendHandler.scala e6fb8d8 [Davies Liu] improve logging 428a99a [Davies Liu] remove test, catch exception fef99de [cafreeman] `intersect`, `subtract`, `unionAll` befbd32 [cafreeman] `insertInto` 9d01bcd [cafreeman] `dropTempTable` d8c1c09 [Davies Liu] add test to start and stop context multiple times 18c6004 [Shivaram Venkataraman] Merge pull request #201 from sun-rui/SPARKR-156_1 dfb399a [Davies Liu] address comments f06ccec [Sun Rui] Use mapply() instead of for statement. 3c7674f [Davies Liu] Merge branch 'die' of github.com:davies/SparkR-pkg into die ac8a852 [Davies Liu] close monitor connection in sparkR.stop() 4d0fb56 [Shivaram Venkataraman] Merge pull request #203 from shivaram/sparkr-hive-fix 62b0760 [Shivaram Venkataraman] Fix test hive context package name 47a613f [Shivaram Venkataraman] Fix HiveContext package name fb3b139 [Davies Liu] fix tests d0d4626 [Shivaram Venkataraman] Merge pull request #199 from davies/load 8b7fb67 [Davies Liu] fix HiveContext bb46832 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load e9e2a03 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into group b875b4f [Davies Liu] fix style de2abfa [Shivaram Venkataraman] Merge pull request #202 from cafreeman/sparkr-sql 3675fcf [cafreeman] Update `explain` and fixed doc for `toJSON` 5fd9575 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load 6fac596 [Davies Liu] support Column expression in agg() f10a24e [Davies Liu] address comments ff8b005 [cafreeman] 'saveAsParquetFile` a5c2887 [cafreeman] fix test 3fab0f8 [cafreeman] `showDF` 779c102 [cafreeman] `isLocal` 68b11cf [cafreeman] `toJSON` 0ac4abc [cafreeman] 'explain` 20242c4 [cafreeman] clean up docs 6a1fe64 [Shivaram Venkataraman] Merge pull request #198 from cafreeman/sparkr-sql 198c130 [Shivaram Venkataraman] Merge pull request #200 from shivaram/sparkr-sql-build 870acd4 [Shivaram Venkataraman] Use rc2 explicitly 8b9a963 [cafreeman] Merge branch 'sparkr-sql' of https://github.com/amplab-extras/SparkR-pkg into sparkr-sql bc90115 [cafreeman] Fixed docs 3865f39 [Sun Rui] [SPARKR-156] phase 1: implement zipWithUniqueId() of the RDD class. a37fd80 [Davies Liu] Update sparkR.R d18f9d3 [Shivaram Venkataraman] Remove SparkR snapshot build We now have 1.3.0 RC2 on Apache Staging 8de958d [Davies Liu] Update SparkRBackend.scala 4e0becc [Shivaram Venkataraman] Merge pull request #194 from davies/api 197a79b [Davies Liu] add HiveContext (commented) 32aa01d [Shivaram Venkataraman] Merge pull request #191 from felixcheung/doc 5073e07 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load 7918634 [cafreeman] Fix test acea146 [cafreeman] remove extra line 74269f3 [cafreeman] Merge branch 'dfMethods' into sparkr-sql cd7ac8a [Shivaram Venkataraman] Merge pull request #197 from cafreeman/sparkr-sql 494a4dd [cafreeman] update export e14c328 [cafreeman] `selectExpr` 32b37d1 [cafreeman] Fixed indent in `join` test. 2e7b190 [Felix Cheung] small update on yarn deploy mode. 8ff29d6 [Davies Liu] fix tests 12a6db2 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into api 294ca4a [cafreeman] `join`, `sort`, and `filter` 4fa6343 [cafreeman] Refactor `join` generic for use with `DataFrame` 3f22c8d [Shivaram Venkataraman] Merge pull request #195 from cafreeman/sparkr-sql 2b6f980 [Davies Liu] shutdown the JVM after R process die e8639c3 [cafreeman] New 1.3 repo and updates to `column.R` ed9a89f [Davies Liu] address comments 03bcf20 [Davies Liu] Merge branch 'group' of github.com:davies/SparkR-pkg into group 39c253d [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into group 98cc97a [Davies Liu] fix test and docs e2d144a [Felix Cheung] Fixed small typos 3beadcf [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into api 06cbc2d [Davies Liu] launch R worker by a daemon 8a676b1 [Shivaram Venkataraman] Merge pull request #188 from davies/column 524c122 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into column f798402 [Davies Liu] Update column.R 1d0f2ae [Davies Liu] Update DataFrame.R 03402eb [Felix Cheung] Updates as per feedback on sparkR-submit 76cf2e0 [Shivaram Venkataraman] Merge pull request #192 from cafreeman/sparkr-sql 1955a09 [cafreeman] return object instead of a list of one object f585929 [cafreeman] Fix brackets e998356 [cafreeman] define generic for 'first' in RDD API 71d66a1 [Davies Liu] fix first(0 8ec21af [Davies Liu] fix signature acae527 [Davies Liu] refactor d7b17a4 [Davies Liu] fix approxCountDistinct 7dfe27d [Davies Liu] fix cyclic namespace dependency 8caf5bb [Davies Liu] use S4 methods 5c0bb24 [Felix Cheung] Doc updates: build and running on YARN 773baf0 [Zongheng Yang] Merge pull request #178 from davies/random 862f07c [Shivaram Venkataraman] Merge pull request #190 from shivaram/SPARKR-79 b457833 [Shivaram Venkataraman] Merge pull request #189 from shivaram/stdErrFix f7caeb8 [Davies Liu] Update SparkRBackend.scala 8c4deae [Shivaram Venkataraman] Remove unused function 6e51c7f [Shivaram Venkataraman] Fix stderr redirection on executors 7afa4c9 [Shivaram Venkataraman] Merge pull request #186 from hlin09/funcDep3 4d36ab1 [hlin09] Add tests for broadcast variables. 3f57e56 [hlin09] Fix comments. 7b72487 [hlin09] Fix comments. ae05bf1 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into column abb4bb9 [Davies Liu] add Column and expression eb8ac11 [Shivaram Venkataraman] Set Spark version 1.3.0 in Windows build 5c72e73 [Davies Liu] wait atmost 100 seconds e425437 [Shivaram Venkataraman] Merge pull request #177 from lythesia/master a00f502 [lythesia] fix indents 0346e5f [Davies Liu] address comment 6134649 [Shivaram Venkataraman] Merge pull request #187 from cafreeman/sparkr-sql ad0935e [lythesia] minor fixes b0e7f73 [cafreeman] Update `sampleDF` test 7b0d070 [lythesia] keep partitions check 889c265 [cafreeman] numToInt utility function 27dd3a0 [lythesia] modify tests for repartition cad0f0c [cafreeman] Fix docs and indents 2808dcf [cafreeman] Three more DataFrame methods 5ef66fb [Davies Liu] send back the port via temporary file 3b46429 [Davies Liu] Merge branch 'master' of github.com:amplab-extras/SparkR-pkg into random 798f453 [cafreeman] Merge branch 'sparkr-sql' into dev 9aa4acf [Shivaram Venkataraman] Merge pull request #184 from davies/socket 020bce8 [Shivaram Venkataraman] Merge pull request #183 from cafreeman/sparkr-sql 222e06b [cafreeman] Lazy evaluation and formatting changes e776324 [Davies Liu] fix import 211cc15 [cafreeman] Merge branch 'sparkr-sql' into dev 3351afd [hlin09] Replaces getDependencies with cleanClosure, to serialize UDFs to workers. e7c56d6 [lythesia] fix random partition key 50c74b1 [Davies Liu] address comments 083c89f [cafreeman] Remove commented lines an unused import dfa119b [hlin09] Improve the coverage of processClosure. a41c9b9 [cafreeman] Merge branch 'wrapper' into sparkr-sql 1cd714f [cafreeman] Wrapper function docs. db0cd9e [cafreeman] Clean up for wrapper functions 818c19f [cafreeman] Update schema-related functions a57884e [cafreeman] Remove unused import d72e830 [cafreeman] Add wrapper for `StructField` and `StructType` 2ea2ecf [lythesia] use generic arg 09b9512 [hlin09] add docs f4f077c [hlin09] Add recursive cleanClosure for function access. f84ad27 [hlin09] Merge remote-tracking branch 'upstream/master' into funcDep2 5300766 [Shivaram Venkataraman] Merge pull request #185 from hlin09/hlin09 07aa7c0 [hlin09] Unifies the implementation of lapply with lapplyParitionsWithIndex. f4dbb0b [Davies Liu] use socket in worker 8282c59 [Davies Liu] Update DataFrame.R ba495a8 [Davies Liu] Update NAMESPACE 36dffb3 [cafreeman] Add 'head` and `first` 534a95f [cafreeman] Schema-related methods 64f488d [cafreeman] Cache and Persist Methods 30d71fd [cafreeman] Standardize method arguments for DataFrame methods 785898b [Shivaram Venkataraman] Merge pull request #182 from cafreeman/sparkr-sql 2619003 [Shivaram Venkataraman] Merge pull request #181 from cafreeman/master a9bbe0b [cafreeman] Update existing SparkSQL functions 8c241a3 [cafreeman] Merge with master, include changes to method args 68d6de4 [cafreeman] Fix typos 8d2ec6e [Davies Liu] add sum/max/min/avg/mean 774e687 [Davies Liu] add missing API in SQLContext 1e72b4b [Davies Liu] missing API in SQLContext 3294949 [Chris Freeman] Restore `rdd` argument to `getJRDD` 3a58ebc [Davies Liu] rm unrelated file 8bd93b5 [Davies Liu] fix signature c652b4c [cafreeman] Update method signatures to use generic arg 48c8827 [Davies Liu] update NAMESPACE 84e2d8c [Davies Liu] groupBy and agg() 7c3ddbd [Davies Liu] create jmode in JVM 9465426 [Davies Liu] load and save 982f342 [lythesia] fix numeric issue 7651d84 [lythesia] fix coalesce 4e712e1 [Davies Liu] use random port in backend 041d22b [Shivaram Venkataraman] Merge pull request #172 from cafreeman/sparkr-sql 0d07770 [cafreeman] Added `limit` and updated `take` 301d8e5 [cafreeman] Remove extraneous map functions 0387db2 [cafreeman] Remove colNames 04c4b65 [lythesia] add repartition/coalesce 231deab [cafreeman] Change reserialize to serializeToBytes acf7e1a [cafreeman] Rework the Scala to R DataFrame Conversion 481ae37 [cafreeman] Updated stale comments and standardized arg names 21d4a97 [hlin09] Adds cleanClosure to capture the function closures. d24ffb4 [hlin09] Merge remote-tracking branch 'upstream/master' into funcDep2 8be02de [hlin09] Revert "loop 1-12 test pass." fddb9cc [hlin09] Revert "add docs" f8ef0ab [hlin09] Revert "More docs" 8e4b3da [hlin09] Revert "More docs" 57e005b [hlin09] Revert "fix tests." c10148e [Shivaram Venkataraman] Merge pull request #174 from shivaram/sparkr-runner 910e3be [Shivaram Venkataraman] Add a timeout for initialization Also move sparkRBackend.stop into a finally block bf52b17 [Shivaram Venkataraman] Merge remote-tracking branch 'amplab-sparkr/master' into sparkr-runner 08102b0 [Shivaram Venkataraman] Merge pull request #176 from lythesia/master 9c77b20 [Chris Freeman] Merge pull request #2 from shivaram/sparkr-sql 179ab38 [lythesia] add try counts and increase time interval 71a73b2 [Shivaram Venkataraman] Use a getter for serialization mode This change encapsulates the semantics of serialization mode for RDDs inside a getter function. For PipelinedRDDs if a backing JavaRDD is available we use that else we fall back to a default serialization mode 06bf250 [Shivaram Venkataraman] Merge pull request #173 from shivaram/windows-space-fix 88bf97f [Shivaram Venkataraman] Create SparkContext for R shell launch f9268d9 [Shivaram Venkataraman] Fix code review comments e6ad12d [Shivaram Venkataraman] Update comment describing sparkR-submit 17eda4c [Shivaram Venkataraman] Merge pull request #175 from falaki/docfix ba2b72b [Hossein] Spark 1.1.0 is default 4cd7d3f [lythesia] retry backend connection 749e2d0 [Hossein] Updated README bc04cf4 [Shivaram Venkataraman] Use SPARKR_BACKEND_PORT in sparkR.R as default Change SparkRRunner to use EXISTING_SPARKR_BACKEND_PORT to differentiate between the two 22a19ac [Shivaram Venkataraman] Use a semaphore to wait for backend to initalize Also pick a random port to avoid collisions 7f1f0f8 [cafreeman] Move comments to fit 100 char line length 8b84e4e [cafreeman] Make if statements more explicit ce5d5ab [cafreeman] New tests for Union and Object File b063320 [cafreeman] Changed 'serialized' to 'serializedMode' 0981dff [Zongheng Yang] Merge pull request #168 from sun-rui/SPARKR-153_2 86fc639 [Shivaram Venkataraman] Move sparkR-submit into pkg/inst fd8f8a9 [Shivaram Venkataraman] Merge branch 'hqzizania-master' a33dbea [Shivaram Venkataraman] Merge branch 'master' of https://github.com/hqzizania/SparkR-pkg into hqzizania-master 384e6e2 [Shivaram Venkataraman] Merge pull request #171 from hlin09/hlin09 1f5a6ac [hlin09] fixed comments 7f7596a [cafreeman] Additional handling for "row" serialization 8c3b8c5 [cafreeman] Add test for UnionRDD on "row" serialization b1141f8 [cafreeman] Fixed formatting issues. 5db30bf [cafreeman] Changed serialized from bool to string 2f0c0b8 [cafreeman] Add check for serialized type d243dfb [cafreeman] Clean up code 5ff63a2 [cafreeman] Change test from boolean to string 77fec1a [cafreeman] Updated .Rd files 9224989 [cafreeman] Various updates for DataFrame to RRDD 26af62b [cafreeman] DataFrame to RRDD e004481 [cafreeman] Update UnionRDD test 5292be7 [hlin09] Adds support of pipeRDD(). e2a7560 [Shivaram Venkataraman] Merge pull request #170 from cafreeman/sparkr-sql 5d537f4 [cafreeman] Add pairRDD to Description b6fa88e [cafreeman] Updating to current master 0cda231 [Sun Rui] [SPARKR-153] phase 2: implement aggregateByKey() and foldByKey(). 95ee6b4 [Shivaram Venkataraman] Merge remote-tracking branch 'amplab-sparkr/master' into sparkr-runner 67fbc60 [Shivaram Venkataraman] Add support for SparkR shell to use spark-submit This ensures that SparkConf options are read in both in batch and interactive modes 2271030 [Shivaram Venkataraman] Merge pull request #167 from sun-rui/removePartionByInRDD 7fcb46a [Sun Rui] Remove partitionBy() in RDD. 52f94c4 [Shivaram Venkataraman] Merge pull request #160 from lythesia/master 59e2d54 [lythesia] merge with upstream 5836650 [Zongheng Yang] Merge pull request #163 from sun-rui/SPARKR-153_1 141723e [Sun Rui] fix comments. f73a07e [Shivaram Venkataraman] Merge pull request #165 from shivaram/sparkr-sql-build 10ffc6d [Shivaram Venkataraman] Set Spark version to 1.3 using staging dependency Also fix the maven build c91ede2 [Shivaram Venkataraman] Merge pull request #164 from hlin09/hlin09 9d335a9 [hlin09] Makes git to ignore Eclipse meta files. 94066bf [Sun Rui] [SPARKR-153] phase 1: implement fold() and aggregate(). 9c391c7 [hqzizania] Merge remote-tracking branch 'upstream/master' 5f29551 [hqzizania] modified: pkg/R/RDD.R modified: pkg/R/context.R d968664 [lythesia] fix comment 7972858 [Shivaram Venkataraman] Merge pull request #159 from sun-rui/SPARKR-150_2 7690878 [lythesia] separate out pair RDD functions f4573c1 [Sun Rui] Use reduce() instead of sortBy().take() to get the ordered elements. 63e62ed [Sun Rui] [SPARKR-150] phase 2: implement takeOrdered() and top(). 050390b [Shivaram Venkataraman] Fix bugs in inferring R file 8398f2e [Shivaram Venkataraman] Add sparkR-submit helper script Also adjust R file path for YARN cluster mode bd6705b [Zongheng Yang] Merge pull request #154 from sun-rui/SPARKR-150 c7964c9 [Sun Rui] Merge with upstream master. 7feac38 [Sun Rui] Use default arguments for sortBy() and sortKeyBy(). de2bfb3 [Sun Rui] Fix minor comments and add more test cases. 0c6e071 [Zongheng Yang] Merge pull request #157 from lythesia/master f5038c0 [lythesia] pull out anonymous functions in groupByKey ba6f044 [lythesia] fixes for reduceByKeyLocally 343b6ab [Oscar Olmedo] Export sparkR.stop Closes #156 from oscaroboto/master 25639cf [Shivaram Venkataraman] Replace tabs with spaces bb25920 [Shivaram Venkataraman] Merge branch 'dputler-master' fd836db [hlin09] fix tests. 24a7f13 [hlin09] More docs a465165 [hlin09] More docs 6ad4fc3 [hlin09] add docs b082a35 [lythesia] add reduceByKeyLocally 7ca6512 [Shivaram Venkataraman] First cut of SparkRRunner 193f5fe [hlin09] loop 1-12 test pass. 345f1b8 [dputler] [SPARKR-195] Implemented project style guidelines for if-else statements 8043559 [Sun Rui] Add a TODO to use binary search in the range partitioner. 91b2fd6 [Sun Rui] Add more test cases. e8ebbe4 [Shivaram Venkataraman] Merge pull request #152 from cafreeman/sparkr-sql 0c53d6c [dputler] Data frames now coerced to lists, and messages issued for a data frame or matrix on how they are parallelized 6d57ec0 [cafreeman] Remove json test file since we're using a temp ac1ef09 [cafreeman] Update registerTempTable test d9da451 [Sun Rui] [SPARKR-150] phase 1: implement sortBy() and sortByKey(). 08ff30b [Shivaram Venkataraman] Merge pull request #153 from hqzizania/master 9767e8e [hqzizania] modified: pkg/man/collect-methods.Rd 5d69f0a [hqzizania] modified: pkg/R/RDD.R 4914091 [hqzizania] modified: pkg/inst/tests/test_rdd.R 742a68b [cafreeman] Update test_sparkRSQL.R a95823e [hqzizania] modified: pkg/R/RDD.R 2d04526 [cafreeman] Formatting fae9bdd [cafreeman] Renamed to SQLUtils.scala 39888ea [Chris Freeman] Update test_sparkSQL.R fce2453 [cafreeman] Updated documentation for SQLContext 13fbf12 [cafreeman] Regenerated .Rd files 51ecf41 [cafreeman] Updated Scala object 30d7337 [cafreeman] Added SparkSQL test 74b3ed6 [cafreeman] Incorporate code feedback 554bda0 [Zongheng Yang] Merge pull request #147 from shivaram/sparkr-ec2-fixes a5f4f8f [cafreeman] Squashed commit of the following: f34bb88 [Shivaram Venkataraman] Remove profiling information from this PR c662f29 [Zongheng Yang] Merge pull request #146 from shivaram/spark-1.2-build 21e9b74 [Zongheng Yang] Merge pull request #145 from lythesia/master 76f6b9e [Shivaram Venkataraman] Merge pull request #149 from hqzizania/master 1c2dbec [lythesia] minor fix for refactoring join code 5b380d3 [hqzizania] modified: pkg/man/combineByKey.Rd modified: pkg/man/groupByKey.Rd modified: pkg/man/partitionBy.Rd modified: pkg/man/reduceByKey.Rd 98794fe [hqzizania] modified: pkg/R/RDD.R b66534d [Zongheng Yang] Merge pull request #144 from shivaram/fix-rd-files 60da1df [Shivaram Venkataraman] Initialize timing variables 179aa75 [Shivaram Venkataraman] Bunch of fixes for longer running jobs 1. Increase the timeout for socket connection to wait for long jobs 2. Add some profiling information in worker.R 3. Put temp file writes before stdin writes in RRDD.scala 06d99f0 [Shivaram Venkataraman] Fix URI to have right number of slashes add97f5 [Shivaram Venkataraman] Use URL encode to create valid URIs for jars 4eec962 [lythesia] refactor join functions 73430c6 [Shivaram Venkataraman] Make SparkR work on paths with spaces on Windows aaf8f47 [Shivaram Venkataraman] Exclude hadoop client from Spark dependency 227ee42 [Zongheng Yang] Merge pull request #141 from shivaram/SPARKR-140 ac5ceb1 [Shivaram Venkataraman] Fix code review comments 32394de [Shivaram Venkataraman] Regenerate Rd files for SparkR This fixes a number of issues in SparkR man pages. The main changes are 1. Don't export or generate docs for PipelineRDD 2. Fix variable names for Filter, count to match base methods 3. Document missing arguments for sparkR.init, print.jobj etc. e157bf6 [Shivaram Venkataraman] Use prev_serialized to track if JRDD is serialized This changes introduces a new variable in PipelineRDD environment to track if the prev_jrdd is serialized or not. 7428a7e [Zongheng Yang] Merge pull request #143 from shivaram/SPARKR-181 7dd1797 [Shivaram Venkataraman] Address code review comments 8f81c45 [Shivaram Venkataraman] Remove roxygen export for PipelinedRDD 0cb90f1 [Zongheng Yang] Merge pull request #142 from shivaram/SPARKR-169 d1c6e6c [Shivaram Venkataraman] Buffer stderr from R and return it on Exception This change buffers the last 100 lines from R process and passes these lines back to the driver if we have an exception. This will help users debug why their tasks failed on the cluster d6c1393 [Shivaram Venkataraman] Suppress warnings from normalizePath a382835 [Shivaram Venkataraman] Fix serialization tracking in pipelined RDDs When creating a pipeline RDD, we need to check if the JavaRDD belonging to the parent is serialized. da39529 [Zongheng Yang] Merge pull request #140 from sun-rui/SPARKR-183 2814caa [Sun Rui] Merge with upstream master. cd2a5b3 [Sun Rui] Add reference to Nagle's algorithm and clean code. 52356b6 [Shivaram Venkataraman] Merge pull request #139 from shivaram/fix-backend-exit 97e5a1f [Sun Rui] [SPARKR-183] Fix the issue that parallelize collect tests are slow. a9f8e8e [Shivaram Venkataraman] Merge pull request #138 from concretevitamin/fix-collect-test 125ae43 [Shivaram Venkataraman] Fix SparkR backend to exit in more cases This change has two fixes 1. When the workspace is saved (from R or RStudio) the backend connection seems to be closed before the finalizer is run. In such cases we reopen the connection and stop the backend 2. With RStudio when R is restarted, there are port-conflicts which appear due to a race condition between the JVM and rsession restart. This change adds a 1 sec sleep to avoid this race. 12c102a [Zongheng Yang] Simplify a unit test. 9c0637a [Zongheng Yang] Merge pull request #137 from shivaram/fix-docs 0df0e18 [Shivaram Venkataraman] Fix documentation for includePackage 7549f88 [Zongheng Yang] Merge pull request #136 from shivaram/man-updates 7edbe46 [Shivaram Venkataraman] Add missing man pages 9cb9567 [Shivaram Venkataraman] Merge pull request #131 from shivaram/rJavaExpt 1fa722e [Shivaram Venkataraman] Rename to SerDe now 2fcb051 [Shivaram Venkataraman] Rename to SerDeJVMR d112cf0 [Shivaram Venkataraman] Style fixes 9fd01cc [Shivaram Venkataraman] Remove unnecessary braces 0881931 [Shivaram Venkataraman] Some more style fixes f00b531 [Shivaram Venkataraman] Address code review comments. Big changes include style fixes throughout for named arguments c09ba05 [Shivaram Venkataraman] Change jobj id to be just an integer Add a new print.jobj that gets the class name and prints it Also add a utility function isInstanceOf be05b16 [Shivaram Venkataraman] Check if context, connection exist before stopping d596a23 [Shivaram Venkataraman] Address code review comments 396e7ac [Shivaram Venkataraman] Changes to make new backend work on Windows This change uses file.path to construct the Java binary path in a OS agnostic way and uses system2 to handle quoting binary paths correctly. Tests pass on Mac OSX and a Windows EC2 instance. e7a4e03 [Shivaram Venkataraman] Remove unused file BACKEND.md 62f380b [Shivaram Venkataraman] Update worker.R to use new deserialization call 8b9c4e6 [Shivaram Venkataraman] Change RDD name, setName to use new backend 6dcd5c5 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into rJavaExpt 0873397 [Shivaram Venkataraman] Refactor java object tracking into a new singleton. Also add comments describing each class 95db964 [Shivaram Venkataraman] Add comments, cleanup new R code bcd4258 [Zongheng Yang] Merge pull request #130 from lythesia/master 74dbc5e [Sun Rui] Match method using parameter types. 7ad4a4d [Sun Rui] Use 1 char to represent types on the backend->client direction. bace887 [Sun Rui] Use an integer count for the backend java object ID because Uniqueness isn't guaranteed by System.identityHashCode(). b38d04f [Sun Rui] Use 1 char to represent types on the client -> backend direction. f88bc68 [lythesia] Merge branch 'master' of github.com:lythesia/SparkR-pkg 71d41f5 [lythesia] add test case for fullOuterJoin eb4f423 [lythesia] --amend cffecc5 [lythesia] add test case for fullOuterJoin a547dd2 [Shivaram Venkataraman] Move classTag, rddRef into newJObject call This avoids them getting eagerly garbage collected 1255391 [Shivaram Venkataraman] Add a finalizer for jobj objects This enables Java objects to be garbage collected on the backend when they are no longer referenced in R. Also rename newJava to newJObject to be more consistent with callJMethod 70fa409 [Sun Rui] Add YARN Conf Dir to the class path when launching the backend. a1108ca [lythesia] add fullOuterJoin in RDD.R 2152727 [Shivaram Venkataraman] Remove empty file cd08bee [Shivaram Venkataraman] Update all functions to use new backend All unit tests pass. 9de49b7 [Shivaram Venkataraman] Add high level calls for methods, constructors Also update BACKEND.md 5a97ea4 [Shivaram Venkataraman] Add jobj S3 class that holds backend refs e071d3e [Shivaram Venkataraman] Change SparkRBackend to use general method calls This change uses a custom protocl + JNI to invoke any method on a given object type. Also update serializers, deserializers to make code more concise 49f0404 [Shivaram Venkataraman] Merge pull request #129 from lythesia/master 7f8cd82 [lythesia] update man 4715ed2 [Yi Lu] Update RDD.R 5a53801 [lythesia] fix name,setName 4f3870b [lythesia] add name,setName in RDD.R 1c25700 [Shivaram Venkataraman] Merge pull request #128 from sun-rui/SPARKR-165 c8507d8 [Sun Rui] [SPARKR-165] IS_SCALAR is not present in R before 3.1 2cff2bd [Sun Rui] Add function to invoke Java method. 7a31da1 [Shivaram Venkataraman] Merge branch 'dputler-master'. Closes #119 0ceba82 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/dputler/SparkR-pkg into dputler-master 735f70c [Shivaram Venkataraman] Merge pull request #125 from 7c00/rawcon fccfe6c [Shivaram Venkataraman] Merge pull request #127 from sun-rui/SPARKR-164 387bd57 [Sun Rui] [SPARKR-164] Temporary files used by SparkR accumulat as time goes on. 5f2268f [Shivaram Venkataraman] Add support to stop backend 5f745c0 [Shivaram Venkataraman] Update notes in backend 22015c1 [Shivaram Venkataraman] Add first cut of SparkR Backend 52821da [Todd Gao] switch the order of packages and function deps d7b0007 [Todd Gao] remove memCompress cb6873e [Shivaram Venkataraman] Merge pull request #126 from sun-rui/SPARKR-147 c5962eb [Todd Gao] further optimize using rawConnection f04c6e0 [Sun Rui] [SPARKR-147] Support multiple directories as input to textFile. b7de604 [Todd Gao] optimize execFunctionDeps loading in worker.R 4d4fc30 [Shivaram Venkataraman] Merge pull request #122 from cafreeman/master b508877 [cafreeman] Update SparkR_IDE_Setup.sh 21ed9d7 [cafreeman] Update build.sbt f73ec16 [cafreeman] Delete SparkR_IDE_Setup_Guide.md d63b026 [cafreeman] Delete SparkR_Quick_Start_Guide.md 6e6cb62 [cafreeman] Update SparkR_IDE_Setup.sh bc6042b [cafreeman] Update build.sbt a8197d5 [cafreeman] Merge remote-tracking branch 'upstream/master' d671564 [Zongheng Yang] Merge pull request #123 from shivaram/jcheck-void 76b8d00 [Zongheng Yang] Merge pull request #124 from shivaram/master b690d58 [Shivaram Venkataraman] Specify how to change Spark versions in README 0fb003d [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into jcheck-void 1c227b4 [Shivaram Venkataraman] Also add a check in context.R 96812b6 [Shivaram Venkataraman] Check for exceptions after void method calls f5c216d [cafreeman] Merge remote-tracking branch 'upstream/master' 90c8933 [Zongheng Yang] Merge pull request #121 from shivaram/fix-sort-order bd0e3b4 [Shivaram Venkataraman] Fix saveAsTextFile test case 2e55f67 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into fix-sort-order f10c607 [Shivaram Venkataraman] Merge pull request #118 from sun-rui/saveAsTextFile 6c9bfc0 [Sun Rui] Merge remote-tracking branch 'SparkR_upstream/master' into saveAsTextFile 6faedbe [cafreeman] Update SparkR_IDE_Setup_Guide.md 57008bc [cafreeman] Update SparkR_IDE_Setup.sh bb1c17d [cafreeman] Update SparkR_IDE_Setup.sh 538bfdb [cafreeman] Update SparkR_Quick_Start_Guide.md 31322c6 [cafreeman] Update SparkR_IDE_Setup.sh ca3f593 [Sun Rui] Refactor RRDD code. df58d95 [cafreeman] Update SparkR_Quick_Start_Guide.md b488c88 [cafreeman] Rename Spark_IDE_Setup.sh to SparkR_IDE_Setup.sh b2545a4 [cafreeman] Added IDE Setup Guide 0ffb5de [cafreeman] Merge branch 'master' of https://github.com/cafreeman/SparkR-pkg bd8fbfb [cafreeman] Merge remote-tracking branch 'upstream/master' 98efa5b [cafreeman] Added Quick Start Guide 3cf88f2 [Shivaram Venkataraman] Sort lists before comparing in unit tests Since Spark doesn't guarantee that shuffle results will always be in the same order, we need to sort the results before comparing for deterministic behavior d621dbc [Shivaram Venkataraman] Merge pull request #120 from sun-rui/objectFile c4a44d7 [Sun Rui] Add @seealso in comments and extract some common code into a function. 724e3a4 [cafreeman] Update Spark_IDE_Setup.sh 8153e5a [Sun Rui] [SPARKR-146] Support read/save object files in SparkR. 17f9909 [cafreeman] Update Spark_IDE_Setup.sh a9eb080 [cafreeman] IDE Shell Script 64d800c [dputler] Merge remote branch 'upstream/master' 1fbdb2e [dputler] Added the ability for the user to specify a text file location throught the use of tilde expansion or just the file name if it is in the working directory. d83c017 [Shivaram Venkataraman] Merge pull request #113 from sun-rui/stringHashCodeInC a7d9cdb [Sun Rui] Fix build on Windows. 7d81b05 [Shivaram Venkataraman] Merge pull request #114 from hlin09/hlin09 47c4bb7 [hlin09] fix reviews a457f7f [Shivaram Venkataraman] Merge pull request #116 from dputler/master 0fa48d1 [Shivaram Venkataraman] Merge pull request #117 from sun-rui/keyBy 85cfeb4 [Sun Rui] [SPARKR-144] Implement saveAsTextFile() in the RDD class. 09083d9 [Sun Rui] Add keyBy() to the RDD class. caad5d7 [dputler] Adding the script to install software on the Cloudera Quick Start VM. dca3d05 [hlin09] Minor fix. ece5f7d [hlin09] Merge remote-tracking branch 'upstream/master' into hlin09 a40874b [hlin09] Use extendible accumulators aggregate the cogroup values. d0347ce [Zongheng Yang] Merge pull request #112 from sun-rui/outer_join 492f76e [Sun Rui] Refine code and add description. ba01358 [Shivaram Venkataraman] Merge pull request #115 from sun-rui/SPARKR-130 5c8e46e [Sun Rui] Fix per the review comments. 7190a2c [Sun Rui] Update comment to add a reference to storage levels. 1da705e [hlin09] Fix the review comments. c4b77be [Sun Rui] [SPARKR-130] Add persist(storageLevel) API to RDD. b424a1a [hlin09] Add function cogroup(). 9770312 [Shivaram Venkataraman] Merge pull request #111 from hlin09/hlin09 cead7df [hlin09] fix review comments. 54f712e [Sun Rui] Implement string hash code in C. 425f0c6 [Sun Rui] Add leftOuterJoin() and rightOuterJoin() to the RDD class. 39509c7 [hlin09] add Rd file for foreach and foreachPartition. 63d6ac7 [hlin09] Adds function foreach() and foreachPartition(). 9c954df [Zongheng Yang] Merge pull request #105 from sun-rui/join c71228d [Sun Rui] Pre-allocate list with fixed length. Add test case for join() using string key. bc3e9f6 [Shivaram Venkataraman] Merge pull request #108 from concretevitamin/take-optimize c06fc90 [Zongheng Yang] Fix: only optimize for unserialized dataset case. d399aeb [Zongheng Yang] Apply size-capping on logical representation instead of physical. e4217dd [Zongheng Yang] Merge pull request #107 from shivaram/master 7952180 [Shivaram Venkataraman] Copy, use getLocalDirs from Spark Utils.scala 08e24c3 [Zongheng Yang] Merge pull request #109 from hlin09/hlin09 97d4e02 [Zongheng Yang] Min() upper-bound size with actual size. bb779bf [hlin09] Rename the filter function to filterRDD to follow the API consistency. Filter() is also kept. ce1661f [Zongheng Yang] Fix slow take(): deserialize only up to necessary # of elements. 4dca9b1 [Shivaram Venkataraman] Merge pull request #106 from hlin09/hlin09 1220d92 [hlin09] Adds function numPartitions(). 2326a65 [Shivaram Venkataraman] Use SPARK_LOCAL_DIRS to create tmp files e119757 [hlin09] Minor fix. 9c24c8b [hlin09] Adds function countByKey(). 48fce67 [hlin09] Adds countByValue(). 6679eef [Sun Rui] Update documentation for join(). 70586b4 [Sun Rui] Add join() to the RDD class. e6fb999 [Zongheng Yang] Merge pull request #103 from shivaram/rlibdir-fix a21f146 [Shivaram Venkataraman] Merge pull request #102 from hlin09/hlin09 32eb619 [Shivaram Venkataraman] Merge pull request #104 from sun-rui/add_keys_values d8692e9 [Sun Rui] Add keys() and values() for the RDD class. 18b9be1 [Shivaram Venkataraman] Allow users to set where SparkR is installed This also adds a warning if somebody tries to call sparkR.init multiple times. a17f135 [hlin09] Adds tests for flatMap and flatMapValues. 4bcf59b [hlin09] Adds function flatMapValues. 4a193ef [Zongheng Yang] Merge pull request #101 from ashutoshraina/master 60d22f2 [Ashutosh Raina] changed sbt version 5400793 [Zongheng Yang] Merge pull request #98 from shivaram/windows-fixes-build 36d61a7 [Shivaram Venkataraman] Merge pull request #97 from hlin09/hlin09 f7d7d89 [hlin09] Remove redundant code in test. 6bbe823 [hlin09] minor style fix. 9b47f3a [Shivaram Venkataraman] Merge pull request #100 from hnahak87/patch-1 7f6e4ea [Harihar Nahak] Update logistic_regression.R a605047 [Shivaram Venkataraman] Merge pull request #99 from hlin09/makefile 323151d [hlin09] Fix yar flag in Makefile to remove build error in Maven. 8911897 [hlin09] Make reserialize() private function in package. 79aee73 [Shivaram Venkataraman] Add notes on how to build SparkR on windows 49a99e7 [Shivaram Venkataraman] Clean up some commented code ddc271b [Shivaram Venkataraman] Only append file:/// to non empty jar paths a53952e [Shivaram Venkataraman] Add windows build scripts 325b179 [hlin09] Merge remote-tracking branch 'upstream/master' into hlin09 daf5040 [hlin09] Add reserialize() before union if two RDDs are not both serialized. 536afb1 [hlin09] Add new function of union(). 7044677 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into windows-fixes d22a02d [Zongheng Yang] Merge pull request #94 from shivaram/windows-fixes-stdin 51924f7 [Shivaram Venkataraman] Merge pull request #90 from oscaroboto/master eb97d85 [Shivaram Venkataraman] Merge pull request #96 from sun-rui/add_clarification_readme 5a128f4 [Sun Rui] Add clarification on setting Spark master when launching the SparkR shell. 187526a [oscaroboto] Update sparkR.R 32c567b [Shivaram Venkataraman] Merge pull request #95 from concretevitamin/master 4cd2d5e [Zongheng Yang] Notes about spark-ec2. 1c28e3b [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into windows-fixes 8e8a029 [Zongheng Yang] Merge pull request #92 from shivaram/sparkr-yarn 721043b [Zongheng Yang] Update README.md with YARN instructions. 1681f58 [Shivaram Venkataraman] Use temporary files for input instead of stdin This fixes a bug for Windows where stdin would get truncated b084314 [oscaroboto] removed ... from example 44c93d4 [oscaroboto] Added example to SparkR.R be82dcc [Shivaram Venkataraman] Merge pull request #93 from hlin09/hlin09 868554d [oscaroboto] Update sparkR.R 488ac47 [hlin09] Add generated Rd file of previous added functions, distinct() and mapValues(). b2740ad [hlin09] Add test for filter all elements. Add filter() as alias. 08d3631 [hlin09] Minor style fixes. 2c0e34f [hlin09] Adds function Filter(), which extracts the elements that satisfy a predicate. 5951d3b [Shivaram Venkataraman] Remove SBT plugin 4e70ced [oscaroboto] changed ExecutorEnv to sparkExecutorEnvMap, to make it consistent with sparkEnvirMap 903d18a [oscaroboto] changed executorEnv to sparkExecutorEnvMap, will do the same in R f97346e [oscaroboto] executorEnv to lower-case e 88a524e [oscaroboto] Added LD_LIBRARY_PATH to the ExecutorEnv. This is need so that the nodes can find libjvm.so, or if the master has a different LD_LIBRARY_PATH then the nodes. Make sure to export LD_LIBRARY_PATH that includes the path to libjvm.so in the nodes. 1d208ae [oscaroboto] added the YARN_CONF_DIR to the classpath 8a9b75c [oscaroboto] forgot to change hm and ee inside the for loops 579db58 [Shivaram Venkataraman] Merge pull request #91 from sun-rui/add_max_min 4381efa [Sun Rui] use reduce() to implemement max() and min(). a5459c5 [Shivaram Venkataraman] Consolidate yarn flags 86b04eb [Shivaram Venkataraman] Don't use quotes around yarn bf0797f [Shivaram Venkataraman] Add dependency on spark yarn module af5fe77 [Shivaram Venkataraman] Fix SBT build, add dependency tree plugin 4917607 [Sun Rui] Add maximum() and minimum() API to RDD. 51bbbe4 [Shivaram Venkataraman] Changes to make SparkR work with YARN 9d5e3ab [oscaroboto] a few stylistic changes. Also change vars to sparkEnvirMap and eevars to ExecutorEnv, to match sparkR.R 578f545 [oscaroboto] a few stylistic changes 39eea2f [oscaroboto] Modification to dynamically create a sparkContext with YARN. Added .setExecutorEnv to the sparkConf in createSparkContext within the RRDD object. This modification was made together with sparkR.R 17ec42e [oscaroboto] A modification to dynamically create a sparkContext with YARN. sparkR.R modified to pass custom Jar file names and EnvironmentEnv to the sparkConf. RRDD.scala was also modified to accept the new inputs to creatSparkContext. 624ac9d [Shivaram Venkataraman] Merge pull request #87 from sun-rui/SPARKR-125 4f213db [Shivaram Venkataraman] Merge pull request #89 from sun-rui/SPARKR-108 eb833c5 [Shivaram Venkataraman] Merge pull request #88 from hlin09/hlin09 07bf971 [Sun Rui] [SPARKR-108] Implement map-side reduction for reduceByKey(). 4accba1 [hlin09] Fixes style and adds an optional param 'numPartition' in distinct(). 80d303a [hlin09] typo fixed. e37a9b5 [hlin09] Adds function distinct() and mapValues(). 08dac06 [Sun Rui] [SPARKR-125] Get the iterator of the parent RDD before launching a R worker process in compute() of RRDD/PairwiseRRDD c4ba53c [Shivaram Venkataraman] Merge pull request #85 from edwardt/master 72a9d27 [root] reorder to keep relative ordering the same f3fcb10 [root] fix up build.sbt also to match pom.xml 5ecbe3e [root] Make spark verison configurable in build script per ISSUE122 a44e63d [Shivaram Venkataraman] Merge pull request #84 from sun-rui/SPARKR-94 fbb5663 [Sun Rui] Add {} to one-line functions and add a test case for lookup where no match is found. 95beb4e [Shivaram Venkataraman] Merge pull request #82 from edwardt/master 36776c5 [edwardt] missed one 0.9.0 revert b26deec [Sun Rui] [SPARKR-94] Add a method to get an element of a pair RDD object by key. 1ba256e [edwardt] Keep 0.9.0 and says uses 1.1.0 by default 5380c43 [root] missed one version 21f74da [root] upgrade to spark version 1.1.0 to match lastest merge list ddfcde9 [root] merge 67d067a [Shivaram Venkataraman] Merge pull request #81 from sun-rui/SparkR-117 993868f [Sun Rui] [SPARKR-117] Update Spark dependency to 1.1.0 d20661a [Zongheng Yang] Merge pull request #80 from sun-rui/master 0b2da9f [Sun Rui] Update Rd file and add a test case for mapPartitions. 5879648 [Sun Rui] Add mapPartitions() method to RDD for API consistency. c033461 [Shivaram Venkataraman] Merge pull request #79 from sun-rui/fix-kmeans f62b77e [Sun Rui] Adjust coding style. b40911d [Sun Rui] Fix syntax error in examples/kmeans.R. 5304451 [Shivaram Venkataraman] Merge pull request #78 from sun-rui/master 70ffbfb [Sun Rui] Fix a bug that modifications to build.sbt won't trigger rebuilding. a25696c [Shivaram Venkataraman] Merge pull request #76 from edwardt/addjira b8bbd93 [edwardt] Update README.md 615d930 [edwardt] Update README.md e522e69 [edwardt] Update README.md 03e6ced [edwardt] Update README.md 3007015 [root] don't check in gedit buffer file' c35c9a6 [root] Add where to enter bugs ad feeback 469eae3 [edwardt] Update README.md 61b4a43 [edwardt] Update Makefile (style uniformity) ce3337d [edwardt] Update README.md 7ff68fc [root] Merge branch 'master' of https://github.com/edwardt/SparkR-pkg 16353f5 [root] add links to devtools and install_github 513b9e5 [Shivaram Venkataraman] Merge pull request #72 from edwardt/master 31608a4 [edwardt] Update Makefile (style uniformity) 4ffe146 [root] Makefile: factor out SPARKR_VERSION to reduce potential copy&paste error; cp & rm called with -f in build/clean phase; .gitignore includes checkpoints and unit test log generated by run-tests.sh 715275f [Zongheng Yang] Merge pull request #68 from shivaram/master 90e2083 [Shivaram Venkataraman] Add return type to hasNext 8eb983d [Shivaram Venkataraman] Fix up comment 2206164 [Shivaram Venkataraman] Delete temporary files after they are read This change deletes temporary files used for communication between Rscript and the JVM once they have been completely read. 5881da7 [Zongheng Yang] Merge pull request #67 from shivaram/improve-shuffle 81251e2 [Shivaram Venkataraman] Address code review comments a5f573f [Shivaram Venkataraman] Use a better list append in shuffles This is helpful in scenarios where we have a large number of values in a bucket 388e64d [Shivaram Venkataraman] Merge pull request #55 from RevolutionAnalytics/master e1f95b6 [Zongheng Yang] Merge pull request #65 from concretevitamin/parallelize-fix fc1a71a [Zongheng Yang] Fix that collect(parallelize(sc,1:72,15)) drops elements. b8204c5 [Zongheng Yang] Minor: update a URL in README. 86f30c3 [Antonio Piccolboni] better fix for amplab-extras/SparkR-pkg#53 b3c318d [Antonio Piccolboni] delayed loading to have all namespaces available. f323e97 [Antonio Piccolboni] tentative fix for amplab-extras/SparkR-pkg#53 6f82269 [Zongheng Yang] Merge pull request #48 from shivaram/master 8f433e5 [Shivaram Venkataraman] Move up Hadoop in pom.xml and add back protobufs As Hadoop 1.0.4 doesn't use protobufs, we can't exclude protobufs from Spark always. This change tries to order the dependencies so that the shader first picks up Hadoop's protobufs over Mesos. bfe7e26 [Shivaram Venkataraman] Merge pull request #36 from RevolutionAnalytics/vectorize-examples 059ae41 [Antonio Piccolboni] and more formatting 9dbd531 [Antonio Piccolboni] more formatting per committer request 948738a [Antonio Piccolboni] converted tabs to spaces per project request 49f5f5a [Shivaram Venkataraman] Merge pull request #35 from shivaram/master 3eb5ad3 [Shivaram Venkataraman] on_failure -> after_failure in travis.yml 139bdee [Shivaram Venkataraman] Cache sbt, maven, ivy dependencies 4ebced2 [Shivaram Venkataraman] Merge pull request #34 from shivaram/master 8437061 [Shivaram Venkataraman] Exclude protobuf from Spark dependency in Maven This avoids pulling in multiple versions of protobuf from Mesos and Hadoop. 91aa527 [Antonio Piccolboni] vectorized version, 36s 10 slices 10^6 per slice. The older version takes 30 sec on 1/10th of data. f137a57 [Antonio Piccolboni] for rstudio users 1f7ffb0 [Antonio Piccolboni] implemented using matrices and vectorized calls wherever possible 46b23df [Antonio Piccolboni] replace require with library b15d7db [Antonio Piccolboni] faster parsing 8b7aeb3 [Antonio Piccolboni] 22x speed improvement, 3X mem impovement c5bce07 [Zongheng Yang] Merge pull request #30 from shivaram/string-tests 21fa2d8 [Shivaram Venkataraman] Fix bug where serialized was not changed for RRRD Reason: When an RRDD is created in getJRDD we have converted any possibly unserialized RDD to a serialized RDD. 9d1ea20 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg into string-tests 7b9348c [Shivaram Venkataraman] Add tests for partition with string keys Add two tests one with a string array and one from a textFile to test both codepaths aacd726 [Shivaram Venkataraman] Update README with maven proxy instructions 803e62c [Shivaram Venkataraman] Merge pull request #28 from concretevitamin/master 7c093e6 [Zongheng Yang] Use inherits() to test an object's class. 061c591 [Shivaram Venkataraman] Merge pull request #26 from hafen/master 90f9fda [Ryan Hafen] Fix isRdd() to properly check for class 5b10cc7 [Zongheng Yang] Merge pull request #24 from shivaram/master 7014f83 [Shivaram Venkataraman] Remove unused transformers in maven's pom.xml b00cea5 [Shivaram Venkataraman] Add support for a Maven build 11ec9b2 [Shivaram Venkataraman] Merge pull request #12 from concretevitamin/pipelined 6b18a90 [Zongheng Yang] Merge branch 'master' into pipelined 57127b8 [Zongheng Yang] Merge pull request #23 from shivaram/master 1ac3940 [Zongheng Yang] Review feedback. a06fb34 [Zongheng Yang] Remove outdated comment. 0a1fc13 [Shivaram Venkataraman] Fixes for using SparkR with Hadoop2. 1. Exclude ASM, Netty from Hadoop similar to Spark. 2. Concat services files to ensure HDFS filesystems work. 3. Update README with an example 9a1db44 [Zongheng Yang] Merge pull request #22 from shivaram/master e462448 [Shivaram Venkataraman] Use `$` for calling `put` instead of .jrcall ed4559a [Shivaram Venkataraman] Add support for passing Spark environment vars This change creates a new `createSparkContext` method in RRDD as we can't pass Map<String, String> through rJava. Also use SPARK_MEM in local mode to increase heap size and update the README with some examples. 10228fb [Shivaram Venkataraman] Merge pull request #20 from concretevitamin/digit-ex 1398d9f [Zongheng Yang] Add linear_solver_mnist to examples/. d484c2a [Zongheng Yang] Add tests for actions on PipelinedRDD. d9cb95c [Zongheng Yang] Add setCheckpointDir() to context.R; comment fix. f8bc8a9 [Zongheng Yang] Minor edits per Shivaram's comments. 8cd67f7 [Shivaram Venkataraman] Merge pull request #15 from shivaram/master d4468a9 [Shivaram Venkataraman] Remove trailing comma e2714b8 [Shivaram Venkataraman] Remove Apache Staging repo and update README 334eace [Zongheng Yang] Add a multi-transformation test to benchmark on pipelining. 5650ad7 [Zongheng Yang] Put serialized field inside env for both RDD and PipelinedRDD. 0b9e8bb [Zongheng Yang] First cut at PipelinedRDD. a4c431e [Zongheng Yang] Add `isCheckpointed` field and checkpoint(). dac0795 [Zongheng Yang] Minor inline comment style fix. bfb8e26 [Zongheng Yang] Add isCached field (inside an env) and unpersist(). 295bff6 [Zongheng Yang] Merge pull request #11 from shivaram/master 4cb209c [Shivaram Venkataraman] Search rLibDir in worker before libPaths This ensures we pick up the SparkR intended and not an older version installed on the same machine ef198ff [Zongheng Yang] Merge pull request #10 from shivaram/unit-tests e0557a8 [Shivaram Venkataraman] Update travis to install plyr 8b18bc1 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg into unit-tests 4a9ca31 [Shivaram Venkataraman] Use smaller broadcast and plyr instead of Matrix Matrix package takes around 2s to load and slows down unit tests. 21c6a61 [Zongheng Yang] Merge pull request #8 from shivaram/master 08c2947 [Shivaram Venkataraman] Move dev install directory to front of libPaths bda42ee [Shivaram Venkataraman] Merge pull request #7 from JoshRosen/travis cc5f5c0 [Josh Rosen] Add Travis CI integration (using craigcitro/r-travis) b6c864b [Shivaram Venkataraman] Merge pull request #6 from concretevitamin/env-style-fix 4fcef22 [Zongheng Yang] Use one style ($) for accessing names in environments. 8a948c6 [Shivaram Venkataraman] Merge pull request #4 from shivaram/master 24978eb [Shivaram Venkataraman] Update README to use install_github 8899db4 [Shivaram Venkataraman] Update TODO.md 91792de [Shivaram Venkataraman] Update Spark requirements f34f4bf [Shivaram Venkataraman] Check tests for failures and output error msg cd750d3 [Shivaram Venkataraman] Update run-tests to use new path 1877b7c [Shivaram Venkataraman] Unset R_TESTS to make tests work with R CMD check Also silence Akka remoting logs and update Makefile to build on log4j changes e60e18a [Shivaram Venkataraman] Update README to remove Spark installation notes 4450189 [Shivaram Venkataraman] Add Spark 0.9 dependency from Apache Staging Also clean up assembly jar from inst on make clean 5eb2131 [Shivaram Venkataraman] Update repo path in README ec8210e [Shivaram Venkataraman] Remove broadcastId hack as it is public in Spark 9f0e080 [Shivaram Venkataraman] Merge branch 'install-github' 5c88fbd [Shivaram Venkataraman] Add helper script to run tests 77450a1 [Shivaram Venkataraman] Remove dependency on Spark Logging 6cb00d1 [Shivaram Venkataraman] Update README and add helper script install-dev.sh 28346ca [Shivaram Venkataraman] Only normalize if SPARK_HOME is not empty 0fd6571 [Shivaram Venkataraman] Normalize SPARK_HOME before passing it ff96d5c [Shivaram Venkataraman] Pass in SPARK_HOME and jar file path 34c4dce [Shivaram Venkataraman] Move src into pkg and update Makefile This enables the package to be installed using install_github using devtools and automates the build procedure. b25afed [Shivaram Venkataraman] Change package name to edu.berkeley.cs.amplab c691464 [Shivaram Venkataraman] Add Apache 2.0 License file 27a4a4b [Shivaram Venkataraman] Add notes on how to compile roxygen2 docs ca63844 [Shivaram Venkataraman] Add broadcast documentation Also generate documentation for sample, takeSample etc. e4dd976 [Shivaram Venkataraman] Update TODO.md e42d435 [Shivaram Venkataraman] Add support for broadcast variables 6b638e7 [Shivaram Venkataraman] Add the assembly jar to SparkContext bf24e32 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg 43c05ce [Zongheng Yang] Fix a flaky/incorrect test for sampleRDD(). c6a9dfc [Zongheng Yang] Initial port of the kmeans example. 6885581 [Zongheng Yang] Implement element-level sampleRDD() and takeSample() with tests. d3a4987 [Zongheng Yang] Add a test for lapplyPartitionsWithIndex on pairwise RDD. c7899c1 [Zongheng Yang] Add lapplyPartitionsWithIndex, with a test and an alias function. a9a7436 [Shivaram Venkataraman] Add DFC example from Tselil, Benjamin and Jonah fbc5a95 [Zongheng Yang] Implement take() and takeSample(). c4a3409 [Shivaram Venkataraman] Use RDD instead of RRDD dfad3f5 [Zongheng Yang] Add test_utils.R: a unit test for convertJListToRList(). a45227d [Zongheng Yang] Update .gitignore. 238fe6e [Zongheng Yang] Add a unit test for textFile(). a88898b [Zongheng Yang] Rename test_rrd to test_rrdd 10c8baa [Shivaram Venkataraman] Make SparkR work as a standalone package. Changes include: 1. Adding a new `sbt` project that builds RRDD.scala 2. Change the onLoad functions to load the assembly jar for SparkR 3. Set rLibDir in RRDD.scala and worker.R to load things correctly 78adcd8 [Shivaram Venkataraman] Add a gitignore ca6108f [Shivaram Venkataraman] Merge branch 'SparkR-scalacode' of ../SparkR 999bd61 [Shivaram Venkataraman] Update collectPartition in R and use ClassTag c58f63e [Shivaram Venkataraman] Update collectPartition in R and use ClassTag 48265fd [Shivaram Venkataraman] Use new version of collectPartitions in take d4fe086 [Shivaram Venkataraman] Move collectPartitions to JavaRDDLike Also remove numPartitions in JavaRDD and update R code bfecd7b [Shivaram Venkataraman] Scala 2.10 changes 1. Update sparkR script 2. Use classTag instead of classManifest 092a4b3 [Shivaram Venkataraman] Add combineByKey, update TODO ac0d81d [Shivaram Venkataraman] Add more documentation d1dc3fa [Shivaram Venkataraman] Add more documentation c515e3a [Shivaram Venkataraman] Update TODO db56a34 [Shivaram Venkataraman] Add a test case for include package 41cea51 [Shivaram Venkataraman] Ensure all parent environments are serialized. Also add a test case with an inline function a978e84 [Shivaram Venkataraman] Add support to include packages in the worker 12bf8ce [Shivaram Venkataraman] Add support to include packages in the worker fb7e72c [Shivaram Venkataraman] Cleanup TODO 16ac314 [Shivaram Venkataraman] Add documentation for functions in context, sparkR 85b1d25 [Shivaram Venkataraman] Set license to Apache 88f1101 [Shivaram Venkataraman] Add unit test running instructions c40768e [Shivaram Venkataraman] Update TODO 0c7efbf [Shivaram Venkataraman] Refactor RRDD.scala and add comments to functions 5880d42 [Shivaram Venkataraman] Refactor RRDD.scala and add comments to functions 2dee36c [Shivaram Venkataraman] Remove empty test file a82219b [Shivaram Venkataraman] Update TODOs 5db00dc [Shivaram Venkataraman] Add reduceByKey, groupByKey and refactor shuffle Other ch…

This is a backport of apache@226d388 ## What changes were proposed in this pull request? This PR adds support for Hive UDFs that return fully typed java Lists or Maps, for example `List<String>` or `Map<String, Integer>`. It is also allowed to nest these structures, for example `Map<String, List<Integer>>`. Raw collections or collections using wildcards are still not supported, and cannot be supported due to the lack of type information. ## How was this patch tested? Modified existing tests in `HiveUDFSuite`, and I have added test cases for raw collection and collection using wildcards. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#218 from hvanhovell/SPARK-19548.

* MapR [SPARK-135] Spark 2.2 with MapR Streams (Kafka 1.0) Added functionality of MapR-Streams specific EOF handling.

* [SPARK-559] Parameterized the Makefile CF template URL, to allow different templates to be used with dcos_launch. * Tests in strict mode: Permissions: added permission for drivers to launch tasks, updated user to 'nobody'. Updated options for installing Spark and running jobs. * Moved away from old setup_permissions.sh script. Built upon sdk_security, added spark-specific permission and role. Added hdfs/kafka security setup. * Fixed the configure_security fixture. Added separate service account and secret for spark. * Marked 'test_marathon_group' as "xfail". It runs test_jar(), which is failing. * Set a default "/spark" app name, explicitly encode the app name in granting permissions. * Grant permission for the foldered spark service in test_marathon_group() * Reverted setup_permissions.sh change * (1) Restored test_marathon_group, now running sparkPi, (2) Removed mesos containerizer, need to set SPARK_USER in docker containerizer.

* MapR [SPARK-135] Spark 2.2 with MapR Streams (Kafka 1.0) Added functionality of MapR-Streams specific EOF handling.

…cript K8S-1077 (apache#598) * K8S-1077 - use single k8s secret with user info MapR [SPARK-651] Replacing joda-time-*.jar with joda-time-2.10.3.jar. MapR [SPARK-638] Wrong permissions when creating files under directory with GID bit set. MapR [SPARK-627] SparkHistoryServer-2.4 is getting 403 Unauthorized home page for users(spark.ui.view.acls) via spark-submit MapR [SPARK-639] Default headers are adding two times MapR [SPARK-629] Spark UI for job lose CSS styles MapR [MS-925] After upgrade to MEP 6.2 (Spark 2.4.0) can no longer consume Kafka / MapR Streams. MapR [SPARK-626] Update kafka dependencies for Spark 2.4.4.0 in release MEP-6.3.0 MapR [SPARK-340] Jetty web server version at Spark should be updated tp v9.4.X MapR [SPARK-617] an't use ssl via spark beeline MapR [SPARK-617] Can't use ssl via spark beeline MapR [SPARK-620] Replace core dependency in Spark-2.4.4 MapR [SPARK-621] Fix multiple XML configuration initialization for (apache#575) custom headers. Use X-XSS-Protection, X-Content-Type-Options Content-Security-Policy and Strict-Transport-Security configuration only in case: cluster security is enabled OR spark.ui.security.headers.enabled set to true. MapR [SPARK-595] Spark cannot access hs2 through zookeeper Revert "MapR [SPARK-595] Spark cannot access hs2 through zookeeper (apache#577)" MapR [SPARK-595] Spark cannot access hs2 through zookeeper MapR [SPARK-620] Replace core dependency in Spark-2.4. MapR [SPARK-619] Move absent commits from 2.4.3 branch to 2.4.4 (apache#574) * Adding SQL API to write to kafka from Spark (apache#567) * Branch 2.4.3 extended kafka and examples (apache#569) * The v2 API is in its own package - the v2 api is in a different package - the old functionality is available in a separated package * v2 API examples - All the examples are using the newest API. - I have removed the old examples since they are not relevant any more and the same functionality is shown in the new examples usin the new API. * MapR [SPARK-619] Move absent commits from 2.4.3 branch to 2.4.4 CORE-321. Add custom http header support for jetty. MapR [SPARK-609] Port Apache Spark-2.4.4 changes to the MapR Spark-2.4.4 branch Adding multi table loader (apache#560) * Adding multi table loader - This allows us to load multiple matching tables into one Union DataFrame. If we have the fallowing MFS structure: ``` /clients/client_1/data.table /clients/client_2/data.table ``` we can load a union dataframe by doing `loadFromMapRDB("/clients/*/*.table")` * Fixing the path to the reader MapR [SPARK-588] Spark thriftserver fails when work with hive-maprdb json table MapR [SPARK-598] Spark can't add needed properties to hive-site.xml MAPR-SPARK-596: Change HBase compatible version for Spark 2.4.3 MapR [SPARK-592] Add possibility to use start-thriftserver.sh script with 2304 port MapR [SPARK-584] MaprDB connector's setHintUsingIndex method doesn't work as expected MapR [SPARK-583] MaprDB connector's loadFromMaprDB function for Java API doesn't work as expected SPARK-579 info about ssl_trustore is added for metrics MapR [SPARK-552] Failed to get broadcast_11_piece0 of broadcast_11 SPARK-569 Generation of SSL ceritificates for spark UI MapR [SPARK-575] Warning messages in spark workspace after the second attempt to login to job's UI Update zookeeper version Adding `joinWithMapRDBTable` function (apache#529) The related documentation of this function is here https://github.com/anicolaspp/MapRDBConnector#joinwithmaprdbtable. The main idea is that having a dataframe (no matter how was it constructed) we can join it with a MapR-DB table. This functions looks at the join query and load only those records from MapR-DB that will join instead of loading the full table and then join in memory. In other words, we only load what we know will be joint. Adding DataSource Reader Support (apache#525) * Adding DataSource Reader Support * Update SparkSessionExt.scala * creating a package object * Update MapRDBSpark.scala * fully path to avoid name collition * refactorings MapR [SPARK-451] Spark hadoop/core dependency updates MapR [SPARK-566] Move absent commits from 2.4.0 branch MapR [SPARK-561] Spark 2.4.3 porting to MapR MapR [SPARK-561] Spark 2.4.3 porting to MapR MapR [SPARK-558] Render application UI init page if driver is not up MapR [SPARK-541] Avoid duplication of the first unexpired record MapR [COLD-150][K8S] Fix metrics copy MapR [K8S-893] Hide plain text password from logs MapR [SPARK-540] Include 'avro' artifacts MapR [SPARK-536] PySpark streaming package for kafka-0-10 added K8S-853: Enable spark metrics for external tenant MapR [SPARK-531] Remove duplicating entries from classpath in ClasspathFilter MapR [SPARK-516] Spark jobs failure using yarn mode on kerberos fixed MapR [SPARK-462] Spark and SparkHistoryServer allow week ciphers, which can allow man in the middle attack [SPARK-508] MapR-DB OJAI Connector for Spark isNull condition returns incorrect result MapR [SPARK-510] nonmapr "admin" users not able to view other user logs in SHS SPARK-460: Spark Metrics for CollectD Configuration for collecting Spark metrics SPARK-463 MAPR_MAVEN_REPO variable for specifying mapR repository MapR [SPARK-492] Spark 2.4.0.0 configure.sh has error messages MapR [SPARK-515][K8S] Remove configure.sh call for k8s MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg MapR [SPARK-514] Recovery from checkpoint is broken MapR [SPARK-445] Messages loss fixed by reverting [MAPR-32290] changes from kafka09 package (apache#460) * MapR [SPARK-445] Revert "[MAPR-32290] Spark processing offsets when messages are already TTL in the first batch (apache#376)" This reverts commit e8d59b9. * MapR [SPARK-445] Revert "[MAPR-32290] Spark processing offsets when messages are already ttl in first batch (apache#368)" This reverts commit b282a8b. MapR [SPARK-445] Messages loss fixed by reverting [MAPR-32290] changes from kafka10 package MapR [SPARK-469] Fix NPE in generated classes by reverting "[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection" (apache#455) This reverts commit c5583fd. MapR [SPARK-482] Spark streaming app fails to start by UnknownTopicOrPartitionException with checkpoint MapR [SPARK-496] Spark HS UI doesn't work MapR [SPARK-416] CVE-2018-1320 vulnerability in Apache Thrift MapR [SPARK-486][K8S] Fix sasl encryption error on Kubernetes MapR [SPARK-481] Cannot run spark configure.sh on Client node MapR [K8S-637][K8S] Add configure.sh configuration in spark-defaults.conf for job runtime MapR [SPARK-465] Error messages after update of spark 2.4 MapR [SPARK-465] Error messages after update of spark 2.4 MapR [SPARK-464] Can't submit spark 2.4 jobs from mapr-client [SPARK-466] SparkR errors fixed MapR [SPARK-456] Spark shell can't be started SPARK-417 impersonation fixes for spark executor. Impersonation is mo… (apache#433) * SPARK-417 impersonation fixes for spark executor. Impersonation is moved from HadoopRDD.compute() method to org.apache.spark.executor.Executor.run() method * SPARK-363 Hive version changed to '1.2.0-mapr-spark-MEP-6.0.0' [SPARK-449] Kafka offset commit issue fixed MapR [SPARK-287] Move logic of creating /apps/spark folder from installer's scripts to the configure.sh MapR [SPARK-221] Investigate possibility to move creating of the spark-env.sh from private-pkg to configure.sh MapR [SPARK-430] PID files should be under /opt/mapr/pid MapR [SPARK-446] Spark configure.sh doesn't start/stop Spark services MapR [SPARK-434] Move absent commits from 2.3.2 branch (apache#425) * MapR [SPARK-352] Spark shell fails with "NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream" if java is not available in PATH * MapR [SPARK-350] Deprecate Spark Kafka-09 package * MapR [SPARK-326] Investigate possibility of writing Java example for the MapRDB OJAI connector * [SPARK-356] Merge mapr changes from kafka-09 package into the kafka-10 * SPARK-319 Fix for sparkR version check * MapR [SPARK-349] Update OJAI client to v3 for Spark MapR-DB JSON connector * MapR [SPARK-367] Move absent commits from 2.3.1 branch * MapR [SPARK-137] Analyze the warning during compilation of OJAI connector * MapR [SPARK-369] Spark 2.3.2 fails with error related to zookeeper * [MAPR-26258] hbasecontext.HBaseDistributedScanExample fails * [SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests * MapR [SPARK-374] Spark Hive example fails when we submit job from another(simple) cluster user * MapR [SPARK-434] Move absent commits from 2.3.2 branch * MapR [SPARK-434] Move absent commits from 2.3.2 branch * MapR [SPARK-373] Unexpected behavior during job running in standalone cluster mode * MapR [SPARK-419] Update hive-maprdb-json-handler jar for spark 2.3.2.0 and spark 2.2.1 * MapR [SPARK-396] Interface change of sendToKafka * MapR [SPARK-357] consumer groups are prepeneded with a "service_" prefix * MapR [SPARK-429] Changes in maprdb connector are the cause of broken backward compatibility * MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr * MapR [SPARK-434] Move absent commits from 2.3.2 branch * Move absent commits from 2.3.2 branch * MapR [SPARK-434] Move absent commits from 2.3.2 branch * Move absent commits from 2.3.2 branch * Move absent commits from 2.3.2 branch MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr MapR [SPARK-379] Spark 2.4 4-gidit version MapR [PIC-48][K8S] Port k8s changes to 2.4.0 [PIC-48] Create user for k8s driver and executor if required [PIC-48] Create user for k8s driver and executor if required Revert "Remove spark.ui.filters property" This reverts commit d8941ba36c3451cdce15d18d6c1a52991de3b971. [SPARK-351] Copy kubernetes start scripts anyway PIC-34: Rename default configmap name to be consistent with mapr-kubernetes [SPARK-23668][K8S] Add config option for passing through k8s Pod.spec.imagePullSecrets (apache#355) Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries. See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/ Unit tests + manual testing. Manual testing procedure: 1. Have private image registry. 2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message: ``` Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n \\\"errors\\\" : [ {\\n \\\"status\\\" : 400,\\n \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n } ]\\n}\"" ``` 3. Create secret `kubectl create secret docker-registry ...` 4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful. Author: Andrew Korzhuev <andrew.korzhuev@klarna.com> Author: Andrew Korzhuev <korzhuev@andrusha.me> Closes apache#20811 from andrusha/spark-23668-image-pull-secrets. [SPARK-321] Change default value of spark.mapr.ssl.secret.prefix property [PIC-32] Spark on k8s with MapR secure cluster Update entrypoint.sh with correct spark version (apache#340) This PR has minor fix to correct the spark version string [SPARK-274] Create home directory for user who submitted job [MAPR-SPARK-230] Implement security for Spark on Kubernetes Run Spark job with specify the username for driver and executor Read cluster configs from configMap Run configure.sh script form entrypoint.sh Remove spark.kubernetes.driver.pod.commands property Add Spark properties for executor and driver environment variable MapR [SPARK-296] Structured Streaming memory leak Revert "[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.tem…" (apache#252) * Revert "[MAPR-SPARK-176] Fix Spark Project Catalyst unit tests (apache#251)" This reverts commit 5de05075cd14abf8ac65046a57a5d76617818fbe. * Revert "[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.template (apache#249)" This reverts commit 1baa677d727e89db7c605ffbae9a9eba00337ad0. [MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.template MapR [SPARK-379] Port Spark to 2.4.0 MapR [SPARK-341] Spark 2.3.2 porting [MAPR-32290] Spark processing offsets when messages are already TTL in the first batch * Bug 32263 - Seek called on unsubscribed partitions [MSPARK-331] Remove snapshot versions of mapr dependencies from Spark-2.3.1 [MAPR-32290] Spark processing offsets when messages are already ttl in first batch MapR [SPARK-325] Add examples for work with the MapRDB JSON connector into the Spark project [ATS-449] Unit test for EBF 32013 created. MAPR-SPARK-311: Spark beeline uses default ssl truststore instead of mapr ssl truststore Bug 32355 - Executor tab empty on Spark UI [SPARK-318] Submitting Spark jobs from Oozie fails due to ClassNotFoundException Bug 32014 - Spark Consumer fails with java.lang.AssertionError Revert "[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1" (apache#341) * Revert "[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1 (apache#335)" This reverts commit 832411e. Bug 32014 - Spark Consumer fails with java.lang.AssertionError (apache#326) (apache#336) * MapR [32014] Spark Consumer fails with java.lang.AssertionError [SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1 DEVOPS-2768 temporarily removed curl for file downloading [SPARK-302] Local privilege escalation MapR [SPARK-297] Added unit test for empty value conversion MapR [SPARK-297] Empty values are loaded as non-null MapR [SPARK-296] Structured Streaming memory leak 2.3.1 spark 289 (apache#318) * MapR [SPARK-289] Fix unit test for Spark-2.3.1 [SPARK-130] MapRDB connector - NPE while saving Pair RDD with 'null' values MapR [SPARK-283] Unit tests fail during initialization SSL properties. [SPARK-212] SparkHiveExample fails when we run it twice MapR [SPARK-282] Remove maprfs and hadoop jars from mapr spark package MapR [SPARK-278] Spark submit fails for jobs with python MapR [SPARK-279] Can't connect to spark thrift server with new spark and hive packages MapR [SPARK-276] Update zookeeper dependency to v.3.4.11 for spark 2.3.1 MapR [SPARK-272] Use only client passwords from ssl-client.xml MapR [SPARK-266] Spark jobs can't finish correctly, when there is an error during job running MapR [SPARK-263] Add possibility to use keyPassword which is different from keyStorePassword [MSPARK-31632] RM UI showing broken page for Spark jobs MapR [SPARK-261] Use mapr-security-web for getting passwords. MapR [SPARK-259] Spark application doesn't finish correctly MapR [SPARK-268] Update Spark version for Warden change project version to 2.3.1-mapr-SNAPSHOT MapR [SPARK-256] Spark doesn't work on yarn mode MapR [SPARK-255] Installer fresh install 610/600 secure fails to start "mapr-spark-thriftserver", "mapr-spark-historyserver" Mapr [SPARK-248] MapRDBTableScanRDD fails to convert to Scala Dataframe when using where clause MapR [SPARK-225] Hadoop credentials provider usage for hiding passwords at spark-defaults MapR [SPARK-214] Hive-2.1 poperties can't be read from a hive-site.xml as Spark uses Hive-1.2 MapR [SPARK-216] Spark thriftserver fails when work with hive-maprdb json table SPARK-244 (apache#278) Provide ability to use MapR-Negotiation authentication for Spark HistoryServer MapR [SPARK-226] Spark - pySpark Security Vulnerability MapR [SPARK-220] SparkR fails with UDF functions bug fixed MapR [SPARK-227] KafkaUtils.createDirectStream fails with kafka-09 MapR [SPARK-183] Spark Integration for Kafka 0.10 unit tests disabled MapR [SPARK-182] Spark Project External Kafka Producer v09 unit tests fixed MapR [SPARK-179] Spark Integration for Kafka 0.9 unit tests fixed MapR [SPARK-181] Kafka 0.10 Structured Streaming unit tests fixed [MSPARK-31305] Spark History server NOT loading applications submitted by users other than 'mapr' MapR [SPARK-175] Fix Spark Project Streaming unit tests [MAPR-SPARK-176] Fix Spark Project Catalyst unit tests [MAPR-SPARK-178] Fix Spark Project Hive unit tests MapR [SPARK-174] Spark Core unit tests fixed Changed version for spark-kafka connector. MapR [SPARK-202] Update MapR Spark to 2.3.0 Fixed compile time errors in tests Change project version [SPARK-198] Update hadoop dependency version to 2.7.0-mapr-1803 for Spark 2.2.1 MapR [SPARK-188] Couldn't connect to thrift server via spark beeline on kerberos cluster MapR [SPARK-143] Spark History Server does not require login for secured-by-default clusters MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector MapR [SPARK-191] Incorrect work of MapR-DB Sink 'complete' and 'update' modes fixed MapR [SPARK-170] StackOverflowException in equals method in DBMapValue 2.2.1 build fixed (apache#231) * MapR [SPARK-164] Update Kafka version to 1.0.1-mapr in Spark Kafka Producer module MapR [SPARK-161] Include Kafka Structured streaming jar to Spark package. MapR [SPARK-155] Change Spark Master port from 8080 MapR [SPARK-153] Exception in spark job with configured labels on yarn-client mode MapR [SPARK-152] Incorrect date string parsing fixed MapR [SPARK-21] Structured Streaming MapR-DB Sink created MapR [SPARK-135] Spark 2.2 with MapR Streams ( Kafka 1.0) (apache#218) * MapR [SPARK-135] Spark 2.2 with MapR Streams (Kafka 1.0) Added functionality of MapR-Streams specific EOF handling. MapR [SPARK-143] Spark History Server does not require login for secured-by-default clusters Disable build failing if scalastyle checking is fall. MapR [SPARK-16] Change Spark version in Warden files and configure.sh MapR [SPARK-144] Add insertToMapRDB method for rdd for Java API [MAPR-30536] Spark SQL queries on Map column fails after upgrade MapR [SPARK-139] Remove "update" related APIs from connector MapR [SPARK-140] Change the option name "tableName" to "tablePath" in the Spark/MapR-DB connectors. MapR [SPARK-121] Spark OJAI JAVA: update functionality removed MapR [SPARK-118] Spark OJAI Python: missed DataFrame import while moving imports in order to fix MapR [ZEP-101] interpreter issue MapR [SPARK-118] Spark OJAI Python: move MapR DB Connector class importing in order to fix MapR [ZEP-101] interpreter issue MapR [SPARK-117] Spark OJAI Python: Save functionality implementation MapR [SPARK-131] Exception when try to save JSON table with Binary _id field Spark OJAI JAVA: load to RDD, save from RDD implementation (apache#195) * MapR [SPARK-124] Loading to JavaRDD implemented * MapR [SPARK-124] MapRDBJavaSparkContext constructor changed * MapR [SPARK-124] implemented RDD[Row] saving MapR [SPARK-118] Spark OJAI Python: Read implementation MapR [SPARK-128] MapRDB connector - wrong handle of null fields when nullable is false * MapR [SPARK-121] Spark OJAI JAVA: Read to Dataset functionality implementation * Minor refactoring MapR [SPARK-125] Default value of idFieldPath parameter is not handle MapR [SPARK-113] Hit java.lang.UnsupportedOperationException: empty.reduceLeft during loadFromMapRDB Spark Mapr-DB connector was refactored according to Scala style Removed code duplication [MSPARK-107]idField information is lost in MapRDBDataFrameWriterFunctions.saveToMapRDB configure.sh takes options to change ports Kafka client excluded from package because correct version is located in "mapr classpath" Changed Kafka version in Kafka producer module. Branch spark 69 (apache#170) * Fixing the wrong type casting of TimeStamp to OTimeStamp when read from spark dataFrame. * SPARK-69: Problem with license when we try to read from json and write to maprdb remove creatin /usr/local/spark link from configure.sh. This link will be creates by private-pkg remove include-maprdb from default profiles added profiles in maprdb pom file instead of two pom files Fixed maprdb connector dependencies. Fixing the wrong type casting of TimeStamp to OTimeStamp when read from spark dataFrame. changed port for spark-thriftserver as it conflicts with hive server changed port for spark-thriftserver as it conflicts with hive server remove .not_configured_yet file after success Ojai connector fixed required java version [MSPARK-45] Move Spark-OJAI connector code to Spark github repo (apache#132) * SPARK-45 Move Spark-OJAI connector code to Spark github repo * Fixing pom versions for maprdb spark connector. * Changes made to the connector code to be compatible with 5.2.* and 6.0 clients. Spark 2.1.0 mapr 29106 (apache#150) * [SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher. Blindly deserializing classes using Java serialization opens the code up to issues in other libraries, since just deserializing data from a stream may end up execution code (think readObject()). Since the launcher protocol is pretty self-contained, there's just a handful of classes it legitimately needs to deserialize, and they're in just two packages, so add a filter that throws errors if classes from any other package show up in the stream. This also maintains backwards compatibility (the updated launcher code can still communicate with the backend code in older Spark releases). Tested with new and existing unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18166 from vanzin/SPARK-20922. (cherry picked from commit 8efc6e9) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 772a9b9) * [SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18178 from vanzin/SPARK-20922-hotfix. Added security by default for historyserver use waitForConsumerAssignment() instead of consumer.poll(0) for spark-29052 change MAPR_HADOOP_CLASSPATH in configure.sh for creating it by mapr-classpath.sh change MAPR_HADOOP_CLASSPATH in configure.sh for creating it by mapr-classpath.sh changes for mapr-classpath.sh changes for mapr-classpath.sh configure.sh changes [SPARK-39] Classpath filter was added Fixed impersonation when data read from MapR-DB via Spark-Hive. added configure.sh and warden.spark-thriftserver.conf hive-hbase-handler added to Spark jars Fixed "Single message comes late" 28339 bug fixed Spark streaming skipped message with zero offset from Kafka 0.9 [MSPARK-9] Initial fix for Spark unit tests Bump dependencies after ECO-1703 release [SPARK-33] Streaming example fixed [MAPR-26060] Fixed case when mapr-streams make gaps in offsets ported features from kafka 10 to kafka 9 [MAPR-26289][SPARK-2.1] Streaming general improvements (apache#93) * Added include-kafka-09 profile to Assembly * Set default poll timeout to 120s Set default HBase verison to 1.1.8 Changes from Kafka10 package were ported to Kafka09 package. [MAPR-26053] Include MapR Classes to the default value of spark.sql.hive.metastore.sharedPrefixes [MAPR-25807] Spark-Warehouse path computes incorrectly Add MapR-SASL support for Thrift Server Adding scala library. [MAPR-25713] Spark might try to load MapR Class Loader multiple times and fail [MAPR-25311] Bump Spark dependencies after ECO-1611 release [MINOR] Fix spark-jars.sh script [MAPR-24603] Could not launch beeline shell after starting spark thrift server fixed syntax error in V09DirectKafkaWordCount example Spark 2.0.1 MAPR-streams Python API [MAPR-24415] SPARK_JAVA_OPTS is deprecated Kafka streaming producer added. Minor fix for previous commit Added script for MAPR-24374 Some minor changes to spark-defaults.conf Changed default HBase version to 1.1.1 in compatibility.version Streaming example was refactored [MAPR-24470] HiveFromSpark test fails in yarn-cluster mode Added MapR Repo [MAPR-22940] Failed to connect spark beeline (after spark thrift server is started) on Kerberos cluster [MAPR-18865] Unable to submit spark apps from Windows client Skip maven clean task on the parent module New: Issue with running Hive commands in Spark This is fixed in SPARK-7819 Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error Spark warden.services.conf should have dependency on cldb Remove DFS shuffle settings. These settings are not used right now. Copy every file in the conf directory into the distribution package. Create spark-defaults.conf for MapR Settings to enable DFS shuffle on MapR. Support hbase classpath computation in util script. Adding external conf and scripts. Enable SPARK_HIVE mode while building. This is needed to bundle datanucleus jars needed for hive table creation. Build Spark on MapR. - make-distribution.sh takes an environment variable to enable profiles - MVN_PROFILE_ARG - Added warden conf files under ext-conf. - Updated pom.xml to use right set of jars and version. Spark Master failed to start in HA mode Updated Apache Curator version Added spark streaming integration with kafka 0.9 and mapr-streams Added MapR Repo

### What changes were proposed in this pull request? This PR fixes `InlineCTE`'s idempotence. E.g. the following query: ``` WITH x(r) AS (SELECT random()), y(r) AS (SELECT * FROM x), z(r) AS (SELECT * FROM x) SELECT * FROM z ``` currently breaks it because we take into account the reference to `x` from `y` when deciding about not inlining `x` in the first round: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE === WithCTE WithCTE :- CTERelationDef 0, false :- CTERelationDef 0, false : +- Project [rand()#218 AS r#219] : +- Project [rand()#218 AS r#219] : +- Project [random(2957388522017368375) AS rand()#218] : +- Project [random(2957388522017368375) AS rand()#218] : +- OneRowRelation : +- OneRowRelation !:- CTERelationDef 1, false +- Project [r#222] !: +- Project [r#219 AS r#221] +- Project [r#220 AS r#222] !: +- Project [r#219] +- Project [r#220] !: +- CTERelationRef 0, true, [r#219] +- CTERelationRef 0, true, [r#220] !:- CTERelationDef 2, false !: +- Project [r#220 AS r#222] !: +- Project [r#220] !: +- CTERelationRef 0, true, [r#220] !+- Project [r#222] ! +- CTERelationRef 2, true, [r#222] ``` But in the next round we inline `x` because `y` was removed due to lack of references: ``` Once strategy's idempotence is broken for batch Inline CTE !WithCTE Project [r#222] !:- CTERelationDef 0, false +- Project [r#220 AS r#222] !: +- Project [rand()#218 AS r#219] +- Project [r#220] !: +- Project [random(2957388522017368375) AS rand()#218] +- Project [r#225 AS r#220] !: +- OneRowRelation +- Project [rand()#218 AS r#225] !+- Project [r#222] +- Project [random(2957388522017368375) AS rand()#218] ! +- Project [r#220 AS r#222] +- OneRowRelation ! +- Project [r#220] ! +- CTERelationRef 0, true, [r#220] ``` ### Why are the changes needed? We use `InlineCTE` as an idempotent rule in the `Optimizer`, `CheckAnalysis` and `ProgressReporter`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes #40856 from peter-toth/SPARK-43199-make-inlinecte-idempotent. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR fixes `InlineCTE`'s idempotence. E.g. the following query: ``` WITH x(r) AS (SELECT random()), y(r) AS (SELECT * FROM x), z(r) AS (SELECT * FROM x) SELECT * FROM z ``` currently breaks it because we take into account the reference to `x` from `y` when deciding about not inlining `x` in the first round: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE === WithCTE WithCTE :- CTERelationDef 0, false :- CTERelationDef 0, false : +- Project [rand()apache#218 AS r#219] : +- Project [rand()apache#218 AS r#219] : +- Project [random(2957388522017368375) AS rand()apache#218] : +- Project [random(2957388522017368375) AS rand()apache#218] : +- OneRowRelation : +- OneRowRelation !:- CTERelationDef 1, false +- Project [r#222] !: +- Project [r#219 AS r#221] +- Project [r#220 AS r#222] !: +- Project [r#219] +- Project [r#220] !: +- CTERelationRef 0, true, [r#219] +- CTERelationRef 0, true, [r#220] !:- CTERelationDef 2, false !: +- Project [r#220 AS r#222] !: +- Project [r#220] !: +- CTERelationRef 0, true, [r#220] !+- Project [r#222] ! +- CTERelationRef 2, true, [r#222] ``` But in the next round we inline `x` because `y` was removed due to lack of references: ``` Once strategy's idempotence is broken for batch Inline CTE !WithCTE Project [r#222] !:- CTERelationDef 0, false +- Project [r#220 AS r#222] !: +- Project [rand()apache#218 AS r#219] +- Project [r#220] !: +- Project [random(2957388522017368375) AS rand()apache#218] +- Project [r#225 AS r#220] !: +- OneRowRelation +- Project [rand()apache#218 AS r#225] !+- Project [r#222] +- Project [random(2957388522017368375) AS rand()apache#218] ! +- Project [r#220 AS r#222] +- OneRowRelation ! +- Project [r#220] ! +- CTERelationRef 0, true, [r#220] ``` ### Why are the changes needed? We use `InlineCTE` as an idempotent rule in the `Optimizer`, `CheckAnalysis` and `ProgressReporter`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes apache#40856 from peter-toth/SPARK-43199-make-inlinecte-idempotent. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…cript K8S-1077 (apache#598) * K8S-1077 - use single k8s secret with user info MapR [SPARK-651] Replacing joda-time-*.jar with joda-time-2.10.3.jar. MapR [SPARK-638] Wrong permissions when creating files under directory with GID bit set. MapR [SPARK-627] SparkHistoryServer-2.4 is getting 403 Unauthorized home page for users(spark.ui.view.acls) via spark-submit MapR [SPARK-639] Default headers are adding two times MapR [SPARK-629] Spark UI for job lose CSS styles MapR [MS-925] After upgrade to MEP 6.2 (Spark 2.4.0) can no longer consume Kafka / MapR Streams. MapR [SPARK-626] Update kafka dependencies for Spark 2.4.4.0 in release MEP-6.3.0 MapR [SPARK-340] Jetty web server version at Spark should be updated tp v9.4.X MapR [SPARK-617] an't use ssl via spark beeline MapR [SPARK-617] Can't use ssl via spark beeline MapR [SPARK-620] Replace core dependency in Spark-2.4.4 MapR [SPARK-621] Fix multiple XML configuration initialization for (apache#575) custom headers. Use X-XSS-Protection, X-Content-Type-Options Content-Security-Policy and Strict-Transport-Security configuration only in case: cluster security is enabled OR spark.ui.security.headers.enabled set to true. MapR [SPARK-595] Spark cannot access hs2 through zookeeper Revert "MapR [SPARK-595] Spark cannot access hs2 through zookeeper (apache#577)" MapR [SPARK-595] Spark cannot access hs2 through zookeeper MapR [SPARK-620] Replace core dependency in Spark-2.4. MapR [SPARK-619] Move absent commits from 2.4.3 branch to 2.4.4 (apache#574) * Adding SQL API to write to kafka from Spark (apache#567) * Branch 2.4.3 extended kafka and examples (apache#569) * The v2 API is in its own package - the v2 api is in a different package - the old functionality is available in a separated package * v2 API examples - All the examples are using the newest API. - I have removed the old examples since they are not relevant any more and the same functionality is shown in the new examples usin the new API. * MapR [SPARK-619] Move absent commits from 2.4.3 branch to 2.4.4 CORE-321. Add custom http header support for jetty. MapR [SPARK-609] Port Apache Spark-2.4.4 changes to the MapR Spark-2.4.4 branch Adding multi table loader (apache#560) * Adding multi table loader - This allows us to load multiple matching tables into one Union DataFrame. If we have the fallowing MFS structure: ``` /clients/client_1/data.table /clients/client_2/data.table ``` we can load a union dataframe by doing `loadFromMapRDB("/clients/*/*.table")` * Fixing the path to the reader MapR [SPARK-588] Spark thriftserver fails when work with hive-maprdb json table MapR [SPARK-598] Spark can't add needed properties to hive-site.xml MAPR-SPARK-596: Change HBase compatible version for Spark 2.4.3 MapR [SPARK-592] Add possibility to use start-thriftserver.sh script with 2304 port MapR [SPARK-584] MaprDB connector's setHintUsingIndex method doesn't work as expected MapR [SPARK-583] MaprDB connector's loadFromMaprDB function for Java API doesn't work as expected SPARK-579 info about ssl_trustore is added for metrics MapR [SPARK-552] Failed to get broadcast_11_piece0 of broadcast_11 SPARK-569 Generation of SSL ceritificates for spark UI MapR [SPARK-575] Warning messages in spark workspace after the second attempt to login to job's UI Update zookeeper version Adding `joinWithMapRDBTable` function (apache#529) The related documentation of this function is here https://github.com/anicolaspp/MapRDBConnector#joinwithmaprdbtable. The main idea is that having a dataframe (no matter how was it constructed) we can join it with a MapR-DB table. This functions looks at the join query and load only those records from MapR-DB that will join instead of loading the full table and then join in memory. In other words, we only load what we know will be joint. Adding DataSource Reader Support (apache#525) * Adding DataSource Reader Support * Update SparkSessionExt.scala * creating a package object * Update MapRDBSpark.scala * fully path to avoid name collition * refactorings MapR [SPARK-451] Spark hadoop/core dependency updates MapR [SPARK-566] Move absent commits from 2.4.0 branch MapR [SPARK-561] Spark 2.4.3 porting to MapR MapR [SPARK-561] Spark 2.4.3 porting to MapR MapR [SPARK-558] Render application UI init page if driver is not up MapR [SPARK-541] Avoid duplication of the first unexpired record MapR [COLD-150][K8S] Fix metrics copy MapR [K8S-893] Hide plain text password from logs MapR [SPARK-540] Include 'avro' artifacts MapR [SPARK-536] PySpark streaming package for kafka-0-10 added K8S-853: Enable spark metrics for external tenant MapR [SPARK-531] Remove duplicating entries from classpath in ClasspathFilter MapR [SPARK-516] Spark jobs failure using yarn mode on kerberos fixed MapR [SPARK-462] Spark and SparkHistoryServer allow week ciphers, which can allow man in the middle attack [SPARK-508] MapR-DB OJAI Connector for Spark isNull condition returns incorrect result MapR [SPARK-510] nonmapr "admin" users not able to view other user logs in SHS SPARK-460: Spark Metrics for CollectD Configuration for collecting Spark metrics SPARK-463 MAPR_MAVEN_REPO variable for specifying mapR repository MapR [SPARK-492] Spark 2.4.0.0 configure.sh has error messages MapR [SPARK-515][K8S] Remove configure.sh call for k8s MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg MapR [SPARK-514] Recovery from checkpoint is broken MapR [SPARK-445] Messages loss fixed by reverting [MAPR-32290] changes from kafka09 package (apache#460) * MapR [SPARK-445] Revert "[MAPR-32290] Spark processing offsets when messages are already TTL in the first batch (apache#376)" This reverts commit e8d59b9. * MapR [SPARK-445] Revert "[MAPR-32290] Spark processing offsets when messages are already ttl in first batch (apache#368)" This reverts commit b282a8b. MapR [SPARK-445] Messages loss fixed by reverting [MAPR-32290] changes from kafka10 package MapR [SPARK-469] Fix NPE in generated classes by reverting "[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection" (apache#455) This reverts commit c5583fd. MapR [SPARK-482] Spark streaming app fails to start by UnknownTopicOrPartitionException with checkpoint MapR [SPARK-496] Spark HS UI doesn't work MapR [SPARK-416] CVE-2018-1320 vulnerability in Apache Thrift MapR [SPARK-486][K8S] Fix sasl encryption error on Kubernetes MapR [SPARK-481] Cannot run spark configure.sh on Client node MapR [K8S-637][K8S] Add configure.sh configuration in spark-defaults.conf for job runtime MapR [SPARK-465] Error messages after update of spark 2.4 MapR [SPARK-465] Error messages after update of spark 2.4 MapR [SPARK-464] Can't submit spark 2.4 jobs from mapr-client [SPARK-466] SparkR errors fixed MapR [SPARK-456] Spark shell can't be started SPARK-417 impersonation fixes for spark executor. Impersonation is mo… (apache#433) * SPARK-417 impersonation fixes for spark executor. Impersonation is moved from HadoopRDD.compute() method to org.apache.spark.executor.Executor.run() method * SPARK-363 Hive version changed to '1.2.0-mapr-spark-MEP-6.0.0' [SPARK-449] Kafka offset commit issue fixed MapR [SPARK-287] Move logic of creating /apps/spark folder from installer's scripts to the configure.sh MapR [SPARK-221] Investigate possibility to move creating of the spark-env.sh from private-pkg to configure.sh MapR [SPARK-430] PID files should be under /opt/mapr/pid MapR [SPARK-446] Spark configure.sh doesn't start/stop Spark services MapR [SPARK-434] Move absent commits from 2.3.2 branch (apache#425) * MapR [SPARK-352] Spark shell fails with "NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream" if java is not available in PATH * MapR [SPARK-350] Deprecate Spark Kafka-09 package * MapR [SPARK-326] Investigate possibility of writing Java example for the MapRDB OJAI connector * [SPARK-356] Merge mapr changes from kafka-09 package into the kafka-10 * SPARK-319 Fix for sparkR version check * MapR [SPARK-349] Update OJAI client to v3 for Spark MapR-DB JSON connector * MapR [SPARK-367] Move absent commits from 2.3.1 branch * MapR [SPARK-137] Analyze the warning during compilation of OJAI connector * MapR [SPARK-369] Spark 2.3.2 fails with error related to zookeeper * [MAPR-26258] hbasecontext.HBaseDistributedScanExample fails * [SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests * MapR [SPARK-374] Spark Hive example fails when we submit job from another(simple) cluster user * MapR [SPARK-434] Move absent commits from 2.3.2 branch * MapR [SPARK-434] Move absent commits from 2.3.2 branch * MapR [SPARK-373] Unexpected behavior during job running in standalone cluster mode * MapR [SPARK-419] Update hive-maprdb-json-handler jar for spark 2.3.2.0 and spark 2.2.1 * MapR [SPARK-396] Interface change of sendToKafka * MapR [SPARK-357] consumer groups are prepeneded with a "service_" prefix * MapR [SPARK-429] Changes in maprdb connector are the cause of broken backward compatibility * MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr * MapR [SPARK-434] Move absent commits from 2.3.2 branch * Move absent commits from 2.3.2 branch * MapR [SPARK-434] Move absent commits from 2.3.2 branch * Move absent commits from 2.3.2 branch * Move absent commits from 2.3.2 branch MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr MapR [SPARK-379] Spark 2.4 4-gidit version MapR [PIC-48][K8S] Port k8s changes to 2.4.0 [PIC-48] Create user for k8s driver and executor if required [PIC-48] Create user for k8s driver and executor if required Revert "Remove spark.ui.filters property" This reverts commit d8941ba36c3451cdce15d18d6c1a52991de3b971. [SPARK-351] Copy kubernetes start scripts anyway PIC-34: Rename default configmap name to be consistent with mapr-kubernetes [SPARK-23668][K8S] Add config option for passing through k8s Pod.spec.imagePullSecrets (apache#355) Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries. See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/ Unit tests + manual testing. Manual testing procedure: 1. Have private image registry. 2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message: ``` Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n \\\"errors\\\" : [ {\\n \\\"status\\\" : 400,\\n \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n } ]\\n}\"" ``` 3. Create secret `kubectl create secret docker-registry ...` 4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful. Author: Andrew Korzhuev <andrew.korzhuev@klarna.com> Author: Andrew Korzhuev <korzhuev@andrusha.me> Closes apache#20811 from andrusha/spark-23668-image-pull-secrets. [SPARK-321] Change default value of spark.mapr.ssl.secret.prefix property [PIC-32] Spark on k8s with MapR secure cluster Update entrypoint.sh with correct spark version (apache#340) This PR has minor fix to correct the spark version string [SPARK-274] Create home directory for user who submitted job [MAPR-SPARK-230] Implement security for Spark on Kubernetes Run Spark job with specify the username for driver and executor Read cluster configs from configMap Run configure.sh script form entrypoint.sh Remove spark.kubernetes.driver.pod.commands property Add Spark properties for executor and driver environment variable MapR [SPARK-296] Structured Streaming memory leak Revert "[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.tem…" (apache#252) * Revert "[MAPR-SPARK-176] Fix Spark Project Catalyst unit tests (apache#251)" This reverts commit 5de05075cd14abf8ac65046a57a5d76617818fbe. * Revert "[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.template (apache#249)" This reverts commit 1baa677d727e89db7c605ffbae9a9eba00337ad0. [MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.template MapR [SPARK-379] Port Spark to 2.4.0 MapR [SPARK-341] Spark 2.3.2 porting [MAPR-32290] Spark processing offsets when messages are already TTL in the first batch * Bug 32263 - Seek called on unsubscribed partitions [MSPARK-331] Remove snapshot versions of mapr dependencies from Spark-2.3.1 [MAPR-32290] Spark processing offsets when messages are already ttl in first batch MapR [SPARK-325] Add examples for work with the MapRDB JSON connector into the Spark project [ATS-449] Unit test for EBF 32013 created. MAPR-SPARK-311: Spark beeline uses default ssl truststore instead of mapr ssl truststore Bug 32355 - Executor tab empty on Spark UI [SPARK-318] Submitting Spark jobs from Oozie fails due to ClassNotFoundException Bug 32014 - Spark Consumer fails with java.lang.AssertionError Revert "[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1" (apache#341) * Revert "[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1 (apache#335)" This reverts commit 832411e. Bug 32014 - Spark Consumer fails with java.lang.AssertionError (apache#326) (apache#336) * MapR [32014] Spark Consumer fails with java.lang.AssertionError [SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1 DEVOPS-2768 temporarily removed curl for file downloading [SPARK-302] Local privilege escalation MapR [SPARK-297] Added unit test for empty value conversion MapR [SPARK-297] Empty values are loaded as non-null MapR [SPARK-296] Structured Streaming memory leak 2.3.1 spark 289 (apache#318) * MapR [SPARK-289] Fix unit test for Spark-2.3.1 [SPARK-130] MapRDB connector - NPE while saving Pair RDD with 'null' values MapR [SPARK-283] Unit tests fail during initialization SSL properties. [SPARK-212] SparkHiveExample fails when we run it twice MapR [SPARK-282] Remove maprfs and hadoop jars from mapr spark package MapR [SPARK-278] Spark submit fails for jobs with python MapR [SPARK-279] Can't connect to spark thrift server with new spark and hive packages MapR [SPARK-276] Update zookeeper dependency to v.3.4.11 for spark 2.3.1 MapR [SPARK-272] Use only client passwords from ssl-client.xml MapR [SPARK-266] Spark jobs can't finish correctly, when there is an error during job running MapR [SPARK-263] Add possibility to use keyPassword which is different from keyStorePassword [MSPARK-31632] RM UI showing broken page for Spark jobs MapR [SPARK-261] Use mapr-security-web for getting passwords. MapR [SPARK-259] Spark application doesn't finish correctly MapR [SPARK-268] Update Spark version for Warden change project version to 2.3.1-mapr-SNAPSHOT MapR [SPARK-256] Spark doesn't work on yarn mode MapR [SPARK-255] Installer fresh install 610/600 secure fails to start "mapr-spark-thriftserver", "mapr-spark-historyserver" Mapr [SPARK-248] MapRDBTableScanRDD fails to convert to Scala Dataframe when using where clause MapR [SPARK-225] Hadoop credentials provider usage for hiding passwords at spark-defaults MapR [SPARK-214] Hive-2.1 poperties can't be read from a hive-site.xml as Spark uses Hive-1.2 MapR [SPARK-216] Spark thriftserver fails when work with hive-maprdb json table SPARK-244 (apache#278) Provide ability to use MapR-Negotiation authentication for Spark HistoryServer MapR [SPARK-226] Spark - pySpark Security Vulnerability MapR [SPARK-220] SparkR fails with UDF functions bug fixed MapR [SPARK-227] KafkaUtils.createDirectStream fails with kafka-09 MapR [SPARK-183] Spark Integration for Kafka 0.10 unit tests disabled MapR [SPARK-182] Spark Project External Kafka Producer v09 unit tests fixed MapR [SPARK-179] Spark Integration for Kafka 0.9 unit tests fixed MapR [SPARK-181] Kafka 0.10 Structured Streaming unit tests fixed [MSPARK-31305] Spark History server NOT loading applications submitted by users other than 'mapr' MapR [SPARK-175] Fix Spark Project Streaming unit tests [MAPR-SPARK-176] Fix Spark Project Catalyst unit tests [MAPR-SPARK-178] Fix Spark Project Hive unit tests MapR [SPARK-174] Spark Core unit tests fixed Changed version for spark-kafka connector. MapR [SPARK-202] Update MapR Spark to 2.3.0 Fixed compile time errors in tests Change project version [SPARK-198] Update hadoop dependency version to 2.7.0-mapr-1803 for Spark 2.2.1 MapR [SPARK-188] Couldn't connect to thrift server via spark beeline on kerberos cluster MapR [SPARK-143] Spark History Server does not require login for secured-by-default clusters MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector MapR [SPARK-191] Incorrect work of MapR-DB Sink 'complete' and 'update' modes fixed MapR [SPARK-170] StackOverflowException in equals method in DBMapValue 2.2.1 build fixed (apache#231) * MapR [SPARK-164] Update Kafka version to 1.0.1-mapr in Spark Kafka Producer module MapR [SPARK-161] Include Kafka Structured streaming jar to Spark package. MapR [SPARK-155] Change Spark Master port from 8080 MapR [SPARK-153] Exception in spark job with configured labels on yarn-client mode MapR [SPARK-152] Incorrect date string parsing fixed MapR [SPARK-21] Structured Streaming MapR-DB Sink created MapR [SPARK-135] Spark 2.2 with MapR Streams ( Kafka 1.0) (apache#218) * MapR [SPARK-135] Spark 2.2 with MapR Streams (Kafka 1.0) Added functionality of MapR-Streams specific EOF handling. MapR [SPARK-143] Spark History Server does not require login for secured-by-default clusters Disable build failing if scalastyle checking is fall. MapR [SPARK-16] Change Spark version in Warden files and configure.sh MapR [SPARK-144] Add insertToMapRDB method for rdd for Java API [MAPR-30536] Spark SQL queries on Map column fails after upgrade MapR [SPARK-139] Remove "update" related APIs from connector MapR [SPARK-140] Change the option name "tableName" to "tablePath" in the Spark/MapR-DB connectors. MapR [SPARK-121] Spark OJAI JAVA: update functionality removed MapR [SPARK-118] Spark OJAI Python: missed DataFrame import while moving imports in order to fix MapR [ZEP-101] interpreter issue MapR [SPARK-118] Spark OJAI Python: move MapR DB Connector class importing in order to fix MapR [ZEP-101] interpreter issue MapR [SPARK-117] Spark OJAI Python: Save functionality implementation MapR [SPARK-131] Exception when try to save JSON table with Binary _id field Spark OJAI JAVA: load to RDD, save from RDD implementation (apache#195) * MapR [SPARK-124] Loading to JavaRDD implemented * MapR [SPARK-124] MapRDBJavaSparkContext constructor changed * MapR [SPARK-124] implemented RDD[Row] saving MapR [SPARK-118] Spark OJAI Python: Read implementation MapR [SPARK-128] MapRDB connector - wrong handle of null fields when nullable is false * MapR [SPARK-121] Spark OJAI JAVA: Read to Dataset functionality implementation * Minor refactoring MapR [SPARK-125] Default value of idFieldPath parameter is not handle MapR [SPARK-113] Hit java.lang.UnsupportedOperationException: empty.reduceLeft during loadFromMapRDB Spark Mapr-DB connector was refactored according to Scala style Removed code duplication [MSPARK-107]idField information is lost in MapRDBDataFrameWriterFunctions.saveToMapRDB configure.sh takes options to change ports Kafka client excluded from package because correct version is located in "mapr classpath" Changed Kafka version in Kafka producer module. Branch spark 69 (apache#170) * Fixing the wrong type casting of TimeStamp to OTimeStamp when read from spark dataFrame. * SPARK-69: Problem with license when we try to read from json and write to maprdb remove creatin /usr/local/spark link from configure.sh. This link will be creates by private-pkg remove include-maprdb from default profiles added profiles in maprdb pom file instead of two pom files Fixed maprdb connector dependencies. Fixing the wrong type casting of TimeStamp to OTimeStamp when read from spark dataFrame. changed port for spark-thriftserver as it conflicts with hive server changed port for spark-thriftserver as it conflicts with hive server remove .not_configured_yet file after success Ojai connector fixed required java version [MSPARK-45] Move Spark-OJAI connector code to Spark github repo (apache#132) * SPARK-45 Move Spark-OJAI connector code to Spark github repo * Fixing pom versions for maprdb spark connector. * Changes made to the connector code to be compatible with 5.2.* and 6.0 clients. Spark 2.1.0 mapr 29106 (apache#150) * [SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher. Blindly deserializing classes using Java serialization opens the code up to issues in other libraries, since just deserializing data from a stream may end up execution code (think readObject()). Since the launcher protocol is pretty self-contained, there's just a handful of classes it legitimately needs to deserialize, and they're in just two packages, so add a filter that throws errors if classes from any other package show up in the stream. This also maintains backwards compatibility (the updated launcher code can still communicate with the backend code in older Spark releases). Tested with new and existing unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18166 from vanzin/SPARK-20922. (cherry picked from commit 8efc6e9) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 772a9b9) * [SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18178 from vanzin/SPARK-20922-hotfix. Added security by default for historyserver use waitForConsumerAssignment() instead of consumer.poll(0) for spark-29052 change MAPR_HADOOP_CLASSPATH in configure.sh for creating it by mapr-classpath.sh change MAPR_HADOOP_CLASSPATH in configure.sh for creating it by mapr-classpath.sh changes for mapr-classpath.sh changes for mapr-classpath.sh configure.sh changes [SPARK-39] Classpath filter was added Fixed impersonation when data read from MapR-DB via Spark-Hive. added configure.sh and warden.spark-thriftserver.conf hive-hbase-handler added to Spark jars Fixed "Single message comes late" 28339 bug fixed Spark streaming skipped message with zero offset from Kafka 0.9 [MSPARK-9] Initial fix for Spark unit tests Bump dependencies after ECO-1703 release [SPARK-33] Streaming example fixed [MAPR-26060] Fixed case when mapr-streams make gaps in offsets ported features from kafka 10 to kafka 9 [MAPR-26289][SPARK-2.1] Streaming general improvements (apache#93) * Added include-kafka-09 profile to Assembly * Set default poll timeout to 120s Set default HBase verison to 1.1.8 Changes from Kafka10 package were ported to Kafka09 package. [MAPR-26053] Include MapR Classes to the default value of spark.sql.hive.metastore.sharedPrefixes [MAPR-25807] Spark-Warehouse path computes incorrectly Add MapR-SASL support for Thrift Server Adding scala library. [MAPR-25713] Spark might try to load MapR Class Loader multiple times and fail [MAPR-25311] Bump Spark dependencies after ECO-1611 release [MINOR] Fix spark-jars.sh script [MAPR-24603] Could not launch beeline shell after starting spark thrift server fixed syntax error in V09DirectKafkaWordCount example Spark 2.0.1 MAPR-streams Python API [MAPR-24415] SPARK_JAVA_OPTS is deprecated Kafka streaming producer added. Minor fix for previous commit Added script for MAPR-24374 Some minor changes to spark-defaults.conf Changed default HBase version to 1.1.1 in compatibility.version Streaming example was refactored [MAPR-24470] HiveFromSpark test fails in yarn-cluster mode Added MapR Repo [MAPR-22940] Failed to connect spark beeline (after spark thrift server is started) on Kerberos cluster [MAPR-18865] Unable to submit spark apps from Windows client Skip maven clean task on the parent module New: Issue with running Hive commands in Spark This is fixed in SPARK-7819 Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error Spark warden.services.conf should have dependency on cldb Remove DFS shuffle settings. These settings are not used right now. Copy every file in the conf directory into the distribution package. Create spark-defaults.conf for MapR Settings to enable DFS shuffle on MapR. Support hbase classpath computation in util script. Adding external conf and scripts. Enable SPARK_HIVE mode while building. This is needed to bundle datanucleus jars needed for hive table creation. Build Spark on MapR. - make-distribution.sh takes an environment variable to enable profiles - MVN_PROFILE_ARG - Added warden conf files under ext-conf. - Updated pom.xml to use right set of jars and version. Spark Master failed to start in HA mode Updated Apache Curator version Added spark streaming integration with kafka 0.9 and mapr-streams Added MapR Repo

nchammas reviewed Mar 25, 2014
View reviewed changes

nchammas added 3 commits April 3, 2014 16:56

Merge remote-tracking branch 'upstream/master'

d8682ec

Merge remote-tracking branch 'upstream/master'

88fa762

Rebasing fork from source.

Changed partitions() to getNumPartitions()

ab4467f

Change the definition of this method per Patrick’s comments [here](#218).

nchammas closed this Apr 22, 2014

nchammas changed the title ~~[SPARK-1308] Add partitions() method to PySpark RDDs~~ [SPARK-1308] [PySpark] Add partitions() method to PySpark RDDs Aug 30, 2014

davies pushed a commit to davies/spark that referenced this pull request Mar 16, 2015

Merge pull request apache#218 from davies/merge

f5d3355

[SPARKR-225] Merge master into sparkr-sql branch

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

MapR [SPARK-135] Spark 2.2 with MapR Streams ( Kafka 1.0) (apache#218)

caeafa0

* MapR [SPARK-135] Spark 2.2 with MapR Streams (Kafka 1.0) Added functionality of MapR-Streams specific EOF handling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-1308] [PySpark] Add partitions() method to PySpark RDDs #218

[SPARK-1308] [PySpark] Add partitions() method to PySpark RDDs #218

nchammas commented Mar 24, 2014

AmplabJenkins commented Mar 25, 2014

nchammas Mar 25, 2014

pwendell commented Mar 25, 2014

nchammas commented Mar 25, 2014

pwendell commented Mar 25, 2014

AmplabJenkins commented Mar 28, 2014

nchammas commented Apr 22, 2014

nchammas commented Apr 22, 2014

[SPARK-1308] [PySpark] Add partitions() method to PySpark RDDs #218

[SPARK-1308] [PySpark] Add partitions() method to PySpark RDDs #218

Conversation

nchammas commented Mar 24, 2014

AmplabJenkins commented Mar 25, 2014

nchammas Mar 25, 2014

Choose a reason for hiding this comment

pwendell commented Mar 25, 2014

nchammas commented Mar 25, 2014

pwendell commented Mar 25, 2014

AmplabJenkins commented Mar 28, 2014

nchammas commented Apr 22, 2014

nchammas commented Apr 22, 2014