From 487710b79ffbed8eccbd531531149a838eb2fbb2 Mon Sep 17 00:00:00 2001 From: Michal Senkyr Date: Mon, 5 Dec 2016 22:16:23 +0100 Subject: [PATCH 1/4] [SPARK-18723][DOC] Expanded programming guide information on wholeTextFiles ## What changes were proposed in this pull request? Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance. Also added reference to the underlying CombineFileInputFormat ## How was this patch tested? Manual build of documentation and inspection in browser ``` cd docs SKIP_API=1 jekyll serve --watch ``` Author: Michal Senkyr --- docs/programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 4267b8cae8110..25fa561b9d204 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -347,7 +347,7 @@ Some notes on reading files with Spark: Apart from text files, Spark's Scala API also supports several other data formats: -* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. +* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same node into a single split. (This can lead to non-optimal partitioning. It is therefore advisable to set the minimal number of partitions explicitly.) * For [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), use SparkContext's `sequenceFile[K, V]` method where `K` and `V` are the types of key and values in the file. These should be subclasses of Hadoop's [Writable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html) interface, like [IntWritable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/IntWritable.html) and [Text](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html). In addition, Spark allows you to specify native types for a few common Writables; for example, `sequenceFile[Int, String]` will automatically read IntWritables and Texts. From cb6e002434d332f81d552ba17c2f296e264b2d34 Mon Sep 17 00:00:00 2001 From: Michal Senkyr Date: Mon, 5 Dec 2016 23:58:04 +0100 Subject: [PATCH 2/4] [SPARK-18723][DOC] Expanded programming guide information on wholeTextFiles --- docs/programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 25fa561b9d204..2d8de32cfc157 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -347,7 +347,7 @@ Some notes on reading files with Spark: Apart from text files, Spark's Scala API also supports several other data formats: -* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same node into a single split. (This can lead to non-optimal partitioning. It is therefore advisable to set the minimal number of partitions explicitly.) +* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same executor into a single partition. (This can lead to non-optimal partitioning. It is therefore advisable to set the minimal number of partitions explicitly.) * For [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), use SparkContext's `sequenceFile[K, V]` method where `K` and `V` are the types of key and values in the file. These should be subclasses of Hadoop's [Writable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html) interface, like [IntWritable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/IntWritable.html) and [Text](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html). In addition, Spark allows you to specify native types for a few common Writables; for example, `sequenceFile[Int, String]` will automatically read IntWritables and Texts. From d76c6c5038fc7130461741614b4c46c86c8e8f8d Mon Sep 17 00:00:00 2001 From: Michal Senkyr Date: Wed, 7 Dec 2016 23:15:35 +0100 Subject: [PATCH 3/4] [SPARK-18723][DOC] Expanded programming guide information on wholeTextFiles --- core/src/main/scala/org/apache/spark/SparkContext.scala | 6 ++++++ docs/programming-guide.md | 2 +- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index be4dae19df5f9..4cccabf07a6a2 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -851,6 +851,9 @@ class SparkContext(config: SparkConf) extends Logging { * @note Small files are preferred, large file is also allowable, but may cause bad performance. * @note On some filesystems, `.../path/*` can be a more efficient way to read all files * in a directory rather than `.../path/` or `.../path` + * @note Uses Hadoop's CombineFileInputFormat to merge multiple files into partitions by + * executors on which they reside, which may cause sub-optimal partitioning in certain + * situations. Set minPartitions in those cases * * @param path Directory to the input data files, the path can be comma separated paths as the * list of inputs. @@ -900,6 +903,9 @@ class SparkContext(config: SparkConf) extends Logging { * @note Small files are preferred; very large files may cause bad performance. * @note On some filesystems, `.../path/*` can be a more efficient way to read all files * in a directory rather than `.../path/` or `.../path` + * @note Uses Hadoop's CombineFileInputFormat to merge multiple files into partitions by + * executors on which they reside, which may cause sub-optimal partitioning in certain + * situations. Set minPartitions in those cases * * @param path Directory to the input data files, the path can be comma separated paths as the * list of inputs. diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 2d8de32cfc157..1f67337a976ff 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -347,7 +347,7 @@ Some notes on reading files with Spark: Apart from text files, Spark's Scala API also supports several other data formats: -* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same executor into a single partition. (This can lead to non-optimal partitioning. It is therefore advisable to set the minimal number of partitions explicitly.) +* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same executor into a single partition. This can lead to sub-optimal partitioning when the file sets would benefit from residing in multiple partitions (e.g., larger partitions would not fit in memory, files are replicated but a large subset is locally reachable from a single executor, subsequent transformations would benefit from multi-core processing). In those cases, set the `minPartitions` argument to enforce splitting. * For [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), use SparkContext's `sequenceFile[K, V]` method where `K` and `V` are the types of key and values in the file. These should be subclasses of Hadoop's [Writable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html) interface, like [IntWritable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/IntWritable.html) and [Text](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html). In addition, Spark allows you to specify native types for a few common Writables; for example, `sequenceFile[Int, String]` will automatically read IntWritables and Texts. From 1fff423c2dfc58057af005830180524886ebd99c Mon Sep 17 00:00:00 2001 From: Michal Senkyr Date: Thu, 15 Dec 2016 12:06:55 +0100 Subject: [PATCH 4/4] Reworded wholeTextFiles documentation expansion and scaladoc note --- .../src/main/scala/org/apache/spark/SparkContext.scala | 10 ++++------ docs/programming-guide.md | 2 +- 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index 4cccabf07a6a2..736ebd79c5748 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -851,9 +851,8 @@ class SparkContext(config: SparkConf) extends Logging { * @note Small files are preferred, large file is also allowable, but may cause bad performance. * @note On some filesystems, `.../path/*` can be a more efficient way to read all files * in a directory rather than `.../path/` or `.../path` - * @note Uses Hadoop's CombineFileInputFormat to merge multiple files into partitions by - * executors on which they reside, which may cause sub-optimal partitioning in certain - * situations. Set minPartitions in those cases + * @note Partitioning is determined by data locality. This may result in too few partitions + * by default. * * @param path Directory to the input data files, the path can be comma separated paths as the * list of inputs. @@ -903,9 +902,8 @@ class SparkContext(config: SparkConf) extends Logging { * @note Small files are preferred; very large files may cause bad performance. * @note On some filesystems, `.../path/*` can be a more efficient way to read all files * in a directory rather than `.../path/` or `.../path` - * @note Uses Hadoop's CombineFileInputFormat to merge multiple files into partitions by - * executors on which they reside, which may cause sub-optimal partitioning in certain - * situations. Set minPartitions in those cases + * @note Partitioning is determined by data locality. This may result in too few partitions + * by default. * * @param path Directory to the input data files, the path can be comma separated paths as the * list of inputs. diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 1f67337a976ff..8633a4bc3eb33 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -347,7 +347,7 @@ Some notes on reading files with Spark: Apart from text files, Spark's Scala API also supports several other data formats: -* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same executor into a single partition. This can lead to sub-optimal partitioning when the file sets would benefit from residing in multiple partitions (e.g., larger partitions would not fit in memory, files are replicated but a large subset is locally reachable from a single executor, subsequent transformations would benefit from multi-core processing). In those cases, set the `minPartitions` argument to enforce splitting. +* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. Partitioning is determined by data locality which, in some cases, may result in too few partitions. For those cases, `wholeTextFiles` provides an optional second argument for controlling the minimal number of partitions. * For [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), use SparkContext's `sequenceFile[K, V]` method where `K` and `V` are the types of key and values in the file. These should be subclasses of Hadoop's [Writable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html) interface, like [IntWritable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/IntWritable.html) and [Text](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html). In addition, Spark allows you to specify native types for a few common Writables; for example, `sequenceFile[Int, String]` will automatically read IntWritables and Texts.