add user guide for binary file data source

apache · Apr 29, 2019 · bdd4529 · bdd4529
1 parent 20a3ef7
commit bdd4529
Show file tree

Hide file tree

Showing 2 changed files with 81 additions and 0 deletions.
diff --git a/docs/sql-data-sources-binaryFile.md b/docs/sql-data-sources-binaryFile.md
@@ -0,0 +1,80 @@
+---
+layout: global
+title: Binary File Data Source
+displayTitle: Binary File Data Source
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Since Spark 3.0, Spark supports binary file data source,
+which reads binary files and converts each file into a single record that contains the raw content
+and metadata of the file.
+It produces a DataFrame with the following columns and possibly partition columns:
+* `path`: StringType
+* `modificationTime`: TimestampType
+* `length`: LongType
+* `content`: BinaryType
+
+It supports the following read option:
+<table class="table">
+  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
+  <tr>
+    <td><code>pathGlobFilter</code></td>
+    <td>none (accepts all)</td>
+    <td>
+    An optional glob pattern to only include files with paths matching the pattern.
+    The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
+    It does not change the behaivor of partition discovery.
+    </td>
+  </tr>
+</table>
+
+To read whole binary files, you need to specify the data source `format` as `binaryFile`.
+For example, the following code reads all PNG files from the input directory:
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+
+spark.read.format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data")
+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% highlight java %}
+
+spark.read().format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data");
+
+{% endhighlight %}
+</div>
+<div data-lang="python" markdown="1">
+{% highlight python %}
+
+spark.read.format("pathGlobFilter").option("pathGlobFilter", "*.png").load("/path/to/data")
+
+{% endhighlight %}
+</div>
+<div data-lang="r" markdown="1">
+{% highlight r %}
+
+read.df("/path/to/data", "binaryFile", pathGlobFilter = "*.png")
+
+{% endhighlight %}
+</div>
+</div>
+
+Binary file data source does not support writing a DataFrame back to the original files.
diff --git a/docs/sql-data-sources.md b/docs/sql-data-sources.md
@@ -54,4 +54,5 @@ goes into specific options that are available for the built-in data sources.
   * [Compatibility with Databricks spark-avro](sql-data-sources-avro.html#compatibility-with-databricks-spark-avro)
   * [Supported types for Avro -> Spark SQL conversion](sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion)
   * [Supported types for Spark SQL -> Avro conversion](sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion)
+* [Whole Binary Files](sql-data-sources-binaryFile.html)
 * [Troubleshooting](sql-data-sources-troubleshooting.html)