Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html …
…to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
- Loading branch information
1 parent
c296254
commit 987f386
Showing
25 changed files
with
3,347 additions
and
3,130 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
- text: Getting Started | ||
url: sql-getting-started.html | ||
subitems: | ||
- text: "Starting Point: SparkSession" | ||
url: sql-getting-started.html#starting-point-sparksession | ||
- text: Creating DataFrames | ||
url: sql-getting-started.html#creating-dataframes | ||
- text: Untyped Dataset Operations (DataFrame operations) | ||
url: sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations | ||
- text: Running SQL Queries Programmatically | ||
url: sql-getting-started.html#running-sql-queries-programmatically | ||
- text: Global Temporary View | ||
url: sql-getting-started.html#global-temporary-view | ||
- text: Creating Datasets | ||
url: sql-getting-started.html#creating-datasets | ||
- text: Interoperating with RDDs | ||
url: sql-getting-started.html#interoperating-with-rdds | ||
- text: Aggregations | ||
url: sql-getting-started.html#aggregations | ||
- text: Data Sources | ||
url: sql-data-sources.html | ||
subitems: | ||
- text: "Generic Load/Save Functions" | ||
url: sql-data-sources-load-save-functions.html | ||
- text: Parquet Files | ||
url: sql-data-sources-parquet.html | ||
- text: ORC Files | ||
url: sql-data-sources-orc.html | ||
- text: JSON Files | ||
url: sql-data-sources-json.html | ||
- text: Hive Tables | ||
url: sql-data-sources-hive-tables.html | ||
- text: JDBC To Other Databases | ||
url: sql-data-sources-jdbc.html | ||
- text: Avro Files | ||
url: sql-data-sources-avro.html | ||
- text: Troubleshooting | ||
url: sql-data-sources-troubleshooting.html | ||
- text: Performance Turing | ||
url: sql-performance-turing.html | ||
subitems: | ||
- text: Caching Data In Memory | ||
url: sql-performance-turing.html#caching-data-in-memory | ||
- text: Other Configuration Options | ||
url: sql-performance-turing.html#other-configuration-options | ||
- text: Broadcast Hint for SQL Queries | ||
url: sql-performance-turing.html#broadcast-hint-for-sql-queries | ||
- text: Distributed SQL Engine | ||
url: sql-distributed-sql-engine.html | ||
subitems: | ||
- text: "Running the Thrift JDBC/ODBC server" | ||
url: sql-distributed-sql-engine.html#running-the-thrift-jdbcodbc-server | ||
- text: Running the Spark SQL CLI | ||
url: sql-distributed-sql-engine.html#running-the-spark-sql-cli | ||
- text: PySpark Usage Guide for Pandas with Apache Arrow | ||
url: sql-pyspark-pandas-with-arrow.html | ||
subitems: | ||
- text: Apache Arrow in Spark | ||
url: sql-pyspark-pandas-with-arrow.html#apache-arrow-in-spark | ||
- text: "Enabling for Conversion to/from Pandas" | ||
url: sql-pyspark-pandas-with-arrow.html#enabling-for-conversion-tofrom-pandas | ||
- text: "Pandas UDFs (a.k.a. Vectorized UDFs)" | ||
url: sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs | ||
- text: Usage Notes | ||
url: sql-pyspark-pandas-with-arrow.html#usage-notes | ||
- text: Migration Guide | ||
url: sql-migration-guide.html | ||
subitems: | ||
- text: Spark SQL Upgrading Guide | ||
url: sql-migration-guide-upgrade.html | ||
- text: Compatibility with Apache Hive | ||
url: sql-migration-guide-hive-compatibility.html | ||
- text: Reference | ||
url: sql-reference.html | ||
subitems: | ||
- text: Data Types | ||
url: sql-reference.html#data-types | ||
- text: NaN Semantics | ||
url: sql-reference.html#nan-semantics | ||
- text: Arithmetic operations | ||
url: sql-reference.html#arithmetic-operations |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
<div class="left-menu-wrapper"> | ||
<div class="left-menu"> | ||
<h3><a href="sql-programming-guide.html">Spark SQL Guide</a></h3> | ||
{% include nav-left.html nav=include.nav-sql %} | ||
</div> | ||
</div> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,166 @@ | ||
--- | ||
layout: global | ||
title: Hive Tables | ||
displayTitle: Hive Tables | ||
--- | ||
|
||
* Table of contents | ||
{:toc} | ||
|
||
Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/). | ||
However, since Hive has a large number of dependencies, these dependencies are not included in the | ||
default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them | ||
automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as | ||
they will need access to the Hive serialization and deserialization libraries (SerDes) in order to | ||
access data stored in Hive. | ||
|
||
Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` (for security configuration), | ||
and `hdfs-site.xml` (for HDFS configuration) file in `conf/`. | ||
|
||
When working with Hive, one must instantiate `SparkSession` with Hive support, including | ||
connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. | ||
Users who do not have an existing Hive deployment can still enable Hive support. When not configured | ||
by the `hive-site.xml`, the context automatically creates `metastore_db` in the current directory and | ||
creates a directory configured by `spark.sql.warehouse.dir`, which defaults to the directory | ||
`spark-warehouse` in the current directory that the Spark application is started. Note that | ||
the `hive.metastore.warehouse.dir` property in `hive-site.xml` is deprecated since Spark 2.0.0. | ||
Instead, use `spark.sql.warehouse.dir` to specify the default location of database in warehouse. | ||
You may need to grant write privilege to the user who starts the Spark application. | ||
|
||
<div class="codetabs"> | ||
|
||
<div data-lang="scala" markdown="1"> | ||
{% include_example spark_hive scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala %} | ||
</div> | ||
|
||
<div data-lang="java" markdown="1"> | ||
{% include_example spark_hive java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java %} | ||
</div> | ||
|
||
<div data-lang="python" markdown="1"> | ||
{% include_example spark_hive python/sql/hive.py %} | ||
</div> | ||
|
||
<div data-lang="r" markdown="1"> | ||
|
||
When working with Hive one must instantiate `SparkSession` with Hive support. This | ||
adds support for finding tables in the MetaStore and writing queries using HiveQL. | ||
|
||
{% include_example spark_hive r/RSparkSQLExample.R %} | ||
|
||
</div> | ||
</div> | ||
|
||
### Specifying storage format for Hive tables | ||
|
||
When you create a Hive table, you need to define how this table should read/write data from/to file system, | ||
i.e. the "input format" and "output format". You also need to define how this table should deserialize the data | ||
to rows, or serialize rows to data, i.e. the "serde". The following options can be used to specify the storage | ||
format("serde", "input format", "output format"), e.g. `CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet')`. | ||
By default, we will read the table files as plain text. Note that, Hive storage handler is not supported yet when | ||
creating table, you can create a table using storage handler at Hive side, and use Spark SQL to read it. | ||
|
||
<table class="table"> | ||
<tr><th>Property Name</th><th>Meaning</th></tr> | ||
<tr> | ||
<td><code>fileFormat</code></td> | ||
<td> | ||
A fileFormat is kind of a package of storage format specifications, including "serde", "input format" and | ||
"output format". Currently we support 6 fileFormats: 'sequencefile', 'rcfile', 'orc', 'parquet', 'textfile' and 'avro'. | ||
</td> | ||
</tr> | ||
|
||
<tr> | ||
<td><code>inputFormat, outputFormat</code></td> | ||
<td> | ||
These 2 options specify the name of a corresponding `InputFormat` and `OutputFormat` class as a string literal, | ||
e.g. `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat`. These 2 options must be appeared in pair, and you can not | ||
specify them if you already specified the `fileFormat` option. | ||
</td> | ||
</tr> | ||
|
||
<tr> | ||
<td><code>serde</code></td> | ||
<td> | ||
This option specifies the name of a serde class. When the `fileFormat` option is specified, do not specify this option | ||
if the given `fileFormat` already include the information of serde. Currently "sequencefile", "textfile" and "rcfile" | ||
don't include the serde information and you can use this option with these 3 fileFormats. | ||
</td> | ||
</tr> | ||
|
||
<tr> | ||
<td><code>fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim</code></td> | ||
<td> | ||
These options can only be used with "textfile" fileFormat. They define how to read delimited files into rows. | ||
</td> | ||
</tr> | ||
</table> | ||
|
||
All other properties defined with `OPTIONS` will be regarded as Hive serde properties. | ||
|
||
### Interacting with Different Versions of Hive Metastore | ||
|
||
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore, | ||
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary | ||
build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below. | ||
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL | ||
will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc). | ||
|
||
The following options can be used to configure the version of Hive that is used to retrieve metadata: | ||
|
||
<table class="table"> | ||
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> | ||
<tr> | ||
<td><code>spark.sql.hive.metastore.version</code></td> | ||
<td><code>1.2.1</code></td> | ||
<td> | ||
Version of the Hive metastore. Available | ||
options are <code>0.12.0</code> through <code>2.3.3</code>. | ||
</td> | ||
</tr> | ||
<tr> | ||
<td><code>spark.sql.hive.metastore.jars</code></td> | ||
<td><code>builtin</code></td> | ||
<td> | ||
Location of the jars that should be used to instantiate the HiveMetastoreClient. This | ||
property can be one of three options: | ||
<ol> | ||
<li><code>builtin</code></li> | ||
Use Hive 1.2.1, which is bundled with the Spark assembly when <code>-Phive</code> is | ||
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be | ||
either <code>1.2.1</code> or not defined. | ||
<li><code>maven</code></li> | ||
Use Hive jars of specified version downloaded from Maven repositories. This configuration | ||
is not generally recommended for production deployments. | ||
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive | ||
and its dependencies, including the correct version of Hadoop. These jars only need to be | ||
present on the driver, but if you are running in yarn cluster mode then you must ensure | ||
they are packaged with your application.</li> | ||
</ol> | ||
</td> | ||
</tr> | ||
<tr> | ||
<td><code>spark.sql.hive.metastore.sharedPrefixes</code></td> | ||
<td><code>com.mysql.jdbc,<br/>org.postgresql,<br/>com.microsoft.sqlserver,<br/>oracle.jdbc</code></td> | ||
<td> | ||
<p> | ||
A comma-separated list of class prefixes that should be loaded using the classloader that is | ||
shared between Spark SQL and a specific version of Hive. An example of classes that should | ||
be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need | ||
to be shared are those that interact with classes that are already shared. For example, | ||
custom appenders that are used by log4j. | ||
</p> | ||
</td> | ||
</tr> | ||
<tr> | ||
<td><code>spark.sql.hive.metastore.barrierPrefixes</code></td> | ||
<td><code>(empty)</code></td> | ||
<td> | ||
<p> | ||
A comma separated list of class prefixes that should explicitly be reloaded for each version | ||
of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a | ||
prefix that typically would be shared (i.e. <code>org.apache.spark.*</code>). | ||
</p> | ||
</td> | ||
</tr> | ||
</table> |
Oops, something went wrong.