Skip to content

Commit

Permalink
[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html …
Browse files Browse the repository at this point in the history
…to multiple separate pages

## What changes were proposed in this pull request?

1. Split the main page of sql-programming-guide into 7 parts:

- Getting Started
- Data Sources
- Performance Turing
- Distributed SQL Engine
- PySpark Usage Guide for Pandas with Apache Arrow
- Migration Guide
- Reference

2. Add left menu for sql-programming-guide, keep first level index for each part in the menu.
![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png)

## How was this patch tested?

Local test with jekyll build/serve.

Closes #22746 from xuanyuanking/SPARK-24499.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
  • Loading branch information
xuanyuanking authored and gatorsmile committed Oct 18, 2018
1 parent c296254 commit 987f386
Show file tree
Hide file tree
Showing 25 changed files with 3,347 additions and 3,130 deletions.
81 changes: 81 additions & 0 deletions docs/_data/menu-sql.yaml
@@ -0,0 +1,81 @@
- text: Getting Started
url: sql-getting-started.html
subitems:
- text: "Starting Point: SparkSession"
url: sql-getting-started.html#starting-point-sparksession
- text: Creating DataFrames
url: sql-getting-started.html#creating-dataframes
- text: Untyped Dataset Operations (DataFrame operations)
url: sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations
- text: Running SQL Queries Programmatically
url: sql-getting-started.html#running-sql-queries-programmatically
- text: Global Temporary View
url: sql-getting-started.html#global-temporary-view
- text: Creating Datasets
url: sql-getting-started.html#creating-datasets
- text: Interoperating with RDDs
url: sql-getting-started.html#interoperating-with-rdds
- text: Aggregations
url: sql-getting-started.html#aggregations
- text: Data Sources
url: sql-data-sources.html
subitems:
- text: "Generic Load/Save Functions"
url: sql-data-sources-load-save-functions.html
- text: Parquet Files
url: sql-data-sources-parquet.html
- text: ORC Files
url: sql-data-sources-orc.html
- text: JSON Files
url: sql-data-sources-json.html
- text: Hive Tables
url: sql-data-sources-hive-tables.html
- text: JDBC To Other Databases
url: sql-data-sources-jdbc.html
- text: Avro Files
url: sql-data-sources-avro.html
- text: Troubleshooting
url: sql-data-sources-troubleshooting.html
- text: Performance Turing
url: sql-performance-turing.html
subitems:
- text: Caching Data In Memory
url: sql-performance-turing.html#caching-data-in-memory
- text: Other Configuration Options
url: sql-performance-turing.html#other-configuration-options
- text: Broadcast Hint for SQL Queries
url: sql-performance-turing.html#broadcast-hint-for-sql-queries
- text: Distributed SQL Engine
url: sql-distributed-sql-engine.html
subitems:
- text: "Running the Thrift JDBC/ODBC server"
url: sql-distributed-sql-engine.html#running-the-thrift-jdbcodbc-server
- text: Running the Spark SQL CLI
url: sql-distributed-sql-engine.html#running-the-spark-sql-cli
- text: PySpark Usage Guide for Pandas with Apache Arrow
url: sql-pyspark-pandas-with-arrow.html
subitems:
- text: Apache Arrow in Spark
url: sql-pyspark-pandas-with-arrow.html#apache-arrow-in-spark
- text: "Enabling for Conversion to/from Pandas"
url: sql-pyspark-pandas-with-arrow.html#enabling-for-conversion-tofrom-pandas
- text: "Pandas UDFs (a.k.a. Vectorized UDFs)"
url: sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
- text: Usage Notes
url: sql-pyspark-pandas-with-arrow.html#usage-notes
- text: Migration Guide
url: sql-migration-guide.html
subitems:
- text: Spark SQL Upgrading Guide
url: sql-migration-guide-upgrade.html
- text: Compatibility with Apache Hive
url: sql-migration-guide-hive-compatibility.html
- text: Reference
url: sql-reference.html
subitems:
- text: Data Types
url: sql-reference.html#data-types
- text: NaN Semantics
url: sql-reference.html#nan-semantics
- text: Arithmetic operations
url: sql-reference.html#arithmetic-operations
6 changes: 6 additions & 0 deletions docs/_includes/nav-left-wrapper-sql.html
@@ -0,0 +1,6 @@
<div class="left-menu-wrapper">
<div class="left-menu">
<h3><a href="sql-programming-guide.html">Spark SQL Guide</a></h3>
{% include nav-left.html nav=include.nav-sql %}
</div>
</div>
3 changes: 2 additions & 1 deletion docs/_includes/nav-left.html
Expand Up @@ -10,7 +10,8 @@
{% endif %}
</a>
</li>
{% if item.subitems and navurl contains item.url %}
{% assign tag = item.url | remove: ".html" %}
{% if item.subitems and navurl contains tag %}
{% include nav-left.html nav=item.subitems %}
{% endif %}
{% endfor %}
Expand Down
8 changes: 6 additions & 2 deletions docs/_layouts/global.html
Expand Up @@ -126,8 +126,12 @@

<div class="container-wrapper">

{% if page.url contains "/ml" %}
{% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
{% if page.url contains "/ml" or page.url contains "/sql" %}
{% if page.url contains "/ml" %}
{% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
{% else %}
{% include nav-left-wrapper-sql.html nav-sql=site.data.menu-sql %}
{% endif %}
<input id="nav-trigger" class="nav-trigger" checked type="checkbox">
<label for="nav-trigger"></label>
<div class="content-with-sidebar" id="content">
Expand Down
2 changes: 1 addition & 1 deletion docs/ml-pipeline.md
Expand Up @@ -57,7 +57,7 @@ E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and p
Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
This API adopts the `DataFrame` from Spark SQL in order to support a variety of data types.

`DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-programming-guide.html#data-types) for a list of supported types.
`DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-reference.html#data-types) for a list of supported types.
In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [`Vector`](mllib-data-types.html#local-vector) types.

A `DataFrame` can be created either implicitly or explicitly from a regular `RDD`. See the code examples below and the [Spark SQL programming guide](sql-programming-guide.html) for examples.
Expand Down
6 changes: 3 additions & 3 deletions docs/sparkr.md
Expand Up @@ -104,7 +104,7 @@ The following Spark driver properties can be set in `sparkConfig` with `sparkR.s
</div>

## Creating SparkDataFrames
With a `SparkSession`, applications can create `SparkDataFrame`s from a local R data frame, from a [Hive table](sql-programming-guide.html#hive-tables), or from other [data sources](sql-programming-guide.html#data-sources).
With a `SparkSession`, applications can create `SparkDataFrame`s from a local R data frame, from a [Hive table](sql-data-sources-hive-tables.html), or from other [data sources](sql-data-sources.html).

### From local data frames
The simplest way to create a data frame is to convert a local R data frame into a SparkDataFrame. Specifically, we can use `as.DataFrame` or `createDataFrame` and pass in the local R data frame to create a SparkDataFrame. As an example, the following creates a `SparkDataFrame` based using the `faithful` dataset from R.
Expand All @@ -125,7 +125,7 @@ head(df)

### From Data Sources

SparkR supports operating on a variety of data sources through the `SparkDataFrame` interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more [specific options](sql-programming-guide.html#manually-specifying-options) that are available for the built-in data sources.
SparkR supports operating on a variety of data sources through the `SparkDataFrame` interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more [specific options](sql-data-sources-load-save-functions.html#manually-specifying-options) that are available for the built-in data sources.

The general method for creating SparkDataFrames from data sources is `read.df`. This method takes in the path for the file to load and the type of data source, and the currently active SparkSession will be used automatically.
SparkR supports reading JSON, CSV and Parquet files natively, and through packages available from sources like [Third Party Projects](https://spark.apache.org/third-party-projects.html), you can find data source connectors for popular file formats like Avro. These packages can either be added by
Expand Down Expand Up @@ -180,7 +180,7 @@ write.df(people, path = "people.parquet", source = "parquet", mode = "overwrite"

### From Hive tables

You can also create SparkDataFrames from Hive tables. To do this we will need to create a SparkSession with Hive support which can access tables in the Hive MetaStore. Note that Spark should have been built with [Hive support](building-spark.html#building-with-hive-and-jdbc-support) and more details can be found in the [SQL programming guide](sql-programming-guide.html#starting-point-sparksession). In SparkR, by default it will attempt to create a SparkSession with Hive support enabled (`enableHiveSupport = TRUE`).
You can also create SparkDataFrames from Hive tables. To do this we will need to create a SparkSession with Hive support which can access tables in the Hive MetaStore. Note that Spark should have been built with [Hive support](building-spark.html#building-with-hive-and-jdbc-support) and more details can be found in the [SQL programming guide](sql-getting-started.html#starting-point-sparksession). In SparkR, by default it will attempt to create a SparkSession with Hive support enabled (`enableHiveSupport = TRUE`).

<div data-lang="r" markdown="1">
{% highlight r %}
Expand Down
File renamed without changes.
166 changes: 166 additions & 0 deletions docs/sql-data-sources-hive-tables.md
@@ -0,0 +1,166 @@
---
layout: global
title: Hive Tables
displayTitle: Hive Tables
---

* Table of contents
{:toc}

Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/).
However, since Hive has a large number of dependencies, these dependencies are not included in the
default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them
automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as
they will need access to the Hive serialization and deserialization libraries (SerDes) in order to
access data stored in Hive.

Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` (for security configuration),
and `hdfs-site.xml` (for HDFS configuration) file in `conf/`.

When working with Hive, one must instantiate `SparkSession` with Hive support, including
connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
Users who do not have an existing Hive deployment can still enable Hive support. When not configured
by the `hive-site.xml`, the context automatically creates `metastore_db` in the current directory and
creates a directory configured by `spark.sql.warehouse.dir`, which defaults to the directory
`spark-warehouse` in the current directory that the Spark application is started. Note that
the `hive.metastore.warehouse.dir` property in `hive-site.xml` is deprecated since Spark 2.0.0.
Instead, use `spark.sql.warehouse.dir` to specify the default location of database in warehouse.
You may need to grant write privilege to the user who starts the Spark application.

<div class="codetabs">

<div data-lang="scala" markdown="1">
{% include_example spark_hive scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala %}
</div>

<div data-lang="java" markdown="1">
{% include_example spark_hive java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java %}
</div>

<div data-lang="python" markdown="1">
{% include_example spark_hive python/sql/hive.py %}
</div>

<div data-lang="r" markdown="1">

When working with Hive one must instantiate `SparkSession` with Hive support. This
adds support for finding tables in the MetaStore and writing queries using HiveQL.

{% include_example spark_hive r/RSparkSQLExample.R %}

</div>
</div>

### Specifying storage format for Hive tables

When you create a Hive table, you need to define how this table should read/write data from/to file system,
i.e. the "input format" and "output format". You also need to define how this table should deserialize the data
to rows, or serialize rows to data, i.e. the "serde". The following options can be used to specify the storage
format("serde", "input format", "output format"), e.g. `CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet')`.
By default, we will read the table files as plain text. Note that, Hive storage handler is not supported yet when
creating table, you can create a table using storage handler at Hive side, and use Spark SQL to read it.

<table class="table">
<tr><th>Property Name</th><th>Meaning</th></tr>
<tr>
<td><code>fileFormat</code></td>
<td>
A fileFormat is kind of a package of storage format specifications, including "serde", "input format" and
"output format". Currently we support 6 fileFormats: 'sequencefile', 'rcfile', 'orc', 'parquet', 'textfile' and 'avro'.
</td>
</tr>

<tr>
<td><code>inputFormat, outputFormat</code></td>
<td>
These 2 options specify the name of a corresponding `InputFormat` and `OutputFormat` class as a string literal,
e.g. `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat`. These 2 options must be appeared in pair, and you can not
specify them if you already specified the `fileFormat` option.
</td>
</tr>

<tr>
<td><code>serde</code></td>
<td>
This option specifies the name of a serde class. When the `fileFormat` option is specified, do not specify this option
if the given `fileFormat` already include the information of serde. Currently "sequencefile", "textfile" and "rcfile"
don't include the serde information and you can use this option with these 3 fileFormats.
</td>
</tr>

<tr>
<td><code>fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim</code></td>
<td>
These options can only be used with "textfile" fileFormat. They define how to read delimited files into rows.
</td>
</tr>
</table>

All other properties defined with `OPTIONS` will be regarded as Hive serde properties.

### Interacting with Different Versions of Hive Metastore

One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).

The following options can be used to configure the version of Hive that is used to retrieve metadata:

<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.sql.hive.metastore.version</code></td>
<td><code>1.2.1</code></td>
<td>
Version of the Hive metastore. Available
options are <code>0.12.0</code> through <code>2.3.3</code>.
</td>
</tr>
<tr>
<td><code>spark.sql.hive.metastore.jars</code></td>
<td><code>builtin</code></td>
<td>
Location of the jars that should be used to instantiate the HiveMetastoreClient. This
property can be one of three options:
<ol>
<li><code>builtin</code></li>
Use Hive 1.2.1, which is bundled with the Spark assembly when <code>-Phive</code> is
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
either <code>1.2.1</code> or not defined.
<li><code>maven</code></li>
Use Hive jars of specified version downloaded from Maven repositories. This configuration
is not generally recommended for production deployments.
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive
and its dependencies, including the correct version of Hadoop. These jars only need to be
present on the driver, but if you are running in yarn cluster mode then you must ensure
they are packaged with your application.</li>
</ol>
</td>
</tr>
<tr>
<td><code>spark.sql.hive.metastore.sharedPrefixes</code></td>
<td><code>com.mysql.jdbc,<br/>org.postgresql,<br/>com.microsoft.sqlserver,<br/>oracle.jdbc</code></td>
<td>
<p>
A comma-separated list of class prefixes that should be loaded using the classloader that is
shared between Spark SQL and a specific version of Hive. An example of classes that should
be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need
to be shared are those that interact with classes that are already shared. For example,
custom appenders that are used by log4j.
</p>
</td>
</tr>
<tr>
<td><code>spark.sql.hive.metastore.barrierPrefixes</code></td>
<td><code>(empty)</code></td>
<td>
<p>
A comma separated list of class prefixes that should explicitly be reloaded for each version
of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a
prefix that typically would be shared (i.e. <code>org.apache.spark.*</code>).
</p>
</td>
</tr>
</table>

0 comments on commit 987f386

Please sign in to comment.