Skip to content

Commit

Permalink
[SPARK-9148][SPARK-10252][SQL] Update SQL Programming Guide
Browse files Browse the repository at this point in the history
  • Loading branch information
marmbrus committed Aug 26, 2015
1 parent 00ae4be commit f11c169
Showing 1 changed file with 70 additions and 19 deletions.
89 changes: 70 additions & 19 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ title: Spark SQL and DataFrames

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.

For how to enable Hive support, please refer to the [Hive Tables](#hive-tables) section.
Spark SQL can also be used to read from data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section.

# DataFrames

Expand Down Expand Up @@ -213,6 +213,11 @@ df.groupBy("age").count().show()
// 30 1
{% endhighlight %}

For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.DataFrame).

In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.DataFrame).


</div>

<div data-lang="java" markdown="1">
Expand Down Expand Up @@ -263,6 +268,10 @@ df.groupBy("age").count().show();
// 30 1
{% endhighlight %}

For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/java/org/apache/spark/sql/DataFrame.html).

In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).

</div>

<div data-lang="python" markdown="1">
Expand Down Expand Up @@ -320,6 +329,10 @@ df.groupBy("age").count().show()

{% endhighlight %}

For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).

In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).

</div>

<div data-lang="r" markdown="1">
Expand Down Expand Up @@ -370,10 +383,13 @@ showDF(count(groupBy(df, "age")))

{% endhighlight %}

</div>
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).

In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/index.html).

</div>

</div>

## Running SQL Queries Programmatically

Expand Down Expand Up @@ -870,12 +886,11 @@ saveDF(select(df, "name", "age"), "namesAndAges.parquet", "parquet")

Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if
present. It is important to realize that these save modes do not utilize any locking and are not
atomic. Thus, it is not safe to have multiple writers attempting to write to the same location.
Additionally, when performing a `Overwrite`, the data will be deleted before writing out the
atomic. Additionally, when performing a `Overwrite`, the data will be deleted before writing out the
new data.

<table class="table">
<tr><th>Scala/Java</th><th>Python</th><th>Meaning</th></tr>
<tr><th>Scala/Java</th><th>Any Language</th><th>Meaning</th></tr>
<tr>
<td><code>SaveMode.ErrorIfExists</code> (default)</td>
<td><code>"error"</code> (default)</td>
Expand Down Expand Up @@ -1642,12 +1657,12 @@ results <- collect(sql(sqlContext, "FROM src SELECT key, value"))
### Interacting with Different Versions of Hive Metastore

One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).

Internally, Spark SQL uses two Hive clients, one for executing native Hive commands like `SET`
and `DESCRIBE`, the other dedicated for communicating with Hive metastore. The former uses Hive
jars of version 0.13.1, which are bundled with Spark 1.4.0. The latter uses Hive jars of the
version specified by users. An isolated classloader is used here to avoid dependency conflicts.
The following options can be used to configure the version of Hive that is used to retrieve metadata:

<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
Expand All @@ -1656,7 +1671,7 @@ version specified by users. An isolated classloader is used here to avoid depend
<td><code>0.13.1</code></td>
<td>
Version of the Hive metastore. Available
options are <code>0.12.0</code> and <code>0.13.1</code>. Support for more versions is coming in the future.
options are <code>0.12.0</code> through <code>1.2.1</code>.
</td>
</tr>
<tr>
Expand All @@ -1667,12 +1682,14 @@ version specified by users. An isolated classloader is used here to avoid depend
property can be one of three options:
<ol>
<li><code>builtin</code></li>
Use Hive 0.13.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
Use Hive 1.2.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
either <code>0.13.1</code> or not defined.
either <code>1.2.1</code> or not defined.
<li><code>maven</code></li>
Use Hive jars of specified version downloaded from Maven repositories.
<li>A classpath in the standard format for both Hive and Hadoop.</li>
Use Hive jars of specified version downloaded from Maven repositories. This configuration
is not generally recommended for production deployments.
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive
and its dependencies, including the correct version of Hadoop.</li>
</ol>
</td>
</tr>
Expand Down Expand Up @@ -1988,6 +2005,27 @@ options.

# Migration Guide

## Upgrading From Spark SQL 1.4 to 1.5

- Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with
code generation for expression evaluation. These features can both be disabled by setting
`spark.sql.tungsten.enabled` to `false.
- Parquet schema merging is no longer enabled by default. It can be re-enabled by setting
`spark.sql.parquet.mergeSchema` to `true`.
- Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or
access nested values. For example `df['table.column.nestedField']`. However, this means that if
your column name contains any dots you must now escape them using backticks.
- In-memory columnar storage partition pruning is on by default. It can be disabled by setting
`spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.
- Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum
precision of 38. When inferring schema from `BigDecimal` objects, a precision of (38, 18) is now
used. When no precision is specified in DDL then the default remains `Decimal(10, 0)`.
- In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains
unchanged.
- The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
- It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe
and thus this output committer will not be used when speculation is on, independent of configuration.

## Upgrading from Spark SQL 1.3 to 1.4

#### DataFrame data reader/writer interface
Expand All @@ -2009,7 +2047,8 @@ See the API docs for `SQLContext.read` (

#### DataFrame.groupBy retains grouping columns

Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.
Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the
grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.

<div class="codetabs">
<div data-lang="scala" markdown="1">
Expand Down Expand Up @@ -2146,7 +2185,7 @@ Python UDF registration is unchanged.
When using DataTypes in Python you will need to construct them (i.e. `StringType()`) instead of
referencing a singleton.

## Migration Guide for Shark User
## Migration Guide for Shark Users

### Scheduling
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session,
Expand Down Expand Up @@ -2222,6 +2261,7 @@ Spark SQL supports the vast majority of Hive features, such as:
* User defined functions (UDF)
* User defined aggregation functions (UDAF)
* User defined serialization formats (SerDes)
* Window functions
* Joins
* `JOIN`
* `{LEFT|RIGHT|FULL} OUTER JOIN`
Expand All @@ -2232,7 +2272,7 @@ Spark SQL supports the vast majority of Hive features, such as:
* `SELECT col FROM ( SELECT a + b AS col from t1) t2`
* Sampling
* Explain
* Partitioned tables
* Partitioned tables including dynamic partition insertion
* View
* All Hive DDL Functions, including:
* `CREATE TABLE`
Expand Down Expand Up @@ -2294,8 +2334,9 @@ releases of Spark SQL.
Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
metadata. Spark SQL does not support that.

# Reference

# Data Types
## Data Types

Spark SQL and DataFrames support the following data types:

Expand Down Expand Up @@ -2908,3 +2949,13 @@ from pyspark.sql.types import *

</div>

## NaN Semantics

There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that
does not exactly match standard floating point semantics.
Specifically:

- NaN = NaN returns true.
- In aggregations all NaN values are grouped together.
- NaN is treated as a normal value in join keys.
- NaN values go last when in ascending order, larger than any other numeric value.

0 comments on commit f11c169

Please sign in to comment.