-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9148][SPARK-10252][SQL] Update SQL Programming Guide #8441
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,7 +11,7 @@ title: Spark SQL and DataFrames | |
|
||
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. | ||
|
||
For how to enable Hive support, please refer to the [Hive Tables](#hive-tables) section. | ||
Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section. | ||
|
||
# DataFrames | ||
|
||
|
@@ -213,6 +213,11 @@ df.groupBy("age").count().show() | |
// 30 1 | ||
{% endhighlight %} | ||
|
||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.DataFrame). | ||
|
||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.DataFrame). | ||
|
||
|
||
</div> | ||
|
||
<div data-lang="java" markdown="1"> | ||
|
@@ -263,6 +268,10 @@ df.groupBy("age").count().show(); | |
// 30 1 | ||
{% endhighlight %} | ||
|
||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/java/org/apache/spark/sql/DataFrame.html). | ||
|
||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html). | ||
|
||
</div> | ||
|
||
<div data-lang="python" markdown="1"> | ||
|
@@ -320,6 +329,10 @@ df.groupBy("age").count().show() | |
|
||
{% endhighlight %} | ||
|
||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame). | ||
|
||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions). | ||
|
||
</div> | ||
|
||
<div data-lang="r" markdown="1"> | ||
|
@@ -370,10 +383,13 @@ showDF(count(groupBy(df, "age"))) | |
|
||
{% endhighlight %} | ||
|
||
</div> | ||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html). | ||
|
||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/index.html). | ||
|
||
</div> | ||
|
||
</div> | ||
|
||
## Running SQL Queries Programmatically | ||
|
||
|
@@ -870,12 +886,11 @@ saveDF(select(df, "name", "age"), "namesAndAges.parquet", "parquet") | |
|
||
Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if | ||
present. It is important to realize that these save modes do not utilize any locking and are not | ||
atomic. Thus, it is not safe to have multiple writers attempting to write to the same location. | ||
Additionally, when performing a `Overwrite`, the data will be deleted before writing out the | ||
atomic. Additionally, when performing a `Overwrite`, the data will be deleted before writing out the | ||
new data. | ||
|
||
<table class="table"> | ||
<tr><th>Scala/Java</th><th>Python</th><th>Meaning</th></tr> | ||
<tr><th>Scala/Java</th><th>Any Language</th><th>Meaning</th></tr> | ||
<tr> | ||
<td><code>SaveMode.ErrorIfExists</code> (default)</td> | ||
<td><code>"error"</code> (default)</td> | ||
|
@@ -1671,12 +1686,12 @@ results <- collect(sql(sqlContext, "FROM src SELECT key, value")) | |
### Interacting with Different Versions of Hive Metastore | ||
|
||
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore, | ||
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below. | ||
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary | ||
build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below. | ||
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL | ||
will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc). | ||
|
||
Internally, Spark SQL uses two Hive clients, one for executing native Hive commands like `SET` | ||
and `DESCRIBE`, the other dedicated for communicating with Hive metastore. The former uses Hive | ||
jars of version 0.13.1, which are bundled with Spark 1.4.0. The latter uses Hive jars of the | ||
version specified by users. An isolated classloader is used here to avoid dependency conflicts. | ||
The following options can be used to configure the version of Hive that is used to retrieve metadata: | ||
|
||
<table class="table"> | ||
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> | ||
|
@@ -1685,7 +1700,7 @@ version specified by users. An isolated classloader is used here to avoid depend | |
<td><code>0.13.1</code></td> | ||
<td> | ||
Version of the Hive metastore. Available | ||
options are <code>0.12.0</code> and <code>0.13.1</code>. Support for more versions is coming in the future. | ||
options are <code>0.12.0</code> through <code>1.2.1</code>. | ||
</td> | ||
</tr> | ||
<tr> | ||
|
@@ -1696,12 +1711,16 @@ version specified by users. An isolated classloader is used here to avoid depend | |
property can be one of three options: | ||
<ol> | ||
<li><code>builtin</code></li> | ||
Use Hive 0.13.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is | ||
Use Hive 1.2.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is | ||
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be | ||
either <code>0.13.1</code> or not defined. | ||
either <code>1.2.1</code> or not defined. | ||
<li><code>maven</code></li> | ||
Use Hive jars of specified version downloaded from Maven repositories. | ||
<li>A classpath in the standard format for both Hive and Hadoop.</li> | ||
Use Hive jars of specified version downloaded from Maven repositories. This configuration | ||
is not generally recommended for production deployments. | ||
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be nice if we could say something about the jars either need to be installed on the cluster or on yarn shipped with your application. |
||
and its dependencies, including the correct version of Hadoop. These jars only need to be | ||
present on the driver, but if you are running in yarn cluster mode then you must ensure | ||
they are packaged with you application.</li> | ||
</ol> | ||
</td> | ||
</tr> | ||
|
@@ -2017,6 +2036,28 @@ options. | |
|
||
# Migration Guide | ||
|
||
## Upgrading From Spark SQL 1.4 to 1.5 | ||
|
||
- Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with | ||
code generation for expression evaluation. These features can both be disabled by setting | ||
`spark.sql.tungsten.enabled` to `false. | ||
- Parquet schema merging is no longer enabled by default. It can be re-enabled by setting | ||
`spark.sql.parquet.mergeSchema` to `true`. | ||
- Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or | ||
access nested values. For example `df['table.column.nestedField']`. However, this means that if | ||
your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``). | ||
- In-memory columnar storage partition pruning is on by default. It can be disabled by setting | ||
`spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`. | ||
- Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should also mention that timestamp precision is now 1us, rather than 1ns. |
||
precision of 38. When inferring schema from `BigDecimal` objects, a precision of (38, 18) is now | ||
used. When no precision is specified in DDL then the default remains `Decimal(10, 0)`. | ||
- Timestamps are now stored at a precision of 1us, rather than 1ns | ||
- In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains | ||
unchanged. | ||
- The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM). | ||
- It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe | ||
and thus this output committer will not be used when speculation is on, independent of configuration. | ||
|
||
## Upgrading from Spark SQL 1.3 to 1.4 | ||
|
||
#### DataFrame data reader/writer interface | ||
|
@@ -2038,7 +2079,8 @@ See the API docs for `SQLContext.read` ( | |
|
||
#### DataFrame.groupBy retains grouping columns | ||
|
||
Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`. | ||
Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the | ||
grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`. | ||
|
||
<div class="codetabs"> | ||
<div data-lang="scala" markdown="1"> | ||
|
@@ -2175,7 +2217,7 @@ Python UDF registration is unchanged. | |
When using DataTypes in Python you will need to construct them (i.e. `StringType()`) instead of | ||
referencing a singleton. | ||
|
||
## Migration Guide for Shark User | ||
## Migration Guide for Shark Users | ||
|
||
### Scheduling | ||
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session, | ||
|
@@ -2251,6 +2293,7 @@ Spark SQL supports the vast majority of Hive features, such as: | |
* User defined functions (UDF) | ||
* User defined aggregation functions (UDAF) | ||
* User defined serialization formats (SerDes) | ||
* Window functions | ||
* Joins | ||
* `JOIN` | ||
* `{LEFT|RIGHT|FULL} OUTER JOIN` | ||
|
@@ -2261,7 +2304,7 @@ Spark SQL supports the vast majority of Hive features, such as: | |
* `SELECT col FROM ( SELECT a + b AS col from t1) t2` | ||
* Sampling | ||
* Explain | ||
* Partitioned tables | ||
* Partitioned tables including dynamic partition insertion | ||
* View | ||
* All Hive DDL Functions, including: | ||
* `CREATE TABLE` | ||
|
@@ -2323,8 +2366,9 @@ releases of Spark SQL. | |
Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS | ||
metadata. Spark SQL does not support that. | ||
|
||
# Reference | ||
|
||
# Data Types | ||
## Data Types | ||
|
||
Spark SQL and DataFrames support the following data types: | ||
|
||
|
@@ -2937,3 +2981,13 @@ from pyspark.sql.types import * | |
|
||
</div> | ||
|
||
## NaN Semantics | ||
|
||
There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that | ||
does not exactly match standard floating point semantics. | ||
Specifically: | ||
|
||
- NaN = NaN returns true. | ||
- In aggregations all NaN values are grouped together. | ||
- NaN is treated as a normal value in join keys. | ||
- NaN values go last when in ascending order, larger than any other numeric value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python/R rather than Any Language?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works in scala too though? I wasn't really sure what to do here. We could also just delete the other column?