Skip to content

Commit

Permalink
SQL documentation generation for built-in functions
Browse files Browse the repository at this point in the history
  • Loading branch information
HyukjinKwon committed Jul 25, 2017
1 parent 481f079 commit b95be04
Show file tree
Hide file tree
Showing 12 changed files with 239 additions and 3 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ dev/pr-deps/
dist/
docs/_site
docs/api
sql/docs
sql/site
lib_managed/
lint-r-report.log
log/
Expand Down
6 changes: 3 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,6 @@ jekyll plugin to run `build/sbt unidoc` before building the site so if you haven
may take some time as it generates all of the scaladoc. The jekyll plugin also generates the
PySpark docs using [Sphinx](http://sphinx-doc.org/).

NOTE: To skip the step of building and copying over the Scala, Python, R API docs, run `SKIP_API=1
jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, and `SKIP_RDOC=1` can be used to skip a single
step of the corresponding language.
NOTE: To skip the step of building and copying over the Scala, Python, R and SQL API docs, run `SKIP_API=1
jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, `SKIP_RDOC=1` and `SKIP_SQLDOC=1` can be used
to skip a single step of the corresponding language.
1 change: 1 addition & 0 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@
<li><a href="api/java/index.html">Java</a></li>
<li><a href="api/python/index.html">Python</a></li>
<li><a href="api/R/index.html">R</a></li>
<li><a href="api/sql/index.html">SQL, Built-in Functions</a></li>
</ul>
</li>

Expand Down
27 changes: 27 additions & 0 deletions docs/_plugins/copy_api_dirs.rb
Original file line number Diff line number Diff line change
Expand Up @@ -150,4 +150,31 @@
cp("../R/pkg/DESCRIPTION", "api")
end

if not (ENV['SKIP_SQLDOC'] == '1')
# Build SQL API docs

puts "Moving to project root and building API docs."
curr_dir = pwd
cd("..")

puts "Running 'build/sbt clean package' from " + pwd + "; this may take a few minutes..."
system("build/sbt clean package") || raise("SQL doc generation failed")

puts "Moving back into docs dir."
cd("docs")

puts "Moving to SQL directory and building docs."
cd("../sql")
system("./create-docs.sh") || raise("SQL doc generation failed")

puts "Moving back into docs dir."
cd("../docs")

puts "Making directory api/sql"
mkdir_p "api/sql"

puts "cp -r ../sql/site/. api/sql"
cp_r("../sql/site/.", "api/sql")
end

end
1 change: 1 addition & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ Here you can read API docs for Spark and its submodules.
- [Spark Java API (Javadoc)](api/java/index.html)
- [Spark Python API (Sphinx)](api/python/index.html)
- [Spark R API (Roxygen2)](api/R/index.html)
- [Spark SQL, Built-in Functions (MkDocs)](api/sql/index.html)
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ options for deployment:
* [Spark Java API (Javadoc)](api/java/index.html)
* [Spark Python API (Sphinx)](api/python/index.html)
* [Spark R API (Roxygen2)](api/R/index.html)
* [Spark SQL, Built-in Functions (MkDocs)](api/sql/index.html)

**Deployment Guides:**

Expand Down
2 changes: 2 additions & 0 deletions sql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ Spark SQL is broken up into four subprojects:
- Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
- Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
- HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running `sql/create-docs.sh` generates SQL documentation for built-in functions under `sql/site`.
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,16 @@

package org.apache.spark.sql.api.python

import org.apache.spark.sql.catalyst.analysis.FunctionRegistry
import org.apache.spark.sql.catalyst.expressions.ExpressionInfo
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.apache.spark.sql.types.DataType

private[sql] object PythonSQLUtils {
def parseDataType(typeText: String): DataType = CatalystSqlParser.parseDataType(typeText)

// This is needed when generating SQL documentation for built-in functions.
def listBuiltinFunctionInfos(): Array[ExpressionInfo] = {
FunctionRegistry.functionSet.flatMap(f => FunctionRegistry.builtin.lookupFunction(f)).toArray
}
}
56 changes: 56 additions & 0 deletions sql/create-docs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Script to create SQL API docs. This requires `mkdocs`.
# Also, it needs to build Spark first.

# After running this script the html docs can be found in
# $SPARK_HOME/sql/docs/site

set -o pipefail
set -e

FWDIR="$(cd "`dirname "${BASH_SOURCE[0]}"`"; pwd)"
SPARK_HOME="$(cd "`dirname "${BASH_SOURCE[0]}"`"/..; pwd)"
WAREHOUSE_DIR="$FWDIR/_spark-warehouse"

if ! hash python 2>/dev/null; then
echo "Missing python in the path, skipping SQL documentation generation."
exit 0
fi

if ! hash mkdocs 2>/dev/null; then
echo "Missing mkdocs in the path, skipping SQL documentation generation."
exit 0
fi

# Now create markdown file
rm -fr docs
rm -rf "$WAREHOUSE_DIR"
mkdir docs
echo "Generating markdown files for SQL documentation."
"$SPARK_HOME/bin/spark-submit" \
--driver-java-options "-Dlog4j.configuration=file:$FWDIR/log4j.properties" \
--conf spark.sql.warehouse.dir="$WAREHOUSE_DIR" \
gen-sql-markdown.py
rm -rf "$WAREHOUSE_DIR"

# Now create HTML files
echo "Generating HTML files for SQL documentation."
mkdocs build --clean
96 changes: 96 additions & 0 deletions sql/gen-sql-markdown.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import sys
import os
from collections import namedtuple

from pyspark.sql import SparkSession

ExpressionInfo = namedtuple("ExpressionInfo", "className usage name extended")


def _list_function_infos(spark):
"""
Returns a list of function information via JVM. Sorts wrapped expression infos by name
and returns them.
"""

jinfos = spark.sparkContext._jvm \
.org.apache.spark.sql.api.python.PythonSQLUtils.listBuiltinFunctionInfos()
infos = []
for jinfo in jinfos:
name = jinfo.getName()
usage = jinfo.getUsage()
usage = usage.replace("_FUNC_", name) if usage is not None else usage
extended = jinfo.getExtended()
extended = extended.replace("_FUNC_", name) if extended is not None else extended
infos.append(ExpressionInfo(
className=jinfo.getClassName(),
usage=usage,
name=name,
extended=extended))
return sorted(infos, key=lambda i: i.name)


def _make_pretty_usage(usage):
"""
Makes the usage description pretty and returns a formatted string.
Otherwise, returns None.
"""

if usage is not None and usage.strip() != "":
usage = "\n".join(map(lambda u: u.strip(), usage.split("\n")))
return "%s\n\n" % usage


def _make_pretty_extended(extended):
"""
Makes the extended description pretty and returns a formatted string.
Otherwise, returns None.
"""

if extended is not None and extended.strip() != "":
extended = "\n".join(map(lambda u: u.strip(), extended.split("\n")))
return "```%s```\n\n" % extended


def generate_sql_markdown(spark, path):
"""
Generates a markdown file after listing the function information. The output file
is created in `path`.
"""

with open(path, 'w') as mdfile:
for info in _list_function_infos(spark):
mdfile.write("### %s\n\n" % info.name)
usage = _make_pretty_usage(info.usage)
extended = _make_pretty_extended(info.extended)
if usage is not None:
mdfile.write(usage)
if extended is not None:
mdfile.write(extended)


if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("GenSQLDocs") \
.getOrCreate()
markdown_file_path = "%s/docs/index.md" % os.path.dirname(sys.argv[0])
generate_sql_markdown(spark, markdown_file_path)
spark.stop()
24 changes: 24 additions & 0 deletions sql/log4j.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# SQL documentation generation simply accesses to JVM and gets the list of functions.
# Just suppresses info level logs.
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
19 changes: 19 additions & 0 deletions sql/mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

site_name: Spark SQL, Built-in Functions
theme: readthedocs
pages:
- 'Functions': 'index.md'

0 comments on commit b95be04

Please sign in to comment.