Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions #18702

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ dev/pr-deps/
dist/
docs/_site
docs/api
sql/docs
sql/site
lib_managed/
lint-r-report.log
log/
Expand Down
6 changes: 3 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,6 @@ jekyll plugin to run `build/sbt unidoc` before building the site so if you haven
may take some time as it generates all of the scaladoc. The jekyll plugin also generates the
PySpark docs using [Sphinx](http://sphinx-doc.org/).

NOTE: To skip the step of building and copying over the Scala, Python, R API docs, run `SKIP_API=1
jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, and `SKIP_RDOC=1` can be used to skip a single
step of the corresponding language.
NOTE: To skip the step of building and copying over the Scala, Python, R and SQL API docs, run `SKIP_API=1
jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, `SKIP_RDOC=1` and `SKIP_SQLDOC=1` can be used
to skip a single step of the corresponding language.
1 change: 1 addition & 0 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@
<li><a href="api/java/index.html">Java</a></li>
<li><a href="api/python/index.html">Python</a></li>
<li><a href="api/R/index.html">R</a></li>
<li><a href="api/sql/index.html">SQL, Built-in Functions</a></li>
</ul>
</li>

Expand Down
27 changes: 27 additions & 0 deletions docs/_plugins/copy_api_dirs.rb
Original file line number Diff line number Diff line change
Expand Up @@ -150,4 +150,31 @@
cp("../R/pkg/DESCRIPTION", "api")
end

if not (ENV['SKIP_SQLDOC'] == '1')
# Build SQL API docs

puts "Moving to project root and building API docs."
curr_dir = pwd
cd("..")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than cd, is it possible to generate the output directly into docs/ somewhere? maybe I miss why that's hard. It would avoid creating more temp output dirs in the sql folder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood your previous comments initially. It shouldn't be hard.

but other language API docs are, up to my knowledge, in separate directories and then copied into docs/ later. I was thinking we could extend this further in the future (e.g., syntax documentation) and it could be easier to check doc output in a separate dir (actually, for me, I check other docs output in this way more often).


puts "Running 'build/sbt clean package' from " + pwd + "; this may take a few minutes..."
system("build/sbt clean package") || raise("SQL doc generation failed")

puts "Moving back into docs dir."
cd("docs")

puts "Moving to SQL directory and building docs."
cd("../sql")
system("./create-docs.sh") || raise("SQL doc generation failed")

puts "Moving back into docs dir."
cd("../docs")

puts "Making directory api/sql"
mkdir_p "api/sql"

puts "cp -r ../sql/site/. api/sql"
cp_r("../sql/site/.", "api/sql")
end

end
1 change: 1 addition & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ Here you can read API docs for Spark and its submodules.
- [Spark Java API (Javadoc)](api/java/index.html)
- [Spark Python API (Sphinx)](api/python/index.html)
- [Spark R API (Roxygen2)](api/R/index.html)
- [Spark SQL, Built-in Functions (MkDocs)](api/sql/index.html)
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ options for deployment:
* [Spark Java API (Javadoc)](api/java/index.html)
* [Spark Python API (Sphinx)](api/python/index.html)
* [Spark R API (Roxygen2)](api/R/index.html)
* [Spark SQL, Built-in Functions (MkDocs)](api/sql/index.html)

**Deployment Guides:**

Expand Down
2 changes: 2 additions & 0 deletions sql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ Spark SQL is broken up into four subprojects:
- Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
- Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
- HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running `sql/create-docs.sh` generates SQL documentation for built-in functions under `sql/site`.
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,16 @@

package org.apache.spark.sql.api.python

import org.apache.spark.sql.catalyst.analysis.FunctionRegistry
import org.apache.spark.sql.catalyst.expressions.ExpressionInfo
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.apache.spark.sql.types.DataType

private[sql] object PythonSQLUtils {
def parseDataType(typeText: String): DataType = CatalystSqlParser.parseDataType(typeText)

// This is needed when generating SQL documentation for built-in functions.
def listBuiltinFunctionInfos(): Array[ExpressionInfo] = {
FunctionRegistry.functionSet.flatMap(f => FunctionRegistry.builtin.lookupFunction(f)).toArray
}
}
49 changes: 49 additions & 0 deletions sql/create-docs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Script to create SQL API docs. This requires `mkdocs` and to build
# Spark first. After running this script the html docs can be found in
# $SPARK_HOME/sql/site

set -o pipefail
set -e

FWDIR="$(cd "`dirname "${BASH_SOURCE[0]}"`"; pwd)"
SPARK_HOME="$(cd "`dirname "${BASH_SOURCE[0]}"`"/..; pwd)"

if ! hash python 2>/dev/null; then
echo "Missing python in your path, skipping SQL documentation generation."
exit 0
fi

if ! hash mkdocs 2>/dev/null; then
echo "Missing mkdocs in your path, skipping SQL documentation generation."
exit 0
fi

# Now create the markdown file
rm -fr docs
mkdir docs
echo "Generating markdown files for SQL documentation."
"$SPARK_HOME/bin/spark-submit" gen-sql-markdown.py

# Now create the HTML files
echo "Generating HTML files for SQL documentation."
mkdocs build --clean
rm -fr docs
91 changes: 91 additions & 0 deletions sql/gen-sql-markdown.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import sys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it complicate things to put these files in some kind of bin directory under sql?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be pretty simple. Let me try.

import os
from collections import namedtuple

ExpressionInfo = namedtuple("ExpressionInfo", "className usage name extended")


def _list_function_infos(jvm):
"""
Returns a list of function information via JVM. Sorts wrapped expression infos by name
and returns them.
"""

jinfos = jvm.org.apache.spark.sql.api.python.PythonSQLUtils.listBuiltinFunctionInfos()
infos = []
for jinfo in jinfos:
name = jinfo.getName()
usage = jinfo.getUsage()
usage = usage.replace("_FUNC_", name) if usage is not None else usage
extended = jinfo.getExtended()
extended = extended.replace("_FUNC_", name) if extended is not None else extended
infos.append(ExpressionInfo(
className=jinfo.getClassName(),
usage=usage,
name=name,
extended=extended))
return sorted(infos, key=lambda i: i.name)


def _make_pretty_usage(usage):
"""
Makes the usage description pretty and returns a formatted string.
Otherwise, returns None.
"""

if usage is not None and usage.strip() != "":
usage = "\n".join(map(lambda u: u.strip(), usage.split("\n")))
return "%s\n\n" % usage


def _make_pretty_extended(extended):
"""
Makes the extended description pretty and returns a formatted string.
Otherwise, returns None.
"""

if extended is not None and extended.strip() != "":
extended = "\n".join(map(lambda u: u.strip(), extended.split("\n")))
return "```%s```\n\n" % extended


def generate_sql_markdown(jvm, path):
"""
Generates a markdown file after listing the function information. The output file
is created in `path`.
"""

with open(path, 'w') as mdfile:
for info in _list_function_infos(jvm):
mdfile.write("### %s\n\n" % info.name)
usage = _make_pretty_usage(info.usage)
extended = _make_pretty_extended(info.extended)
if usage is not None:
mdfile.write(usage)
if extended is not None:
mdfile.write(extended)


if __name__ == "__main__":
from pyspark.java_gateway import launch_gateway

jvm = launch_gateway().jvm
markdown_file_path = "%s/docs/index.md" % os.path.dirname(sys.argv[0])
generate_sql_markdown(jvm, markdown_file_path)
19 changes: 19 additions & 0 deletions sql/mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

site_name: Spark SQL, Built-in Functions
theme: readthedocs
pages:
- 'Functions': 'index.md'