[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions #18702

HyukjinKwon · 2017-07-21T09:19:33Z

What changes were proposed in this pull request?

This generates a documentation for Spark SQL built-in functions.

One drawback is, this requires a proper build to generate built-in function list.
Once it is built, it only takes few seconds by sql/create-docs.sh.

Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the output documentation.

There are few more works to be done in order to make the documentation pretty, for example, separating Arguments: and Examples: but I guess this should be done within ExpressionDescription and ExpressionInfo rather than manually parsing it. I will fix these in a follow up.

This requires pip install mkdocs to generate HTMLs from markdown files.

How was this patch tested?

Manually tested:

cd docs
jekyll build

,

cd docs
jekyll serve

and

cd sql
create-docs.sh

HyukjinKwon · 2017-07-21T09:21:41Z

docs/_plugins/copy_api_dirs.rb

+    cd("..")
+
+    puts "Running 'build/mvn -DskipTests clean package' from " + pwd + "; this may take a few minutes..."
+    system("build/mvn -DskipTests clean package") || raise("SQL doc generation failed")


Somehow, I can't run ./build/sbt clean package and spark-submit in my environment -
#17150 (comment)

I tested this via sbt after manually reverting that commit in my environment. I will change this to sbt if anyone could confirm it was my mistake ...

HyukjinKwon · 2017-07-21T09:24:35Z

cc @srowen and @rxin.

srowen · 2017-07-21T09:38:41Z

.gitignore

@@ -47,6 +47,8 @@ dev/pr-deps/
 dist/
 docs/_site
 docs/api
+sql/docs/docs


My only main question is, can we do this without introducing a new sql/docs module? can it just live in sql/ and generate output into the top-level docs/ folder?

Sure, I can. I actually was wondering where I should put those.

Just to double check, would it be okay to simply put four (sql/docs/mkdocs.yml, sql/docs/log4j.properties, sql/docs/gen-sql-markdown.py and sql/docs/create-docs.sh) files into sql/?

SparkQA · 2017-07-21T12:18:15Z

Test build #79832 has finished for PR 18702 at commit ea86529.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-22T07:53:22Z

Test build #79859 has finished for PR 18702 at commit 6175320.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-22T08:00:57Z

sql/gen-sql-markdown.py

+    Otherwise, returns None.
+    """
+
+    if extended is not None and extended.strip() != "":


@rxin, I could make this prettier with manual parsing and tried many things but I just revert them back. I was thinking it'd be rather better if we make some changes within ExpressionDescription and ExpressionInfo and I was thinking I could do this in a followup. WDYT?

If it does not sound good to you, I am willing to add some parsing logics here, mainly, to separate examples and arguments in extended description.

SparkQA · 2017-07-22T11:01:42Z

Test build #79860 has finished for PR 18702 at commit d96ef25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-22T11:35:48Z

Test build #79862 has finished for PR 18702 at commit 9853029.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-22T11:42:28Z

Test build #79863 has finished for PR 18702 at commit 1e21000.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-22T11:46:20Z

Test build #79864 has finished for PR 18702 at commit 4bc93b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-22T12:41:22Z

@rxin, while fixing minor cleanup, the comment was folded but please see #18702 (comment).

SparkQA · 2017-07-22T15:37:21Z

Test build #79867 has finished for PR 18702 at commit acf2cfa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-22T15:45:45Z

Test build #79868 has finished for PR 18702 at commit 9252f97.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-25T06:47:08Z

@rxin, do you have any thought on this?

srowen · 2017-07-25T08:42:32Z

sql/log4j.properties

@@ -0,0 +1,24 @@
+#


Pardon, is a log4j.properties really needed here? what reads it?

This is used - https://github.com/apache/spark/pull/18702/files#diff-a4b1e8e0e72fd59bd246285a34b21a45R49

I hesitated to add this one but just added to make the infos quiet when a spark session is initialised.

There is a similar one in R - https://github.com/apache/spark/blob/master/R/run-tests.sh#L26 and https://github.com/apache/spark/blob/master/R/log4j.properties but this case logs are quite small though.

$ sh create-docs.sh

Before

Generating markdown files for SQL documentation. Generating HTML files for SQL documentation. INFO - Cleaning site directory INFO - Building documentation to directory: /.../spark/sql/site

After

Generating markdown files for SQL documentation. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/07/25 18:02:57 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT 17/07/25 18:02:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/07/25 18:02:57 INFO SparkContext: Submitted application: GenSQLDocs 17/07/25 18:02:57 INFO SecurityManager: Changing view acls to: hyukjinkwon 17/07/25 18:02:57 INFO SecurityManager: Changing modify acls to: hyukjinkwon 17/07/25 18:02:57 INFO SecurityManager: Changing view acls groups to: 17/07/25 18:02:57 INFO SecurityManager: Changing modify acls groups to: 17/07/25 18:02:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hyukjinkwon); groups with view permissions: Set(); users with modify permissions: Set(hyukjinkwon); groups with modify permissions: Set() 17/07/25 18:02:58 INFO Utils: Successfully started service 'sparkDriver' on port 62695. 17/07/25 18:02:58 INFO SparkEnv: Registering MapOutputTracker 17/07/25 18:02:58 INFO SparkEnv: Registering BlockManagerMaster 17/07/25 18:02:58 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 17/07/25 18:02:58 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 17/07/25 18:02:58 INFO DiskBlockManager: Created local directory at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/blockmgr-13909519-0864-4aa3-82fb-eba4b9d1e527 17/07/25 18:02:58 INFO MemoryStore: MemoryStore started with capacity 366.3 MB 17/07/25 18:02:58 INFO SparkEnv: Registering OutputCommitCoordinator 17/07/25 18:02:58 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/07/25 18:02:58 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.146:4040 17/07/25 18:02:58 INFO SparkContext: Added file file:/.../spark/sql/gen-sql-markdown.py at file:/.../spark/sql/gen-sql-markdown.py with timestamp 1500973378636 17/07/25 18:02:58 INFO Utils: Copying /.../spark/sql/gen-sql-markdown.py to /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-0eb99795-6a5e-4101-8ed2-c55b3ba80173/userFiles-64210993-0ed8-4c05-85bd-83d13ae22831/gen-sql-markdown.py 17/07/25 18:02:58 INFO Executor: Starting executor ID driver on host localhost 17/07/25 18:02:58 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 62696. 17/07/25 18:02:58 INFO NettyBlockTransferService: Server created on 192.168.0.146:62696 17/07/25 18:02:58 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 17/07/25 18:02:58 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.146, 62696, None) 17/07/25 18:02:58 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.146:62696 with 366.3 MB RAM, BlockManagerId(driver, 192.168.0.146, 62696, None) 17/07/25 18:02:58 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.146, 62696, None) 17/07/25 18:02:58 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.146, 62696, None) 17/07/25 18:02:59 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('/.../spark/sql/_spark-warehouse'). 17/07/25 18:02:59 INFO SharedState: Warehouse path is '/.../spark/sql/_spark-warehouse'. 17/07/25 18:02:59 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint 17/07/25 18:02:59 INFO SparkUI: Stopped Spark web UI at http://192.168.0.146:4040 17/07/25 18:02:59 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/07/25 18:02:59 INFO MemoryStore: MemoryStore cleared 17/07/25 18:02:59 INFO BlockManager: BlockManager stopped 17/07/25 18:02:59 INFO BlockManagerMaster: BlockManagerMaster stopped 17/07/25 18:02:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/07/25 18:02:59 INFO SparkContext: Successfully stopped SparkContext 17/07/25 18:02:59 INFO ShutdownHookManager: Shutdown hook called 17/07/25 18:02:59 INFO ShutdownHookManager: Deleting directory /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-0eb99795-6a5e-4101-8ed2-c55b3ba80173 17/07/25 18:02:59 INFO ShutdownHookManager: Deleting directory /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-0eb99795-6a5e-4101-8ed2-c55b3ba80173/pyspark-4cdacd99-6ab2-4a1e-9b76-c4d6d457477e Generating HTML files for SQL documentation. INFO - Cleaning site directory INFO - Building documentation to directory: /.../spark/sql/site

Would you prefer to get rid of this?

Oh, actually, I think I can get rid of this. Let me try.

srowen · 2017-07-25T08:43:52Z

sql/gen-sql-markdown.py

+# limitations under the License.
+#
+
+import sys


Does it complicate things to put these files in some kind of bin directory under sql?

It should be pretty simple. Let me try.

srowen · 2017-07-25T08:44:32Z

docs/_plugins/copy_api_dirs.rb

+
+    puts "Moving to project root and building API docs."
+    curr_dir = pwd
+    cd("..")


Rather than cd, is it possible to generate the output directly into docs/ somewhere? maybe I miss why that's hard. It would avoid creating more temp output dirs in the sql folder

I think I misunderstood your previous comments initially. It shouldn't be hard.

but other language API docs are, up to my knowledge, in separate directories and then copied into docs/ later. I was thinking we could extend this further in the future (e.g., syntax documentation) and it could be easier to check doc output in a separate dir (actually, for me, I check other docs output in this way more often).

HyukjinKwon · 2017-07-25T10:14:54Z

I addressed the log4j comment for now first (I got rid of log4j file by directly accessing to JVM rather than Spark Session)

HyukjinKwon · 2017-07-25T10:29:01Z

@srowen, so, the current structure is like ...

.
├── docs
└── sql
    ├── create-docs.sh      # Generates HTMLs.
    ├── gen-sql-markdown.py # Generates markdown files.
    └── mkdocs.yml          # MkDocs configuration file.

After cd sql && create-docs.sh:

.
├── docs
└── sql
    ├── create-docs.sh
    ├── gen-sql-markdown.py
    ├── mkdocs.yml
    └── site                # Generated HTML files.

After cd docs && SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll build

.
├── docs
│   └── api
│       └── sql             # Copied from ./sql/site
└── sql
    ├── create-docs.sh
    ├── gen-sql-markdown.py
    ├── mkdocs.yml
    └── site

It is pretty easy to move files around in any case and I am fine with trying out to move multiple times to show how it looks like.

So ... do you mean something like the one as below?

.
├── docs
└── sql
    ├── bin
    │  ├── create-docs.sh      # Generates HTMLs.
    │  └── gen-sql-markdown.py # Generates markdown files.
    └── mkdocs.yml             # MkDocs configuration file.

After cd sql && create-docs.sh:

.
├── docs
│   └── api
│       └── sql              # Generated HTML files.
└── sql
    ├── bin
    │  ├── create-docs.sh
    │  └── gen-sql-markdown.py
    └── mkdocs.yml

After SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll build

.
├── docs
│   └── api
│       └── sql              # Generated HTML files.
└── sql
    ├── bin
    │  ├── create-docs.sh
    │  └── gen-sql-markdown.py
    └── mkdocs.yml

Or, would this maybe give you another idea?

srowen · 2017-07-25T10:33:01Z

Yeah the latter is what I had in mind but I don't know, is it hard? it does seem more natural to generate the output directly into its destination, I think, unless I'm overlooking something else the docs process does.

HyukjinKwon · 2017-07-25T10:47:54Z

In any case, it is easy. I was just thinking of matching it up with, like, R and Python which are, up to my knowledge, something like the below:

R:

.
├── docs
└── R
    └── create-docs.sh      # Genenerates HTMLs.

After cd R && create-docs.sh:

.
├── docs
└── R
    ├── create-docs.sh      
    └── pkg
        └── html           # Generated HTML files.

After cd docs && jekyll build

.
├── docs
│   └── api
│       └── R              # Copied from ./R/pkg/html
└── R
    ├── create-docs.sh      
    └── pkg
        └── html

Python:

.
├── docs
└── python
    └── docs
        └── Makefile      # Genenerates HTMLs.

After cd python && make html:

.
├── docs
└── python
    └── docs
        ├── Makefile 
        └── _build
            └── html       # Genereated HTML files.

After cd docs && jekyll build

.
├── docs
│   └── api
│       └── python        # Copied from ./python/docs/_build/html
└── python
    └── docs
        ├── Makefile
        └── _build
            └── html

HyukjinKwon · 2017-07-25T10:53:35Z

(sorry, I just edited the structure above to be more correct)

srowen · 2017-07-25T11:21:08Z

Oh I see, it's for consistency. Well OK stay consistent with R/Python then.

HyukjinKwon · 2017-07-25T11:25:26Z

Thank you @srowen. I will double check and clean up minor nits and will ping you within tomorrow.

SparkQA · 2017-07-25T13:12:22Z

Test build #79931 has finished for PR 18702 at commit c92533b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-26T02:24:01Z

@srowen, I tried to resemble, in particular, R:

.
├── docs
└── sql
    ├── create-docs.sh      # Generates HTMLs.
    ├── gen-sql-markdown.py # Generates markdown files.
    └── mkdocs.yml          # MkDocs configuration file.

After cd sql && create-docs.sh:

.
├── docs
└── sql
    ├── create-docs.sh
    ├── gen-sql-markdown.py
    ├── mkdocs.yml
    └── site                # Generated HTML files.

After cd docs && SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll build

.
├── docs
│   └── api
│       └── sql             # Copied from ./sql/site
└── sql
    ├── create-docs.sh
    ├── gen-sql-markdown.py
    ├── mkdocs.yml
    └── site

and I cleaned up few nits in the last commit. I believe it is ready for another look.

SparkQA · 2017-07-26T05:20:38Z

Test build #79953 has finished for PR 18702 at commit c711ff5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looks good to me. Maybe you can merge this one shortly when you're able?

HyukjinKwon · 2017-07-26T09:44:21Z

I guess it probably will take about a week more for my Apache account creation (according to the doc).

HyukjinKwon · 2017-07-26T09:45:18Z

I am fine with leaving it to me too :).

rxin · 2017-07-26T16:38:37Z

LGTM too.

Merging in master.

HyukjinKwon · 2017-07-26T23:35:48Z

Thank you @srowen and @rxin.

jkbradley · 2017-08-17T21:59:44Z

@HyukjinKwon It looks like the Jenkins doc build was broken by this PR (and I can't build it locally either). Is it just that the script needs to be updated to install mkdocs and to run sql/create-docs.sh ?

HyukjinKwon · 2017-08-17T23:17:39Z

Oh, yes it looks so. Will update this script to install pip install mkdocs if mkdocs is missing in the path, if you are not looking into this. Does this fail to build the doc even after installing mkdocs locally?

HyukjinKwon · 2017-08-17T23:51:21Z

Will test this in a bare machine and open a PR in any event. The changes should be small.

jkbradley · 2017-08-18T00:45:24Z

I tested locally, and it's fine if I install mkdocs and run sql/create-docs.sh
Thanks for fixing it!

HyukjinKwon · 2017-08-18T00:47:33Z

Nice. Thanks for trying it out!

HyukjinKwon changed the title ~~[SPARK-21485][SQL][DOCS] SQL documentation generation for built-in functions~~ [WIP][SPARK-21485][SQL][DOCS] SQL documentation generation for built-in functions Jul 21, 2017

HyukjinKwon commented Jul 21, 2017

View reviewed changes

srowen reviewed Jul 21, 2017

View reviewed changes

HyukjinKwon force-pushed the SPARK-21485 branch from ea86529 to 6175320 Compare July 22, 2017 07:51

HyukjinKwon changed the title ~~[WIP][SPARK-21485][SQL][DOCS] SQL documentation generation for built-in functions~~ [SPARK-21485][SQL][DOCS] SQL documentation generation for built-in functions Jul 22, 2017

HyukjinKwon commented Jul 22, 2017

View reviewed changes

HyukjinKwon force-pushed the SPARK-21485 branch from 9853029 to 1e21000 Compare July 22, 2017 08:37

HyukjinKwon force-pushed the SPARK-21485 branch from acf2cfa to 9252f97 Compare July 22, 2017 12:46

HyukjinKwon changed the title ~~[SPARK-21485][SQL][DOCS] SQL documentation generation for built-in functions~~ [SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions Jul 23, 2017

srowen reviewed Jul 25, 2017

View reviewed changes

HyukjinKwon added 2 commits July 25, 2017 19:04

SQL documentation generation for built-in functions

b95be04

Avoid initialize SparkSession but only JVM

c92533b

HyukjinKwon force-pushed the SPARK-21485 branch from 9252f97 to c92533b Compare July 25, 2017 10:14

Minor cleanup

c711ff5

srowen approved these changes Jul 26, 2017

View reviewed changes

asfgit closed this in 60472db Jul 26, 2017

HyukjinKwon mentioned this pull request Aug 21, 2017

[SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in the path in SQL documentation build #18984

Closed

HyukjinKwon deleted the SPARK-21485 branch January 2, 2018 03:41

[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions #18702

[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions #18702

Conversation

HyukjinKwon commented Jul 21, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

HyukjinKwon commented Jul 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 21, 2017

SparkQA commented Jul 22, 2017

Choose a reason for hiding this comment

SparkQA commented Jul 22, 2017

SparkQA commented Jul 22, 2017

SparkQA commented Jul 22, 2017

SparkQA commented Jul 22, 2017

HyukjinKwon commented Jul 22, 2017

SparkQA commented Jul 22, 2017

SparkQA commented Jul 22, 2017

HyukjinKwon commented Jul 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 25, 2017 • edited Loading

HyukjinKwon commented Jul 25, 2017 • edited Loading

srowen commented Jul 25, 2017

HyukjinKwon commented Jul 25, 2017 • edited Loading

HyukjinKwon commented Jul 25, 2017

srowen commented Jul 25, 2017

HyukjinKwon commented Jul 25, 2017

SparkQA commented Jul 25, 2017

HyukjinKwon commented Jul 26, 2017

SparkQA commented Jul 26, 2017

srowen left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 26, 2017

HyukjinKwon commented Jul 26, 2017

rxin commented Jul 26, 2017

HyukjinKwon commented Jul 26, 2017

jkbradley commented Aug 17, 2017 • edited Loading

HyukjinKwon commented Aug 17, 2017

HyukjinKwon commented Aug 17, 2017

jkbradley commented Aug 18, 2017

HyukjinKwon commented Aug 18, 2017

HyukjinKwon commented Jul 21, 2017 •

edited

Loading

HyukjinKwon commented Jul 25, 2017 •

edited

Loading

HyukjinKwon commented Jul 25, 2017 •

edited

Loading

HyukjinKwon commented Jul 25, 2017 •

edited

Loading

jkbradley commented Aug 17, 2017 •

edited

Loading