Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions #18702

Closed
wants to merge 3 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 21, 2017

What changes were proposed in this pull request?

This generates a documentation for Spark SQL built-in functions.

One drawback is, this requires a proper build to generate built-in function list.
Once it is built, it only takes few seconds by sql/create-docs.sh.

Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the output documentation.

There are few more works to be done in order to make the documentation pretty, for example, separating Arguments: and Examples: but I guess this should be done within ExpressionDescription and ExpressionInfo rather than manually parsing it. I will fix these in a follow up.

This requires pip install mkdocs to generate HTMLs from markdown files.

How was this patch tested?

Manually tested:

cd docs
jekyll build

,

cd docs
jekyll serve

and

cd sql
create-docs.sh

@HyukjinKwon HyukjinKwon changed the title [SPARK-21485][SQL][DOCS] SQL documentation generation for built-in functions [WIP][SPARK-21485][SQL][DOCS] SQL documentation generation for built-in functions Jul 21, 2017
cd("..")

puts "Running 'build/mvn -DskipTests clean package' from " + pwd + "; this may take a few minutes..."
system("build/mvn -DskipTests clean package") || raise("SQL doc generation failed")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow, I can't run ./build/sbt clean package and spark-submit in my environment -
#17150 (comment)

I tested this via sbt after manually reverting that commit in my environment. I will change this to sbt if anyone could confirm it was my mistake ...

@HyukjinKwon
Copy link
Member Author

cc @srowen and @rxin.

.gitignore Outdated
@@ -47,6 +47,8 @@ dev/pr-deps/
dist/
docs/_site
docs/api
sql/docs/docs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only main question is, can we do this without introducing a new sql/docs module? can it just live in sql/ and generate output into the top-level docs/ folder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can. I actually was wondering where I should put those.

Just to double check, would it be okay to simply put four (sql/docs/mkdocs.yml, sql/docs/log4j.properties, sql/docs/gen-sql-markdown.py and sql/docs/create-docs.sh) files into sql/?

@SparkQA
Copy link

SparkQA commented Jul 21, 2017

Test build #79832 has finished for PR 18702 at commit ea86529.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2017

Test build #79859 has finished for PR 18702 at commit 6175320.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-21485][SQL][DOCS] SQL documentation generation for built-in functions [SPARK-21485][SQL][DOCS] SQL documentation generation for built-in functions Jul 22, 2017
Otherwise, returns None.
"""

if extended is not None and extended.strip() != "":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin, I could make this prettier with manual parsing and tried many things but I just revert them back. I was thinking it'd be rather better if we make some changes within ExpressionDescription and ExpressionInfo and I was thinking I could do this in a followup. WDYT?

If it does not sound good to you, I am willing to add some parsing logics here, mainly, to separate examples and arguments in extended description.

@SparkQA
Copy link

SparkQA commented Jul 22, 2017

Test build #79860 has finished for PR 18702 at commit d96ef25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2017

Test build #79862 has finished for PR 18702 at commit 9853029.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2017

Test build #79863 has finished for PR 18702 at commit 1e21000.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2017

Test build #79864 has finished for PR 18702 at commit 4bc93b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@rxin, while fixing minor cleanup, the comment was folded but please see #18702 (comment).

@SparkQA
Copy link

SparkQA commented Jul 22, 2017

Test build #79867 has finished for PR 18702 at commit acf2cfa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2017

Test build #79868 has finished for PR 18702 at commit 9252f97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-21485][SQL][DOCS] SQL documentation generation for built-in functions [SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions Jul 23, 2017
@HyukjinKwon
Copy link
Member Author

@rxin, do you have any thought on this?

@@ -0,0 +1,24 @@
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon, is a log4j.properties really needed here? what reads it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used - https://github.com/apache/spark/pull/18702/files#diff-a4b1e8e0e72fd59bd246285a34b21a45R49

I hesitated to add this one but just added to make the infos quiet when a spark session is initialised.

There is a similar one in R - https://github.com/apache/spark/blob/master/R/run-tests.sh#L26 and https://github.com/apache/spark/blob/master/R/log4j.properties but this case logs are quite small though.

$ sh create-docs.sh

Before

Generating markdown files for SQL documentation.
Generating HTML files for SQL documentation.
INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: /.../spark/sql/site

After

Generating markdown files for SQL documentation.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/07/25 18:02:57 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT
17/07/25 18:02:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/25 18:02:57 INFO SparkContext: Submitted application: GenSQLDocs
17/07/25 18:02:57 INFO SecurityManager: Changing view acls to: hyukjinkwon
17/07/25 18:02:57 INFO SecurityManager: Changing modify acls to: hyukjinkwon
17/07/25 18:02:57 INFO SecurityManager: Changing view acls groups to:
17/07/25 18:02:57 INFO SecurityManager: Changing modify acls groups to:
17/07/25 18:02:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hyukjinkwon); groups with view permissions: Set(); users  with modify permissions: Set(hyukjinkwon); groups with modify permissions: Set()
17/07/25 18:02:58 INFO Utils: Successfully started service 'sparkDriver' on port 62695.
17/07/25 18:02:58 INFO SparkEnv: Registering MapOutputTracker
17/07/25 18:02:58 INFO SparkEnv: Registering BlockManagerMaster
17/07/25 18:02:58 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/07/25 18:02:58 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/07/25 18:02:58 INFO DiskBlockManager: Created local directory at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/blockmgr-13909519-0864-4aa3-82fb-eba4b9d1e527
17/07/25 18:02:58 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/07/25 18:02:58 INFO SparkEnv: Registering OutputCommitCoordinator
17/07/25 18:02:58 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/07/25 18:02:58 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.146:4040
17/07/25 18:02:58 INFO SparkContext: Added file file:/.../spark/sql/gen-sql-markdown.py at file:/.../spark/sql/gen-sql-markdown.py with timestamp 1500973378636
17/07/25 18:02:58 INFO Utils: Copying /.../spark/sql/gen-sql-markdown.py to /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-0eb99795-6a5e-4101-8ed2-c55b3ba80173/userFiles-64210993-0ed8-4c05-85bd-83d13ae22831/gen-sql-markdown.py
17/07/25 18:02:58 INFO Executor: Starting executor ID driver on host localhost
17/07/25 18:02:58 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 62696.
17/07/25 18:02:58 INFO NettyBlockTransferService: Server created on 192.168.0.146:62696
17/07/25 18:02:58 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/07/25 18:02:58 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.146, 62696, None)
17/07/25 18:02:58 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.146:62696 with 366.3 MB RAM, BlockManagerId(driver, 192.168.0.146, 62696, None)
17/07/25 18:02:58 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.146, 62696, None)
17/07/25 18:02:58 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.146, 62696, None)
17/07/25 18:02:59 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('/.../spark/sql/_spark-warehouse').
17/07/25 18:02:59 INFO SharedState: Warehouse path is '/.../spark/sql/_spark-warehouse'.
17/07/25 18:02:59 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/07/25 18:02:59 INFO SparkUI: Stopped Spark web UI at http://192.168.0.146:4040
17/07/25 18:02:59 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/07/25 18:02:59 INFO MemoryStore: MemoryStore cleared
17/07/25 18:02:59 INFO BlockManager: BlockManager stopped
17/07/25 18:02:59 INFO BlockManagerMaster: BlockManagerMaster stopped
17/07/25 18:02:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/07/25 18:02:59 INFO SparkContext: Successfully stopped SparkContext
17/07/25 18:02:59 INFO ShutdownHookManager: Shutdown hook called
17/07/25 18:02:59 INFO ShutdownHookManager: Deleting directory /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-0eb99795-6a5e-4101-8ed2-c55b3ba80173
17/07/25 18:02:59 INFO ShutdownHookManager: Deleting directory /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-0eb99795-6a5e-4101-8ed2-c55b3ba80173/pyspark-4cdacd99-6ab2-4a1e-9b76-c4d6d457477e
Generating HTML files for SQL documentation.
INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: /.../spark/sql/site

Would you prefer to get rid of this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually, I think I can get rid of this. Let me try.

# limitations under the License.
#

import sys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it complicate things to put these files in some kind of bin directory under sql?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be pretty simple. Let me try.


puts "Moving to project root and building API docs."
curr_dir = pwd
cd("..")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than cd, is it possible to generate the output directly into docs/ somewhere? maybe I miss why that's hard. It would avoid creating more temp output dirs in the sql folder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood your previous comments initially. It shouldn't be hard.

but other language API docs are, up to my knowledge, in separate directories and then copied into docs/ later. I was thinking we could extend this further in the future (e.g., syntax documentation) and it could be easier to check doc output in a separate dir (actually, for me, I check other docs output in this way more often).

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 25, 2017

I addressed the log4j comment for now first (I got rid of log4j file by directly accessing to JVM rather than Spark Session)

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 25, 2017

@srowen, so, the current structure is like ...

.
├── docs
└── sql
    ├── create-docs.sh      # Generates HTMLs.
    ├── gen-sql-markdown.py # Generates markdown files.
    └── mkdocs.yml          # MkDocs configuration file.

After cd sql && create-docs.sh:

.
├── docs
└── sql
    ├── create-docs.sh
    ├── gen-sql-markdown.py
    ├── mkdocs.yml
    └── site                # Generated HTML files.

After cd docs && SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll build

.
├── docs
│   └── api
│       └── sql             # Copied from ./sql/site
└── sql
    ├── create-docs.sh
    ├── gen-sql-markdown.py
    ├── mkdocs.yml
    └── site

It is pretty easy to move files around in any case and I am fine with trying out to move multiple times to show how it looks like.

So ... do you mean something like the one as below?

.
├── docs
└── sql
    ├── bin
    │  ├── create-docs.sh      # Generates HTMLs.
    │  └── gen-sql-markdown.py # Generates markdown files.
    └── mkdocs.yml             # MkDocs configuration file.

After cd sql && create-docs.sh:

.
├── docs
│   └── api
│       └── sql              # Generated HTML files.
└── sql
    ├── bin
    │  ├── create-docs.sh
    │  └── gen-sql-markdown.py
    └── mkdocs.yml

After SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll build

.
├── docs
│   └── api
│       └── sql              # Generated HTML files.
└── sql
    ├── bin
    │  ├── create-docs.sh
    │  └── gen-sql-markdown.py
    └── mkdocs.yml

Or, would this maybe give you another idea?

@srowen
Copy link
Member

srowen commented Jul 25, 2017

Yeah the latter is what I had in mind but I don't know, is it hard? it does seem more natural to generate the output directly into its destination, I think, unless I'm overlooking something else the docs process does.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 25, 2017

In any case, it is easy. I was just thinking of matching it up with, like, R and Python which are, up to my knowledge, something like the below:

R:

.
├── docs
└── R
    └── create-docs.sh      # Genenerates HTMLs.

After cd R && create-docs.sh:

.
├── docs
└── R
    ├── create-docs.sh      
    └── pkg
        └── html           # Generated HTML files.

After cd docs && jekyll build

.
├── docs
│   └── api
│       └── R              # Copied from ./R/pkg/html
└── R
    ├── create-docs.sh      
    └── pkg
        └── html

Python:

.
├── docs
└── python
    └── docs
        └── Makefile      # Genenerates HTMLs.

After cd python && make html:

.
├── docs
└── python
    └── docs
        ├── Makefile 
        └── _build
            └── html       # Genereated HTML files.

After cd docs && jekyll build

.
├── docs
│   └── api
│       └── python        # Copied from ./python/docs/_build/html
└── python
    └── docs
        ├── Makefile
        └── _build
            └── html 

@HyukjinKwon
Copy link
Member Author

(sorry, I just edited the structure above to be more correct)

@srowen
Copy link
Member

srowen commented Jul 25, 2017

Oh I see, it's for consistency. Well OK stay consistent with R/Python then.

@HyukjinKwon
Copy link
Member Author

Thank you @srowen. I will double check and clean up minor nits and will ping you within tomorrow.

@SparkQA
Copy link

SparkQA commented Jul 25, 2017

Test build #79931 has finished for PR 18702 at commit c92533b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@srowen, I tried to resemble, in particular, R:

.
├── docs
└── sql
    ├── create-docs.sh      # Generates HTMLs.
    ├── gen-sql-markdown.py # Generates markdown files.
    └── mkdocs.yml          # MkDocs configuration file.

After cd sql && create-docs.sh:

.
├── docs
└── sql
    ├── create-docs.sh
    ├── gen-sql-markdown.py
    ├── mkdocs.yml
    └── site                # Generated HTML files.

After cd docs && SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll build

.
├── docs
│   └── api
│       └── sql             # Copied from ./sql/site
└── sql
    ├── create-docs.sh
    ├── gen-sql-markdown.py
    ├── mkdocs.yml
    └── site

and I cleaned up few nits in the last commit. I believe it is ready for another look.

@SparkQA
Copy link

SparkQA commented Jul 26, 2017

Test build #79953 has finished for PR 18702 at commit c711ff5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Maybe you can merge this one shortly when you're able?

@HyukjinKwon
Copy link
Member Author

I guess it probably will take about a week more for my Apache account creation (according to the doc).

@HyukjinKwon
Copy link
Member Author

I am fine with leaving it to me too :).

@rxin
Copy link
Contributor

rxin commented Jul 26, 2017

LGTM too.

Merging in master.

@asfgit asfgit closed this in 60472db Jul 26, 2017
@HyukjinKwon
Copy link
Member Author

Thank you @srowen and @rxin.

@jkbradley
Copy link
Member

jkbradley commented Aug 17, 2017

@HyukjinKwon It looks like the Jenkins doc build was broken by this PR (and I can't build it locally either). Is it just that the script needs to be updated to install mkdocs and to run sql/create-docs.sh ?

@HyukjinKwon
Copy link
Member Author

Oh, yes it looks so. Will update this script to install pip install mkdocs if mkdocs is missing in the path, if you are not looking into this. Does this fail to build the doc even after installing mkdocs locally?

@HyukjinKwon
Copy link
Member Author

Will test this in a bare machine and open a PR in any event. The changes should be small.

@jkbradley
Copy link
Member

I tested locally, and it's fine if I install mkdocs and run sql/create-docs.sh
Thanks for fixing it!

@HyukjinKwon
Copy link
Member Author

Nice. Thanks for trying it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants