-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions #18702
Conversation
docs/_plugins/copy_api_dirs.rb
Outdated
cd("..") | ||
|
||
puts "Running 'build/mvn -DskipTests clean package' from " + pwd + "; this may take a few minutes..." | ||
system("build/mvn -DskipTests clean package") || raise("SQL doc generation failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow, I can't run ./build/sbt clean package
and spark-submit
in my environment -
#17150 (comment)
I tested this via sbt
after manually reverting that commit in my environment. I will change this to sbt
if anyone could confirm it was my mistake ...
.gitignore
Outdated
@@ -47,6 +47,8 @@ dev/pr-deps/ | |||
dist/ | |||
docs/_site | |||
docs/api | |||
sql/docs/docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only main question is, can we do this without introducing a new sql/docs module? can it just live in sql/ and generate output into the top-level docs/ folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can. I actually was wondering where I should put those.
Just to double check, would it be okay to simply put four (sql/docs/mkdocs.yml
, sql/docs/log4j.properties
, sql/docs/gen-sql-markdown.py
and sql/docs/create-docs.sh
) files into sql/
?
Test build #79832 has finished for PR 18702 at commit
|
Test build #79859 has finished for PR 18702 at commit
|
sql/gen-sql-markdown.py
Outdated
Otherwise, returns None. | ||
""" | ||
|
||
if extended is not None and extended.strip() != "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin, I could make this prettier with manual parsing and tried many things but I just revert them back. I was thinking it'd be rather better if we make some changes within ExpressionDescription
and ExpressionInfo
and I was thinking I could do this in a followup. WDYT?
If it does not sound good to you, I am willing to add some parsing logics here, mainly, to separate examples and arguments in extended description.
Test build #79860 has finished for PR 18702 at commit
|
Test build #79862 has finished for PR 18702 at commit
|
Test build #79863 has finished for PR 18702 at commit
|
Test build #79864 has finished for PR 18702 at commit
|
@rxin, while fixing minor cleanup, the comment was folded but please see #18702 (comment). |
Test build #79867 has finished for PR 18702 at commit
|
Test build #79868 has finished for PR 18702 at commit
|
@rxin, do you have any thought on this? |
sql/log4j.properties
Outdated
@@ -0,0 +1,24 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pardon, is a log4j.properties really needed here? what reads it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used - https://github.com/apache/spark/pull/18702/files#diff-a4b1e8e0e72fd59bd246285a34b21a45R49
I hesitated to add this one but just added to make the infos quiet when a spark session is initialised.
There is a similar one in R - https://github.com/apache/spark/blob/master/R/run-tests.sh#L26 and https://github.com/apache/spark/blob/master/R/log4j.properties but this case logs are quite small though.
$ sh create-docs.sh
Before
Generating markdown files for SQL documentation.
Generating HTML files for SQL documentation.
INFO - Cleaning site directory
INFO - Building documentation to directory: /.../spark/sql/site
After
Generating markdown files for SQL documentation.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/07/25 18:02:57 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT
17/07/25 18:02:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/25 18:02:57 INFO SparkContext: Submitted application: GenSQLDocs
17/07/25 18:02:57 INFO SecurityManager: Changing view acls to: hyukjinkwon
17/07/25 18:02:57 INFO SecurityManager: Changing modify acls to: hyukjinkwon
17/07/25 18:02:57 INFO SecurityManager: Changing view acls groups to:
17/07/25 18:02:57 INFO SecurityManager: Changing modify acls groups to:
17/07/25 18:02:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hyukjinkwon); groups with view permissions: Set(); users with modify permissions: Set(hyukjinkwon); groups with modify permissions: Set()
17/07/25 18:02:58 INFO Utils: Successfully started service 'sparkDriver' on port 62695.
17/07/25 18:02:58 INFO SparkEnv: Registering MapOutputTracker
17/07/25 18:02:58 INFO SparkEnv: Registering BlockManagerMaster
17/07/25 18:02:58 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/07/25 18:02:58 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/07/25 18:02:58 INFO DiskBlockManager: Created local directory at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/blockmgr-13909519-0864-4aa3-82fb-eba4b9d1e527
17/07/25 18:02:58 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/07/25 18:02:58 INFO SparkEnv: Registering OutputCommitCoordinator
17/07/25 18:02:58 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/07/25 18:02:58 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.146:4040
17/07/25 18:02:58 INFO SparkContext: Added file file:/.../spark/sql/gen-sql-markdown.py at file:/.../spark/sql/gen-sql-markdown.py with timestamp 1500973378636
17/07/25 18:02:58 INFO Utils: Copying /.../spark/sql/gen-sql-markdown.py to /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-0eb99795-6a5e-4101-8ed2-c55b3ba80173/userFiles-64210993-0ed8-4c05-85bd-83d13ae22831/gen-sql-markdown.py
17/07/25 18:02:58 INFO Executor: Starting executor ID driver on host localhost
17/07/25 18:02:58 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 62696.
17/07/25 18:02:58 INFO NettyBlockTransferService: Server created on 192.168.0.146:62696
17/07/25 18:02:58 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/07/25 18:02:58 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.146, 62696, None)
17/07/25 18:02:58 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.146:62696 with 366.3 MB RAM, BlockManagerId(driver, 192.168.0.146, 62696, None)
17/07/25 18:02:58 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.146, 62696, None)
17/07/25 18:02:58 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.146, 62696, None)
17/07/25 18:02:59 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('/.../spark/sql/_spark-warehouse').
17/07/25 18:02:59 INFO SharedState: Warehouse path is '/.../spark/sql/_spark-warehouse'.
17/07/25 18:02:59 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/07/25 18:02:59 INFO SparkUI: Stopped Spark web UI at http://192.168.0.146:4040
17/07/25 18:02:59 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/07/25 18:02:59 INFO MemoryStore: MemoryStore cleared
17/07/25 18:02:59 INFO BlockManager: BlockManager stopped
17/07/25 18:02:59 INFO BlockManagerMaster: BlockManagerMaster stopped
17/07/25 18:02:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/07/25 18:02:59 INFO SparkContext: Successfully stopped SparkContext
17/07/25 18:02:59 INFO ShutdownHookManager: Shutdown hook called
17/07/25 18:02:59 INFO ShutdownHookManager: Deleting directory /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-0eb99795-6a5e-4101-8ed2-c55b3ba80173
17/07/25 18:02:59 INFO ShutdownHookManager: Deleting directory /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-0eb99795-6a5e-4101-8ed2-c55b3ba80173/pyspark-4cdacd99-6ab2-4a1e-9b76-c4d6d457477e
Generating HTML files for SQL documentation.
INFO - Cleaning site directory
INFO - Building documentation to directory: /.../spark/sql/site
Would you prefer to get rid of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually, I think I can get rid of this. Let me try.
# limitations under the License. | ||
# | ||
|
||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it complicate things to put these files in some kind of bin
directory under sql
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be pretty simple. Let me try.
|
||
puts "Moving to project root and building API docs." | ||
curr_dir = pwd | ||
cd("..") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than cd
, is it possible to generate the output directly into docs/
somewhere? maybe I miss why that's hard. It would avoid creating more temp output dirs in the sql
folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood your previous comments initially. It shouldn't be hard.
but other language API docs are, up to my knowledge, in separate directories and then copied into docs/
later. I was thinking we could extend this further in the future (e.g., syntax documentation) and it could be easier to check doc output in a separate dir (actually, for me, I check other docs output in this way more often).
I addressed the log4j comment for now first (I got rid of log4j file by directly accessing to JVM rather than Spark Session) |
@srowen, so, the current structure is like ...
After
After
It is pretty easy to move files around in any case and I am fine with trying out to move multiple times to show how it looks like. So ... do you mean something like the one as below?
After
After
Or, would this maybe give you another idea? |
Yeah the latter is what I had in mind but I don't know, is it hard? it does seem more natural to generate the output directly into its destination, I think, unless I'm overlooking something else the docs process does. |
In any case, it is easy. I was just thinking of matching it up with, like, R and Python which are, up to my knowledge, something like the below: R:
After
After
Python:
After
After
|
(sorry, I just edited the structure above to be more correct) |
Oh I see, it's for consistency. Well OK stay consistent with R/Python then. |
Thank you @srowen. I will double check and clean up minor nits and will ping you within tomorrow. |
Test build #79931 has finished for PR 18702 at commit
|
@srowen, I tried to resemble, in particular, R:
After
After
and I cleaned up few nits in the last commit. I believe it is ready for another look. |
Test build #79953 has finished for PR 18702 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Maybe you can merge this one shortly when you're able?
I guess it probably will take about a week more for my Apache account creation (according to the doc). |
I am fine with leaving it to me too :). |
LGTM too. Merging in master. |
@HyukjinKwon It looks like the Jenkins doc build was broken by this PR (and I can't build it locally either). Is it just that the script needs to be updated to install mkdocs and to run sql/create-docs.sh ? |
Oh, yes it looks so. Will update this script to install |
Will test this in a bare machine and open a PR in any event. The changes should be small. |
I tested locally, and it's fine if I install mkdocs and run sql/create-docs.sh |
Nice. Thanks for trying it out! |
What changes were proposed in this pull request?
This generates a documentation for Spark SQL built-in functions.
One drawback is, this requires a proper build to generate built-in function list.
Once it is built, it only takes few seconds by
sql/create-docs.sh
.Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the output documentation.
There are few more works to be done in order to make the documentation pretty, for example, separating
Arguments:
andExamples:
but I guess this should be done withinExpressionDescription
andExpressionInfo
rather than manually parsing it. I will fix these in a follow up.This requires
pip install mkdocs
to generate HTMLs from markdown files.How was this patch tested?
Manually tested:
,
and