Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46764][DOCS] Reorganize script to build API docs #44791

Closed
wants to merge 4 commits into from

Conversation

nchammas
Copy link
Contributor

@nchammas nchammas commented Jan 19, 2024

What changes were proposed in this pull request?

We are abusing Jekyll's plugin system to set flags for what API docs to build. This change maintains this overall status quo but makes the following improvements:

  • Rename the plugin file to be in line with its actual purpose.
  • Organize the code into functions so it's easier to follow and understand.
  • Print section headers that are easier to find in the output when building the docs.

The behavior of the documentation build otherwise remains unchanged.

Why are the changes needed?

This should make maintaining this part of the doc building workflow easier.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

In several ways:

  • I built the docs with various skip flags set and confirmed the build succeeds. I also manually browsed through some of the Scala, Java, Python, and SQL API docs.
    • The only docs I didn't test building were the R docs, because I do not have R installed locally.
    • Here is the build output on my machine for SKIP_RDOC=1: build-docs.log.zip
    • There are some errors output when building the Java unidoc, but these are present already in master.
  • I built the docs against master and diffed the resulting _site/ directory against the one output by this branch.
    • The _site/ directories are identical except for minor differences in some general SQL function example files.
    • Here is the diff: site.diff.zip
  • I also confirmed that ./dev/change-scala-version.sh 2.13 runs successfully (though I didn't try to run the modified scripts after that).

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the DOCS label Jan 19, 2024
@nchammas
Copy link
Contributor Author

In the future, I think it would be appropriate to convert this script into Python, as Ruby is not a focus of our community. (I don't use Ruby much myself, either.) But organizing the code into clear functions is something we'd want to do regardless of the language.

print_header "Building Scala and Java API docs."
cd(SPARK_PROJECT_ROOT)

command = "build/sbt -Pkinesis-asl clean compile unidoc"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we build both the Scala docs as well as either the Python or SQL docs, we end up building Spark twice. In a follow-up improvement, I think it would make sense to build Spark just once if any of the Scala, Python, or SQL docs are requested:

./build/sbt -Pkinesis-asl -Phive clean package

Then, if Scala docs are specifically also requested, we just build the unidoc:

./build/sbt -Pkinesis-asl -Phive unidoc

This should save us around 2 minutes on the complete documentation build.

For now, I'm leaving it like this since that's the current behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure we need the kinesis-asl profile here, as it seems the Kinesis docs are built with or without it. But I'd leave that as-is for now since it's not that much noise and I'm not 100% sure it's useless.

It was added in #2239.

@@ -0,0 +1,205 @@
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably change:

dev/change-scala-version.sh:echo "$BASEDIR/docs/_plugins/copy_api_dirs.rb"
dev/change-scala-version.sh:  sed_i '/\-Pscala-'$TO_VERSION'/!s:build/sbt:build/sbt \-Pscala\-'$TO_VERSION':' "$BASEDIR/docs/_plugins/copy_api_dirs.rb"
dev/change-scala-version.sh:  sed_i 's:build/sbt \-Pscala\-'$FROM_VERSION':build/sbt:' "$BASEDIR/docs/_plugins/copy_api_dirs.rb"
dev/change-scala-version.sh:sed_i 's/scala\-'$FROM_VERSION'/scala\-'$TO_VERSION'/' "$BASEDIR/docs/_plugins/copy_api_dirs.rb"

together ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch. Will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and updated the PR description.

@@ -0,0 +1,205 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the change basically refactoring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The goal is to preserve current behavior while making the code easier to maintain.

@github-actions github-actions bot added the BUILD label Jan 22, 2024
@nchammas
Copy link
Contributor Author

I diffed the built _site/ directory across master and this branch. They are identical except for some minor differences in some generated files. I've updated the PR description accordingly with the details.

@HyukjinKwon - I think this PR is good to go. Let me know if there is anything more you'd like me to do here.

@HyukjinKwon
Copy link
Member

Merged to master.

@nchammas nchammas deleted the create-api-docs branch January 24, 2024 00:13
HyukjinKwon pushed a commit that referenced this pull request Jan 24, 2024
### What changes were proposed in this pull request?

As [suggested here][1], this change improves the documentation build so that it builds Spark at most one time, regardless of what API docs are requested in the build.

[1]: #44791 (comment)

### Why are the changes needed?

There is no need to build Spark multiple times when generating docs. In particular, building Scala and Python docs, or Scala and SQL docs, causes Spark to be built twice.

Fixing this problem saves us a couple of minutes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I built the docs as follows on `master` as well as on this branch:

```sh
time SKIP_RDOC=1 SKIP_PYTHONDOC=1 bundle exec jekyll build
```

The time results before and after this change are as follows:

```
before
------
real    6m48.815s
user    23m17.943s
sys     1m29.578s

after
-----
real    4m10.672s
user    14m10.130s
sys     1m0.773s
```

That's a savings of about 2.5 minutes.

Additionally, I diffed the generated `_site/` dir across `master` and this branch and confirmed they are essentially identical except for some general SQL examples files.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44865 from nchammas/SPARK-46825-jekyll-build-spark-once.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants