Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up docs build #2430

Merged
merged 11 commits into from Jun 24, 2021
Merged

Speed up docs build #2430

merged 11 commits into from Jun 24, 2021

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jun 23, 2021

Pull Request Description

Following the process described in this document

I think this is about a 45% reduction in total run time, so it's about twice as fast now!

It's hard to look at the internals of sphinx process but I think there are two main buckets:

  1. Running our notebooks and converting them to html
  2. Creating our api ref from our source code by importing our modules

The sphinx build time is about 8 minutes total. It takes about 5 minutes to run all of the notebooks. So the other 3 minutes are sphinx creating our api reference according to our templates.

I scaled the notebook computations as much as possible while still being reasonable, so that 5 minutes looks hard to beat (feel free to make suggestions!). I'm not exactly sure how to speed up the api reference beyond setting collapse_navigation to True. It's not clear how we can profile that to see where the bottleneck is (if it exists). One thing that comes to mind is using the -j parameter to build the api ref in parallel, but RTD does not let us control that as far as I know (once again, correct me if I'm wrong!).

So I think this 8 minute build time is about as low as we can go if we still want to have dynamically generated docs, which I think we do!


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Jun 23, 2021

Codecov Report

Merging #2430 (bebedd6) into main (150a1b8) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #2430   +/-   ##
=====================================
  Coverage   99.7%   99.7%           
=====================================
  Files        283     283           
  Lines      25478   25478           
=====================================
  Hits       25378   25378           
  Misses       100     100           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 150a1b8...bebedd6. Read the comment docs.

@@ -3,3 +3,4 @@ pydata-sphinx-theme>=0.3.1
Sphinx>=2.0.1,<4.0.0
nbconvert>=5.5.0
nbsphinx>=0.8.5
ipython-autotime
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to time the notebook cells. I will delete before merge.

@@ -101,6 +101,8 @@
html_theme_options = {
"github_url": "https://github.com/alteryx/evalml",
"twitter_url": "https://twitter.com/AlteryxOSS",
"collapse_navigation": True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to be pretty important. It shaves about 3-4 minutes from build time without any change to the docs that I can notice 🤷

https://sphinx-rtd-theme.readthedocs.io/en/stable/configuring.html#confval-collapse_navigation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 amazing.

I wonder if the navigation depth also helped

@@ -133,7 +133,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Because the lead scoring labels are binary, we will use `AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')`. When we call `.search()`, the search for the best pipeline will begin. "
"Because the lead scoring labels are binary, we will use set the problem type to \"binary\". When we call `.search()`, the search for the best pipeline will begin. "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was bugging me that AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary') is not how we actually create automl in the next cell 😂

@@ -146,6 +146,7 @@
" problem_type='binary',\n",
" objective=lead_scoring_objective,\n",
" additional_objectives=['auc'],\n",
" allowed_model_families=[\"catboost\", \"random_forest\", \"linear_model\"],\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the only way to not fail the lead_score >= auc score assertion is to train catboost on the entire data. I think this is the cleanest way to do that while also cutting out some pipelines to not spend about a minute on this search.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea!

@freddyaboulton freddyaboulton marked this pull request as ready for review June 23, 2021 20:47
@freddyaboulton freddyaboulton changed the title WIP: Speed up docs build Speed up docs build Jun 23, 2021
@@ -270,6 +272,7 @@ def setup(app):
p.mkdir(parents=True, exist_ok=True)
shutil.copy("disable-warnings.py", "/home/docs/.ipython/profile_default/startup/")
shutil.copy("set-headers.py", "/home/docs/.ipython/profile_default/startup")
shutil.copy("ipython_config.py", "/home/docs/.ipython/profile_default")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will also delete before merge.

Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this @freddyaboulton , thanks for making such an impactful change so quickly.

@@ -123,7 +127,7 @@
"outputs": [],
"source": [
"from evalml.model_understanding.graphs import partial_dependence\n",
"partial_dependence(pipeline_binary, X, features='mean radius')"
"partial_dependence(pipeline_binary, X_holdout, features='mean radius', grid_resolution=5)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic grid resolution 👍

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 incredible

can_you_believe_this_guy

@@ -101,6 +101,8 @@
html_theme_options = {
"github_url": "https://github.com/alteryx/evalml",
"twitter_url": "https://twitter.com/AlteryxOSS",
"collapse_navigation": True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 amazing.

I wonder if the navigation depth also helped

@@ -122,7 +122,8 @@
"outputs": [],
"source": [
"from evalml import AutoMLSearch\n",
"automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary', objective='log loss binary')\n",
"automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary', objective='log loss binary',\n",
" max_iterations=5)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine. We'll probably have to revisit this when the new automl algo comes in because max_iterations may no longer be supported at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Hopefully there's a "super fast" mode hehe

@@ -68,7 +68,7 @@
"metadata": {},
"outputs": [],
"source": [
"X, y = evalml.demos.load_fraud(n_rows=5000)"
"X, y = evalml.demos.load_fraud(n_rows=1000)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the fraud demo still read OK with this change? I seem to remember us changing this because the numbers in the demo were screwy when the data size was too small. But I could be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton
Looks like for the fraud demo - both pipelines score evenly on fraud. AUC is actually higher on the fraud optimized pipeline as well. Not sure if this is a change in behavior but could make this section a little confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fraud demo is currently broken which is why @bchen1116 is reworking the fraud objective? It's possible we'll have to increase the size of the data once the fraud objective "works" again but when we get there, we can do what we're doing now for lead scoring and limit the model families to keep the computation fast.

So I think this is the simplest way to not spend a lot of time in this notebook while we fix it. Let me know what you think!

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton works for me! thanks for explaining 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh nvm, lower scores are better for fraud! I'll limit the model families now!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, that screen shot was from stable, not latest. Looks like it is broken on latest/main hehe

https://evalml.alteryx.com/en/latest/demos/fraud.html

image

@@ -38,7 +38,7 @@
"import pandas as pd\n",
"\n",
"input_data = urlopen('https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv')\n",
"data = pd.read_csv(input_data)\n",
"data = pd.read_csv(input_data)[:750]\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Dang how big was this dataset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, 3k rows. Not too bad. Truncating seems good to me tho

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea 3k isn't too bad but the problem is that the text featurizer takes up a lot of the run time. Perhaps when we add component caching, we can go back to the full 3k rows!

@@ -0,0 +1,2 @@

c.InteractiveShellApp.extensions = ['autotime']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this will also be deleted before merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@@ -607,10 +607,12 @@
"outputs": [],
"source": [
"X, y = evalml.demos.load_breast_cancer()\n",
"X, y = X[:250], y[:250]\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? You already set n_rows at the top

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different dataset from fraud! Let me time if I can use the full breast cancer data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think shrinking the model families and doing one less batch are sufficient so I'll use the full dataset!

"X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='binary',\n",
" test_size=0.2, random_seed=0)\n",
"\n",
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiice good call

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work @freddyaboulton! I'll double check the docs itself but this looks good to me.

@@ -146,6 +146,7 @@
" problem_type='binary',\n",
" objective=lead_scoring_objective,\n",
" additional_objectives=['auc'],\n",
" allowed_model_families=[\"catboost\", \"random_forest\", \"linear_model\"],\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea!

@freddyaboulton freddyaboulton merged commit 6f3b6e9 into main Jun 24, 2021
@dsherry dsherry mentioned this pull request Jul 2, 2021
@freddyaboulton freddyaboulton deleted the speed-up-docs-build branch May 13, 2022 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants