-
Notifications
You must be signed in to change notification settings - Fork 91
Speed up docs build #2430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up docs build #2430
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2430 +/- ##
=====================================
Coverage 99.7% 99.7%
=====================================
Files 283 283
Lines 25478 25478
=====================================
Hits 25378 25378
Misses 100 100 Continue to review full report at Codecov.
|
docs-requirements.txt
Outdated
| Sphinx>=2.0.1,<4.0.0 | ||
| nbconvert>=5.5.0 | ||
| nbsphinx>=0.8.5 | ||
| ipython-autotime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to time the notebook cells. I will delete before merge.
| html_theme_options = { | ||
| "github_url": "https://github.com/alteryx/evalml", | ||
| "twitter_url": "https://twitter.com/AlteryxOSS", | ||
| "collapse_navigation": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turns out to be pretty important. It shaves about 3-4 minutes from build time without any change to the docs that I can notice 🤷
https://sphinx-rtd-theme.readthedocs.io/en/stable/configuring.html#confval-collapse_navigation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯 amazing.
I wonder if the navigation depth also helped
| "metadata": {}, | ||
| "source": [ | ||
| "Because the lead scoring labels are binary, we will use `AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')`. When we call `.search()`, the search for the best pipeline will begin. " | ||
| "Because the lead scoring labels are binary, we will use set the problem type to \"binary\". When we call `.search()`, the search for the best pipeline will begin. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was bugging me that AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary') is not how we actually create automl in the next cell 😂
| " problem_type='binary',\n", | ||
| " objective=lead_scoring_objective,\n", | ||
| " additional_objectives=['auc'],\n", | ||
| " allowed_model_families=[\"catboost\", \"random_forest\", \"linear_model\"],\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the only way to not fail the lead_score >= auc score assertion is to train catboost on the entire data. I think this is the cleanest way to do that while also cutting out some pipelines to not spend about a minute on this search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea!
docs/source/conf.py
Outdated
| p.mkdir(parents=True, exist_ok=True) | ||
| shutil.copy("disable-warnings.py", "/home/docs/.ipython/profile_default/startup/") | ||
| shutil.copy("set-headers.py", "/home/docs/.ipython/profile_default/startup") | ||
| shutil.copy("ipython_config.py", "/home/docs/.ipython/profile_default") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will also delete before merge.
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this @freddyaboulton , thanks for making such an impactful change so quickly.
| "source": [ | ||
| "from evalml.model_understanding.graphs import partial_dependence\n", | ||
| "partial_dependence(pipeline_binary, X, features='mean radius')" | ||
| "partial_dependence(pipeline_binary, X_holdout, features='mean radius', grid_resolution=5)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classic grid resolution 👍
dsherry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| html_theme_options = { | ||
| "github_url": "https://github.com/alteryx/evalml", | ||
| "twitter_url": "https://twitter.com/AlteryxOSS", | ||
| "collapse_navigation": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯 amazing.
I wonder if the navigation depth also helped
| "from evalml import AutoMLSearch\n", | ||
| "automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary', objective='log loss binary')\n", | ||
| "automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary', objective='log loss binary',\n", | ||
| " max_iterations=5)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine. We'll probably have to revisit this when the new automl algo comes in because max_iterations may no longer be supported at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Hopefully there's a "super fast" mode hehe
docs/source/demos/fraud.ipynb
Outdated
| "outputs": [], | ||
| "source": [ | ||
| "X, y = evalml.demos.load_fraud(n_rows=5000)" | ||
| "X, y = evalml.demos.load_fraud(n_rows=1000)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the fraud demo still read OK with this change? I seem to remember us changing this because the numbers in the demo were screwy when the data size was too small. But I could be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton
Looks like for the fraud demo - both pipelines score evenly on fraud. AUC is actually higher on the fraud optimized pipeline as well. Not sure if this is a change in behavior but could make this section a little confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fraud demo is currently broken which is why @bchen1116 is reworking the fraud objective? It's possible we'll have to increase the size of the data once the fraud objective "works" again but when we get there, we can do what we're doing now for lead scoring and limit the model families to keep the computation fast.
So I think this is the simplest way to not spend a lot of time in this notebook while we fix it. Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton works for me! thanks for explaining 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh nvm, lower scores are better for fraud! I'll limit the model families now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, that screen shot was from stable, not latest. Looks like it is broken on latest/main hehe
| "\n", | ||
| "input_data = urlopen('https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv')\n", | ||
| "data = pd.read_csv(input_data)\n", | ||
| "data = pd.read_csv(input_data)[:750]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Dang how big was this dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, 3k rows. Not too bad. Truncating seems good to me tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea 3k isn't too bad but the problem is that the text featurizer takes up a lot of the run time. Perhaps when we add component caching, we can go back to the full 3k rows!
docs/source/ipython_config.py
Outdated
| @@ -0,0 +1,2 @@ | |||
|
|
|||
| c.InteractiveShellApp.extensions = ['autotime'] | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this will also be deleted before merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
docs/source/user_guide/automl.ipynb
Outdated
| "outputs": [], | ||
| "source": [ | ||
| "X, y = evalml.demos.load_breast_cancer()\n", | ||
| "X, y = X[:250], y[:250]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? You already set n_rows at the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a different dataset from fraud! Let me time if I can use the full breast cancer data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think shrinking the model families and doing one less batch are sufficient so I'll use the full dataset!
| "X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='binary',\n", | ||
| " test_size=0.2, random_seed=0)\n", | ||
| "\n", | ||
| "\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiiice good call
jeremyliweishih
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work @freddyaboulton! I'll double check the docs itself but this looks good to me.
| " problem_type='binary',\n", | ||
| " objective=lead_scoring_objective,\n", | ||
| " additional_objectives=['auc'],\n", | ||
| " allowed_model_families=[\"catboost\", \"random_forest\", \"linear_model\"],\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea!



Pull Request Description
Following the process described in this document
I think this is about a 45% reduction in total run time, so it's about twice as fast now!
It's hard to look at the internals of sphinx process but I think there are two main buckets:
The sphinx build time is about 8 minutes total. It takes about 5 minutes to run all of the notebooks. So the other 3 minutes are sphinx creating our api reference according to our templates.
I scaled the notebook computations as much as possible while still being reasonable, so that 5 minutes looks hard to beat (feel free to make suggestions!). I'm not exactly sure how to speed up the api reference beyond setting
collapse_navigationtoTrue. It's not clear how we can profile that to see where the bottleneck is (if it exists). One thing that comes to mind is using the-jparameter to build the api ref in parallel, but RTD does not let us control that as far as I know (once again, correct me if I'm wrong!).So I think this 8 minute build time is about as low as we can go if we still want to have dynamically generated docs, which I think we do!
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.