Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update and overhaul documentation #937

Merged
merged 37 commits into from
Jul 16, 2020
Merged

Update and overhaul documentation #937

merged 37 commits into from
Jul 16, 2020

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented Jul 16, 2020

Fix #861, also: fix #713 fix #630

View the docs changes here.

Changes:

  • Adds new sphinx style from pandas
  • Updates docs outline: add Install and Start guides as top-level items, add Tutorials section to grow over time, User Guide to hold examples of our key APIs and features, Development to discuss how to contribute.
  • Update user guide content: added a Model Understanding section. Reworked the automl/objectives/components/pipelines sections
  • Rename changelog to release notes and update circleci job (will cause breaking changes)

What's missing:

  • Clean up the home page, make sure we're on board with the wording
  • Update / clean up the "Start" section to make it as high-impact as possible
  • Many more tutorials! Examples of using featuretools.
  • Clean up API ref style under new template. Release names aren't bolded anymore.
  • There's plenty of automl features, data check features, etc. which I didn't have time to fit in this PR, but which we should add back in the appropriate place. Perhaps the User Guide section needs two toctree levels instead of just being one flat list.
  • I like the cards and icons on the pandas homepage. We should add something similar for a good first impression.
  • Consider making the toctrees hidden on most if not all pages.

@codecov
Copy link

codecov bot commented Jul 16, 2020

Codecov Report

Merging #937 into main will increase coverage by 0.20%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #937      +/-   ##
==========================================
+ Coverage   99.65%   99.86%   +0.20%     
==========================================
  Files         170      170              
  Lines        8593     8593              
==========================================
+ Hits         8563     8581      +18     
+ Misses         30       12      -18     
Impacted Files Coverage Δ
.../automl_tests/test_automl_search_classification.py 100.00% <0.00%> (+0.45%) ⬆️
evalml/automl/automl_search.py 99.26% <0.00%> (+0.48%) ⬆️
evalml/tests/component_tests/test_components.py 100.00% <0.00%> (+0.61%) ⬆️
evalml/tests/pipeline_tests/test_pipelines.py 100.00% <0.00%> (+0.92%) ⬆️
...ests/automl_tests/test_automl_search_regression.py 100.00% <0.00%> (+1.07%) ⬆️
evalml/utils/gen_utils.py 100.00% <0.00%> (+2.63%) ⬆️
evalml/tests/component_tests/test_utils.py 100.00% <0.00%> (+3.57%) ⬆️
evalml/tests/utils_tests/test_dependencies.py 100.00% <0.00%> (+12.50%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 07d5bea...dd9e5d4. Read the comment docs.

@@ -9,7 +9,7 @@
"EvalML is available for Python 3.6+. It can be installed by running the following command:\n",
"\n",
"```bash\n",
" pip install evaml --extra-index-url https://install.featurelabs.com/<license>/\n",
"pip install evaml\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: has our release documents/process been updated to reflect pushing to pip?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should be evalml!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet! Good point. I'd like to do that once we've done our first pip release. I'll file something for that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry dsherry marked this pull request as ready for review July 16, 2020 14:29
@angela97lin
Copy link
Contributor

Looks so much better!! Excited to look through this but at a quick glance, I noticed the pagenation links are a little funky?

image

Minor, and assuming this happens because it's the last link in the previous section but a little confusing 🤔

@ctduffy
Copy link
Contributor

ctduffy commented Jul 16, 2020

I have noticed that at the bottom of a few of the pages it seems like there is an extra code cell, as seen here. Looks super clean overall though!!
Screen Shot 2020-07-16 at 11 27 39 AM

@ctduffy
Copy link
Contributor

ctduffy commented Jul 16, 2020

Also, this is a more general comment not specifically related to this PR, but when we begin open sourcing the repo, are we considering adding a "Thanks to the following people for contributing to this release: ..." to the changelog similar to the featuretools changelog?

@dsherry
Copy link
Contributor Author

dsherry commented Jul 16, 2020

@ctduffy that's a great idea! Yes, once we open up the repo to outside contributors, adding that sort of info to the release notes would be good. GitHub shows who's contributed and when so that information will already be available in another form at least.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking awesome and I'm super excited for this! I left comments below that can be are either typo/grammatical errors or style comments. Feel free to ignore the style comments based on your own judgement.

There are a couple more issues:

  • favicon: we need to update the favicon to the new evalml logo (or at least the old version still shows up for me). If we need to fix this would be great to include in this PR.
  • inheritance graph: since we change the templating of our API reference, the inheritance graph looks a little short now. We should put up a new issue fixing this.

This is a lot of comments at once so let me know if you want to hop on a call to go through them!

@@ -9,7 +9,7 @@
"EvalML is available for Python 3.6+. It can be installed by running the following command:\n",
"\n",
"```bash\n",
" pip install evaml --extra-index-url https://install.featurelabs.com/<license>/\n",
"pip install evaml\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should be evalml!

docs/source/user_guide/automl.ipynb Outdated Show resolved Hide resolved
docs/source/user_guide/automl.ipynb Outdated Show resolved Hide resolved
docs/source/user_guide/automl.ipynb Outdated Show resolved Hide resolved
"\n",
"There are [a variety](https://en.wikipedia.org/wiki/Machine_learning#Approaches) of ML problem types. Supervised learning describes the case where the collected data contains an output quantity to be modeled and a set of inputs with which to train the model. EvalML focuses on training models for the supervised learning case.\n",
"\n",
"EvalML supports three common supervised ML problem types. The first is regression, where the target quantity to model is a continuous numeric value. Next are binary and multiclass classification, where the target quantity to model consists of two or more discrete values or categories. The choice of which supervised ML problem type is most appropriate depends on domain expertise and on how the model will be evaluated and used.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although implied I think it might be nice to explain binary = 2 classes and multiclass = 2+.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is helpful but I'm more used to reading supervised ML problem types as defined between regression and classification (then classification being defined between multiclass and binary). I like how it says 3 problem types as it coincides with evalml problem types though. The former might be easier to understand for a ML newbie but I dont think this comment is important to consider if the target audience is ML engineers or data scientists etc..

"\n",
"Estimator classes each define a `model_family` attribute indicating what type of model is used.\n",
"\n",
"Here's an example of using the [`LogisticRegressionClassifier`](../generated/evalml.pipelines.components.LogisticRegressionClassifier.ipynb) estimator to fit and predict on a simple dataset:"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is broken on the actual website.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think I fixed it. Needed to be ref to rst I think?

docs/source/user_guide/pipelines.ipynb Outdated Show resolved Hide resolved
docs/source/user_guide/pipelines.ipynb Show resolved Hide resolved
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get Pipeline\n",
"We can get the object of any pipeline via their `id` as well:"
"## Feature Importance\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add how we calculate such importances (maybe in a new issue).

@@ -137,7 +103,7 @@
"source": [
"## Precision-Recall Curve\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular dataset makes the curves not very useful. We could put up an issue to use another dataset for the below curves.

@kmax12
Copy link
Contributor

kmax12 commented Jul 16, 2020

a page on how to get help would be good to add. something like: https://docs.featuretools.com/en/stable/help.html. not blocking though

"cell_type": "markdown",
"metadata": {},
"source": [
"First, we load in the features and outcomes we want to use to train our model."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be helpful here to describe what form this might take-like specify if this should be two dataframes, two arrays or something else. I think this is a good addition, as it might not be specifically mentioned elsewhere in the documentation as X and y together.
Also might be useful to change the wording from outcomes to target? Not sure exactly, but I feel like target is more generally used to refer to the y value in ML?

@dsherry
Copy link
Contributor Author

dsherry commented Jul 16, 2020

docs/source/start.ipynb Outdated Show resolved Hide resolved
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry The new theme looks great! These changes look good to me but the links to classes, such as LogisticRegressionClassifier are still broken for me (even after you changed the link to point to the .rst file). I think it'd be nice to fix that before merging.

I also left some comments that can be addressed in other issues. One thing I have in mind is making sure all classes wrapped in "``" should link to their source. My instinct is to expect a link and go to the source when a new class or concept is introduced.

"[What is EvalML](self)\n",
"\n",
"[Installation](install)"
"[Start](start)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking: I prefer the old "Getting Started" title

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let's circle back on this. I liked having both install and start as top-level items, but could be persuaded otherwise

docs/source/start.ipynb Show resolved Hide resolved
"\n",
"Estimator classes each define a `model_family` attribute indicating what type of model is used.\n",
"\n",
"Here's an example of using the [`LogisticRegressionClassifier`](../generated/evalml.pipelines.components.LogisticRegressionClassifier.rst) estimator to fit and predict on a simple dataset:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is still broken for me. Also, we should make all the classes we put in "``" links? Might be a little confusing that some class names are broken and some aren't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 dang, IDK why its broken. Will try one more push but probably going to punt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe needs to be html

@@ -20,7 +20,7 @@
"source": [
"## Missing Data\n",
"\n",
"Missing data or rows with `NaN` values provide many challenges for machine learning pipelines. In the worst case, many algorithms simply will not run with missing data! EvalML pipelines contain imputation [components](../pipelines/components.ipynb) to ensure that doesn't happen. Imputation works by approximating missing values with existing values. However, if a column contains a high number of missing values, a large percentage of the column would be approximated by a small percentage. This could potentially create a column without useful information for machine learning pipelines. By using the `HighlyNullDataCheck()` data check, EvalML will alert you to this potential problem by returning the columns that pass the missing values threshold."
"Missing data or rows with `NaN` values provide many challenges for machine learning pipelines. In the worst case, many algorithms simply will not run with missing data! EvalML pipelines contain imputation [components](../user_guide/components.ipynb) to ensure that doesn't happen. Imputation works by approximating missing values with existing values. However, if a column contains a high number of missing values, a large percentage of the column would be approximated by a small percentage. This could potentially create a column without useful information for machine learning pipelines. By using the `HighlyNullDataCheck()` data check, EvalML will alert you to this potential problem by returning the columns that pass the missing values threshold."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not-blocking: Make HighlyNullDataCheck a link to the source?

docs/source/user_guide/pipelines.ipynb Show resolved Hide resolved
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Importance\n",
"## Permutation Importance\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not-blocking: I think it'd be nice to add a few sentences interpreting the plots for users who are not familiar with these techniques.

@@ -1,7 +1,7 @@
.. _changelog:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change the file name and page name to release notes, so the url updates. currently https://evalml.featurelabs.com/en/ds_update_theme_pandas/changelog.html

docs/source/development.ipynb Outdated Show resolved Hide resolved
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 lets goo

@@ -23,7 +23,7 @@
"EvalML includes several dependencies in `requirements.txt` by default: `xgboost` and `catboost` support pipelines built around those modeling libraries, and `plotly` and `ipywidgets` support plotting functionality in automl searches. These dependencies are recommended but are not required in order to install and use EvalML. To install these additional dependencies run `pip install -r requirements.txt`.\n",
"\n",
"### Core Dependencies\n",
"If you wish to install EvalML with only the core required dependencies, include `--no-dependencies` in your EvalML pip install command, and then install all core dependencies with `pip install -r core-requirements.txt`."
"If you wish to install EvalML with only the core required dependencies, include `--no-dependencies` in your EvalML pip install command, and then install all core dependencies with `pip install -r core-requirements.txt`."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily related to your change, but I'm curious as to how a user actually installs evalml. Do they not just call pip install evalml and call it a day? Just thinking that with other pip packages, I don't call the pip install -r core-requirements.txt or similar commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin and I just chatted about this. And I made some improvements to the install page as a result. Long-term, its unclear if we need to document our core-requirements.txt vs requirements.txt on a user-facing install page. Maybe we just keep that context for developers / contributors.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some really awesome stuff!! Thank you for doing this and adding in so much more documentation 👏 All my comments are minor / not blocking :)

docs/source/user_guide/automl.ipynb Show resolved Hide resolved
docs/source/user_guide/automl.ipynb Show resolved Hide resolved
docs/source/user_guide/components.ipynb Show resolved Hide resolved
docs/source/user_guide/objectives.ipynb Outdated Show resolved Hide resolved
docs/source/user_guide/objectives.ipynb Outdated Show resolved Hide resolved
@dsherry
Copy link
Contributor Author

dsherry commented Jul 16, 2020

Will merge when tests are green!

@dsherry dsherry merged commit 947d814 into main Jul 16, 2020
@dsherry dsherry deleted the ds_update_theme_pandas branch July 16, 2020 20:14
@dsherry dsherry mentioned this pull request Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants