Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch order of boolean checks in infer_feature_types #2661

Merged
merged 2 commits into from
Aug 19, 2021

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Aug 19, 2021

Pull Request Description

Fixes #2642

I was able to track the regression down to infer_feature_types like @chukarsten and I suspected.

Steps to identify problem:

  1. Profile this script locally for both v0.30.0 and main
import pandas as pd
from evalml.automl import AutoMLSearch
from evalml.preprocessing import split_data

X = pd.read_csv("/Users/freddy.boulton/Downloads/KDDCup09_churn.csv")
y = X.pop("CHURN")

X_train, X_validation, y_train, y_validation = split_data(X, y, problem_type="binary",test_size=0.5, random_seed=0)
automl = AutoMLSearch(X_train, y_train, "binary")
automl.search()
  1. Visualize with snakeviz. I saw that main took 5 minutes 3 seconds while v0.30.0 took 4 minutes 30 seconds. There's pretty much an exact 30 second increase in total time spent in infer_feature_types corresponding to is_pd_na

main

image

0.30.0

image

  1. We need to perform is_pd_na but the trick is to only do it when the column is Unknown, that's the only case we care about. In python, the booleans are evaluated from left to right so we'll put the is_column_unknown check first since that one is near instant. Example:

Screen Shot 2021-08-19 at 12 00 41 PM

  1. With the new order, infer_feature_types now only takes 10 seconds longer compared to 0.30.0. This is coming from running convert_all_nan_unknown_to_double and checking if the columns are Unknown.

image


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Aug 19, 2021

Codecov Report

Merging #2661 (0bff54c) into main (b0b657e) will not change coverage.
The diff coverage is 100.0%.

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #2661   +/-   ##
=====================================
  Coverage   99.9%   99.9%           
=====================================
  Files        298     298           
  Lines      27305   27305           
=====================================
  Hits       27261   27261           
  Misses        44      44           
Impacted Files Coverage Δ
evalml/utils/woodwork_utils.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0b657e...0bff54c. Read the comment docs.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for looking into this!

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool and as always, your writeup is super clear. LGTM! 🚢 👏

@chukarsten chukarsten merged commit 019455a into main Aug 19, 2021
@freddyaboulton freddyaboulton deleted the 2642-kdd-increase-time branch August 19, 2021 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spike: Investigate increase in fit time for KDDCup dataset
3 participants