Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update demos to download data from S3 #2387

Merged
merged 24 commits into from
Jun 22, 2021
Merged

Update demos to download data from S3 #2387

merged 24 commits into from
Jun 22, 2021

Conversation

frances-h
Copy link
Contributor

Update EvalML demos to pull data from S3 instead of using local data files in the package. Tests that use the demos have been updated to use the use_local flag which pulls data from the local data directory rather than S3.

@frances-h frances-h self-assigned this Jun 16, 2021
@CLAassistant
Copy link

CLAassistant commented Jun 16, 2021

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Jun 16, 2021

Codecov Report

Merging #2387 (150d134) into main (369c2e0) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2387     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        283     283             
  Lines      25247   25321     +74     
=======================================
+ Hits       25147   25221     +74     
  Misses       100     100             
Impacted Files Coverage Δ
evalml/demos/breast_cancer.py 100.0% <100.0%> (ø)
evalml/demos/churn.py 100.0% <100.0%> (ø)
evalml/demos/diabetes.py 100.0% <100.0%> (ø)
evalml/demos/fraud.py 100.0% <100.0%> (ø)
evalml/demos/wine.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.8% <100.0%> (-<0.1%) ⬇️
evalml/tests/conftest.py 98.5% <100.0%> (+0.2%) ⬆️
evalml/tests/demo_tests/test_datasets.py 100.0% <100.0%> (ø)
.../prediction_explanations_tests/test_force_plots.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 369c2e0...150d134. Read the comment docs.

@gsheni gsheni requested review from freddyaboulton, chukarsten and dsherry and removed request for dsherry, freddyaboulton and chukarsten June 16, 2021 20:18
@frances-h frances-h marked this pull request as ready for review June 17, 2021 19:56
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frances-h Thanks for this! I have questions about how the change to data_file and a suggestion for creating fixtures that set use_local=True that I'd like to resolve before merge.

@@ -21,6 +21,5 @@
'evalml = evalml.__main__:cli'
]
},
data_files=[('evalml/demos/data', ['evalml/demos/data/fraud_transactions.csv.gz', 'evalml/demos/data/churn.csv']),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we delete fraud_transactions.csv.gz and churn.csv from this list, won't that mean that load_fraud and load_churn will fail for users who pip install and set use_local=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, use_local would fail unless they manually download the files. Since we're defaulting to downloading from S3, it seemed unnecessary to keep them in the package. They are small though, so keeping them in isn't a big deal if that's the way we want to go.

Also, if we don't want to include the data files but do want to allow use_local for users/make it easier to use, I think there's two other options we could do:

  1. Change use_local from a boolean to a path. That way users could download a local copy and specify its location instead of downloading from S3 every time.
  2. Add a save_data flag that saves the CSV after downloading from S3 to the default data directory so that use_local would work in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you in that what we should do should depends on whether users are intended to be able to use the local datasets.

  1. If we want them to be able to use the local data, then I think the simplest thing to do is to include fraud_transactions.csv.gz and churn.csv in the package. I don't think we need to add save_data or change use_local to be a path but let me know what you think.

  2. If we want them to always go through s3, then I think what we can do is remove use_local from the load_data functions. In our tests, we'll create fixtures that load the data locally. Then we can remove the demo data from our pypi package.

I'm partial to 2 (users always go through s3) because it's consistent with what featuretools does and it lets us run the unit tests without an internet connection.

Let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frances-h Thanks for making the changes! This looks great. I thought about it this morning, and I think we'll need to still include fraud and churn in the pypi package. The reason is that we run our unit tests when we build the conda package for a new release and the conda build happens from the tar.gz on PyPi.

Once that small change is in, feel free to merge! Thank you 😄

FYI @dsherry @chukarsten

opener.addheaders = [("Testing", "True")]
urllib.request.install_opener(opener)


def test_fraud():
X, y = demos.load_fraud()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of keeping this line is to make sure the api requests work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and to catch any issues if for any reason the local and S3 versions ever get out of sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this unit test will fail if there's no network connection? Are there others which fail under that scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these ones will fail if there's no network connection but they're the only ones that will. I can pull them out into a separate test file so it's clearer that it's a separate test. It might also be possible to conditionally xfail if the tests can't reach S3 so they still work offline without causing failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I split out checking that the local and downloaded versions match and marked them to be skipped if the update server is offline. The tests won't fail if you run them offline, and they should still catch if something's wrong with the individual S3 links. Let me know what you think?

@@ -269,6 +269,7 @@ def setup(app):
p = Path("/home/docs/.ipython/profile_default/startup")
p.mkdir(parents=True, exist_ok=True)
shutil.copy("disable-warnings.py", "/home/docs/.ipython/profile_default/startup/")
shutil.copy("set-headers.py", "/home/docs/.ipython/profile_default/startup")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Would not have thought to do this!

if "bool" in target_type:
y = y.map({"malignant": False, "benign": True})
elif automl_type == ProblemTypes.MULTICLASS:
if "bool" in target_type:
pytest.skip(
"Skipping test where problem type is multiclass but target type is boolean"
)
X, y = load_wine()
X, y = load_wine(use_local=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be best to just create wine_data, breast_cancer_data fixtures that set use_local=True. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense 👍

@chukarsten chukarsten force-pushed the download-demo-files branch 2 times, most recently from 07b1f16 to 42e7a97 Compare June 20, 2021 18:40
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frances-h this is super cool! Thanks for tackling it.

After this PR, can we still run all of our unit tests without a network connection and have them all pass? I'd like us to preserve this property. I see you and @freddyaboulton have already had some discussion about this.

If we are going to have a unit test or two which tests that the downloaded data matches the local data, which seems like a helpful thing to me, let's do what we can to make it clear why these tests will fail when there's no network connection. A couple ideas: add a note, move them to a separate file.

opener.addheaders = [("Testing", "True")]
urllib.request.install_opener(opener)


def test_fraud():
X, y = demos.load_fraud()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this unit test will fail if there's no network connection? Are there others which fail under that scenario?

def pytest_configure(config):
config.addinivalue_line(
"markers",
"skip_offline: mark test to be skipped if offline (https://api.featurelabs.com cannot be reached)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa neat 👍 this is cool

@chukarsten chukarsten merged commit 96abe4a into main Jun 22, 2021
@chukarsten chukarsten mentioned this pull request Jun 22, 2021
@frances-h frances-h deleted the download-demo-files branch June 23, 2021 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants