Update demos to download data from S3 #2387

frances-h · 2021-06-16T19:49:04Z

Update EvalML demos to pull data from S3 instead of using local data files in the package. Tests that use the demos have been updated to use the use_local flag which pulls data from the local data directory rather than S3.

CLAassistant · 2021-06-16T19:49:12Z

All committers have signed the CLA.

codecov · 2021-06-16T19:52:09Z

Codecov Report

Merging #2387 (150d134) into main (369c2e0) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #2387     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        283     283             
  Lines      25247   25321     +74     
=======================================
+ Hits       25147   25221     +74     
  Misses       100     100

Impacted Files	Coverage Δ
evalml/demos/breast_cancer.py	`100.0% <100.0%> (ø)`
evalml/demos/churn.py	`100.0% <100.0%> (ø)`
evalml/demos/diabetes.py	`100.0% <100.0%> (ø)`
evalml/demos/fraud.py	`100.0% <100.0%> (ø)`
evalml/demos/wine.py	`100.0% <100.0%> (ø)`
evalml/tests/automl_tests/test_automl.py	`99.8% <100.0%> (-<0.1%)`	⬇️
evalml/tests/conftest.py	`98.5% <100.0%> (+0.2%)`	⬆️
evalml/tests/demo_tests/test_datasets.py	`100.0% <100.0%> (ø)`
.../prediction_explanations_tests/test_force_plots.py	`100.0% <100.0%> (ø)`
...del_understanding_tests/test_partial_dependence.py	`100.0% <100.0%> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 369c2e0...150d134. Read the comment docs.

freddyaboulton

@frances-h Thanks for this! I have questions about how the change to data_file and a suggestion for creating fixtures that set use_local=True that I'd like to resolve before merge.

freddyaboulton · 2021-06-18T14:33:08Z

setup.py

@@ -21,6 +21,5 @@
          'evalml = evalml.__main__:cli'
        ]
    },
-    data_files=[('evalml/demos/data', ['evalml/demos/data/fraud_transactions.csv.gz', 'evalml/demos/data/churn.csv']),


If we delete fraud_transactions.csv.gz and churn.csv from this list, won't that mean that load_fraud and load_churn will fail for users who pip install and set use_local=True?

Yeah, use_local would fail unless they manually download the files. Since we're defaulting to downloading from S3, it seemed unnecessary to keep them in the package. They are small though, so keeping them in isn't a big deal if that's the way we want to go.

Also, if we don't want to include the data files but do want to allow use_local for users/make it easier to use, I think there's two other options we could do:

Change use_local from a boolean to a path. That way users could download a local copy and specify its location instead of downloading from S3 every time.

Add a save_data flag that saves the CSV after downloading from S3 to the default data directory so that use_local would work in the future.

I agree with you in that what we should do should depends on whether users are intended to be able to use the local datasets.

If we want them to be able to use the local data, then I think the simplest thing to do is to include fraud_transactions.csv.gz and churn.csv in the package. I don't think we need to add save_data or change use_local to be a path but let me know what you think.

If we want them to always go through s3, then I think what we can do is remove use_local from the load_data functions. In our tests, we'll create fixtures that load the data locally. Then we can remove the demo data from our pypi package.

I'm partial to 2 (users always go through s3) because it's consistent with what featuretools does and it lets us run the unit tests without an internet connection.

Let me know what you think!

@frances-h Thanks for making the changes! This looks great. I thought about it this morning, and I think we'll need to still include fraud and churn in the pypi package. The reason is that we run our unit tests when we build the conda package for a new release and the conda build happens from the tar.gz on PyPi.

Once that small change is in, feel free to merge! Thank you 😄

FYI @dsherry @chukarsten

freddyaboulton · 2021-06-18T14:35:14Z

evalml/tests/demo_tests/test_datasets.py

+    opener.addheaders = [("Testing", "True")]
+    urllib.request.install_opener(opener)
+
+
 def test_fraud():
    X, y = demos.load_fraud()


The purpose of keeping this line is to make sure the api requests work?

Yep, and to catch any issues if for any reason the local and S3 versions ever get out of sync.

So this unit test will fail if there's no network connection? Are there others which fail under that scenario?

Yeah, these ones will fail if there's no network connection but they're the only ones that will. I can pull them out into a separate test file so it's clearer that it's a separate test. It might also be possible to conditionally xfail if the tests can't reach S3 so they still work offline without causing failures.

@dsherry I split out checking that the local and downloaded versions match and marked them to be skipped if the update server is offline. The tests won't fail if you run them offline, and they should still catch if something's wrong with the individual S3 links. Let me know what you think?

freddyaboulton · 2021-06-18T14:37:07Z

docs/source/conf.py

@@ -269,6 +269,7 @@ def setup(app):
    p = Path("/home/docs/.ipython/profile_default/startup")
    p.mkdir(parents=True, exist_ok=True)
    shutil.copy("disable-warnings.py", "/home/docs/.ipython/profile_default/startup/")
+    shutil.copy("set-headers.py", "/home/docs/.ipython/profile_default/startup")


Good catch. Would not have thought to do this!

freddyaboulton · 2021-06-18T14:37:56Z

evalml/tests/automl_tests/test_automl.py

        if "bool" in target_type:
            y = y.map({"malignant": False, "benign": True})
    elif automl_type == ProblemTypes.MULTICLASS:
        if "bool" in target_type:
            pytest.skip(
                "Skipping test where problem type is multiclass but target type is boolean"
            )
-        X, y = load_wine()
+        X, y = load_wine(use_local=True)


I think it may be best to just create wine_data, breast_cancer_data fixtures that set use_local=True. What do you think?

Yeah that makes sense 👍

dsherry

@frances-h this is super cool! Thanks for tackling it.

After this PR, can we still run all of our unit tests without a network connection and have them all pass? I'd like us to preserve this property. I see you and @freddyaboulton have already had some discussion about this.

If we are going to have a unit test or two which tests that the downloaded data matches the local data, which seems like a helpful thing to me, let's do what we can to make it clear why these tests will fail when there's no network connection. A couple ideas: add a note, move them to a separate file.

dsherry · 2021-06-21T13:53:44Z

evalml/tests/demo_tests/test_datasets.py

+    opener.addheaders = [("Testing", "True")]
+    urllib.request.install_opener(opener)
+
+
 def test_fraud():
    X, y = demos.load_fraud()


So this unit test will fail if there's no network connection? Are there others which fail under that scenario?

Updated fixtures.

evalml/tests/demo_tests/test_datasets.py

dsherry · 2021-06-22T20:35:34Z

evalml/tests/conftest.py

+def pytest_configure(config):
+    config.addinivalue_line(
+        "markers",
+        "skip_offline: mark test to be skipped if offline (https://api.featurelabs.com cannot be reached)",


Whoa neat 👍 this is cool

evalml/tests/demo_tests/test_datasets.py

frances-h self-assigned this Jun 16, 2021

gsheni requested review from freddyaboulton, chukarsten and dsherry and removed request for dsherry, freddyaboulton and chukarsten June 16, 2021 20:18

frances-h marked this pull request as ready for review June 17, 2021 19:56

gsheni requested review from chukarsten, freddyaboulton, dsherry and angela97lin June 17, 2021 21:36

freddyaboulton approved these changes Jun 18, 2021

View reviewed changes

chukarsten force-pushed the download-demo-files branch 2 times, most recently from 07b1f16 to 42e7a97 Compare June 20, 2021 18:40

dsherry approved these changes Jun 21, 2021

View reviewed changes

chukarsten force-pushed the download-demo-files branch from 0ab9fa2 to 7a74aef Compare June 21, 2021 17:06

frances-h and others added 6 commits June 21, 2021 13:43

Update demos to download data from the update server

0caf730

release notes

8df0e70

set header to avoid logging on docs build

edf82a8

fix conf.py

67a2207

Addressed comments.

23c4632

Updated fixtures.

Added datasets to testing dir.

e28976b

chukarsten force-pushed the download-demo-files branch from 51d1444 to e28976b Compare June 21, 2021 17:43

frances-h and others added 4 commits June 21, 2021 14:32

remove leftover demo calls, pull out online tests and skip if offline

b96b64e

lint

33400b6

Merge branch 'main' into download-demo-files

d68c3ac

add test data files to setup.py

e464057

frances-h and others added 7 commits June 22, 2021 10:50

Merge branch 'main' into download-demo-files

6c4626c

Merge branch 'main' into download-demo-files

ffe1957

Merge branch 'main' into download-demo-files

29e6faf

Merge branch 'main' into download-demo-files

dfab449

Merge branch 'main' into download-demo-files

203612c

Merge branch 'main' into download-demo-files

014e765

exclude coverage of fixture except clause

f368100

dsherry reviewed Jun 22, 2021

View reviewed changes

evalml/tests/demo_tests/test_datasets.py Outdated Show resolved Hide resolved

update fixture name

1b7004d

dsherry reviewed Jun 22, 2021

View reviewed changes

evalml/tests/demo_tests/test_datasets.py Show resolved Hide resolved

frances-h added 6 commits June 22, 2021 16:40

scope fixture to session

62b98f3

fix tests

cd7a573

exclude coverage of fixture

f7d43ff

fix fixture

36f9f0b

skip coverage for offline skip fixture

cc9d6f1

lint

150d134

chukarsten merged commit 96abe4a into main Jun 22, 2021

chukarsten mentioned this pull request Jun 22, 2021

Release v0.27.0 #2428

Merged

frances-h deleted the download-demo-files branch June 23, 2021 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update demos to download data from S3 #2387

Update demos to download data from S3 #2387

frances-h commented Jun 16, 2021

CLAassistant commented Jun 16, 2021 •

edited

Loading

codecov bot commented Jun 16, 2021 •

edited

Loading

freddyaboulton left a comment

freddyaboulton Jun 18, 2021

frances-h Jun 18, 2021

freddyaboulton Jun 18, 2021

freddyaboulton Jun 22, 2021

freddyaboulton Jun 18, 2021

frances-h Jun 18, 2021

dsherry Jun 21, 2021

frances-h Jun 21, 2021

frances-h Jun 21, 2021

freddyaboulton Jun 18, 2021

freddyaboulton Jun 18, 2021

frances-h Jun 18, 2021

dsherry left a comment

dsherry Jun 21, 2021

dsherry Jun 22, 2021

Update demos to download data from S3 #2387

Update demos to download data from S3 #2387

Conversation

frances-h commented Jun 16, 2021

CLAassistant commented Jun 16, 2021 • edited Loading

codecov bot commented Jun 16, 2021 • edited Loading

Codecov Report

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Jun 16, 2021 •

edited

Loading

codecov bot commented Jun 16, 2021 •

edited

Loading