New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update demos to download data from S3 #2387
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2387 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 283 283
Lines 25247 25321 +74
=======================================
+ Hits 25147 25221 +74
Misses 100 100
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frances-h Thanks for this! I have questions about how the change to data_file
and a suggestion for creating fixtures that set use_local=True
that I'd like to resolve before merge.
@@ -21,6 +21,5 @@ | |||
'evalml = evalml.__main__:cli' | |||
] | |||
}, | |||
data_files=[('evalml/demos/data', ['evalml/demos/data/fraud_transactions.csv.gz', 'evalml/demos/data/churn.csv']), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we delete fraud_transactions.csv.gz
and churn.csv
from this list, won't that mean that load_fraud
and load_churn
will fail for users who pip install and set use_local=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, use_local
would fail unless they manually download the files. Since we're defaulting to downloading from S3, it seemed unnecessary to keep them in the package. They are small though, so keeping them in isn't a big deal if that's the way we want to go.
Also, if we don't want to include the data files but do want to allow use_local
for users/make it easier to use, I think there's two other options we could do:
- Change
use_local
from a boolean to a path. That way users could download a local copy and specify its location instead of downloading from S3 every time. - Add a
save_data
flag that saves the CSV after downloading from S3 to the default data directory so thatuse_local
would work in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you in that what we should do should depends on whether users are intended to be able to use the local datasets.
-
If we want them to be able to use the local data, then I think the simplest thing to do is to include
fraud_transactions.csv.gz
andchurn.csv
in the package. I don't think we need to addsave_data
or changeuse_local
to be a path but let me know what you think. -
If we want them to always go through s3, then I think what we can do is remove
use_local
from theload_data
functions. In our tests, we'll create fixtures that load the data locally. Then we can remove the demo data from our pypi package.
I'm partial to 2 (users always go through s3) because it's consistent with what featuretools does and it lets us run the unit tests without an internet connection.
Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frances-h Thanks for making the changes! This looks great. I thought about it this morning, and I think we'll need to still include fraud and churn in the pypi package. The reason is that we run our unit tests when we build the conda package for a new release and the conda build happens from the tar.gz on PyPi.
Once that small change is in, feel free to merge! Thank you 😄
FYI @dsherry @chukarsten
opener.addheaders = [("Testing", "True")] | ||
urllib.request.install_opener(opener) | ||
|
||
|
||
def test_fraud(): | ||
X, y = demos.load_fraud() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of keeping this line is to make sure the api requests work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, and to catch any issues if for any reason the local and S3 versions ever get out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this unit test will fail if there's no network connection? Are there others which fail under that scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, these ones will fail if there's no network connection but they're the only ones that will. I can pull them out into a separate test file so it's clearer that it's a separate test. It might also be possible to conditionally xfail if the tests can't reach S3 so they still work offline without causing failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry I split out checking that the local and downloaded versions match and marked them to be skipped if the update server is offline. The tests won't fail if you run them offline, and they should still catch if something's wrong with the individual S3 links. Let me know what you think?
@@ -269,6 +269,7 @@ def setup(app): | |||
p = Path("/home/docs/.ipython/profile_default/startup") | |||
p.mkdir(parents=True, exist_ok=True) | |||
shutil.copy("disable-warnings.py", "/home/docs/.ipython/profile_default/startup/") | |||
shutil.copy("set-headers.py", "/home/docs/.ipython/profile_default/startup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Would not have thought to do this!
if "bool" in target_type: | ||
y = y.map({"malignant": False, "benign": True}) | ||
elif automl_type == ProblemTypes.MULTICLASS: | ||
if "bool" in target_type: | ||
pytest.skip( | ||
"Skipping test where problem type is multiclass but target type is boolean" | ||
) | ||
X, y = load_wine() | ||
X, y = load_wine(use_local=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be best to just create wine_data
, breast_cancer_data
fixtures that set use_local=True
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense 👍
07b1f16
to
42e7a97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frances-h this is super cool! Thanks for tackling it.
After this PR, can we still run all of our unit tests without a network connection and have them all pass? I'd like us to preserve this property. I see you and @freddyaboulton have already had some discussion about this.
If we are going to have a unit test or two which tests that the downloaded data matches the local data, which seems like a helpful thing to me, let's do what we can to make it clear why these tests will fail when there's no network connection. A couple ideas: add a note, move them to a separate file.
opener.addheaders = [("Testing", "True")] | ||
urllib.request.install_opener(opener) | ||
|
||
|
||
def test_fraud(): | ||
X, y = demos.load_fraud() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this unit test will fail if there's no network connection? Are there others which fail under that scenario?
0ab9fa2
to
7a74aef
Compare
51d1444
to
e28976b
Compare
def pytest_configure(config): | ||
config.addinivalue_line( | ||
"markers", | ||
"skip_offline: mark test to be skipped if offline (https://api.featurelabs.com cannot be reached)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa neat 👍 this is cool
Update EvalML demos to pull data from S3 instead of using local data files in the package. Tests that use the demos have been updated to use the
use_local
flag which pulls data from the local data directory rather than S3.