Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download datasets from online so unit tests can run locally #408

Merged
merged 8 commits into from
Feb 27, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Feb 21, 2020

Closes #342.

  • Uploaded titanic.csv and tips.csv datasets in tests/data folder
  • Uploaded zipped fraud dataset in demos/data folder

(Also uploaded license for tips.csv in tests/data/licenses folder)

@codecov
Copy link

codecov bot commented Feb 21, 2020

Codecov Report

Merging #408 into master will increase coverage by 0.07%.
The diff coverage is 99.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #408      +/-   ##
==========================================
+ Coverage   97.39%   97.47%   +0.07%     
==========================================
  Files         103      105       +2     
  Lines        3229     3287      +58     
==========================================
+ Hits         3145     3204      +59     
+ Misses         84       83       -1
Impacted Files Coverage Δ
...s/components/estimators/regressors/rf_regressor.py 100% <ø> (ø) ⬆️
...eature_selection/rf_classifier_feature_selector.py 100% <ø> (ø) ⬆️
...ml/pipelines/classification/logistic_regression.py 100% <ø> (ø) ⬆️
...components/transformers/scalers/standard_scaler.py 100% <ø> (ø) ⬆️
evalml/automl/auto_classification_search.py 100% <ø> (ø) ⬆️
...onents/estimators/regressors/catboost_regressor.py 100% <ø> (ø) ⬆️
evalml/pipelines/classification/catboost.py 100% <ø> (ø) ⬆️
...nents/estimators/classifiers/xgboost_classifier.py 100% <ø> (ø) ⬆️
evalml/automl/auto_regression_search.py 100% <ø> (ø) ⬆️
...components/transformers/encoders/onehot_encoder.py 100% <ø> (ø) ⬆️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56a734c...8da061a. Read the comment docs.

Comment on lines +1 to +27
Copyright (c) 2012-2020, Michael L. Waskom
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name of the project nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not exactly sure if this is how to distribute w/ license? 🤔🤔🤔🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good -- maybe add a note on #317?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have one "licenses" directory somewhere, perhaps just licenses/. Let's do that to start? Do we have other licenses in the repo yet, and if so, where do we keep them?

Also, what is this license for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, it's a license for the dataset. Ok, I guess this is a fine place for the license.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I currently have the license folder as closely linked to the dataset as possible, so here I have the /data folder and then a licenses subfolder. What do you think about this vs a root /licenses folder for all datasets?

@dsherry
Copy link
Contributor

dsherry commented Feb 25, 2020

Is the fraud transactions data generated? 2.5mb is pretty good :) but if its generated perhaps we could just truncate it.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 👍

@angela97lin
Copy link
Contributor Author

angela97lin commented Feb 25, 2020

Is the fraud transactions data generated? 2.5mb is pretty good :) but if its generated perhaps we could just truncate it.

@dsherry Yup, it's generated! 2.5mb is the size of it, zipped. I think the original .csv has 99,000+ rows and we could truncate it but not sure if it's necessary? :o Could be nice to just have a larger dataset on hand, if not simply for testing purposes. Users currently have the ability to load in any number of rows they'd like using the n_rows parameter in load_fraud if they wanted a smaller dataset :)

@dsherry
Copy link
Contributor

dsherry commented Feb 27, 2020

Sounds good! In that case, I think we should file a separate ticket to track permanently reducing the size of this dataset. There we can either generate a new one and upload it, or have the unit tests generate that data in realtime. This PR is a great start though! Keeping installation as lightweight as possible is helpful to keep in mind.

@angela97lin
Copy link
Contributor Author

Filed #424, closing this!

@angela97lin angela97lin deleted the 342-offline-unit-tests branch February 27, 2020 17:44
@angela97lin angela97lin mentioned this pull request Mar 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unit tests should be able to run offline
2 participants