Make example data downloader more robust #13

rebeccabilbro · 2016-05-18T16:00:19Z

Make the UCI data downloader from the examples more robust.

bbengfort · 2016-05-31T23:52:33Z

In the examples directory we have a simple download script that will download the example data sets for the blog posts we created on yellowbrick (and power the Jupyter notebook that is also in that directory).

However, we also need to make sure that this script can be used to download or unpack data for testing (e.g. to a tmp directory). So we have to do one of the following:

Add compressed datasets and write a script in tests to load and decompress them in memory (or write them to a temporary directory).
Upload the datasets to S3 (they're on dropbox now I think, which isn't exactly a great host) and then write a more robust download script.

Whoever takes this item, we'd be happy to discuss a path for either of the above.

bbengfort · 2016-10-13T20:54:38Z

@rebeccabilbro moving forward on this for my afternoon Yellowbrick session; just to take on something fairly easy so that I can get back to my proposal.

@ojedatony1616 I'm going to add the datasets to the DDL Data Bucket on S3; hope that's ok, let me know if you have other thoughts.

Tasks I'm going to do here:

Create data set bundles as CSV with header and meta data and README from the three data sets we currently have.
Zip the dataset bundles into a single file with an MD5 hash.
Create a downloader that checks the hash against the file and stores it locally.
Update the examples.ipynb and examples.py with downloaded data.
Create a test fixture that downloads the data to a temporary file.
Write a test or two that uses the temporary fixture.

bbengfort · 2016-10-13T23:53:31Z

@rebeccabilbro ok, so what I've done (particularly as it relates to #60) for each dataset is as follows:

Created a README.md from the UCI ML Repository information with correct citations (including bibtex) and other information (similar to the bundle methodology)
Created a meta.json with the feature names and target names if a classifier
Created a single csv dataset (e.g. combining test, training, etc.) with a header row that works with pd.read_csv.
Ensured that the feature names in the CSV file were easily understandable.
Ensure that the target is the last column in the dataset
Packaged the dataset into a directory with zip as follows:
```
$ zip -r name.zip name/ 
```
Where name is the single identifier of the directory (e.g. "occupancy")
Took the sha256 hash of the file and stored it in download.py to ensure that the latest version of the dataset is being downloaded and that it hasn't been corrupted.
Uploaded the .zip file to S3 -- namely to DDL's data-lake bucket and made it publicly available for download
Modified download.py to ensure that the new file is downloaded in download_all.

If possible, I'd like to ensure that all of our data sets that we produce for Yellowbrick are treated in a similar fashion.

bbengfort · 2016-10-13T23:54:02Z

Also note that this means the examples.ipynb will break if you haven't redownloaded the data files!

bbengfort · 2016-10-14T00:30:41Z

And now we cross our fingers that Travis passes ...

rebeccabilbro added this to the Version 0.2 milestone May 18, 2016

bbengfort added ready priority: low no particular rush in addressing type: task non-code related task labels May 18, 2016

rebeccabilbro added the level: novice good for beginners or new contributors label May 31, 2016

bbengfort modified the milestones: Version 0.2, Backlog Sep 4, 2016

bbengfort removed the ready label Sep 7, 2016

bbengfort modified the milestones: Version 0.3.2, Backlog Oct 13, 2016

bbengfort added the ready label Oct 13, 2016

bbengfort self-assigned this Oct 13, 2016

bbengfort added in progress label for Waffle board and removed ready labels Oct 13, 2016

bbengfort closed this as completed Oct 14, 2016

bbengfort removed the in progress label for Waffle board label Oct 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make example data downloader more robust #13

Make example data downloader more robust #13

rebeccabilbro commented May 18, 2016 •

edited by bbengfort

Loading

bbengfort commented May 31, 2016

bbengfort commented Oct 13, 2016 •

edited

Loading

bbengfort commented Oct 13, 2016 •

edited

Loading

bbengfort commented Oct 13, 2016

bbengfort commented Oct 14, 2016

Make example data downloader more robust #13

Make example data downloader more robust #13

Comments

rebeccabilbro commented May 18, 2016 • edited by bbengfort Loading

bbengfort commented May 31, 2016

bbengfort commented Oct 13, 2016 • edited Loading

bbengfort commented Oct 13, 2016 • edited Loading

bbengfort commented Oct 13, 2016

bbengfort commented Oct 14, 2016

rebeccabilbro commented May 18, 2016 •

edited by bbengfort

Loading

bbengfort commented Oct 13, 2016 •

edited

Loading

bbengfort commented Oct 13, 2016 •

edited

Loading