Skip to content
This repository has been archived by the owner on Apr 5, 2024. It is now read-only.

Sample Machine Learning datasets on DataHub #33

Closed
4 tasks done
rufuspollock opened this issue Dec 4, 2017 · 11 comments
Closed
4 tasks done

Sample Machine Learning datasets on DataHub #33

rufuspollock opened this issue Dec 4, 2017 · 11 comments

Comments

@rufuspollock
Copy link
Member

rufuspollock commented Dec 4, 2017

Many potential users come from a machine learning context and may be interested in sample machine learning datasets so let's get some up on the DataHub.

See also openml/OpenML#482

Tasks

  • Identify some sample datasets
  • Tabular Data package-ize them
  • Get them into a machine-learning section (maybe create a special org or add these to examples and/or awesome list)
  • Write a tutorial especially explaining how to convert to common formats wanted elsewhere (e.g. ARFF?)

Research

https://github.com/renatopp/arff-datasets

@HeidiSeibold
Copy link

You can for example use any data set from OpenML and extract the relevant info to create the data package.

Example in R:

library("OpenML")
eyedat <- getOMLDataSet(data.name = "eeg-eye-state")
str(eyedat)

The same can be done in python, java, ... (whatever you prefer, see OpenML website).

@rufuspollock
Copy link
Member Author

@HeidiSeibold great. Do you have any recommendations on a top-5 "getting started" datasets you'd recommend for us to data package?

@HeidiSeibold
Copy link

HeidiSeibold commented Dec 4, 2017

There is a collection called OpenML 100 with interesting (classification) data sets. It was collected as a quick way to do benchmarking with OpenML.

Maybe the first 5 of those 😉

@rufuspollock
Copy link
Member Author

@HeidiSeibold thanks - if you have any tips on which ones you'd especially recommend let us know.

@Mikanebu please take a look through these datasets and see how they work and if there any that you think would be good

From my own quick browse i found:

@Mikanebu
Copy link
Member

Mikanebu commented Dec 5, 2017

@rufuspollock ok, I will take a look

@rufuspollock
Copy link
Member Author

@rufuspollock
Copy link
Member Author

@HeidiSeibold we've started data packaging some machine learning datasets here http://datahub.io/machine-learning

Any feedback or suggestions for next steps would be very welcome 😄

@HeidiSeibold
Copy link

HeidiSeibold commented Jan 15, 2018

Cool, thanks 🎉 . So there is at least one data set that is both in your collection as well as on OpenML:

I guess a first step now would be to check:

  • which fields in the datapackage correspond to which fields in the ARFF file / XML file.
  • if there is any information in the ARFF file / XML file that we cannot get from the data package.

Is the R package for data packages stable yet? That would make it very easy because then we could just convert between the file formats in R, I think.

@Branko-Dj
Copy link
Contributor

FIXED! We added new datasets from open-ml at datahub.io/machine-learning and wrote a blog post about ARFF format specifically https://datahub.io/blog/attribute-relation-file-format-arff

@HeidiSeibold
Copy link

Awesome, thanks @Branko-Dj! Do you have code for the transformation from ARFF to data package that we could share?

@Branko-Dj
Copy link
Contributor

@HeidiSeibold Currently we would do it by transforming arff file to csv by using pre-existing scripts (https://github.com/haloboy777/arfftocsv for example) and then create data package for the csv file.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants