In [None]:
%matplotlib inline


# Supplemental Material on OpenML (Optional)

This notebook provides (optional) supplemental material on working with datasets and tasks in OpenML.



## Datasets

How to list and download datasets.


In [50]:
import openml
import pandas as pd
from openml.datasets import edit_dataset, fork_dataset, get_dataset

### Datasets Exercise 0

* List datasets

  * Use the output_format parameter to select output type
  * Default gives 'dict' (other option: 'dataframe', see below)




In [51]:
openml_list = openml.datasets.list_datasets()  # returns a dict

# Show a nice table with some key data properties
datalist = pd.DataFrame.from_dict(openml_list, orient="index")
datalist = datalist[["did", "name", "NumberOfInstances", "NumberOfFeatures", "NumberOfClasses"]]

print(f"First 10 of {len(datalist)} datasets...")
datalist.head(n=10)

# The same can be done with lesser lines of code
openml_df = openml.datasets.list_datasets(output_format="dataframe")
openml_df.head(n=10)

First 10 of 4040 datasets...


Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
2,2,anneal,1,1,active,ARFF,684.0,7.0,8.0,5.0,39.0,898.0,898.0,22175.0,6.0,33.0
3,3,kr-vs-kp,1,1,active,ARFF,1669.0,3.0,1527.0,2.0,37.0,3196.0,0.0,0.0,0.0,37.0
4,4,labor,1,1,active,ARFF,37.0,3.0,20.0,2.0,17.0,57.0,56.0,326.0,8.0,9.0
5,5,arrhythmia,1,1,active,ARFF,245.0,13.0,2.0,13.0,280.0,452.0,384.0,408.0,206.0,74.0
6,6,letter,1,1,active,ARFF,813.0,26.0,734.0,26.0,17.0,20000.0,0.0,0.0,16.0,1.0
7,7,audiology,1,1,active,ARFF,57.0,24.0,1.0,24.0,70.0,226.0,222.0,317.0,0.0,70.0
8,8,liver-disorders,1,1,active,ARFF,,,,0.0,6.0,345.0,0.0,0.0,6.0,0.0
9,9,autos,1,1,active,ARFF,67.0,22.0,3.0,6.0,26.0,205.0,46.0,59.0,15.0,11.0
10,10,lymph,1,1,active,ARFF,81.0,8.0,2.0,4.0,19.0,148.0,0.0,0.0,3.0,16.0
11,11,balance-scale,1,1,active,ARFF,288.0,3.0,49.0,3.0,5.0,625.0,0.0,0.0,4.0,1.0


### Datasets Exercise 1

* Find datasets with more than 10000 examples.
* Find a dataset called 'eeg_eye_state'.
* Find all datasets with more than 50 classes.



In [53]:
datalist[datalist.NumberOfInstances > 10000].sort_values(["NumberOfInstances"]).head(n=20)
""
datalist.query('name == "eeg-eye-state"')
""
datalist.query("NumberOfClasses > 50")

Unnamed: 0,did,name,NumberOfInstances,NumberOfFeatures,NumberOfClasses
1491,1491,one-hundred-plants-margin,1600.0,65.0,100.0
1492,1492,one-hundred-plants-shape,1600.0,65.0,100.0
1493,1493,one-hundred-plants-texture,1599.0,65.0,100.0
4552,4552,BachChoralHarmony,5665.0,17.0,102.0
41167,41167,dionis,416188.0,61.0,355.0
41169,41169,helena,65196.0,28.0,100.0
41960,41960,seattlecrime6,523590.0,8.0,144.0
41983,41983,CIFAR-100,60000.0,3073.0,100.0
42078,42078,beer_reviews,1586614.0,13.0,104.0
42087,42087,beer_reviews,1586614.0,13.0,104.0


### Download datasets



In [54]:
# This is done based on the dataset ID.
dataset = openml.datasets.get_dataset(1471)

# Print a summary
print(
    f"This is dataset '{dataset.name}', the target feature is "
    f"'{dataset.default_target_attribute}'"
)
print(f"URL: {dataset.url}")
print(dataset.description[:500])

This is dataset 'eeg-eye-state', the target feature is 'Class'
URL: https://old.openml.org/data/v1/download/1587924/eeg-eye-state.arff
**Author**: Oliver Roesler  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State), Baden-Wuerttemberg, Cooperative State University (DHBW), Stuttgart, Germany  
**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)  

All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after


Get the actual data.

The dataset can be returned in 3 possible formats: as a NumPy array, a SciPy
sparse matrix, or as a Pandas DataFrame. The format is
controlled with the parameter ``dataset_format`` which can be either 'array'
(default) or 'dataframe'. Let's first build our dataset from a NumPy array
and manually create a dataframe.



In [55]:
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="array", target=dataset.default_target_attribute
)
eeg = pd.DataFrame(X, columns=attribute_names)
eeg["class"] = y
print(eeg[:10])

            V1           V2           V3           V4           V5  \
0  4329.229980  4009.229980  4289.229980  4148.209961  4350.259766   
1  4324.620117  4004.620117  4293.850098  4148.720215  4342.049805   
2  4327.689941  4006.669922  4295.379883  4156.410156  4336.919922   
3  4328.720215  4011.790039  4296.410156  4155.899902  4343.589844   
4  4326.149902  4011.790039  4292.310059  4151.279785  4347.689941   
5  4321.029785  4004.620117  4284.100098  4153.330078  4345.640137   
6  4319.490234  4001.030029  4280.509766  4151.790039  4343.589844   
7  4325.640137  4006.669922  4278.459961  4143.080078  4344.100098   
8  4326.149902  4010.770020  4276.410156  4139.490234  4345.129883   
9  4326.149902  4011.280029  4276.919922  4142.049805  4344.100098   

            V6           V7           V8           V9          V10  \
0  4586.149902  4096.919922  4641.029785  4222.049805  4238.459961   
1  4586.669922  4097.439941  4638.970215  4210.770020  4226.669922   
2  4583.589844  409

Instead of manually creating the dataframe, you can already request a
dataframe with the correct dtypes.



In [56]:
X, y, categorical_indicator, attribute_names = dataset.get_data(
    target=dataset.default_target_attribute, dataset_format="dataframe"
)
print(X.head())
print(X.info())

        V1       V2       V3       V4       V5       V6       V7       V8  \
0  4329.23  4009.23  4289.23  4148.21  4350.26  4586.15  4096.92  4641.03   
1  4324.62  4004.62  4293.85  4148.72  4342.05  4586.67  4097.44  4638.97   
2  4327.69  4006.67  4295.38  4156.41  4336.92  4583.59  4096.92  4630.26   
3  4328.72  4011.79  4296.41  4155.90  4343.59  4582.56  4097.44  4630.77   
4  4326.15  4011.79  4292.31  4151.28  4347.69  4586.67  4095.90  4627.69   

        V9      V10      V11      V12      V13      V14  
0  4222.05  4238.46  4211.28  4280.51  4635.90  4393.85  
1  4210.77  4226.67  4207.69  4279.49  4632.82  4384.10  
2  4207.69  4222.05  4206.67  4282.05  4628.72  4389.23  
3  4217.44  4235.38  4210.77  4287.69  4632.31  4396.41  
4  4210.77  4244.10  4212.82  4288.21  4632.82  4398.46  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14980 entries, 0 to 14979
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V

Sometimes you only need access to a dataset's metadata.
In those cases, you can download the dataset without downloading the
data file. The dataset object can be used as normal.
Whenever you use any functionality that requires the data,
such as `get_data`, the data will be downloaded.



In [58]:
dataset = openml.datasets.get_dataset(1471, download_data=False)

## Dataset Upload

There are various ways to upload a dataset. The most convenient ways are documented in [this example](https://github.com/openml/openml-python/blob/master/examples/create_upload_tutorial.py). Most conveniently, this can be done using a [pandas dataframe](https://github.com/openml/openml-python/blob/a0ef724fec6ab31f6381d3ac2a84827ab535170d/examples/create_upload_tutorial.py#L206). Additionally, we need to create a [OpenMLDataset](https://openml.github.io/openml-python/master/generated/openml.OpenMLDataset.html#openml.OpenMLDataset) object, containing information about the dataset. Most notably, the arguments `name`, `default_target_attribute`, `attributes` and `data` need to be set.

* Find your favorite dataset (on your laptop), load it as pandas dataframe and upload it to OpenML.
* Common problem: Server returns error 131. This means that the description file was not complete. The [XSD](https://github.com/openml/OpenML/blob/master/openml_OS/views/pages/api_new/v1/xsd/openml.data.upload.xsd) for uploading the dataset hints what fields are mandatory.


* Note that:
    * The dataset should not already be on OpenML.
    * Tabular (e.g. CSV) data, representing a classification or regression problem
    * No text/image data, unless already featurized


## Tasks

A tutorial on how to list and download tasks.


In [None]:
# License: BSD 3-Clause

import openml
from openml.tasks import TaskType
import pandas as pd

Tasks are identified by IDs and can be accessed in two different ways:

1. In a list providing basic information on all tasks available on OpenML.
   This function will not download the actual tasks, but will instead download
   meta data that can be used to filter the tasks and retrieve a set of IDs.
   We can filter this list, for example, we can only list tasks having a
   special tag or only tasks for a specific target such as
   *supervised classification*.
2. A single task by its ID. It contains all meta information, the target
   metric, the splits and an iterator which can be used to access the
   splits in a useful manner.



### Listing tasks

We will start by simply listing only *supervised classification* tasks:



In [None]:
tasks = openml.tasks.list_tasks(task_type=TaskType.SUPERVISED_CLASSIFICATION)

**openml.tasks.list_tasks()** returns a dictionary of dictionaries by default, which we convert
into a
`pandas dataframe <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html>`_
to have better visualization capabilities and easier access:



In [None]:
tasks = pd.DataFrame.from_dict(tasks, orient="index")
print(tasks.columns)
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

# As conversion to a pandas dataframe is a common task, we have added this functionality to the
# OpenML-Python library which can be used by passing ``output_format='dataframe'``:
tasks_df = openml.tasks.list_tasks(
    task_type=TaskType.SUPERVISED_CLASSIFICATION, output_format="dataframe"
)
print(tasks_df.head())

We can filter the list of tasks to only contain datasets with more than
500 samples, but less than 1000 samples:



In [None]:
filtered_tasks = tasks.query("NumberOfInstances > 500 and NumberOfInstances < 1000")
print(list(filtered_tasks.index))

In [None]:
# Number of tasks
print(len(filtered_tasks))

Then, we can further restrict the tasks to all have the same resampling strategy:



In [None]:
filtered_tasks = filtered_tasks.query('estimation_procedure == "10-fold Crossvalidation"')
print(list(filtered_tasks.index))

In [None]:
# Number of tasks
print(len(filtered_tasks))

Resampling strategies can be found on the
`OpenML Website <https://www.openml.org/search?type=measure&q=estimation%20procedure>`_.

Similar to listing tasks by task type, we can list tasks by tags:



In [None]:
tasks = openml.tasks.list_tasks(tag="OpenML100", output_format="dataframe")
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

Furthermore, we can list tasks based on the dataset id:



In [None]:
tasks = openml.tasks.list_tasks(data_id=1471, output_format="dataframe")
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

In addition, a size limit and an offset can be applied both separately and simultaneously:



In [None]:
tasks = openml.tasks.list_tasks(size=10, offset=50, output_format="dataframe")
print(tasks)

**OpenML 100**
is a curated list of 100 tasks to start using OpenML. They are all
supervised classification tasks with more than 500 instances and less than 50000
instances per task. To make things easier, the tasks do not contain highly
unbalanced data and sparse data. However, the tasks include missing values and
categorical features. You can find out more about the *OpenML 100* on
`the OpenML benchmarking page <https://docs.openml.org/benchmark/>`_.

Finally, it is also possible to list all tasks on OpenML with:



In [None]:
tasks = openml.tasks.list_tasks(output_format="dataframe")
print(len(tasks))

### Tasks Exercise

Search for the tasks on the 'eeg-eye-state' dataset.



In [None]:
tasks.query('name=="eeg-eye-state"')

### Downloading tasks

We provide two functions to download tasks, one which downloads only a
single task by its ID, and one which takes a list of IDs and downloads
all of these tasks:



In [None]:
task_id = 31
task = openml.tasks.get_task(task_id)

Properties of the task are stored as member variables:



In [None]:
print(task)

And:



In [None]:
ids = [2, 1891, 31, 9983]
tasks = openml.tasks.get_tasks(ids)
print(tasks[0])

### Creating tasks

You can also create new tasks. Take the following into account:

* You can only create tasks on *active* datasets
* For now, only the following tasks are supported: classification, regression,
  clustering, and learning curve analysis.
* For now, tasks can only be created on a single dataset.
* The exact same task must not already exist.

Creating a task requires the following input:

* task_type: The task type ID, required (see below). Required.
* dataset_id: The dataset ID. Required.
* target_name: The name of the attribute you aim to predict. Optional.
* estimation_procedure_id : The ID of the estimation procedure used to create train-test
  splits. Optional.
* evaluation_measure: The name of the evaluation measure. Optional.
* Any additional inputs for specific tasks

It is best to leave the evaluation measure open if there is no strong prerequisite for a
specific measure. OpenML will always compute all appropriate measures and you can filter
or sort results on your favourite measure afterwards. Only add an evaluation measure if
necessary (e.g. when other measure make no sense), since it will create a new task, which
scatters results across tasks.



We'll use the test server for the rest of this tutorial.

<div class="alert alert-danger"><h4>Warning</h4><p>.. include:: ../../test_server_usage_warning.txt</p></div>



In [None]:
openml.config.start_using_configuration_for_example()

### Task Example

Let's create a classification task on a dataset. In this example we will do this on the
Iris dataset (ID=128 (on test server)). We'll use 10-fold cross-validation (ID=1),
and *predictive accuracy* as the predefined measure (this can also be left open).
If a task with these parameters exists, we will get an appropriate exception.
If such a task doesn't exist, a task will be created and the corresponding task_id
will be returned.



In [None]:
try:
    my_task = openml.tasks.create_task(
        task_type=TaskType.SUPERVISED_CLASSIFICATION,
        dataset_id=128,
        target_name="class",
        evaluation_measure="predictive_accuracy",
        estimation_procedure_id=1,
    )
    my_task.publish()
except openml.exceptions.OpenMLServerException as e:
    # Error code for 'task already exists'
    if e.code == 614:
        # Lookup task
        tasks = openml.tasks.list_tasks(data_id=128, output_format="dataframe")
        tasks = tasks.query(
            'task_type == "Supervised Classification" '
            'and estimation_procedure == "10-fold Crossvalidation" '
            'and evaluation_measures == "predictive_accuracy"'
        )
        task_id = tasks.loc[:, "tid"].values[0]
        print("Task already exists. Task ID is", task_id)

# reverting to prod server
openml.config.stop_using_configuration_for_example()

* `Complete list of task types <https://www.openml.org/search?type=task_type>`_.
* `Complete list of model estimation procedures <https://www.openml.org/search?q=%2520measure_type%3Aestimation_procedure&type=measure>`_.
* `Complete list of evaluation measures <https://www.openml.org/search?q=measure_type%3Aevaluation_measure&type=measure>`_.


