## Small demo of the OpenML API.

It covers downloading datasets, tasks, how to use scikit-learn to build classifiers, and upload the results to the server.


Initialization and login. This assumes you have a .openml dir in your homedir with a subdir for caches and a file with your API key. You find your API key in your account settings on openml.org.

In [None]:
from sklearn import preprocessing, ensemble
from openml.apiconnector import APIConnector
from openml.autorun import openml_run
import numpy as np
import pandas as pd
import os

home_dir = os.path.expanduser("~")
openml_dir = os.path.join(home_dir, ".openml")
if not os.path.exists(openml_dir): os.makedirs(openml_dir)
cache_dir = os.path.join(openml_dir, "cache")
key =  ## Put your key as a string here

In [None]:
openml = APIConnector(cache_directory=cache_dir, apikey=key)

List all datasets on OpenML

In [None]:
datasets = openml.get_dataset_list()

data = pd.DataFrame(datasets)
print("Found %s datasets on OpenML. Here are the first 10:" % len(datasets))
print(data[:10][['did','NumberOfInstances','NumberOfFeatures','NumberOfClasses']])

cl_data = data.loc[data['NumberOfClasses'] >= 2]
print("Selected %s classification datasets. Here are the first 10:" % len(cl_data))
print(cl_data[:10][['did','NumberOfInstances','NumberOfFeatures','NumberOfClasses']])

Download a specific dataset. This is done based on the dataset ID (called 'did' in the table above).

In [None]:
from pprint import pprint
import arff

print("Downloading dataset 61.")
dataset = openml.download_dataset(61)

print("This is dataset '%s', the target feature is called '%s'" % (dataset.name, dataset.default_target_attribute))
print("More info, including the location off the .arff file on disk:")
pprint(vars(dataset))

print("Peeking at the actual data (class labels are replaced by indices):")
X, y, attribute_names = dataset.get_dataset(target=dataset.default_target_attribute, return_attribute_names=True)
iris = pd.DataFrame(X, columns=attribute_names)
iris['class'] = y
print(iris[:10])

Training a scikit-learn model with the data

In [None]:
dataset = openml.download_dataset(61)
X, y = dataset.get_dataset(target=dataset.default_target_attribute)
clf = ensemble.RandomForestClassifier()
clf.fit(X, y)

You can also ask with features are categorical to do your own encoding 

In [None]:
X, y, categorical = dataset.get_dataset(target=dataset.default_target_attribute,return_categorical_indicator=True)
enc = preprocessing.OneHotEncoder(categorical_features=categorical)
X = enc.fit_transform(X)
clf.fit(X, y)

To run benchmarks consistently (also across studies and tools), OpenML offers Tasks, which include specific train-test splits and other information.

In [None]:
task_list = openml.get_task_list()
print(task_list[0])

tasks = pd.DataFrame(task_list)
print("Found %s tasks on OpenML. Here are the first 10:" % len(tasks))
print(tasks[:10][['tid','did','task_type','NumberOfInstances','NumberOfFeatures','NumberOfClasses']])

Download a single OpenML task (id=10), create a scikit-learn classifier (RandomForest), and run it on the task

In [None]:
task = openml.download_task(10)
print(task)

clf = ensemble.RandomForestClassifier()
prediction_path, description_path = openml_run(task, clf)
print("RandomForest has run on the task.")
print("The predictions are in file %s" % prediction_path)
print("The complete run description is stored in file %s" % description_path)

#import json
#print(json.dumps(xmltodict.parse(open(os.path.abspath(description_path), "r").read()), indent=4))

Upload the run to the OpenML server

In [None]:
import xmltodict
from IPython.core.display import display, HTML

prediction_abspath = os.path.abspath(prediction_path)
description_abspath = os.path.abspath(description_path)

return_code, response = openml.upload_run(prediction_abspath, description_abspath)

if(return_code == 200):
    response_dict = xmltodict.parse(response.content)
    run_id = response_dict['oml:upload_run']['oml:run_id']
    print("Uploaded run with id %s" % (run_id))
    display(HTML("<a href='http://www.openml.org/r/{0}'>http://www.openml.org/r/{0}</a>".format(run_id)))
