Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
78f36ef
reproduce figure 3 of nips paper
Sep 26, 2018
3e24361
update readme
Sep 26, 2018
ccd24b2
Update Readme.md
herilalaina Sep 26, 2018
cd5eebc
Update Readme.md
herilalaina Sep 26, 2018
fe3c435
Fix pep8 error
herilalaina Sep 26, 2018
fa81b9d
Fix codestyle issues
herilalaina Sep 27, 2018
3e4457d
Delete unused instruction
herilalaina Sep 27, 2018
1d3814c
Save progress
ahn1340 Dec 11, 2018
19c556b
ADD get_tasks.py. converts given dataset_ids to task_ids.
ahn1340 Dec 12, 2018
def630b
Save current progress
ahn1340 Dec 14, 2018
3d80512
Save progress.
ahn1340 Dec 20, 2018
c3b3c8d
ADD resource folder
ahn1340 Dec 20, 2018
467c2cf
Save progress.
ahn1340 Dec 21, 2018
2220ab3
ADD score_metalearning.py (In progress)
ahn1340 Jan 13, 2019
53d97e6
Testing with cluster
ahn1340 Jan 14, 2019
340dc94
Testing with cluster
ahn1340 Jan 14, 2019
c7de82b
ADD load_task_offline.py
Jan 14, 2019
4c479c7
Testing with cluster
ahn1340 Jan 14, 2019
653262c
Testing
ahn1340 Jan 14, 2019
3c10c9c
Testing on cluster
ahn1340 Jan 15, 2019
8d942f3
Save Progress
ahn1340 Jan 23, 2019
393ff70
Update score_metalearning
ahn1340 Jan 23, 2019
7ee5d1d
Split test script to startup and shutdown file.
ahn1340 Jan 23, 2019
0b38666
rename
Jan 24, 2019
5f6de3a
Trying figure out which task_ids are missing in the cluster cache
ahn1340 Jan 24, 2019
7758b27
Modify scripts to run on cluster
Jan 24, 2019
eed3181
Modify
ahn1340 Jan 24, 2019
e297414
Update
Jan 24, 2019
22e23eb
.
Jan 27, 2019
2b74e08
ADD plot
Jan 28, 2019
239b09e
Looking at the result
Jan 30, 2019
c0f4b92
Save progress
ahn1340 Feb 7, 2019
e1d37d6
Change way of computing vanilla and metalearning
ahn1340 Feb 8, 2019
8e77b65
Inspecting run result again
Feb 8, 2019
84b8be3
Save changes
ahn1340 Feb 20, 2019
a088899
update metalearning part
ahn1340 Feb 20, 2019
489e4f0
Commit so it can pull from origin
Feb 21, 2019
801ef4b
Bring some files for testing
Feb 21, 2019
8eac064
Merge branch 'herilalaina-nips_reproduce' of https://github.com/ahn13…
ahn1340 Feb 21, 2019
076e9f5
Use pandas series for fill_trajectory
ahn1340 Feb 21, 2019
fb0b9ec
Clean up the code
ahn1340 Feb 21, 2019
c7e12b9
Merge pull request #5 from automl/development
ahn1340 Feb 21, 2019
145105e
Cleanup more code
ahn1340 Feb 21, 2019
a5222ea
reproduce figure 3 of nips paper
Sep 26, 2018
c8662d9
update readme
Sep 26, 2018
bbff346
Update Readme.md
herilalaina Sep 26, 2018
d181542
Update Readme.md
herilalaina Sep 26, 2018
4f64985
Fix pep8 error
herilalaina Sep 26, 2018
908b67d
Fix codestyle issues
herilalaina Sep 27, 2018
c16adef
Delete unused instruction
herilalaina Sep 27, 2018
fee4ecd
Save progress
ahn1340 Dec 11, 2018
5b93a14
ADD get_tasks.py. converts given dataset_ids to task_ids.
ahn1340 Dec 12, 2018
2c34093
Save current progress
ahn1340 Dec 14, 2018
b507972
Save progress.
ahn1340 Dec 20, 2018
d03201b
ADD resource folder
ahn1340 Dec 20, 2018
610f4d2
Save progress.
ahn1340 Dec 21, 2018
01ffe95
ADD score_metalearning.py (In progress)
ahn1340 Jan 13, 2019
f28d643
Testing with cluster
ahn1340 Jan 14, 2019
92cbfcf
Testing with cluster
ahn1340 Jan 14, 2019
1b2639f
ADD load_task_offline.py
Jan 14, 2019
3c59403
Testing with cluster
ahn1340 Jan 14, 2019
c45f4b6
Testing
ahn1340 Jan 14, 2019
96574b4
Testing on cluster
ahn1340 Jan 15, 2019
ec9065b
Save Progress
ahn1340 Jan 23, 2019
1a081f6
Update score_metalearning
ahn1340 Jan 23, 2019
0aab2f7
Split test script to startup and shutdown file.
ahn1340 Jan 23, 2019
9318704
rename
Jan 24, 2019
6743158
Trying figure out which task_ids are missing in the cluster cache
ahn1340 Jan 24, 2019
f0aee25
Modify scripts to run on cluster
Jan 24, 2019
e38e122
Modify
ahn1340 Jan 24, 2019
28c614c
Update
Jan 24, 2019
172095b
.
Jan 27, 2019
16f65b4
ADD plot
Jan 28, 2019
353204b
Looking at the result
Jan 30, 2019
8197bcf
Save progress
ahn1340 Feb 7, 2019
1129259
Change way of computing vanilla and metalearning
ahn1340 Feb 8, 2019
5861d80
Inspecting run result again
Feb 8, 2019
018eda7
Save changes
ahn1340 Feb 20, 2019
474be77
update metalearning part
ahn1340 Feb 20, 2019
ebed88d
Commit so it can pull from origin
Feb 21, 2019
3434e57
Bring some files for testing
Feb 21, 2019
44e1adf
Use pandas series for fill_trajectory
ahn1340 Feb 21, 2019
2388087
Clean up the code
ahn1340 Feb 21, 2019
b8b9b6d
Cleanup more code
ahn1340 Feb 21, 2019
53290eb
Modify fill_trajectory to not delete unchanging incumbent trajectory
ahn1340 Mar 7, 2019
40fa855
Merge branch 'herilalaina-nips_reproduce' of https://github.com/ahn13…
ahn1340 Mar 13, 2019
3e5f481
Modify run_with_meatalearning to use new metadata_directory feature,
ahn1340 Mar 13, 2019
df2df61
PEP8
ahn1340 Mar 13, 2019
89ed909
Update Readme.md
ahn1340 Mar 13, 2019
d5d4f97
Organize files
Mar 18, 2019
97b10c0
save progress
Mar 18, 2019
b5929c5
Modify plotting script
ahn1340 Mar 27, 2019
7fbd153
Change plotting file name
ahn1340 Mar 27, 2019
e8d1694
Minor docstring change
ahn1340 Mar 27, 2019
659f420
Modify Readme.md
ahn1340 Mar 27, 2019
19781c6
Fix PEP8
ahn1340 Mar 27, 2019
e26a9f8
Fix what Matthias pointed out
ahn1340 Apr 10, 2019
26e7976
fix pep8
ahn1340 Apr 11, 2019
40f78ad
Add kwargs to AutoMLRegressor init
yazanobeidi Apr 25, 2019
8896e85
Merge pull request #597 from ahn1340/herilalaina-nips_reproduce
mfeurer Apr 26, 2019
d350685
Merge pull request #669 from yazanobeidi/patch-1
mfeurer Apr 26, 2019
502e8b4
fix issue with np 1.16.3 which forbids pickle loading
mfeurer May 10, 2019
d7d4e03
Merge pull request #675 from automl/fix_pickle_error
mfeurer May 10, 2019
80dc173
Allow brackets in path arguments
mfeurer May 10, 2019
a548e2b
prepare new release
mfeurer May 10, 2019
5fc9128
Merge pull request #676 from automl/glob_escape
mfeurer May 10, 2019
b3a8667
Merge pull request #677 from automl/prepare_new_release
mfeurer May 10, 2019
f236ded
add missing contributor
mfeurer May 11, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ docs/build/*
*.py[cod]

# C extensions
*.c
*.so

# Packages
Expand Down Expand Up @@ -46,3 +47,5 @@ download
*.pkl
num_run
number_submission
.pypirc
dmypy.json
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ matrix:
- os: linux
env: DISTRIB="conda" COVERAGE="true" DOCPUSH="true" PYTHON="3.6"
- os: linux
env: DISTRIB="conda" $TEST_DIST="true" PYTHON="3.7"
env: DISTRIB="conda" TEST_DIST="true" PYTHON="3.7"
- os: linux
env: DISTRIB="conda" EXAMPLES="true" PYTHON=3.7"
- os: linux
Expand Down
2 changes: 1 addition & 1 deletion autosklearn/__version__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Version information."""

# The following line *must* be the last in the module, exactly as formatted:
__version__ = "0.5.1"
__version__ = "0.5.2"
3 changes: 3 additions & 0 deletions autosklearn/automl.py
Original file line number Diff line number Diff line change
Expand Up @@ -1057,6 +1057,9 @@ def predict_proba(self, X, batch_size=None, n_jobs=1):


class AutoMLRegressor(BaseAutoML):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

def fit(
self,
X: np.ndarray,
Expand Down
38 changes: 24 additions & 14 deletions autosklearn/ensemble_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,13 +257,15 @@ def read_ensemble_preds(self):

if self.shared_mode is False:
pred_path = os.path.join(
self.dir_ensemble,
'predictions_ensemble_%s_*.npy' % self.seed)
glob.escape(self.dir_ensemble),
'predictions_ensemble_%s_*.npy' % self.seed,
)
# pSMAC
else:
pred_path = os.path.join(
self.dir_ensemble,
'predictions_ensemble_*_*.npy')
glob.escape(self.dir_ensemble),
'predictions_ensemble_*_*.npy',
)

y_ens_files = glob.glob(pred_path)
# no validation predictions so far -- no files
Expand Down Expand Up @@ -453,13 +455,21 @@ def get_valid_test_preds(self, selected_keys: list):

for k in selected_keys:
valid_fn = glob.glob(
os.path.join(self.dir_valid, 'predictions_valid_%d_%d.npy'
% (self.read_preds[k]["seed"],
self.read_preds[k]["num_run"])))
os.path.join(
glob.escape(self.dir_valid),
'predictions_valid_%d_%d.npy' % (
self.read_preds[k]["seed"],
self.read_preds[k]["num_run"])
)
)
test_fn = glob.glob(
os.path.join(self.dir_test, 'predictions_test_%d_%d.npy' %
(self.read_preds[k]["seed"],
self.read_preds[k]["num_run"])))
os.path.join(
glob.escape(self.dir_test),
'predictions_test_%d_%d.npy' % (
self.read_preds[k]["seed"],
self.read_preds[k]["num_run"])
)
)

# TODO don't read valid and test if not changed
if len(valid_fn) == 0:
Expand Down Expand Up @@ -636,11 +646,11 @@ def predict(self, set_: str,

def _read_np_fn(self, fp):
if self.precision is "16":
predictions = np.load(fp).astype(dtype=np.float16)
predictions = np.load(fp, allow_pickle=True).astype(dtype=np.float16)
elif self.precision is "32":
predictions = np.load(fp).astype(dtype=np.float32)
predictions = np.load(fp, allow_pickle=True).astype(dtype=np.float32)
elif self.precision is "64":
predictions = np.load(fp).astype(dtype=np.float64)
predictions = np.load(fp, allow_pickle=True).astype(dtype=np.float64)
else:
predictions = np.load(fp)
predictions = np.load(fp, allow_pickle=True)
return predictions
20 changes: 12 additions & 8 deletions autosklearn/util/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ def get_smac_output_directory_for_run(self, seed):

def get_smac_output_glob(self, smac_run_id: Union[str, int] = 1) -> str:
return os.path.join(
self.temporary_directory,
glob.escape(self.temporary_directory),
'smac3-output',
'run_%s' % str(smac_run_id),
)
Expand All @@ -265,7 +265,7 @@ def save_targets_ensemble(self, targets):
# number of times where we erronously keep a lock on the ensemble
# targets file although the process already was killed
try:
existing_targets = np.load(filepath)
existing_targets = np.load(filepath, allow_pickle=True)
if existing_targets.shape[0] > targets.shape[0] or \
(existing_targets.shape == targets.shape and
np.allclose(existing_targets, targets)):
Expand All @@ -278,7 +278,7 @@ def save_targets_ensemble(self, targets):
with lockfile.LockFile(lock_path):
if os.path.exists(filepath):
with open(filepath, 'rb') as fh:
existing_targets = np.load(fh)
existing_targets = np.load(fh, allow_pickle=True)
if existing_targets.shape[0] > targets.shape[0] or \
(existing_targets.shape == targets.shape and
np.allclose(existing_targets, targets)):
Expand All @@ -299,7 +299,7 @@ def load_targets_ensemble(self):
lock_path = filepath + '.lock'
with lockfile.LockFile(lock_path):
with open(filepath, 'rb') as fh:
targets = np.load(fh)
targets = np.load(fh, allow_pickle=True)

return targets

Expand Down Expand Up @@ -346,8 +346,9 @@ def save_model(self, model, idx, seed):
def list_all_models(self, seed):
model_directory = self.get_model_dir()
if seed >= 0:
model_files = glob.glob(os.path.join(model_directory,
'%s.*.model' % seed))
model_files = glob.glob(
os.path.join(glob.escape(model_directory), '%s.*.model' % seed)
)
else:
model_files = os.listdir(model_directory)
model_files = [os.path.join(model_directory, mf)
Expand Down Expand Up @@ -408,9 +409,11 @@ def load_ensemble(self, seed):
self.logger.warning('Directory %s does not exist' % ensemble_dir)
return None

print(seed)
if seed >= 0:
indices_files = glob.glob(os.path.join(ensemble_dir,
'%s.*.ensemble' % seed))
indices_files = glob.glob(
os.path.join(glob.escape(ensemble_dir), '%s.*.ensemble' % seed)
)
indices_files.sort()
else:
indices_files = os.listdir(ensemble_dir)
Expand All @@ -419,6 +422,7 @@ def load_ensemble(self, seed):

with open(indices_files[-1], 'rb') as fh:
ensemble_members_run_numbers = pickle.load(fh)
print(indices_files)

return ensemble_members_run_numbers

Expand Down
16 changes: 16 additions & 0 deletions doc/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,22 @@
Releases
========

Version 0.5.2
=============

* FIX #669: Correctly handle arguments to the ``AutoMLRegressor``
* FIX #667: Auto-sklearn works with numpy 1.16.3 again.
* ADD #676: Allow brackets [ ] inside the temporary and output directory paths.
* ADD #424: (Experimental) scripts to reproduce the results from the original Auto-sklearn paper.

Contributors
************

* Jin Woo Ahn
* Herilalaina Rakotoarison
* Matthias Feurer
* yazanobeidi

Version 0.5.1
=============

Expand Down
34 changes: 34 additions & 0 deletions scripts/2015_nips_paper/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## Reproduce results of Efficient and Robust Automated Machine Learning (Feurer et al.)
This folder contains all necessary scripts in order to reproduce the results shown in
Figure 3 of Efficient and Robust Automated Machine Learning (Feurer et al.). The scripts
can be modified to include different datasets, change the runtime, etc. The scripts only
only handles classification tasks, and balanced accuracy is used as the score metric.

### 1. Creating commands.txt
To run the experiment, first create commands.txt by running:
```bash
cd setup
bash create_commands.sh
```
The script can be modified to run experiments with different settings, i.e.
different runtime and/or different tasks.

### 2. Executing commands.txt
Run each commands in commands.txt:
```bash
cd run
bash run_commands.sh
```
Each command line in commands.txt first executes model fitting, and then creating the
single best and ensemble trajectories. Therefore, the commands can be run in parallel
on a cluster by modifying run_commands.sh.

### 3. Plotting the results
To plot the results, run:
```bash
cd plot
bash plot_ranks.py
```



165 changes: 165 additions & 0 deletions scripts/2015_nips_paper/plot/plot_ranks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
#!/usr/bin/env python3

import csv
import sys
import os

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt


def read_csv(fn, has_header=True, data_type=str):
"""
Function which reads the csv files containing trajectories
of the auto-sklearn runs.
"""
data = list()
header = None
with open(fn, 'r') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in csv_reader:
if header is None and has_header:
header = row
continue
data.append(list(map(data_type, [i.strip() for i in row])))
return header, data


def fill_trajectory(performance_list, time_list):
# Create n series objects.
series_list = []
for n in range(len(time_list)):
series_list.append(pd.Series(data=performance_list[n], index=time_list[n]))

# Concatenate to one Series with NaN vales.
series = pd.concat(series_list, axis=1)

# Fill missing performance values (NaNs) with last non-NaN value.
series = series.fillna(method='ffill')

# return the trajectories over seeds (series object)
return series


def main():
# name of the file where the plot is stored
saveto = "../plot.png"
# runtime of each experiment
max_runtime = 3600
# folder where all trajectories are stored.
working_directory = "../log_output"

# list of models
model_list = ['vanilla', 'ensemble', 'metalearning', 'meta_ensemble']

# list of seeds
seed_dir = os.path.join(working_directory, 'vanilla')
seed_list = [seed for seed in os.listdir(seed_dir)]

# list of tasks
vanilla_task_dir = os.path.join(seed_dir, seed_list[0])
task_list = [task_id for task_id in os.listdir(vanilla_task_dir)]

# Step 1. Merge all trajectories into one Dataframe object.
#####################################################################################
all_trajectories = []

for model in model_list:
trajectories = []
for task_id in task_list:
csv_files = []

for seed in seed_list:
# collect all csv files of different seeds for current model and
# current task.
if model in ['vanilla', 'ensemble']:
csv_file = os.path.join(working_directory,
'vanilla',
seed,
task_id,
"score_{}.csv".format(model)
)

elif model in ['metalearning', 'meta_ensemble']:
csv_file = os.path.join(working_directory,
'metalearning',
seed,
task_id,
"score_{}.csv".format(model),
)
csv_files.append(csv_file)

performance_list = []
time_list = []

# Get data from csv
for fl in csv_files:
_, csv_data = read_csv(fl, has_header=True)
csv_data = np.array(csv_data)
# Replace too high values with args.maxsize
data = [min([sys.maxsize, float(i.strip())]) for i in
csv_data[:, 2]] # test trajectories are stored in third column

time_steps = [float(i.strip()) for i in csv_data[:, 0]]
assert time_steps[0] == 0

performance_list.append(data)
time_list.append(time_steps)

# trajectory is the pd.Series object containing all seed runs of the
# current model and current task.
trajectory = fill_trajectory(performance_list, time_list)
trajectories.append(trajectory)

# list[list[pd.Series]]
all_trajectories.append(trajectories)

# Step 2. Compute average ranks of the trajectories.
#####################################################################################
all_rankings = []
n_iter = 500 # number of bootstrap samples to use for estimating the ranks.
n_tasks = len(task_list)

for i in range(n_iter):
pick = np.random.choice(all_trajectories[0][0].shape[1],
size=(len(model_list)))

for j in range(n_tasks):
all_trajectories_tmp = pd.DataFrame(
{model_list[k]: at[j].iloc[:, pick[k]] for
k, at in enumerate(all_trajectories)}
)
all_trajectories_tmp = all_trajectories_tmp.fillna(method='ffill', axis=0)
r_tmp = all_trajectories_tmp.rank(axis=1)
all_rankings.append(r_tmp)

final_ranks = []
for i, model in enumerate(model_list):
ranks_for_model = []
for ranking in all_rankings:
ranks_for_model.append(ranking.loc[:, model])
ranks_for_model = pd.DataFrame(ranks_for_model)
ranks_for_model = ranks_for_model.fillna(method='ffill', axis=1)
final_ranks.append(ranks_for_model.mean(skipna=True))

# Step 3. Plot the average ranks over time.
#####################################################################################
for i, model in enumerate(model_list):
X_data = []
y_data = []
for x, y in final_ranks[i].iteritems():
X_data.append(x)
y_data.append(y)
X_data.append(max_runtime)
y_data.append(y)
plt.plot(X_data, y_data, label=model)
plt.xlabel('time [sec]')
plt.ylabel('average rank')
plt.legend()
plt.savefig(saveto)


if __name__ == "__main__":
main()
Loading