Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contextual Memory Tree #1799

Merged
merged 57 commits into from
Jun 5, 2019
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
6cf9eb7
setup with the latest vw; online and offline aloi results can be repr…
Mar 2, 2019
bc5ba44
wiki online script
Mar 2, 2019
79ce8d2
wiki offline few shots script
Mar 2, 2019
c6a28ec
readme
Mar 2, 2019
145e681
.
Mar 2, 2019
87563f2
.
Mar 2, 2019
1c5c9a8
scripts updated
Mar 2, 2019
b8f051e
Merge branch 'master' of github.com:LAIRLAB/vowpal_wabbit
Mar 2, 2019
f1f7d55
seperated multilabel and multiclass
Mar 3, 2019
52071c2
updated xml part
Mar 3, 2019
b5de530
.
Mar 3, 2019
788bbb7
multilabel classification scripts
Mar 3, 2019
3e190c1
fixed loaded bug in multilabel setting
Mar 5, 2019
f664ac9
Merge remote-tracking branch 'upstream/master'
Mar 7, 2019
c30d7ee
a fix of nan prediction: initialized the ec.l.simple
Mar 8, 2019
cd7f9d1
update readme
Mar 8, 2019
630ae03
Merge remote-tracking branch 'upstream/master'
Mar 20, 2019
ff39bab
scripts added to demo
Mar 20, 2019
8b6f5c4
updates on scripts
Mar 20, 2019
ac6c146
Merge branch 'master' into master
JohnLangford Apr 1, 2019
8f8577e
Merge branch 'master' into master
JohnLangford Apr 1, 2019
5c3848e
fixed some comments
May 31, 2019
2feb347
remove the unique feature function and added sort feature to wikipara…
Jun 1, 2019
7c8c91d
sort namespace indices and then walk through two sorted indices to av…
Jun 1, 2019
0a413a4
avoided double loop in computing hamming loss
Jun 2, 2019
66e234c
random seed, name changed on descent and insert example rew
Jun 3, 2019
140102f
merge from upstream
Jun 3, 2019
caebc2b
add memory tree cc in cmakelist
Jun 3, 2019
8313f76
got rid of write it define in memory tree file, putted it in io buf h…
Jun 3, 2019
276d7ce
allocated a space in memory tree for designing kprod example, and fre…
Jun 3, 2019
b19bf61
Update vowpalwabbit/memory_tree.cc
Jun 4, 2019
401a9d3
Update vowpalwabbit/memory_tree.cc
Jun 4, 2019
b974b12
Update vowpalwabbit/memory_tree.cc
Jun 4, 2019
89b33ee
Update vowpalwabbit/memory_tree.cc
Jun 4, 2019
7fc9820
typo
Jun 4, 2019
0b3e1ba
Merge branch 'master' into master
JohnLangford Jun 4, 2019
5d7017b
Update vowpalwabbit/memory_tree.cc
Jun 4, 2019
654df26
Update vowpalwabbit/memory_tree.cc
Jun 4, 2019
b47360b
Update vowpalwabbit/memory_tree.cc
Jun 4, 2019
b684fcc
lower case alpha in demo scripts
Jun 4, 2019
6c34d8d
added two tests (online and offline) for cmt
Jun 4, 2019
affeac6
Update test/RunTests
Jun 4, 2019
cf463ea
Update test/RunTests
Jun 4, 2019
daaed75
Update test/RunTests
Jun 4, 2019
c48317f
Update test/RunTests
Jun 4, 2019
a2bdfc7
Update test/RunTests
Jun 4, 2019
d88479a
staged stderr files in train set ref folder and deleted time output i…
Jun 4, 2019
6c64c73
decrease problem (smaller rcv1) and solution size (bit 15)
Jun 4, 2019
8d3a40d
updates on stderr files
Jun 4, 2019
c6adcf6
ignore cache file
Jun 4, 2019
5bc6c87
dealt with some initilization
Jun 5, 2019
cf0ca60
.
Jun 5, 2019
802fefb
merge
Jun 5, 2019
90ef274
memory leak
Jun 5, 2019
563175f
memory leak..
Jun 5, 2019
d02a187
Update test/RunTests
JohnLangford Jun 5, 2019
645aada
Update test/RunTests
JohnLangford Jun 5, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions demo/memory_tree/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
Contextual Memory Tree (CMT)
===============================

This demo exercises CMT for applications of logarithmic time
multiclass classification (online and offline), and logarithmic time multilabel classification.


The datasets for multiclass classification used are [ALOI](http://aloi.science.uva.nl/) and WikiPara. ALOI
has 1000 classes, and each class has in average 100 training examples. WikiPara
contains 10000 classes. We consider two versions of WikiPara here: 1-shot version which
contains 1 training example per class, and 2-shot version which contains 2 training examples per class.

The datasets for multilabel classification used are RCV1-2K, AmazonCat-13K, and Wiki10-31K from the XML [repository](http://manikvarma.org/downloads/XC/XMLRepository.html).

We refer users to the [manuscript](https://arxiv.org/pdf/1807.06473.pdf) for detailed datastrutures and algorithms in CMT

## Dependency:
python 3

## Training Online Contextual Memory Tree on ALOI and WikiPara:
```bash
python aloi_script_progerror.py
python wikipara10000_script_progerror.py
```

## Training Offline Contextual Memory Tree on ALOI, WikiPara, RCV1-2K, AmazonCat-13K and Wiki10-31K:
```bash
python aloi_script.py
python wikipara10000_script.py
python xml_rcv1x.script.py
python xml_amazoncat_13K_script.py
python xml_wiki10.script.py
```

56 changes: 56 additions & 0 deletions demo/memory_tree/aloi_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import os
import time
import numpy as np


#for shot in available_shots.iterkeys():
print("## perform experiments on aloi ##")
num_of_classes = 1000
leaf_example_multiplier = 4 #8
shots = 100
lr = 0.001
bits = 29
alpha = 0.1 #0.3
passes = 3 #3 #5
use_oas = 0
dream_at_update = 0
learn_at_leaf = 1 #turn on leaf at leaf actually works better
num_queries = 5 #int(np.log(passes*num_of_classes*shots))
loss = "squared"
dream_repeats = 3
online = 0

tree_node = int(2*passes*(num_of_classes*shots/(np.log(num_of_classes*shots)/np.log(2)*leaf_example_multiplier)));

train_data = "aloi_train.vw"
test_data = "aloi_test.vw"
if os.path.exists(train_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(train_data))
if os.path.exists(test_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(test_data))


saved_model = "{}.vw".format(train_data)

print("## Training...")
start = time.time()
command_train = "../../build/vowpalwabbit/vw {} --memory_tree {} --learn_at_leaf {} --max_number_of_labels {} --dream_at_update {} --dream_repeats {} --oas {} --online {} --leaf_example_multiplier {} --Alpha {} -l {} -b {} -c --passes {} --loss_function {} --holdout_off -f {}".format(
train_data, tree_node, learn_at_leaf, num_of_classes, dream_at_update,
dream_repeats, use_oas, online, leaf_example_multiplier, alpha, lr, bits, passes, loss, saved_model)
print(command_train)
os.system(command_train)
train_time = time.time() - start

#test:
print("## Testing...")
start = time.time();
os.system("../../build/vowpalwabbit/vw {} -i {}".format(test_data, saved_model))

test_time = time.time() - start

print("## train time {}, and test time {}".format(train_time, test_time))





56 changes: 56 additions & 0 deletions demo/memory_tree/aloi_script_progerror.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import os
import time
import numpy as np
#from IPython import embed


#for shot in available_shots.iterkeys():
print("## perform experiments on aloi ##")
num_of_classes = 1000
leaf_example_multiplier = 10
shots = 100
lr = 0.001
bits = 29
alpha = 0.1 #0.3
passes = 1 #3 #5
use_oas = 0
dream_at_update = 0
learn_at_leaf = 1 #turn on leaf at leaf actually works better
loss = "squared"
dream_repeats = 20 #3
online = 1

tree_node = int(2*passes*(num_of_classes*shots/(np.log(num_of_classes*shots)/np.log(2)*leaf_example_multiplier)));

train_data = "aloi_train.vw"
test_data = "aloi_test.vw"
if os.path.exists(train_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(train_data))
if os.path.exists(test_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(test_data))


saved_model = "{}.vw".format(train_data)

print("## Training...")
start = time.time()
os.system("../../build/vowpalwabbit/vw {} --memory_tree {} --learn_at_leaf {} --max_number_of_labels {} --dream_at_update {}\
--dream_repeats {} --oas {} --online {}\
--leaf_example_multiplier {} --Alpha {} -l {} -b {} -c --passes {} --loss_function {} --holdout_off -f {}".format(
train_data, tree_node, learn_at_leaf, num_of_classes, dream_at_update,
dream_repeats, use_oas, online, leaf_example_multiplier, alpha, lr, bits, passes, loss, saved_model))
train_time = time.time() - start

#test:
#print "## Testing..."
#start = time.time();
#os.system(".././vw {} -i {}".format(test_data, saved_model))

#test_time = time.time() - start

print("## train time {}, and test time {}".format(train_time, test_time))





62 changes: 62 additions & 0 deletions demo/memory_tree/wikipara10000_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import os
import time
import numpy as np
#from IPython import embed


available_shots = {'three':3, "one":1}
#available_shots = {'three':3}

for shot,shots in available_shots.items():
print("## perform experiments on {}-shot wikipara-10K ##".format(shot))
#shots = available_shots[shot]
num_of_classes = 10000
leaf_example_multiplier = 4 #2
lr = 0.1
bits = 29#30
passes = 2 #1
#hal_version = 1
#num_queries = 1 #int(np.log(shots*num_of_classes)/np.log(2.))
alpha = 0.1
learn_at_leaf = 1
use_oas = 0
dream_at_update = 1
dream_repeats = 5
loss = "squared"
online = 0

tree_node = int(2*passes*(num_of_classes*shots/(np.log(num_of_classes*shots)/np.log(2)*leaf_example_multiplier)));

train_data = "paradata10000_{}_shot.vw.train".format(shot)
test_data = "paradata10000_{}_shot.vw.test".format(shot)
if os.path.exists(train_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(train_data))
if os.path.exists(test_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(test_data))

saved_model = "{}.vw".format(train_data)

print("## Training...")
start = time.time()
os.system("../../build/vowpalwabbit/vw {} --memory_tree {} --learn_at_leaf {} --max_number_of_labels {} --oas {} --online {} --dream_at_update {}\
--leaf_example_multiplier {} --dream_repeats {} \
--Alpha {} -l {} -b {} -c --passes {} --loss_function {} --holdout_off -f {}".format(
train_data,
tree_node, learn_at_leaf, num_of_classes, use_oas, online, dream_at_update,
leaf_example_multiplier, dream_repeats, alpha, lr, bits, passes, loss, saved_model))
train_time = time.time() - start

#test:
print("## Testing...")
start = time.time();
os.system("../../build/vowpalwabbit/vw {} -i {}".format(test_data, saved_model))

test_time = time.time() - start


print("## train time {}, and test time {}".format(train_time, test_time))





60 changes: 60 additions & 0 deletions demo/memory_tree/wikipara10000_script_progerror.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import os
import time
import numpy as np


#available_shots = {'three':3, "one":1}
available_shots = {'three':3}

for shot,shots in available_shots.items():
print("## perform experiments on {}-shot wikipara-10K ##".format(shot))
#shots = available_shots[shot]
num_of_classes = 10000
leaf_example_multiplier = 10 #2
lr = 0.1
bits = 29#30
passes =1# 2
#hal_version = 1
#num_queries = 1 #int(np.log(shots*num_of_classes)/np.log(2.))
alpha = 0.1
learn_at_leaf = 0
use_oas = 0
dream_at_update = 1
dream_repeats = 15
loss = "squared"
online = 1

tree_node = int(2*passes*(num_of_classes*shots/(np.log(num_of_classes*shots)/np.log(2)*leaf_example_multiplier)));

train_data = "paradata10000_{}_shot.vw.train".format(shot)
test_data = "paradata10000_{}_shot.vw.test".format(shot)
if os.path.exists(train_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(train_data))
if os.path.exists(test_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(test_data))

saved_model = "{}.vw".format(train_data)

print("## Training...")
start = time.time()
os.system("../../build/vowpalwabbit/vw {} --memory_tree {} --learn_at_leaf {} --max_number_of_labels {} --oas {} --online {} --dream_at_update {}\
--leaf_example_multiplier {} --dream_repeats {} \
--Alpha {} -l {} -b {} -c --passes {} --loss_function {} --holdout_off -f {}".format(
train_data, tree_node, learn_at_leaf, num_of_classes, use_oas, online, dream_at_update,
leaf_example_multiplier, dream_repeats, alpha, lr, bits, passes, loss, saved_model))
train_time = time.time() - start

#test:
#print "## Testing..."
#start = time.time();
#os.system(".././vw {} -i {}".format(test_data, saved_model))

#test_time = time.time() - start


#print "## train time {}, and test time {}".format(train_time, test_time)





54 changes: 54 additions & 0 deletions demo/memory_tree/xml_amazoncat_13K_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import os
import time
import numpy as np
#from IPython import embed

print("perform experiments on amazoncat 13K (multilabel)")
leaf_example_multiplier = 2
lr = 1
bits = 30
alpha = 0.1 #0.3
passes = 4
learn_at_leaf = 1
use_oas = 1
#num_queries = 1 #does not really use
dream_at_update = 1
#hal_version = 1 #does not really use
loss = "squared"
dream_repeats = 3
#Precision_at_K = 5

num_examples = 1186239
max_num_labels = 13330

tree_node = int(num_examples/(np.log(num_examples)/np.log(2)*leaf_example_multiplier))
train_data = "amazoncat_train.mat.mult_label.vw.txt"
test_data = "amazoncat_test.mat.mult_label.vw.txt"

if os.path.exists(train_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(train_data))
if os.path.exists(test_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(test_data))

saved_model = "{}.vw".format(train_data)

print("## Training...")
start = time.time()
#train_data = 'tmp_rcv1x.vw.txt'
os.system("../../build/vowpalwabbit/vw {} --memory_tree {} --learn_at_leaf {} --dream_at_update {}\
--max_number_of_labels {} --dream_repeats {} --oas {} \
--leaf_example_multiplier {} --Alpha {} -l {} -b {} -c --passes {} --loss_function {} --holdout_off -f {}".format(
train_data, tree_node, learn_at_leaf, dream_at_update,
max_num_labels, dream_repeats, use_oas,
leaf_example_multiplier,
alpha, lr, bits,
passes, loss,
saved_model))
train_time = time.time() - start

print("## Testing...")
start = time.time()
os.system("../../build/vowpalwabbit/vw {} --oas {} -i {}".format(test_data,use_oas, saved_model))
test_time = time.time() - start
print("## train time {}, and test time {}".format(train_time, test_time))

53 changes: 53 additions & 0 deletions demo/memory_tree/xml_rcv1x.script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import os
import time
import numpy as np
#from IPython import embed

print("perform experiments on rcv1x (multilabel)")
leaf_example_multiplier = 2
lr = 0.1
bits = 30
alpha = 0.1
passes = 6 #4
learn_at_leaf = 1
use_oas = 1
dream_at_update =0 # 1
#num_queries = 1 #does not really use
#hal_version = 1 #does not really use
loss = "squared"
dream_repeats = 3
#Precision_at_K = 5

num_examples = 630000
max_num_labels = 2456

tree_node = int(num_examples/(np.log(num_examples)/np.log(2)*leaf_example_multiplier))
train_data = "rcv1x_train.mat.mult_label.vw.txt"
test_data = "rcv1x_test.mat.mult_label.vw.txt"
if os.path.exists(train_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(train_data))
if os.path.exists(test_data) is not True:
os.system("wget http://kalman.ml.cmu.edu/wen_datasets/{}".format(test_data))

saved_model = "{}.vw".format(train_data)

print("## Training...")
start = time.time()
#train_data = 'tmp_rcv1.vw.txt'
os.system("../../build/vowpalwabbit/vw {} --memory_tree {} --learn_at_leaf {} --dream_at_update {}\
--max_number_of_labels {} --dream_repeats {} --oas {} \
--leaf_example_multiplier {} --Alpha {} -l {} -b {} -c --passes {} --loss_function {} -f {}".format(
train_data, tree_node, learn_at_leaf, dream_at_update,
max_num_labels, dream_repeats,use_oas,
leaf_example_multiplier,
alpha, lr, bits,
passes, loss,
saved_model))
train_time = time.time() - start

print("## Testing...")
start = time.time()
os.system("../../build/vowpalwabbit/vw {} --oas {} -i {}".format(test_data, use_oas, saved_model))
test_time = time.time() - start
print("## train time {}, and test time {}".format(train_time, test_time))

Loading