Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add popper workflow for comparing xlearn vs liblinear for the higgs demo #332

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion demo/classification/higgs/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,8 @@ You can find the full data from this here (`Link`__)

The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks are presented in the original paper. The last 500,000 examples are used as a test set.

.. __: https://archive.ics.uci.edu/ml/datasets/HIGGS
Popper
*****
There is a performace validation test that you can find in the popper folder that compares the liblinear and xLearn libraries with a workflow that automatically downloads the data set, runs the benchmark and shows the results on a chart.

.. __: https://archive.ics.uci.edu/ml/datasets/HIGGS
23 changes: 23 additions & 0 deletions demo/classification/higgs/popper/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM python:3.7-slim-buster as base

ENV USER=root

# install build dependencies and python libs to run benchmarks
RUN apt update && \
apt install -y cmake g++ git curl && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
pip install --no-cache-dir sklearn pandas==1.0.4

# install liblinear from source
RUN git clone https://github.com/cjlin1/liblinear /opt/liblinear && \
cd /opt/liblinear/python && \
git checkout f41e72c && \
make -j4
ENV PYTHONPATH=/opt/liblinear/python

# install xlearn from source
COPY . /xlearn
RUN cd /xlearn && \
ls -l && \
./build.sh && \
rm -r /xlearn
44 changes: 44 additions & 0 deletions demo/classification/higgs/popper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Performance Validation Workflow with HIGGS
## Using Popper

[Popper](https://github.com/systemslab/popper) is a tool for defining and executing container-native workflows in Docker, as well as other container engines. More details about Popper can be found [here](https://popper.readthedocs.io/).

## Description

This folder contains a `wf.yml` file that defines a Popper workflow for automatically downloading and verifying the complete [HIGGS data set](https://archive.ics.uci.edu/ml/datasets/HIGGS) from UCI (which has 11 million entries), running the benchmark to compare the liblinear library with xLearn and finally generating a report with a chart that shows the results including error bars.

The benchmark tests the performance of each library by running five times the following set of main tasks:
- Load data set with the help of [Pandas](https://pandas.pydata.org/).
- Generate the trained linear model
- Predict

This is an example of how the chart looks:

![report](https://user-images.githubusercontent.com/33427324/86541248-39be6a00-bec0-11ea-8961-132951ac028f.png)
### Instructions:

1. Clone the repository.
```
git clone https://github.com/aksnzhy/xlearn.git
```

2. Install [docker](https://docs.docker.com/get-docker/).

3. Install the `popper` tool.
```
curl -sSfL https://raw.githubusercontent.com/getpopper/popper/master/install.sh | sh
```
4. Run the workflow.
```
cd xlearn/
popper run -f demo/classification/higgs/popper/wf.yml
```
There is a way to run a single step of the workflow in case you don't want to run the whole thing each time, you only have to add the name of the step at the end like the following example.
```
popper run -f demo/classification/higgs/popper/wf.yml prepare-data
```
When we are having problems with a step there is also an easy way to debug the workflow by opening an interactive shell instead of having to update the YAML file and invoke `popper run` again.
```
popper sh -f demo/classification/higgs/popper/wf.yml prepare-data
```
The example above opens a shell inside the container where other things can be done. More information on this matter can be found [here](https://popper.readthedocs.io/en/latest/sections/getting_started.html#run-your-workflow).
76 changes: 76 additions & 0 deletions demo/classification/higgs/popper/csv2libsvm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#!/usr/bin/env python

"""
Convert CSV file to libsvm format. Works only with numeric variables.
Put -1 as label index (argv[3]) if there are no labels in your file.
Expecting no headers. If present, headers can be skipped with argv[4] == 1.

source: https://stackoverflow.com/questions/23170152/converting-csv-file-to-libsvm-compatible-data-file-using-python

"""

import sys
import csv
import operator
from collections import defaultdict

def construct_line(label, line, labels_dict):
new_line = []
if label.isnumeric():
if float(label) == 0.0:
label = "0"
else:
if label in labels_dict:
new_line.append(labels_dict.get(label))
else:
label_id = str(len(labels_dict))
labels_dict[label] = label_id
new_line.append(label_id)

for i, item in enumerate(line):
if item == '' or float(item) == 0.0:
continue
elif item=='NaN':
item="0.0"
new_item = "%s:%s" % (i + 1, item)
new_line.append(new_item)
new_line = " ".join(new_line)
new_line += "\n"
return new_line

# ---

input_file = sys.argv[1]
try:
output_file = sys.argv[2]
except IndexError:
output_file = input_file+".out"


try:
label_index = int( sys.argv[3] )
except IndexError:
label_index = 0

try:
skip_headers = sys.argv[4]
except IndexError:
skip_headers = 0

i = open(input_file, 'rt')
o = open(output_file, 'wb')

reader = csv.reader(i)

if skip_headers:
headers = reader.__next__()

labels_dict = {}
for line in reader:
if label_index == -1:
label = '1'
else:
label = line.pop(label_index)

new_line = construct_line(label, line, labels_dict)
o.write(new_line.encode('utf-8'))
28 changes: 28 additions & 0 deletions demo/classification/higgs/popper/run_benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
set -ex

timestamp=$(date "+%Y%m%d-%H%M%S")
results_dir="results/$timestamp"
report_file="results/$timestamp/report.csv"

if [ -f $report_file ]; then
rm -f $report_file
fi

# Generate the output directory
if [ ! -d $results_dir ]; then
mkdir -p ./$results_dir
chmod -R 777 ./$results_dir
fi

echo time,library >> $report_file
# Run the training 5 times
counter=1
while [ $counter -le 5 ]
do
. ./run_xlearn.sh
echo $result,xlearn >> $report_file
. ./run_liblinear.sh
echo $result,liblinear >> $report_file
counter=$(( counter+1 ))
done
6 changes: 6 additions & 0 deletions demo/classification/higgs/popper/run_higgs_liblinear.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from liblinearutil import *

# Read data in LIBSVM format
y, x = svm_read_problem('HIGGSlibsvm')
m = train(y[:8800000], x[:8800000], '-s 0 -c 4 -B 1')
p_label, p_acc, p_val = predict(y[8800000:], x[8800000:], m)
47 changes: 47 additions & 0 deletions demo/classification/higgs/popper/run_higgs_xlearn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Import dataset
import numpy as np
import pandas as pd
import xlearn as xl
from sklearn.model_selection import train_test_split

# Load dataset
higgs = pd.read_csv("HIGGS.csv", header=None, sep=",")

X = higgs[higgs.columns[1:]]
y = higgs[0]

# Split train and test set
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

# DMatrix transition
xdm_train = xl.DMatrix(x_train, y_train)
xdm_test = xl.DMatrix(x_test, y_test)

# Training task
linear_model = xl.create_linear() # Use linear model
linear_model.setTrain(xdm_train) # Training data
linear_model.setValidate(xdm_test) # Validation data

# param:
# 0. regression task
# 1. learning rate: 0.2
# 2. regular lambda: 0.002
# 3. evaluation metric: acc
param = {'task':'binary', 'lr':0.2,
'lambda':0.002, 'metric':'acc'}

# Start to train
# The trained model will be stored in model.out
linear_model.fit(param, './model_dm.out')

# Prediction task
linear_model.setTest(xdm_test) # Test data
linear_model.setSigmoid() # Convert output to 0-1

# Start to predict
# The output result will be stored in output.txt
# if no result out path setted, we return res as numpy.ndarray
res = linear_model.predict("./model_dm.out")

print(res)

24 changes: 24 additions & 0 deletions demo/classification/higgs/popper/run_liblinear.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/sh
set -ex

# start timing
start=$(date +%s)
start_fmt=$(date +%Y-%m-%d\ %r)
echo "STARTING TIMING RUN AT $start_fmt"

# run benchmark

echo "running benchmark"

python3 run_higgs_liblinear.py

# end timing
end=$(date +%s)
end_fmt=$(date +%Y-%m-%d\ %r)
echo "ENDING TIMING RUN AT $end_fmt"

# report result
result=$(( $end - $start ))
result_name="liblinear"

echo "RESULT,$result_name,$result,$USER,$start_fmt"
25 changes: 25 additions & 0 deletions demo/classification/higgs/popper/run_xlearn.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/sh
set -ex

# start timing
start=$(date +%s)
start_fmt=$(date +%Y-%m-%d\ %r)
echo "STARTING TIMING RUN AT $start_fmt"

# run benchmark

echo "running benchmark"

python3 run_higgs_xlearn.py

# end timing
end=$(date +%s)
end_fmt=$(date +%Y-%m-%d\ %r)
echo "ENDING TIMING RUN AT $end_fmt"

# report result
result=$(( $end - $start ))
result_name="xlearn"

echo "RESULT,$result_name,$result,$USER,$start_fmt"

18 changes: 18 additions & 0 deletions demo/classification/higgs/popper/show_results.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#/usr/bin/env python

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import glob

list_reports = glob.glob("results/*/report.csv")
dir_list = glob.glob("results/*")
list_reports.sort()
dir_list.sort()

results = pd.read_csv(list_reports[-1], sep=",")

sns.barplot(x = 'library', y = 'time', data = results)
plt.title('Performance of the libraries with HIGGS dataset')
plt.savefig(dir_list[-1] + "/report.png")
plt.show()
45 changes: 45 additions & 0 deletions demo/classification/higgs/popper/wf.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
steps:

- id: build-img
uses: docker://docker:19.03.10
args:
- build
- --tag=xlearn
- --file=Dockerfile
- .

- id: download-data
uses: docker://byrnedo/alpine-curl:0.1.8
runs: [sh]
dir: /workspace/demo/classification/higgs/popper
args:
- -c
- |
set -ex
if [ ! -f HIGGS.csv ]; then
curl -LO https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
if [ -f HIGGS.csv.gz ]; then
gunzip HIGGS.csv.gz
fi
fi

- id: prepare-data
uses: docker://python:3.7
dir: /workspace/demo/classification/higgs/popper
runs: [python3]
args: ['csv2libsvm.py','HIGGS.csv','libsvm.data','0','False']


- id: run-benchmark
uses: docker://xlearn
dir: /workspace/demo/classification/higgs/popper
skip_pull: true
runs: [sh]
args: [run_benchmark.sh]

- id: show-results
uses: docker://jupyter/scipy-notebook:latest
dir: /workspace/demo/classification/higgs/popper
runs: [python3]
args: [show_results.py]

21 changes: 21 additions & 0 deletions demo/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM python:3.7-slim-buster as base

RUN apt update && \
apt install -y cmake g++ && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

COPY . /xlearn

# build from source (installs to /usr/local/lib/python3.7/site-packages/)
RUN cd /xlearn && \
ls -l && \
./build.sh && \
rm -r /xlearn

# install other library (installs to same site-packages path)
RUN pip install --no-cache-dir liblinear==2.11.2

# create an image without build dependencies
FROM python:3.7-slim-buster AS lib
ENV USER=root
COPY --from=base /usr/local/lib/python3.7/site-packages/* /usr/local/lib/python3.7/