# Coresets On Bacalhau 


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bacalhau-project/examples/blob/main/Coreset/BIDS/index.ipynb)
[![Open In Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/bacalhau-project/examples/HEAD?labpath=miscellaneous/Coreset/index.ipynb)

## **Introduction**

[Coreset ](https://arxiv.org/abs/2011.09384)is a data subsetting method, Since the uncompressed datasets involved can get very big when compressed it becomes much harder to train them as training time increases with the dataset size, to reduce the training time to save costs we use the coreset method the coreset method can also be applied to other datasets

Coresets similar functionality as same as the whole dataset

![](https://i.imgur.com/AQDLMXn.png)

In this case, we use the coreset method which can lead to a fast speed in solving the k-means problem among the big data with high accuracy in the meantime.

We construct a small coreset for arbitrary shapes of numerical data with a decent time cost. The implementation was mainly based on the coreset construction algorithm that was proposed by Braverman et al. (SODA 2021).


## **Running Locally**

Clone the repo which contains the code


In [None]:
%%bash
git clone https://github.com/js-ts/Coreset

fatal: destination path 'Coreset' already exists and is not an empty directory.



Downloading the dataset

Open Street Map, which is a public repository that aims to generate and distribute accessible geographic data for the whole world. Basically, it supplies detailed position information, including the longitude and latitude of the places around the world. 

 The dataset is a osm.pbf (compressed format for .osm file), the file can be downloaded from [Geofabrik Download Server](https://download.geofabrik.de/) 


In [None]:
%%bash
wget https://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf -o liechtenstein-latest.osm.pbf


Installing the Linux dependencies


In [None]:
%%bash
sudo apt-get -y update \
sudo apt-get -y install osmium-tool \
sudo apt update \
sudo apt-get -y install libpq-dev gdal-bin libgdal-dev libxml2-dev libxslt-dev

E: The update command takes no arguments


Installing Python Dependencies


In [None]:
%%bash
pip3 install -r Coreset/requirements.txt

Running coreset locally

Convert from compressed pbf format to geojson format

In [None]:
%%bash
osmium export liechtenstein-latest.osm.pbf -o liechtenstein-latest.geojson

/bin/bash: osmium: command not found


 Running the python script to generate the coreset

In [None]:
%%bash
python Coreset/python/coreset.py -f liechtenstein-latest.geojson

python3: can't open file 'Coreset/python/coreset.py': [Errno 2] No such file or directory


Building the docker container

In this step you will create a  `Dockerfile` to create your Docker deployment. The `Dockerfile` is a text document that contains the commands used to assemble the image.

First, create the `Dockerfile`.

Next, add your desired configuration to the `Dockerfile`. These commands specify how the image will be built, and what extra requirements will be included.

Dockerfile


```
FROM python:3.8

RUN apt-get -y update && apt-get -y install osmium-tool && apt update && apt-get -y install libpq-dev gdal-bin libgdal-dev libxml2-dev libxslt-dev

ADD Coreset Coreset

ADD monaco-latest.geojson .

RUN cd Coreset && pip3 install -r requirements.txt
```


We will use the `python:3.8` image, and we will choose the src directory in the container as our work directory, we run the same commands for installing dependencies that we used locally, but we also add files and directories which are present on our local machine, we also run a test command, in the end, to check whether the script works

To Build the docker container run the docker build command


```
docker build -t <hub-user>/<repo-name>:<tag> .
```


Please replace

&lt;hub-user> with your docker hub username, If you don’t have a docker hub account [Follow these instructions to create docker account](https://docs.docker.com/docker-id/), and use the username of the account you created

&lt;repo-name> This is the name of the container, you can name it anything you want

&lt;tag> This is not required but you can use the latest tag

After you have build the container, the next step is to test it locally and then push it docker hub

Now you can push this repository to the registry designated by its name or tag.


```
 docker push <hub-user>/<repo-name>:<tag>
```


After the repo image has been pushed to docker hub, we can now use the container for running on bacalhau


## Running on Bacalhau

COMMAND


```
bacalhau docker run \
-v QmXuatKaWL24CwrBPC9PzmLW8NGjgvBVJfk6ZGCWUGZgCu:/input \
jsace/coreset \
-- /bin/bash -c 'osmium export input/liechtenstein-latest.osm.pbf -o liechtenstein-latest.geojson;
python Coreset/python/coreset.py -f input/liechtenstein-latest.geojson -o outputs'
```


Backend: Docker backend here for running the job

Input dataset: Upload the .osm.pbf file while you want to use as a dataset to IPFS, use this CID here 

we mount it to the folder inside the container so it can be used by the script

Image: custom docker Image (it has osmium, python and the requirements for the script installed )

Command:

Convert the osm.pbf dataset to geojson (the dataset is stored in the input volume folder)


```
osmium export input/.osm.pbf -o liechtenstein-latest.geojson
```


Run the script ‘-f’ path of the output geojson file from the above step


```
python Coreset/python/coreset.py -f liechtenstein-latest.geojson -o outputs
```


We get the output in stdout

Additional parameters: -k amount of initialized centers (default=5)

-n: size of coreset (default=50)

-o the folder where you want to store you outputs

Insalling bacalhau

In [None]:
%%bash
curl -sL https://get.bacalhau.org/install.sh | bash

Your system is linux_amd64

BACALHAU CLI is detected:
Client Version: v0.2.5
Server Version: v0.2.5
Reinstalling BACALHAU CLI - /usr/local/bin/bacalhau...
Getting the latest BACALHAU CLI...
Installing v0.2.5 BACALHAU CLI...
Downloading https://github.com/filecoin-project/bacalhau/releases/download/v0.2.5/bacalhau_v0.2.5_linux_amd64.tar.gz ...
Downloading sig file https://github.com/filecoin-project/bacalhau/releases/download/v0.2.5/bacalhau_v0.2.5_linux_amd64.tar.gz.signature.sha256 ...
Verified OK
Extracting tarball ...
NOT verifying Bin
bacalhau installed into /usr/local/bin successfully.
Client Version: v0.2.5
Server Version: v0.2.5


In [None]:
%%bash
echo $(bacalhau docker run --id-only --wait --wait-timeout-secs 1000 -v QmXuatKaWL24CwrBPC9PzmLW8NGjgvBVJfk6ZGCWUGZgCu:/input jsace/coreset -- /bin/bash -c 'osmium export input/liechtenstein-latest.osm.pbf -o liechtenstein-latest.geojson; python Coreset/python/coreset.py -f liechtenstein-latest.geojson -o outputs') > job_id.txt
cat job_id.txt

339d24aa-743c-4af9-8ebb-d09b5590730e



Running the commands will output a UUID (like `54506541-4eb9-45f4-a0b1-ea0aecd34b3e`). This is the ID of the job that was created. You can check the status of the job with the following command:


In [None]:
%%bash
bacalhau list --id-filter $(cat job_id.txt)

[92;100m CREATED  [0m[92;100m ID       [0m[92;100m JOB                     [0m[92;100m STATE     [0m[92;100m VERIFIED [0m[92;100m PUBLISHED               [0m
[97;40m 10:52:06 [0m[97;40m 339d24aa [0m[97;40m Docker jsace/coreset... [0m[97;40m Completed [0m[97;40m          [0m[97;40m /ipfs/QmQ31zBAKJqcc5... [0m



Where it says "`Published `", that means the job is done, and we can get the results.

To find out more information about your job, run the following command:

In [None]:
%%bash
bacalhau describe $(cat job_id.txt)

JobAPIVersion: ""
ID: 339d24aa-743c-4af9-8ebb-d09b5590730e
RequesterNodeID: QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL
ClientID: 5efb801527e7b02e1f071fabfc282986241f76b80d0e3ad0d82dad0837294474
Spec:
    Engine: 2
    Verifier: 1
    Publisher: 4
    Docker:
        Image: jsace/coreset
        Entrypoint:
            - /bin/bash
            - -c
            - osmium export input/liechtenstein-latest.osm.pbf -o liechtenstein-latest.geojson; python Coreset/python/coreset.py -f liechtenstein-latest.geojson -o outputs
    inputs:
        - Engine: 1
          Cid: QmXuatKaWL24CwrBPC9PzmLW8NGjgvBVJfk6ZGCWUGZgCu
          path: /input
    outputs:
        - Engine: 1
          Name: outputs
          path: /outputs
    Sharding:
        BatchSize: 1
        GlobPatternBasePath: /inputs
Deal:
    Concurrency: 1
CreatedAt: 2022-10-02T10:52:06.591904351Z
JobState:
    Nodes:
        QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3:
            Shards:
                0:
                    N

Since there is no error we can’t see any error instead we see the state of our job to be complete, that means 
we can download the results!
we create a temporary directory to save our results

In [None]:
%%bash
mkdir results

mkdir: cannot create directory ‘results’: File exists


To Download the results of your job, run 

---

the following command:

In [None]:
%%bash
bacalhau get  $(cat job_id.txt)  --output-dir results

[90m10:56:15.402 |[0m [32mINF[0m [1mbacalhau/get.go:67[0m[36m >[0m Fetching results of job '339d24aa-743c-4af9-8ebb-d09b5590730e'...
2022/10/02 10:56:16 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.
[90m10:56:26.315 |[0m [32mINF[0m [1mipfs/downloader.go:115[0m[36m >[0m Found 1 result shards, downloading to temporary folder.
[90m10:56:31.334 |[0m [32mINF[0m [1mipfs/downloader.go:195[0m[36m >[0m Combining shard from output volume 'outputs' to final location: '/content/results'


After the download has finished you should 
see the following contents in results directory

In [None]:
%%bash
ls results/

shards	stderr	stdout	volumes


#VIEW THE OUTPUT CSV FILES

In [None]:
%%bash
cat results/volumes/outputs/centers.csv | head -n 10

lat,long
9.5342551,47.1020112
9.53747591608732,47.21394087505187
9.52950185,47.1150105
9.507580397758776,47.06997243435313
9.5125655,47.17233984999999


In [None]:
%%bash
cat results/volumes/outputs/coreset-values-liechtenstein-latest.csv | head -n 10

9.654311561365327421e+00,4.751590749417925963e+01
9.513394350375845576e+00,4.707597406564644160e+01
9.548034159454211078e+00,4.708135142425457786e+01
9.546953000000000245e+00,4.711704789999999576e+01
9.656645984032927288e+00,4.751462491221921169e+01
9.658719169316757558e+00,4.751368737901290729e+01
9.548072499999999962e+00,4.722114349999999661e+01
9.523127485222827815e+00,4.714089508645799498e+01
9.509368300000000218e+00,4.706821285000000188e+01
9.660201312308860366e+00,4.751053105944505006e+01


In [None]:
%%bash
cat results/volumes/outputs/coreset-weights-liechtenstein-latest.csv | head -n 10

1.799991151164056724e+00
1.184127214484931756e+03
1.010669193783661740e+03
2.092230760981649837e+03
1.799998215566914528e+00
1.799999994087179589e+00
2.537794141864215817e+03
1.926616704844339438e+03
1.164522728189803502e+03
1.799995045638392410e+00



Sources

[1] [http://proceedings.mlr.press/v97/braverman19a/braverman19a.pdf](http://proceedings.mlr.press/v97/braverman19a/braverman19a.pdf)

[2][https://aaltodoc.aalto.fi/bitstream/handle/123456789/108293/master_Wu_Xiaobo_2021.pdf?sequence=2](https://aaltodoc.aalto.fi/bitstream/handle/123456789/108293/master_Wu_Xiaobo_2021.pdf?sequence=2)


In [None]:
%%bash
bacalhau describe $(cat job_id.txt) --spec > job.yaml

In [None]:
%%bash
cat job.yaml