Integrate custom template into main usage

asreview · Aug 15, 2023 · 05d73b5 · 05d73b5
1 parent a5382f3
commit 05d73b5
Show file tree

Hide file tree

Showing 17 changed files with 209 additions and 278 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,7 @@ env
 output
 scripts
 synergy
+
+credentials.yml
+tmp*
+jobs.sh.part*
diff --git a/40-kubernetes.md b/40-kubernetes.md
@@ -70,6 +70,8 @@ All the `.yml` files that you need to run below are inside the `k8-config` folde
 The Dockerfiles and scripts are inside `code`.
 Remember to change to the correct folder as necessary.
 
+## Specific preparation
+
 First, follow the specific guides to setup your local computer or cluster:
 
 - [Single computer](41-kubernetes-single-computer.md)
@@ -84,20 +86,18 @@ Run the following command taken from [RabbitMQ Cluster Operator](https://www.rab
 kubectl apply -f "https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml"
 ```
 
-## Create a namespace for asreview things
+## Start RabbitMQ configuration
 
-The configuration files use the namespace `asreview-cloud` by default, so if you want to change it, you need to change in the file below and all other places that have `# namespace: asreview-cloud`.
+Run
 
 ```bash
-kubectl apply -f asreview-cloud-namespace.yml
+kubectl apply -f rabbitmq.yml
 ```
 
-## Start RabbitMQ configuration
-
-Run
+Check that the `rabbitmq-server-0` pod starts running after a minute or two:
 
 ```bash
-kubectl apply -f rabbitmq.yml
+kubectl -n asreview-cloud get pods
 ```
 
 ## S3 storage (_Optional step_)
@@ -134,12 +134,62 @@ To change that, edit [tasker.sh](code/tasker.sh).
 The [tasker.sh](code/tasker.sh) defines everything that will be executed by the tasker, and indirectly by the workers.
 The [tasker.Dockerfile](code/tasker.Dockerfile) will create the image that will be executed in the tasker pod.
 You can modify these as you see fit.
-After you are done, compile and push the image:
+
+The default commands used inside the tasker script and Dockerfile assume that you are:
+
+- simulating using data from a `data` folder.
+- running various settings, classifiers, and/or feature extractors.
+- running a custom ARFI template.
+- aggregating all jobs.sh into a single one.
+
+### Data
+
+If you are providing the data, create a `data` folder inside the `code` folder and put your csv files in there.
 
 > **Warning**
 >
-> The default tasker assumes that a data folder exists with your data.
-> Make sure to either provide the data or change the tasker and Dockerfile.
+> Don't skip this part, you either need to create a data folder, or change below.
+
+If, instead, you want to use the Synergy data set, edit [tasker.Dockerfile](code/tasker.Dockerfile) and look for the relevant lines.
+
+### Settings, classifiers and feature extractors
+
+Like we did for the use case ["Running many jobs.sh files one after the other"](30-many-jobs.md), each line of the file [makita-args.txt](code/makita-args.txt) contains a different setting that you can pass to the asreview command.
+
+By default, we are running `-m logistic -e tfidf` and `-m nb -e tfidf`.
+Edit the file if you want to change or add more.
+
+### Custom ARFI template
+
+We also assume that we are running a custom ARFI template [custom_arfi.txt.template](code/custom_arfi.txt.template).
+The template contains placeholder values related to the settings mentioned in the section above.
+The placeholder `SETTINGS_PLACEHOLDER` will be substituded by each line of the [makita-args.txt](code/makita-args.txt) file.
+The placeholder `SETTINGS_DIR` is used to create a folder one level above the data.
+By default, the value of `SETTINGS_DIR` is equal to `SETTINGS_PLACEHOLDER`, except that spaces are substituded by `_`.
+
+This template also removes some unnecessary lines for our case (such as creating images and aggregating the results).
+
+Furthermore, it runs a new command `rm -f ...` to remove the `.asreview` project file after use.
+This ensures that the disk space does not grow to absurd proportions.
+
+Finally, it moves three commands to the same line, to ensure that the same worker will run these in order:
+
+- simulate (which creates the project file);
+- create metrics using the project file;
+- delete the project file.
+
+### Aggregating all jobs.sh into a single jobs.sh file
+
+Instead of following ["Running many jobs.sh files one after the other"](30-many-jobs.md), we want to parallelize even between different jobs files.
+To do that, we aggregate all `jobs.sh` files into a single one.
+Then, when we split the file, all of the simulation calls of all jobs will be sent to the workers at the same time.
+This allows scaling the number of workers even more.
+
+To keep things organized, we create an additional folder level before the dataset, which was described in the custom template above.
+
+### Build and push
+
+After you are done with modifications, compile and push the image:
 
 ```bash
 docker build -t YOURUSER/tasker -f tasker.Dockerfile .
@@ -152,7 +202,7 @@ docker push YOURUSER/tasker
 
 ## Prepare the worker script and Docker image
 
-The [worker.sh](code/worker.sh) defines a very short list of tasks: running [worker-receiver.py](code/worker-receiver.py).
+The [worker.sh](code/worker.sh) script simply runs [worker-receiver.py](code/worker-receiver.py).
 You can do other things before that, but tasks that are meant to be run before **all** workers start working should go on [tasker.sh](code/tasker.sh).
 The [worker-receiver.py](code/worker-receiver.py) runs continuously, waiting for new tasks from the tasker.
 
@@ -161,10 +211,20 @@ docker build -t YOURUSER/worker -f worker.Dockerfile .
 docker push YOURUSER/worker
 ```
 
+> **Note**
+>
+> We have created a small script that builds and pushes both images called [build-and-push.sh](code/build-and-push.sh).
+> You can run it with `bash build-and-push.sh YOURUSER`.
+
 ## Running the workers
 
 The file [worker.yml](k8-config/worker.yml) contains the configuration of the deployment of the workers.
 Change the `image` to reflect the path to the image that you pushed.
+
+> **Warning**
+>
+> Did you change the image?
+
 You can select the number of `replicas` to change the number of workers.
 Pay attention to the resource limits, and change as you see fit.
 
@@ -198,6 +258,11 @@ Logging as ...
 
 Similarly, the [tasker.yml](k8-config/tasker.yml) allows you to run the tasker as a Kubernetes job.
 Change the `image`, and optionally add a `ttlSecondsAfterFinished` to auto delete the task - I prefer to keep it until I review the log.
+
+> **Warning**
+>
+> Did you change the image?
+
 Run
 
 ```bash
@@ -206,6 +271,43 @@ kubectl apply -f tasker.yml
 
 Similarly, you should see a `tasker` pod, and you can follow its log.
 
+## Retrieving the output
+
+You can copy the `output` folder from the volume with
+
+```bash
+kubectl -n asreview-cloud cp asreview-worker-FULL-NAME:/app/workdir/output ./output
+```
+
+Also, check the `/app/workdir/issues` folder.
+It should be empty, because it contains errors while running the simulate code.
+If it is not empty, the infringing lines will be shown.
+
+### If you used NFS
+
+When you have an NFS server you can mount it.
+Run the following command in a terminal:
+
+```bash
+kubectl -n asreview-cloud port-forward nfs-server-FULL-NAME 2049
+```
+
+In another terminal, run
+
+```bash
+mkdir asreview-storage
+sudo mount -v -o vers=4,loud localhost:/ asreview-storage
+```
+
+Copy things out as necessary.
+When you're done, run
+
+```bash
+sudo umount asreview-storage
+```
+
+And hit CTRL-C on the running `kubectl port-forward` command.
+
 ## Deleting and restarting
 
 If you plan to make modifications to the tasker or the worker, they have to be deleted, respectivelly.

diff --git a/41-kubernetes-single-computer.md b/41-kubernetes-single-computer.md
@@ -79,6 +79,14 @@ minikube start --cpus CPU_NUMBER --memory HOW_MUCH_MEMORY
 The `CPU_NUMBER` argument is the number of CPUs you want to dedicate to `minikube`.
 The `HOW_MUCH_MEMORY` argument is how much memory.
 
+## Create a namespace for asreview things
+
+The configuration files use the namespace `asreview-cloud` by default, so if you want to change it, you need to change in the file below and all other places that have `# namespace: asreview-cloud`.
+
+```bash
+kubectl apply -f asreview-cloud-namespace.yml
+```
+
 ## Create a volume
 
 To share data between the worker and taskers, and to keep that data after using it, we need to create a volume.
@@ -112,15 +120,3 @@ volumes:
     persistentVolumeClaim:
       claimName: asreview-storage
 ```
-
-### Retrieving the output
-
-You can copy the `output` folder from the volume with
-
-```bash
-kubectl cp asreview-worker-FULL-NAME:/app/workdir/output ./output
-```
-
-Also, check the `/app/workdir/issues` folder.
-It should be empty, because it contains errors while running the simulate code.
-If it is not empty, the infringing lines will be shown.
diff --git a/42-kubernetes-cloud-provider.md b/42-kubernetes-cloud-provider.md
@@ -10,6 +10,14 @@ You can check the guide for [Single computer](41-kubernetes-single-computer.md),
 You have to configure access to the cluster, and since that depends on the cloud provider, I will leave that to you.
 Please remember that all commands will assume that you are connecting to the cluster, which might involve additional flags to pass your credentials.
 
+## Create a namespace for asreview things
+
+The configuration files use the namespace `asreview-cloud` by default, so if you want to change it, you need to change in the file below and all other places that have `# namespace: asreview-cloud`.
+
+```bash
+kubectl apply -f asreview-cloud-namespace.yml
+```
+
 ## Create a volume
 
 To share data between the worker and taskers, and to keep that data after using it, we need to create a volume.
@@ -51,28 +59,3 @@ volumes:
       server: NFS_SERVICE_IP
       path: "/"
 ```
-
-### Retrieving the output
-
-The easiest way to manipulate the output when you have an NFS server is to mount the NFS server.
-Run the following command in a terminal:
-
-```bash
-kubectl -n asreview-cloud port-forward nfs-server-FULL-NAME 2049
-```
-
-In another terminal, run
-
-```bash
-mkdir asreview-storage
-sudo mount -v -o vers=4,loud localhost:/ asreview-storage
-```
-
-Copy things out as necessary.
-When you're done, run
-
-```bash
-sudo umount asreview-storage
-```
-
-And hit CTRL-C on the running `kubectl port-forward` command.
diff --git a/code/build-and-push.sh b/code/build-and-push.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+
+YOURUSER=$1
+
+if [ -z "$YOURUSER" ]; then
+  echo "ERROR: Missing YOURUSER. Run 'bash build-and-push.sh YOURUSER'"
+  exit 1
+fi
+
+for f in worker tasker
+do
+  if ! docker build -t "$YOURUSER/$f" -f $f.Dockerfile .; then
+    echo "ERROR building docker image"
+    exit 1
+  fi
+  if ! docker push "$YOURUSER/$f"; then
+    echo "ERROR pushing docker image"
+    exit 1
+  fi
+done
diff --git a/code/custom_arfi.txt.template b/code/custom_arfi.txt.template
@@ -0,0 +1,29 @@
+---
+name: ARFI-settings
+name_long: All Relevant, Fixed Irrelevant, with settings
+
+scripts:
+  - get_plot.py
+  - merge_descriptives.py
+  - merge_metrics.py
+  - merge_tds.py
+
+docs:
+  - README.md
+
+---
+#!/bin/bash
+{# This is a template for the ARFI method #}
+# version {{ version }}
+
+{% for dataset in datasets %}
+mkdir -p {{ output_folder }}/simulation/SETTINGS_DIR/{{ dataset.input_file_stem }}/metrics
+mkdir -p {{ output_folder }}/simulation/SETTINGS_DIR/{{ dataset.input_file_stem }}/descriptives
+asreview data describe {{ dataset.input_file }} -o {{ output_folder }}/simulation/SETTINGS_DIR/{{ dataset.input_file_stem }}/descriptives/data_stats_{{ dataset.input_file_stem }}.json
+mkdir -p {{ output_folder }}/simulation/SETTINGS_DIR/{{ dataset.input_file_stem }}/state_files
+
+{% for prior in dataset.priors %}
+asreview simulate {{ dataset.input_file }} SETTINGS_PLACEHOLDER -s {{ output_folder }}/simulation/SETTINGS_DIR/{{ dataset.input_file_stem }}/state_files/sim_{{ dataset.input_file_stem }}_{{ prior[0] }}.asreview --prior_record_id {{ " ".join(prior) }} --seed {{ dataset.model_seed }} && asreview metrics {{ output_folder }}/simulation/SETTINGS_DIR/{{ dataset.input_file_stem }}/state_files/sim_{{ dataset.input_file_stem }}_{{ prior[0] }}.asreview -o {{ output_folder }}/simulation/SETTINGS_DIR/{{ dataset.input_file_stem }}/metrics/metrics_sim_{{ dataset.input_file_stem }}_{{ prior[0] }}.json && rm -f {{ output_folder }}/simulation/SETTINGS_DIR/{{ dataset.input_file_stem }}/state_files/sim_{{ dataset.input_file_stem }}_{{ prior[0] }}.asreview
+{% endfor %}
+
+{% endfor %}
diff --git a/code/makita-args.txt b/code/makita-args.txt
@@ -0,0 +1,2 @@
+-m logistic -e tfidf
+-m nb -e tfidf
diff --git a/code/tasker.Dockerfile b/code/tasker.Dockerfile
@@ -1,22 +1,23 @@
-FROM ghcr.io/asreview/asreview:v1.2
+FROM ghcr.io/asreview/asreview:v1.2.1
 
 RUN apt-get update && \
-    apt-get install -y curl ca-certificates amqp-tools python \
+    apt-get install -y curl ca-certificates amqp-tools python3 \
        --no-install-recommends \
     && rm -rf /var/lib/apt/lists/* \
     && pip install pika
 
 #### Don't modify above this line
+# Alternative 1: Copy your data folder
 COPY data /app/data
+# Alternative 2: Install and synergy-dataset
+# RUN pip install synergy-dataset
+# RUN mkdir -p /app/data
+# RUN synergy get -l -o ./app/data
 
-# This is necessary until a new asreview-makita is released and the asreview image is updated
-RUN apt-get update && \
-    apt-get install -y git \
-       --no-install-recommends \
-    && rm -rf /var/lib/apt/lists/* \
-    && pip install asreview-makita
 #### Don't modify below this line
 
+COPY ./custom_arfi.txt.template /app/custom_arfi.txt.template
+COPY ./makita-args.txt /app/makita-args.txt
 COPY ./split-file.py /app/split-file.py
 COPY ./tasker-send.py /app/tasker-send.py
 COPY ./tasker.sh /app/tasker.sh

diff --git a/code/tasker.sh b/code/tasker.sh
@@ -13,6 +13,8 @@ rm -rf /app/workdir/*
 # Copy files from the parent folder for the workdir.
 cp ../*.sh ../*.py ./
 cp -r ../data ./
+cp ../custom_arfi.txt.template ./
+cp ../makita-args.txt ./
 
 # Create a logging function
 function log {
@@ -22,7 +24,16 @@ function log {
 # Run makita
 log "Running makita"
 
-asreview makita template arfi -f jobs.sh
+echo "" > all-jobs.sh
+while read -r SETTINGS
+do
+  SETTINGS_DIR=$(echo "$SETTINGS" | tr ' ' '_')
+  echo "A" | asreview makita template arfi --template custom_arfi.txt.template -f jobs.sh
+  sed -i "s/SETTINGS_PLACEHOLDER/$SETTINGS/g" jobs.sh
+  sed -i "s/SETTINGS_DIR/$SETTINGS_DIR/g" jobs.sh
+  cat jobs.sh >> all-jobs.sh
+done < makita-args.txt
+mv all-jobs.sh jobs.sh
 # Define the S3_PREFIX, using whatever you think makes sense.
 # This file is run exactly once, to it makes sense to use the date.
 # You could also use the settings, if any.
@@ -46,6 +57,3 @@ log "Sending part 3 to rabbitmq"
 python tasker-send.py jobs.sh.part3
 
 log "Done"
-
-# Send results someplace?
-# TODO
diff --git a/code/worker.Dockerfile b/code/worker.Dockerfile
@@ -1,7 +1,7 @@
-FROM ghcr.io/asreview/asreview:v1.2
+FROM ghcr.io/asreview/asreview:v1.2.1
 
 RUN apt-get update && \
-    apt-get install -y curl ca-certificates amqp-tools python \
+    apt-get install -y curl ca-certificates amqp-tools python3 \
        --no-install-recommends \
     && rm -rf /var/lib/apt/lists/* \
     && pip install boto3 pika