`docker.yml`: run demo script in CI #1126

tgeoghegan · 2024-06-17T23:46:00Z

Add steps to the compose job in docker.yml that run the demo script from cli/README.md against a Docker Compose deployment to ensure that it works.

Perhaps unsurprisingly, adding a test for this revealed a couple of bugs, which are also addressed in this change:

The task discovery interval and task creation intervals were poorly tuned
The leader was trying to reach the helper at localhost:9002, which is not routable inside of the Docker Compose network. So we run nc alongside the leader's Janus processes so that connections to localhost:9002 get redirected to janus_2_aggregator:8080, which is routable inside Docker Compose.

divergentdave · 2024-06-18T17:05:48Z

.github/workflows/docker.yml

+          AGGREGATOR_LIST=`./divviup aggregator list`
+          printf 'aggregator list %s\n' $AGGREGATOR_LIST
+
+          LEADER_ID=`echo $AGGREGATOR_LIST | ../jq -r 'map_values(select(.name == "leader")).[0].id'`


FYI, it looks like we could lower map_values as follows, to avoid needing to install a newer version.

Suggested change

LEADER_ID=`echo $AGGREGATOR_LIST | ../jq -r 'map_values(select(.name == "leader")).[0].id'`

LEADER_ID=`echo $AGGREGATOR_LIST | ../jq -r '.[] |= select(.name == "leader") |.[0].id'`

divergentdave · 2024-06-18T18:02:38Z

.github/workflows/docker.yml

+            --leader-aggregator-id $LEADER_ID --helper-aggregator-id $HELPER_ID \
+            --collector-credential-id $COLLECTOR_CREDENTIAL_ID \
+            --vdaf histogram --categorical-buckets 0,1,2,3,4,5,6,7,8,9,10 \
+            --min-batch-size 10 --max-batch-size 200 --time-precision 60sec`


I think this is going to run afoul of our validation:

divviup-api/src/entity/task/new_task.rs

Lines 30 to 31 in 152f352

#[validate(required, range(min = 100))]

pub min_batch_size: Option<u64>,

It does. That's kind of annoying. I worry this test will take forever to run if we have to wait for 100ish reports to be aggregated.

tgeoghegan · 2024-06-18T22:59:39Z

Finally figured out what's wrong here: first, the problem was that the task refresh interval in the aggregation job creator was set to 3600s, so it was never noticing the task being created (I'm not sure why this worked for me locally; I must have been winning some kind of race between task provisioning and the job creator starting up). Second, it's still failing because the aggregators can't talk to each other:

{
    "timestamp": "2024-06-18T22:12:13.710011Z",
    "level": "WARN",
    "fields": {
        "message": "Encountered retryable network error",
        "err": "reqwest::Error { kind: Request, url: Url { scheme: \"http\", cannot_be_a_base: false, username: \"\", password: None, host: Some(Domain(\"localhost\")), port: Some(9002), path: \"/tasks/WkMb56665qFp_PLNZAxI8ouwGXKh8XuEjMTEbRO25fM/aggregation_jobs/0grBtgKu2-PVK-Bc_AJPGw\", query: None, fragment: None }, source: Error { kind: Connect, source: Some(ConnectError(\"tcp connect error\", Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" })) } }"
    },
    "target": "janus_core::retries",
    "filename": "core/src/retries.rs",
    "line_number": 154,
    "spans": [
        {
            "acquired_job": "AcquiredAggregationJob { task_id: TaskId(WkMb56665qFp_PLNZAxI8ouwGXKh8XuEjMTEbRO25fM), aggregation_job_id: AggregationJobId(0grBtgKu2-PVK-Bc_AJPGw), query_type: FixedSize { max_batch_size: Some(200), batch_time_window_size: None }, vdaf: Prio3Histogram { length: 11, chunk_length: 4 } }",
            "name": "Job stepper"
        },
        {
            "method": "PUT",
            "route_label": "tasks/:task_id/aggregation_jobs/:aggregation_job_id",
            "url": "http://localhost:9002/tasks/WkMb56665qFp_PLNZAxI8ouwGXKh8XuEjMTEbRO25fM/aggregation_jobs/0grBtgKu2-PVK-Bc_AJPGw",
            "name": "send_request_to_helper"
        }
    ],
    "threadId": "ThreadId(5)"
}

The leader can't connect to the helper at localhost:9002, which makes sense because that is the aggregator's address outside the compose network. What I don't get is: how did this ever work? I must have mangled something about the network configuration while working on this.

edit: it's possible that this never worked in Docker Compose, and that I only ever managed to run collections against staging.

tgeoghegan · 2024-06-19T00:01:09Z

Like all problems, this wound up being solvable using nc(1).

inahga · 2024-06-20T21:10:37Z

.github/workflows/docker.yml

+
+          echo "finished uploading measurements"
+
+          sleep 120


I'm cool with this for now, but I sense this might end up flaky depending on whether our GitHub Actions runner has had a good breakfast that day.

But we can see what happens, if that's always a long enough wait then it beats hacky shell retry logic.

I think the way to make it robust would be to add collection job polling logic to divviup dap-client akin to what Janus' collect already has, but I was/am reluctant to copy over that much code.

divergentdave reviewed Jun 18, 2024

View reviewed changes

tgeoghegan force-pushed the timg/test-demo-script branch from 70f9cbb to ab4416f Compare June 18, 2024 22:08

tgeoghegan added 17 commits June 18, 2024 17:00

docker.yml: run demo script in CI

bcf05f5

fix yaml

3737005

missing quote

fe6e72d

debug script

adb2a60

install newer jq

e52c8b8

backticks

af3d925

use jq 1.6 compatible query

6cdb49c

wait longer

9835ea6

wait even longer

ea300b1

upload+aggregate fewer reports

ed74d5a

10 reports is too few -- tune job creation interval

028c326

wait ridiculously long time

869be6e

debug slow aggregations

f1ded7d

tweak task discovery interval

5565b87

crimes: run nc alongside janus

f108ed7

sleep less

ff4184c

clean up debug logs

0ff24bb

tgeoghegan force-pushed the timg/test-demo-script branch from 4a1554c to 0ff24bb Compare June 19, 2024 00:00

tgeoghegan marked this pull request as ready for review June 19, 2024 00:04

tgeoghegan requested a review from a team as a code owner June 19, 2024 00:04

tgeoghegan mentioned this pull request Jun 19, 2024

cli: demo experience #1096

Open

9 tasks

inahga approved these changes Jun 20, 2024

View reviewed changes

inahga reviewed Jun 20, 2024

View reviewed changes

tgeoghegan merged commit 3251d15 into main Jun 20, 2024
8 checks passed

tgeoghegan deleted the timg/test-demo-script branch June 20, 2024 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`docker.yml`: run demo script in CI #1126

`docker.yml`: run demo script in CI #1126

tgeoghegan commented Jun 17, 2024 •

edited

Loading

divergentdave Jun 18, 2024

divergentdave Jun 18, 2024

tgeoghegan Jun 18, 2024

tgeoghegan commented Jun 18, 2024 •

edited

Loading

tgeoghegan commented Jun 19, 2024

inahga Jun 20, 2024

tgeoghegan Jun 20, 2024

	LEADER_ID=`echo $AGGREGATOR_LIST \| ../jq -r 'map_values(select(.name == "leader")).[0].id'`
	LEADER_ID=`echo $AGGREGATOR_LIST \| ../jq -r '.[] \|= select(.name == "leader") \|.[0].id'`

	#[validate(required, range(min = 100))]
	pub min_batch_size: Option<u64>,

docker.yml: run demo script in CI #1126

docker.yml: run demo script in CI #1126

Conversation

tgeoghegan commented Jun 17, 2024 • edited Loading

divergentdave Jun 18, 2024

Choose a reason for hiding this comment

divergentdave Jun 18, 2024

Choose a reason for hiding this comment

tgeoghegan Jun 18, 2024

Choose a reason for hiding this comment

tgeoghegan commented Jun 18, 2024 • edited Loading

tgeoghegan commented Jun 19, 2024

inahga Jun 20, 2024

Choose a reason for hiding this comment

tgeoghegan Jun 20, 2024

Choose a reason for hiding this comment

`docker.yml`: run demo script in CI #1126

`docker.yml`: run demo script in CI #1126

tgeoghegan commented Jun 17, 2024 •

edited

Loading

tgeoghegan commented Jun 18, 2024 •

edited

Loading