-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker.yml
: run demo script in CI
#1126
Conversation
.github/workflows/docker.yml
Outdated
AGGREGATOR_LIST=`./divviup aggregator list` | ||
printf 'aggregator list %s\n' $AGGREGATOR_LIST | ||
|
||
LEADER_ID=`echo $AGGREGATOR_LIST | ../jq -r 'map_values(select(.name == "leader")).[0].id'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, it looks like we could lower map_values
as follows, to avoid needing to install a newer version.
LEADER_ID=`echo $AGGREGATOR_LIST | ../jq -r 'map_values(select(.name == "leader")).[0].id'` | |
LEADER_ID=`echo $AGGREGATOR_LIST | ../jq -r '.[] |= select(.name == "leader") |.[0].id'` |
.github/workflows/docker.yml
Outdated
--leader-aggregator-id $LEADER_ID --helper-aggregator-id $HELPER_ID \ | ||
--collector-credential-id $COLLECTOR_CREDENTIAL_ID \ | ||
--vdaf histogram --categorical-buckets 0,1,2,3,4,5,6,7,8,9,10 \ | ||
--min-batch-size 10 --max-batch-size 200 --time-precision 60sec` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going to run afoul of our validation:
divviup-api/src/entity/task/new_task.rs
Lines 30 to 31 in 152f352
#[validate(required, range(min = 100))] | |
pub min_batch_size: Option<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does. That's kind of annoying. I worry this test will take forever to run if we have to wait for 100ish reports to be aggregated.
70f9cbb
to
ab4416f
Compare
Finally figured out what's wrong here: first, the problem was that the task refresh interval in the aggregation job creator was set to 3600s, so it was never noticing the task being created (I'm not sure why this worked for me locally; I must have been winning some kind of race between task provisioning and the job creator starting up). Second, it's still failing because the aggregators can't talk to each other: {
"timestamp": "2024-06-18T22:12:13.710011Z",
"level": "WARN",
"fields": {
"message": "Encountered retryable network error",
"err": "reqwest::Error { kind: Request, url: Url { scheme: \"http\", cannot_be_a_base: false, username: \"\", password: None, host: Some(Domain(\"localhost\")), port: Some(9002), path: \"/tasks/WkMb56665qFp_PLNZAxI8ouwGXKh8XuEjMTEbRO25fM/aggregation_jobs/0grBtgKu2-PVK-Bc_AJPGw\", query: None, fragment: None }, source: Error { kind: Connect, source: Some(ConnectError(\"tcp connect error\", Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" })) } }"
},
"target": "janus_core::retries",
"filename": "core/src/retries.rs",
"line_number": 154,
"spans": [
{
"acquired_job": "AcquiredAggregationJob { task_id: TaskId(WkMb56665qFp_PLNZAxI8ouwGXKh8XuEjMTEbRO25fM), aggregation_job_id: AggregationJobId(0grBtgKu2-PVK-Bc_AJPGw), query_type: FixedSize { max_batch_size: Some(200), batch_time_window_size: None }, vdaf: Prio3Histogram { length: 11, chunk_length: 4 } }",
"name": "Job stepper"
},
{
"method": "PUT",
"route_label": "tasks/:task_id/aggregation_jobs/:aggregation_job_id",
"url": "http://localhost:9002/tasks/WkMb56665qFp_PLNZAxI8ouwGXKh8XuEjMTEbRO25fM/aggregation_jobs/0grBtgKu2-PVK-Bc_AJPGw",
"name": "send_request_to_helper"
}
],
"threadId": "ThreadId(5)"
} The leader can't connect to the helper at edit: it's possible that this never worked in Docker Compose, and that I only ever managed to run collections against staging. |
4a1554c
to
0ff24bb
Compare
Like all problems, this wound up being solvable using |
|
||
echo "finished uploading measurements" | ||
|
||
sleep 120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm cool with this for now, but I sense this might end up flaky depending on whether our GitHub Actions runner has had a good breakfast that day.
But we can see what happens, if that's always a long enough wait then it beats hacky shell retry logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way to make it robust would be to add collection job polling logic to divviup dap-client
akin to what Janus' collect
already has, but I was/am reluctant to copy over that much code.
Add steps to the
compose
job indocker.yml
that run the demo script fromcli/README.md
against a Docker Compose deployment to ensure that it works.Perhaps unsurprisingly, adding a test for this revealed a couple of bugs, which are also addressed in this change:
localhost:9002
, which is not routable inside of the Docker Compose network. So we runnc
alongside the leader's Janus processes so that connections tolocalhost:9002
get redirected tojanus_2_aggregator:8080
, which is routable inside Docker Compose.