Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker.yml: run demo script in CI #1126

Merged
merged 17 commits into from
Jun 20, 2024
Merged

docker.yml: run demo script in CI #1126

merged 17 commits into from
Jun 20, 2024

Conversation

tgeoghegan
Copy link
Contributor

@tgeoghegan tgeoghegan commented Jun 17, 2024

Add steps to the compose job in docker.yml that run the demo script from cli/README.md against a Docker Compose deployment to ensure that it works.

Perhaps unsurprisingly, adding a test for this revealed a couple of bugs, which are also addressed in this change:

  • The task discovery interval and task creation intervals were poorly tuned
  • The leader was trying to reach the helper at localhost:9002, which is not routable inside of the Docker Compose network. So we run nc alongside the leader's Janus processes so that connections to localhost:9002 get redirected to janus_2_aggregator:8080, which is routable inside Docker Compose.

AGGREGATOR_LIST=`./divviup aggregator list`
printf 'aggregator list %s\n' $AGGREGATOR_LIST

LEADER_ID=`echo $AGGREGATOR_LIST | ../jq -r 'map_values(select(.name == "leader")).[0].id'`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, it looks like we could lower map_values as follows, to avoid needing to install a newer version.

Suggested change
LEADER_ID=`echo $AGGREGATOR_LIST | ../jq -r 'map_values(select(.name == "leader")).[0].id'`
LEADER_ID=`echo $AGGREGATOR_LIST | ../jq -r '.[] |= select(.name == "leader") |.[0].id'`

--leader-aggregator-id $LEADER_ID --helper-aggregator-id $HELPER_ID \
--collector-credential-id $COLLECTOR_CREDENTIAL_ID \
--vdaf histogram --categorical-buckets 0,1,2,3,4,5,6,7,8,9,10 \
--min-batch-size 10 --max-batch-size 200 --time-precision 60sec`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to run afoul of our validation:

#[validate(required, range(min = 100))]
pub min_batch_size: Option<u64>,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. That's kind of annoying. I worry this test will take forever to run if we have to wait for 100ish reports to be aggregated.

@tgeoghegan
Copy link
Contributor Author

tgeoghegan commented Jun 18, 2024

Finally figured out what's wrong here: first, the problem was that the task refresh interval in the aggregation job creator was set to 3600s, so it was never noticing the task being created (I'm not sure why this worked for me locally; I must have been winning some kind of race between task provisioning and the job creator starting up). Second, it's still failing because the aggregators can't talk to each other:

{
    "timestamp": "2024-06-18T22:12:13.710011Z",
    "level": "WARN",
    "fields": {
        "message": "Encountered retryable network error",
        "err": "reqwest::Error { kind: Request, url: Url { scheme: \"http\", cannot_be_a_base: false, username: \"\", password: None, host: Some(Domain(\"localhost\")), port: Some(9002), path: \"/tasks/WkMb56665qFp_PLNZAxI8ouwGXKh8XuEjMTEbRO25fM/aggregation_jobs/0grBtgKu2-PVK-Bc_AJPGw\", query: None, fragment: None }, source: Error { kind: Connect, source: Some(ConnectError(\"tcp connect error\", Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" })) } }"
    },
    "target": "janus_core::retries",
    "filename": "core/src/retries.rs",
    "line_number": 154,
    "spans": [
        {
            "acquired_job": "AcquiredAggregationJob { task_id: TaskId(WkMb56665qFp_PLNZAxI8ouwGXKh8XuEjMTEbRO25fM), aggregation_job_id: AggregationJobId(0grBtgKu2-PVK-Bc_AJPGw), query_type: FixedSize { max_batch_size: Some(200), batch_time_window_size: None }, vdaf: Prio3Histogram { length: 11, chunk_length: 4 } }",
            "name": "Job stepper"
        },
        {
            "method": "PUT",
            "route_label": "tasks/:task_id/aggregation_jobs/:aggregation_job_id",
            "url": "http://localhost:9002/tasks/WkMb56665qFp_PLNZAxI8ouwGXKh8XuEjMTEbRO25fM/aggregation_jobs/0grBtgKu2-PVK-Bc_AJPGw",
            "name": "send_request_to_helper"
        }
    ],
    "threadId": "ThreadId(5)"
}

The leader can't connect to the helper at localhost:9002, which makes sense because that is the aggregator's address outside the compose network. What I don't get is: how did this ever work? I must have mangled something about the network configuration while working on this.

edit: it's possible that this never worked in Docker Compose, and that I only ever managed to run collections against staging.

@tgeoghegan
Copy link
Contributor Author

Like all problems, this wound up being solvable using nc(1).

@tgeoghegan tgeoghegan marked this pull request as ready for review June 19, 2024 00:04
@tgeoghegan tgeoghegan requested a review from a team as a code owner June 19, 2024 00:04
@tgeoghegan tgeoghegan mentioned this pull request Jun 19, 2024
9 tasks

echo "finished uploading measurements"

sleep 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm cool with this for now, but I sense this might end up flaky depending on whether our GitHub Actions runner has had a good breakfast that day.

But we can see what happens, if that's always a long enough wait then it beats hacky shell retry logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way to make it robust would be to add collection job polling logic to divviup dap-client akin to what Janus' collect already has, but I was/am reluctant to copy over that much code.

@tgeoghegan tgeoghegan merged commit 3251d15 into main Jun 20, 2024
8 checks passed
@tgeoghegan tgeoghegan deleted the timg/test-demo-script branch June 20, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants