Skip to content

Commit

Permalink
[EDR Workflows][E2E] Recreate agent on createEndpointHost task fail (#…
Browse files Browse the repository at this point in the history
…169092)

Restart vagrant vm on error during `beforeAll` task `createEndpointHost`

Defend Workflows Cypress suite ran 300 times through flaky test runner:
1. 100x
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3699
2. 50x
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3707
3. 50x
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3708
4. 50x
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3709
5. 50x
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3710


Flaky test runner runs with `createEndpointHost` task failure with
successful recovery:
1.
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3710#018b62fd-9ae9-4988-b1e0-ab0f04d8efdc
2.
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3710#018b62fd-9ae6-4340-992b-1474ee0f114b
3.
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3708#018b62fd-578e-4817-ae1c-8c58e8774eec
4.
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3708#018b62fd-5787-4245-85a6-cb446e42bc73
5.
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3707#018b62fc-fc17-407e-88de-d0b43b6b1d44
(failed due to unrelated issue)
6.
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3699#018b61d9-d2c3-430c-b3e3-72b9fbb22d24
7.
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3699#018b61d9-d2c6-4315-b828-b3218a70f209
8.
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3699#018b61d9-d2c7-4ff7-9a70-7354f90179e0
9.
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3699#018b61d9-d2d7-418f-b043-049e5effb26f
10.
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3699#018b61d9-d2da-47cc-b4ea-a4d4de3ba0a0

New errors not spotted before that got to do with env set up:

1. `vagrant up` failed:
1.1
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3708#018b62fd-5787-4245-85a6-cb446e42bc73
1.2
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3699#018b61d9-d2d0-4a52-87d9-34caa8927465

2. `CypressError: `cy.task('indexFleetEndpointPolicy')` timed out after
waiting `60000ms`.:
2.1
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3707#018b62fc-fc04-40d4-b155-46f094681edb
2.2
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3699#018b61d9-d2c9-4ebb-9174-eb9d79d04d02
2.3
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/3699#018b61d9-d2dc-438f-94b0-9f94ae95701c
    

Closes:
#168284
#169343
#169468
#169469
#169467
#169465
#169466
#169157
#168719
#168427
#168359
#168340
#169689

---------

Co-authored-by: Patryk Kopyciński <contact@patrykkopycinski.com>
  • Loading branch information
szwarckonrad and patrykkopycinski committed Oct 25, 2023
1 parent 980162b commit 1fdcb41
Show file tree
Hide file tree
Showing 7 changed files with 57 additions and 22 deletions.
3 changes: 2 additions & 1 deletion .buildkite/pipelines/flaky_tests/pipeline.ts
Original file line number Diff line number Diff line change
Expand Up @@ -162,10 +162,11 @@ for (const testSuite of testSuites) {
`Group configuration was not found in groups.json for the following cypress suite: {${suiteName}}.`
);
}
const agentQueue = suiteName.includes('defend_workflows') ? 'n2-4-virt' : 'n2-4-spot';
steps.push({
command: `.buildkite/scripts/steps/functional/${suiteName}.sh`,
label: group.name,
agents: { queue: 'n2-4-spot' },
agents: { queue: agentQueue },
depends_on: 'build',
parallelism: testSuite.count,
concurrency,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,7 @@ import { createEndpointHost } from '../../tasks/create_endpoint_host';
import { deleteAllLoadedEndpointData } from '../../tasks/delete_all_endpoint_data';
import { enableAllPolicyProtections } from '../../tasks/endpoint_policy';

// FLAKY: https://github.com/elastic/kibana/issues/168340
describe.skip(
describe(
'Automated Response Actions',
{
tags: [
Expand Down Expand Up @@ -76,8 +75,7 @@ describe.skip(
disableExpandableFlyoutAdvancedSettings();
});

// FLAKY: https://github.com/elastic/kibana/issues/168427
describe.skip('From alerts', () => {
describe('From alerts', () => {
let ruleId: string;
let ruleName: string;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,7 @@ import { createEndpointHost } from '../../tasks/create_endpoint_host';
import { deleteAllLoadedEndpointData } from '../../tasks/delete_all_endpoint_data';
import { enableAllPolicyProtections } from '../../tasks/endpoint_policy';

// FLAKY: https://github.com/elastic/kibana/issues/168284
describe.skip('Endpoints page', { tags: ['@ess', '@serverless'] }, () => {
describe('Endpoints page', { tags: ['@ess', '@serverless'] }, () => {
let indexedPolicy: IndexedFleetEndpointPolicyResponse;
let policy: PolicyData;
let createdHost: CreateAndEnrollEndpointHostResponse;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -332,16 +332,39 @@ export const dataLoadersForRealEndpoints = (
options: Omit<CreateAndEnrollEndpointHostOptions, 'log' | 'kbnClient'>
): Promise<CreateAndEnrollEndpointHostResponse> => {
const { kbnClient, log } = await stackServicesPromise;
return createAndEnrollEndpointHost({
useClosestVersionMatch: true,
...options,
log,
kbnClient,
}).then((newHost) => {
return waitForEndpointToStreamData(kbnClient, newHost.agentId, 360000).then(() => {

let retryAttempt = 0;
const attemptCreateEndpointHost = async (): Promise<CreateAndEnrollEndpointHostResponse> => {
try {
log.info(`Creating endpoint host, attempt ${retryAttempt}`);
const newHost = await createAndEnrollEndpointHost({
useClosestVersionMatch: true,
...options,
log,
kbnClient,
});
await waitForEndpointToStreamData(kbnClient, newHost.agentId, 360000);
return newHost;
});
});
} catch (err) {
log.info(`Caught error when setting up the agent: ${err}`);
if (retryAttempt === 0 && err.agentId) {
retryAttempt++;
await destroyEndpointHost(kbnClient, {
hostname: err.hostname || '', // No hostname in CI env for vagrant
agentId: err.agentId,
});
log.info(`Deleted endpoint host ${err.agentId} and retrying`);
return attemptCreateEndpointHost();
} else {
log.info(
`${retryAttempt} attempts of creating endpoint host failed, reason for the last failure was ${err}`
);
throw err;
}
}
};

return attemptCreateEndpointHost();
},

destroyEndpointHost: async (
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,6 @@ export const createEndpointHost = (
{
agentPolicyId,
},
{ timeout: timeout ?? 900000 } // 15 minutes, since setup can take 10 minutes and more. Task will time out if is not resolved within this time.
{ timeout: timeout ?? 30 * 60 * 1000 }
);
};
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ const createMultipassVm = async ({
};
};

const deleteMultipassVm = async (vmName: string): Promise<void> => {
export const deleteMultipassVm = async (vmName: string): Promise<void> => {
if (process.env.CI) {
await execa.command(`vagrant destroy -f`, {
env: {
Expand Down Expand Up @@ -339,7 +339,10 @@ const enrollHostWithFleet = async ({
]);
}
log.info(`Waiting for Agent to check-in with Fleet`);
const agent = await waitForHostToEnroll(kbnClient, vmName, 240000);

const agent = await waitForHostToEnroll(kbnClient, vmName, 8 * 60 * 1000);

log.info(`Agent enrolled with Fleet, status: `, agent.status);

return {
agentId: agent.id,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -147,15 +147,19 @@ export const waitForHostToEnroll = async (
return elapsedTime > timeoutMs;
};
let found: Agent | undefined;
let agentId: string | undefined;

while (!found && !hasTimedOut()) {
found = await retryOnError(
async () =>
fetchFleetAgents(kbnClient, {
perPage: 1,
kuery: `(local_metadata.host.hostname.keyword : "${hostname}") and (status:online)`,
kuery: `(local_metadata.host.hostname.keyword : "${hostname}")`,
showInactive: false,
}).then((response) => response.items[0]),
}).then((response) => {
agentId = response.items[0]?.id;
return response.items.filter((agent) => agent.status === 'online')[0];
}),
RETRYABLE_TRANSIENT_ERRORS
);

Expand All @@ -166,7 +170,14 @@ export const waitForHostToEnroll = async (
}

if (!found) {
throw new Error(`Timed out waiting for host [${hostname}] to show up in Fleet`);
throw Object.assign(
new Error(
`Timed out waiting for host [${hostname}] to show up in Fleet in ${
timeoutMs / 60 / 1000
} seconds`
),
{ agentId, hostname }
);
}

return found;
Expand Down

0 comments on commit 1fdcb41

Please sign in to comment.