Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] pipeline with id [*] does not exists #116343

Closed
nchaulet opened this issue Oct 26, 2021 · 5 comments · Fixed by #116707
Closed

[Fleet] pipeline with id [*] does not exists #116343

nchaulet opened this issue Oct 26, 2021 · 5 comments · Fixed by #116707
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@nchaulet
Copy link
Member

nchaulet commented Oct 26, 2021

Description

In happens after an upgrade that the agent is not able to send data with that error in the logs (replace synthetics-http-0.2.1 by others datastream and package too)

{"type":"illegal_argument_exception","reason":"pipeline with id [synthetics-http-0.2.1] does not exist"}

We saw a few time this error for different packages and different datastream that do not define any ingest pipeline, I tried different upgrade scenario and was not able to reproduce.

What's odd is that none of these pipelines should exist. These data streams are not intended to have a pipeline in these versions of these packages. For example, metrics-system.process.summary does not have any pipeline but has been reported as missing a pipeline in error logs.

How to reproduce

How to reproduce locally?

You need to corrupt the package cache
Using nginx as a package,

  1. From a fresh Kibana and ES
  2. Navigate to Fleet and wait for setup to complete
  3. Install the nginx package
curl --request POST \
  --url http://localhost:5601/api/fleet/epm/packages/nginx-1.1.1 \
  --header 'Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==' \
  --header 'Content-Type: application/json' \
  --header 'kbn-xsrf: as' \
  --data '{
	"force": true
}'
  1. Break the connection between Kibana and the registry (disabling your wifi does the trick)
  2. Restart Kibana and do not visit any UI
  3. Re-enable the connection between Kibana and the registry (enabling your wifi does the trick)
  4. Then force reinstall the same version of the nginx package
curl --request POST \
  --url http://localhost:5601/api/fleet/epm/packages/nginx-1.1.1 \
  --header 'Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==' \
  --header 'Content-Type: application/json' \
  --header 'kbn-xsrf: as' \
  --data '{
	"force": true
}'
  1. Check your index template for the nginx package you should have default pipeline that do not exists (In the devtools GET _index_template/metrics-nginx.stubstatus) you should see a default pipeline in the index template settings that do not exists.

Bug details: If there is a no cache entry we install the package from a saved version saved in Elasticsearch and there is a bug here were we populate the ingest_pipeline with default, we should solve that bug. But we should probably think about a more long term solution too relying on a cache system to install a package seems not future proof.

I think in most of the scenarios users probably did not call the reinstall call, but we have a mechanism that reinstalled package to install the fleet final pipeline during upgrade I think this with some connection error to the registry could have caused the same issue

Workaround

Force reinstalling the package should solve this. In case the force install package does not solve it yet. You should probably manually rollover the data streams.

Way to investigate and potential workaround (⚠️ not tested yet)

For this investigation, let's use the `metrics-system.process` as example to work with. The same could be applied to any others. Given the following error from elastic-agent or similar on Elasticsearch:
{"type":"illegal_argument_exception","reason":"pipeline with id [metrics-system.process-1.4.0] does not exist"}, dropping event!
  1. First let's see what is pointing to the non-existent metrics-system.process-1.4.0. Could you run the following command for me from Dev Tools in Kibana:
    GET /_index_template/metrics-system.process?filter_path=index_templates.index_template.template.settings.index.default_pipeline
  2. If that returns an empty response, then the reinstall most likely worked. Let's see if this setting is still present on the current concrete index:
    GET /metrics-system.process-*/_settings?filter_path=*.settings.index.default_pipeline
  3. If any of these indices return a non-empty value AND the template request from (1) was empty, then it's likely that rolling over the data stream should fix the issue. Here's the command to try this. ⚠️ This has not yet been tested. If anyone tries this, please add a comment with what happened to get in this state (if known) and how the workaround goes. Note you'd need to change default if you customized the namespace:
    POST /metrics-system.process-default/_rollover

This last command would need to be repeated for each of these data streams. There is no bulk API for this. Another option could be to delete the underlying indices completely if you don't need the data, ⚠️ Warning: this deletes all data ingested by Elastic Agent:

DELETE /logs-*,metrics-*

We're not yet sure of a root cause here so anything you can share would be helpful in making sure that we can fix this bug.

@nchaulet nchaulet added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Oct 26, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Member

Bug details: If there is a no cache entry we install the package from a saved version saved in Elasticsearch and there is a bug here were we populate the ingest_pipeline with default, we should solve that bug. But we should probably think about a more long term solution too relying on a cache system to install a package seems not future proof.

I think in most of the scenarios users probably did not call the reinstall call, but we have a mechanism that reinstalled package to install the fleet final pipeline during upgrade I think this with some connection error to the registry could have caused the same issue

@nchaulet I haven't yet been able to reproduce this with your given steps, so I can't yet verify your fix in #116707. I'm unclear on why the registry connectivity would cause this issue. Inspecting the code where we retrieve packages, it seems we attempt to retrieve packages that were already installed from in-memory cache, then ES regardless of connectivity:

if (installedPkg && installedPkg.version === pkgVersion) {
const { install_source: pkgInstallSource } = installedPkg;
// check cache
res = getArchivePackage({
name: pkgName,
version: pkgVersion,
});
if (res) {
logger.debug(`retrieved installed package ${pkgName}-${pkgVersion} from cache`);
}
if (!res && installedPkg.package_assets) {
res = await getEsPackage(
pkgName,
pkgVersion,
installedPkg.package_assets,
savedObjectsClient
);
if (res) {
logger.debug(`retrieved installed package ${pkgName}-${pkgVersion} from ES`);
}
}
// for packages not in cache or package storage and installed from registry, check registry
if (!res && pkgInstallSource === 'registry') {
try {
res = await Registry.getRegistryPackage(pkgName, pkgVersion);
logger.debug(`retrieved installed package ${pkgName}-${pkgVersion} from registry`);
// TODO: add to cache and storage here?
} catch (error) {
// treating this is a 404 as no status code returned
// in the unlikely event its missing from cache, storage, and never installed from registry
}
}
} else {

That said, I can see why your fix should work. I do think this code needs refactoring. It's not clear to me why we need special post-processing logic for retrieving packages from ES to rebuild the PackageInfo type that we normally retrieve from the registry. Could we not save this info directly instead of trying re-build it in code and having two separate sources of truth?

Here's where we get this from the registry:

export async function fetchInfo(pkgName: string, pkgVersion: string): Promise<RegistryPackage> {
const registryUrl = getRegistryUrl();
try {
const res = await fetchUrl(`${registryUrl}/package/${pkgName}/${pkgVersion}`).then(JSON.parse);
return res;
} catch (err) {
if (err instanceof RegistryResponseError && err.status === 404) {
throw new PackageNotFoundError(`${pkgName}@${pkgVersion} not found`);
}
throw err;
}
}

And how we rebuild this when retrieving from ES:

// create the packageInfo
// TODO: this is mostly copied from validtion.ts, needed in case package does not exist in storage yet or is missing from cache
// we don't want to reach out to the registry again so recreate it here. should check whether it exists in packageInfoCache first
const manifestPath = `${pkgName}-${pkgVersion}/manifest.yml`;
const soResManifest = await savedObjectsClient.get<PackageAsset>(
ASSETS_SAVED_OBJECT_TYPE,
assetPathToObjectId(manifestPath)
);
const packageInfo = safeLoad(soResManifest.attributes.data_utf8);
try {
const readmePath = `docs/README.md`;
await savedObjectsClient.get<PackageAsset>(
ASSETS_SAVED_OBJECT_TYPE,
assetPathToObjectId(`${pkgName}-${pkgVersion}/${readmePath}`)
);
packageInfo.readme = `/package/${pkgName}/${pkgVersion}/${readmePath}`;
} catch (err) {
// read me doesn't exist
}
let dataStreamPaths: string[] = [];
const dataStreams: RegistryDataStream[] = [];
paths
.filter((path) => path.startsWith(`${pkgKey}/data_stream/`))
.forEach((path) => {
const parts = path.split('/');
if (parts.length > 2 && parts[2]) dataStreamPaths.push(parts[2]);
});
dataStreamPaths = uniq(dataStreamPaths);
await Promise.all(
dataStreamPaths.map(async (dataStreamPath) => {
const dataStreamManifestPath = `${pkgKey}/data_stream/${dataStreamPath}/manifest.yml`;
const soResDataStreamManifest = await savedObjectsClient.get<PackageAsset>(
ASSETS_SAVED_OBJECT_TYPE,
assetPathToObjectId(dataStreamManifestPath)
);
const dataStreamManifest = safeLoad(soResDataStreamManifest.attributes.data_utf8);
const {
ingest_pipeline: ingestPipeline,
dataset,
streams: manifestStreams,
...dataStreamManifestProps
} = dataStreamManifest;
const streams = parseAndVerifyStreams(manifestStreams, dataStreamPath);
dataStreams.push({
dataset: dataset || `${pkgName}.${dataStreamPath}`,
package: pkgName,
ingest_pipeline: ingestPipeline || 'default',
path: dataStreamPath,
streams,
...dataStreamManifestProps,
});
})
);
packageInfo.policy_templates = parseAndVerifyPolicyTemplates(packageInfo);
packageInfo.data_streams = dataStreams;
packageInfo.assets = paths.map((path) => {
return path.replace(`${pkgName}-${pkgVersion}`, `/package/${pkgName}/${pkgVersion}`);
});

@joshdover
Copy link
Member

Could we not save this info directly instead of trying re-build it in code and having two separate sources of truth?

Maybe the use case for this is for uploaded packages. If that's the case, then we should always build this PackageInfo object in Kibana so that behavior is consistent regardless of how the package is retrieved.

@nchaulet
Copy link
Member Author

Maybe the use case for this is for uploaded packages. If that's the case, then we should always build this PackageInfo object in Kibana so that behavior is consistent regardless of how the package is retrieved.

Yes I think the use case for this is uploaded package and I agree that we should always build this package info in Kibana and have only one code path for uploaded or registry packages. It will also help to avoid PR to the package registry each time we add something to the package (like this one elastic/package-registry#750)

For the refactoring I think it's probably too late to do it for 7.16 (my fix could help to mitigate the problem) but probably something we should tackle for next releases.

. I'm unclear on why the registry connectivity would cause this issue. Inspecting the code where we retrieve packages, it seems we attempt to retrieve packages that were already installed from in-memory cache, then ES regardless of connectivity:

I was only able to reproduce the bug when my Kibana was not able to reach the registry during the setup, otherwise I think the cache is populated somehow and there is no bug.

@joshdover
Copy link
Member

For the refactoring I think it's probably too late to do it for 7.16 (my fix could help to mitigate the problem) but probably something we should tackle for next releases.

Yep, definitely agree.

I was only able to reproduce the bug when my Kibana was not able to reach the registry during the setup, otherwise I think the cache is populated somehow and there is no bug.

Ah, maybe there's a missing step between 6 and 7 here to hit the setup API?

5. Break the connection between Kibana and the registry (disabling your wifi does the trick)
6. Restart Kibana and do not visit any UI
7. Re-enable the connection between Kibana and the registry (enabling your wifi does the trick)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants