[Fleet] pipeline with id [*] does not exists #116343

nchaulet · 2021-10-26T18:20:23Z

Description

In happens after an upgrade that the agent is not able to send data with that error in the logs (replace synthetics-http-0.2.1 by others datastream and package too)

{"type":"illegal_argument_exception","reason":"pipeline with id [synthetics-http-0.2.1] does not exist"}

We saw a few time this error for different packages and different datastream that do not define any ingest pipeline, I tried different upgrade scenario and was not able to reproduce.

What's odd is that none of these pipelines should exist. These data streams are not intended to have a pipeline in these versions of these packages. For example, metrics-system.process.summary does not have any pipeline but has been reported as missing a pipeline in error logs.

How to reproduce

How to reproduce locally?

You need to corrupt the package cache
Using nginx as a package,

From a fresh Kibana and ES
Navigate to Fleet and wait for setup to complete
Install the nginx package

curl --request POST \
  --url http://localhost:5601/api/fleet/epm/packages/nginx-1.1.1 \
  --header 'Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==' \
  --header 'Content-Type: application/json' \
  --header 'kbn-xsrf: as' \
  --data '{
	"force": true
}'

Break the connection between Kibana and the registry (disabling your wifi does the trick)
Restart Kibana and do not visit any UI
Re-enable the connection between Kibana and the registry (enabling your wifi does the trick)
Then force reinstall the same version of the nginx package

curl --request POST \
  --url http://localhost:5601/api/fleet/epm/packages/nginx-1.1.1 \
  --header 'Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==' \
  --header 'Content-Type: application/json' \
  --header 'kbn-xsrf: as' \
  --data '{
	"force": true
}'

Check your index template for the nginx package you should have default pipeline that do not exists (In the devtools GET _index_template/metrics-nginx.stubstatus) you should see a default pipeline in the index template settings that do not exists.

Bug details: If there is a no cache entry we install the package from a saved version saved in Elasticsearch and there is a bug here were we populate the ingest_pipeline with default, we should solve that bug. But we should probably think about a more long term solution too relying on a cache system to install a package seems not future proof.

I think in most of the scenarios users probably did not call the reinstall call, but we have a mechanism that reinstalled package to install the fleet final pipeline during upgrade I think this with some connection error to the registry could have caused the same issue

Workaround

Force reinstalling the package should solve this. In case the force install package does not solve it yet. You should probably manually rollover the data streams.

Way to investigate and potential workaround (⚠️ not tested yet)

For this investigation, let's use the `metrics-system.process` as example to work with. The same could be applied to any others. Given the following error from elastic-agent or similar on Elasticsearch:

{"type":"illegal_argument_exception","reason":"pipeline with id [metrics-system.process-1.4.0] does not exist"}, dropping event!

First let's see what is pointing to the non-existent metrics-system.process-1.4.0. Could you run the following command for me from Dev Tools in Kibana:
```
GET /_index_template/metrics-system.process?filter_path=index_templates.index_template.template.settings.index.default_pipeline
```
If that returns an empty response, then the reinstall most likely worked. Let's see if this setting is still present on the current concrete index:
```
GET /metrics-system.process-*/_settings?filter_path=*.settings.index.default_pipeline
```
If any of these indices return a non-empty value AND the template request from (1) was empty, then it's likely that rolling over the data stream should fix the issue. Here's the command to try this. ⚠️ This has not yet been tested. If anyone tries this, please add a comment with what happened to get in this state (if known) and how the workaround goes. Note you'd need to change default if you customized the namespace:
```
POST /metrics-system.process-default/_rollover
```

This last command would need to be repeated for each of these data streams. There is no bulk API for this. Another option could be to delete the underlying indices completely if you don't need the data, ⚠️ Warning: this deletes all data ingested by Elastic Agent:

DELETE /logs-*,metrics-*

We're not yet sure of a root cause here so anything you can share would be helpful in making sure that we can fix this bug.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-10-26T18:20:27Z

Pinging @elastic/fleet (Team:Fleet)

joshdover · 2021-10-29T13:54:56Z

Bug details: If there is a no cache entry we install the package from a saved version saved in Elasticsearch and there is a bug here were we populate the ingest_pipeline with default, we should solve that bug. But we should probably think about a more long term solution too relying on a cache system to install a package seems not future proof.

I think in most of the scenarios users probably did not call the reinstall call, but we have a mechanism that reinstalled package to install the fleet final pipeline during upgrade I think this with some connection error to the registry could have caused the same issue

@nchaulet I haven't yet been able to reproduce this with your given steps, so I can't yet verify your fix in #116707. I'm unclear on why the registry connectivity would cause this issue. Inspecting the code where we retrieve packages, it seems we attempt to retrieve packages that were already installed from in-memory cache, then ES regardless of connectivity:

kibana/x-pack/plugins/fleet/server/services/epm/packages/get.ts

Lines 200 to 235 in ddf092f

    
           if (installedPkg && installedPkg.version === pkgVersion) { 
        
             const { install_source: pkgInstallSource } = installedPkg; 
        
             // check cache 
        
             res = getArchivePackage({ 
        
               name: pkgName, 
        
               version: pkgVersion, 
        
             }); 
        
             if (res) { 
        
               logger.debug(`retrieved installed package ${pkgName}-${pkgVersion} from cache`); 
        
             } 
        
             if (!res && installedPkg.package_assets) { 
        
               res = await getEsPackage( 
        
                 pkgName, 
        
                 pkgVersion, 
        
                 installedPkg.package_assets, 
        
                 savedObjectsClient 
        
               ); 
        
               if (res) { 
        
                 logger.debug(`retrieved installed package ${pkgName}-${pkgVersion} from ES`); 
        
               } 
        
             } 
        
             // for packages not in cache or package storage and installed from registry, check registry 
        
             if (!res && pkgInstallSource === 'registry') { 
        
               try { 
        
                 res = await Registry.getRegistryPackage(pkgName, pkgVersion); 
        
                 logger.debug(`retrieved installed package ${pkgName}-${pkgVersion} from registry`); 
        
                 // TODO: add to cache and storage here? 
        
               } catch (error) { 
        
                 // treating this is a 404 as no status code returned 
        
                 // in the unlikely event its missing from cache, storage, and never installed from registry 
        
               } 
        
             } 
        
           } else {

That said, I can see why your fix should work. I do think this code needs refactoring. It's not clear to me why we need special post-processing logic for retrieving packages from ES to rebuild the PackageInfo type that we normally retrieve from the registry. Could we not save this info directly instead of trying re-build it in code and having two separate sources of truth?

Here's where we get this from the registry:

kibana/x-pack/plugins/fleet/server/services/epm/registry/index.ts

Lines 114 to 126 in ddf092f

    
           export async function fetchInfo(pkgName: string, pkgVersion: string): Promise<RegistryPackage> { 
        
             const registryUrl = getRegistryUrl(); 
        
             try { 
        
               const res = await fetchUrl(`${registryUrl}/package/${pkgName}/${pkgVersion}`).then(JSON.parse); 
        
               return res; 
        
             } catch (err) { 
        
               if (err instanceof RegistryResponseError && err.status === 404) { 
        
                 throw new PackageNotFoundError(`${pkgName}@${pkgVersion} not found`); 
        
               } 
        
               throw err; 
        
             } 
        
           }

And how we rebuild this when retrieving from ES:

kibana/x-pack/plugins/fleet/server/services/epm/archive/storage.ts

Lines 216 to 278 in ddf092f

    
           // create the packageInfo 
        
           // TODO: this is mostly copied from validtion.ts, needed in case package does not exist in storage yet or is missing from cache 
        
           // we don't want to reach out to the registry again so recreate it here.  should check whether it exists in packageInfoCache first 
        
           const manifestPath = `${pkgName}-${pkgVersion}/manifest.yml`; 
        
           const soResManifest = await savedObjectsClient.get<PackageAsset>( 
        
             ASSETS_SAVED_OBJECT_TYPE, 
        
             assetPathToObjectId(manifestPath) 
        
           ); 
        
           const packageInfo = safeLoad(soResManifest.attributes.data_utf8); 
        
           try { 
        
             const readmePath = `docs/README.md`; 
        
             await savedObjectsClient.get<PackageAsset>( 
        
               ASSETS_SAVED_OBJECT_TYPE, 
        
               assetPathToObjectId(`${pkgName}-${pkgVersion}/${readmePath}`) 
        
             ); 
        
             packageInfo.readme = `/package/${pkgName}/${pkgVersion}/${readmePath}`; 
        
           } catch (err) { 
        
             // read me doesn't exist 
        
           } 
        
           let dataStreamPaths: string[] = []; 
        
           const dataStreams: RegistryDataStream[] = []; 
        
           paths 
        
             .filter((path) => path.startsWith(`${pkgKey}/data_stream/`)) 
        
             .forEach((path) => { 
        
               const parts = path.split('/'); 
        
               if (parts.length > 2 && parts[2]) dataStreamPaths.push(parts[2]); 
        
             }); 
        
           dataStreamPaths = uniq(dataStreamPaths); 
        
           await Promise.all( 
        
             dataStreamPaths.map(async (dataStreamPath) => { 
        
               const dataStreamManifestPath = `${pkgKey}/data_stream/${dataStreamPath}/manifest.yml`; 
        
               const soResDataStreamManifest = await savedObjectsClient.get<PackageAsset>( 
        
                 ASSETS_SAVED_OBJECT_TYPE, 
        
                 assetPathToObjectId(dataStreamManifestPath) 
        
               ); 
        
               const dataStreamManifest = safeLoad(soResDataStreamManifest.attributes.data_utf8); 
        
               const { 
        
                 ingest_pipeline: ingestPipeline, 
        
                 dataset, 
        
                 streams: manifestStreams, 
        
                 ...dataStreamManifestProps 
        
               } = dataStreamManifest; 
        
               const streams = parseAndVerifyStreams(manifestStreams, dataStreamPath); 
        
               dataStreams.push({ 
        
                 dataset: dataset || `${pkgName}.${dataStreamPath}`, 
        
                 package: pkgName, 
        
                 ingest_pipeline: ingestPipeline || 'default', 
        
                 path: dataStreamPath, 
        
                 streams, 
        
                 ...dataStreamManifestProps, 
        
               }); 
        
             }) 
        
           ); 
        
           packageInfo.policy_templates = parseAndVerifyPolicyTemplates(packageInfo); 
        
           packageInfo.data_streams = dataStreams; 
        
           packageInfo.assets = paths.map((path) => { 
        
             return path.replace(`${pkgName}-${pkgVersion}`, `/package/${pkgName}/${pkgVersion}`); 
        
           });

joshdover · 2021-10-29T13:56:39Z

Could we not save this info directly instead of trying re-build it in code and having two separate sources of truth?

Maybe the use case for this is for uploaded packages. If that's the case, then we should always build this PackageInfo object in Kibana so that behavior is consistent regardless of how the package is retrieved.

nchaulet · 2021-10-29T14:21:15Z

Maybe the use case for this is for uploaded packages. If that's the case, then we should always build this PackageInfo object in Kibana so that behavior is consistent regardless of how the package is retrieved.

Yes I think the use case for this is uploaded package and I agree that we should always build this package info in Kibana and have only one code path for uploaded or registry packages. It will also help to avoid PR to the package registry each time we add something to the package (like this one elastic/package-registry#750)

For the refactoring I think it's probably too late to do it for 7.16 (my fix could help to mitigate the problem) but probably something we should tackle for next releases.

. I'm unclear on why the registry connectivity would cause this issue. Inspecting the code where we retrieve packages, it seems we attempt to retrieve packages that were already installed from in-memory cache, then ES regardless of connectivity:

I was only able to reproduce the bug when my Kibana was not able to reach the registry during the setup, otherwise I think the cache is populated somehow and there is no bug.

joshdover · 2021-10-29T15:27:28Z

For the refactoring I think it's probably too late to do it for 7.16 (my fix could help to mitigate the problem) but probably something we should tackle for next releases.

Yep, definitely agree.

I was only able to reproduce the bug when my Kibana was not able to reach the registry during the setup, otherwise I think the cache is populated somehow and there is no bug.

Ah, maybe there's a missing step between 6 and 7 here to hit the setup API?

5. Break the connection between Kibana and the registry (disabling your wifi does the trick)
6. Restart Kibana and do not visit any UI
7. Re-enable the connection between Kibana and the registry (enabling your wifi does the trick)

nchaulet added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Oct 26, 2021

nchaulet self-assigned this Oct 28, 2021

nchaulet mentioned this issue Oct 28, 2021

[Fleet] Fix pipeline with id [*] does not exists #116707

Merged

nchaulet closed this as completed in #116707 Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] pipeline with id [*] does not exists #116343

[Fleet] pipeline with id [*] does not exists #116343

nchaulet commented Oct 26, 2021 •

edited

elasticmachine commented Oct 26, 2021

joshdover commented Oct 29, 2021

joshdover commented Oct 29, 2021

nchaulet commented Oct 29, 2021

joshdover commented Oct 29, 2021

[Fleet] pipeline with id [*] does not exists #116343

[Fleet] pipeline with id [*] does not exists #116343

Comments

nchaulet commented Oct 26, 2021 • edited

Description

How to reproduce

Workaround

Way to investigate and potential workaround (⚠️ not tested yet)

elasticmachine commented Oct 26, 2021

joshdover commented Oct 29, 2021

joshdover commented Oct 29, 2021

nchaulet commented Oct 29, 2021

joshdover commented Oct 29, 2021

nchaulet commented Oct 26, 2021 •

edited