Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] Get Flatcar added to supported operating systems for Google Ops Agent #560

Closed
cpswan opened this issue Nov 25, 2021 · 6 comments
Closed
Labels

Comments

@cpswan
Copy link
Contributor

cpswan commented Nov 25, 2021

Current situation

The install script for ops agent (curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh && sudo bash add-google-cloud-ops-agent-repo.sh --also-install) fails with the message Unidentifiable or unsupported platform. See https://cloud.google.com/stackdriver/docs/solutions/ops-agent/#supported_operating_systems for a list of supported platforms.

Flatcar is not listed at https://cloud.google.com/stackdriver/docs/solutions/ops-agent/#supported_operating_systems

Impact

GCE VMs running Flatcar can't be added to monitoring dashboards.

Ideal future situation

Flatcar will become a supported OS for the Ops Agent, and future versions of the script will install it.

**Implementation options

The Ops Agent could be baked into future GCE images

@cpswan cpswan added the kind/feature A feature request label Nov 25, 2021
@t-lo
Copy link
Member

t-lo commented Jan 11, 2022

Hello @cpswan,

please pardon the long silence. We have taken a closer look at GCP's ops-agent and would like to share below findings. We'd also like to ask if you could elaborate on the scenario you're using Flatcar in and what benefits ops-agent would bring over other monitoring tools?

We've taken a closer look at https://github.com/GoogleCloudPlatform/ops-agent and its potential relevance for Flatcar.

Ops-agent, while shipping basic host metrics support (cpu, network, disk, and mem usage stats) also includes support for gathering metrics from a significant number of on-host applications like JVM apps, apache, memcached, etc.. Supporting these on the host level feels irrelevant for Flatcar since we're a container-focused OS; application workloads are expected to ship in containers instead of running in the host OS directly. Flatcar also does not ship a JVM, and the SDK does not include a java runtime - so even a native build of ops-agent for Flatcar would be challenging. Support for these apps pulls in a large amount of dependencies and seems to be baked in, i.e. it indeed seems not possible to ship a stripped-down version of ops-agent that does not include said support without a significant code change in ops-agent.

Furthermore, ops-agent appears to require a number of runtime dependencies to metrics gateways (fluentd, collectd, opentelemetry) which are not readily available on Flatcar, and would need to be added, e.g. as container images. Alternatively, significant code changes would need to be required to ops-agent to make these runtime requirements optional.

Thirdly, to make the installer script work, we would need to have GPC either produce or ingest the code changes called out above, and produce a build. This is more on the Google side of things which we don't have control over.

Could you please elaborate on your scenario so we can better help find a path forward?

@cpswan
Copy link
Contributor Author

cpswan commented Jan 12, 2022

Thanks @t-lo

Basic host metrics are what we use the agent for. We have a custom dashboard in GCP Monitoring that shows us:

  • CPU load (5m) [MEAN]
  • VM Instance - Memory utilization [MEAN]

This first group seems to get its data from the agent, as the Flatcar hosts we have running don't show up.

We also have these on the dashboard, and the Flatcar hosts are already included, as the metrics come from the VM rather than the OS.

  • VM Instance CPU usage
  • VM Swap out
  • VM Swap In
  • VM Memory Used
  • VM Instance - Received bytes, Sent bytes
  • VM Instance CPU utilization [MEAN].

Digging around in the Metric explorer I can see that the first group come from agent.googleapis.com whilst the second group come from compute.googleapis.com.

We can probably substitute VM Instance - Memory utilization [MEAN] for VM Instance - VM Memory Used

CPU load (5m) [MEAN] is harder to substitute, as it's also used for alerting, when a host has a load average over 5 for over 15m (which is generally a good indicator that one of our containers has gone awry).

We're not at all interested in metrics from on-host applications, because we run everything in containers.

If you can suggest another way of getting CPU load from Flatcar into GCP Monitoring we won't need the agent.

@cpswan
Copy link
Contributor Author

cpswan commented Mar 17, 2022

Closing this as I've solved the problem with a containerised Python script to send CPU load average as a GCP custom metric.

Dockerfile

FROM python:3.10.3-slim

WORKDIR /usr/src/app

RUN pip3 install --no-cache-dir google-cloud-monitoring

COPY docker_send_data.py .

CMD [ "python3", "./docker_send_data.py" ]

docker_send_data.py

#!/usr/bin/env python3
from google.cloud import monitoring_v3

import time
import os
import requests
metadata_server = "http://metadata/computeMetadata/v1/"
metadata_flavor = {'Metadata-Flavor' : 'Google'}
#gce_id = requests.get(metadata_server + 'instance/id', headers = metadata_flavor).text
gce_name = requests.get(metadata_server + 'instance/hostname', headers = metadata_flavor).text
gce_project = requests.get(metadata_server + 'project/project-id', headers = metadata_flavor).text
split_gce_name=gce_name.split(".",2)

client = monitoring_v3.MetricServiceClient()
project_id = gce_project
project_name = f"projects/{project_id}"

series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/at_swarm_node_load"
series.resource.type = "gce_instance"
series.resource.labels["instance_id"] = split_gce_name[0]
series.resource.labels["zone"] = split_gce_name[1]

while True:
    load1, load5, load15 = os.getloadavg()

    now = time.time()
    seconds = int(now)
    nanos = int((now - seconds) * 10 ** 9)
    interval = monitoring_v3.TimeInterval(
        {"end_time": {"seconds": seconds, "nanos": nanos}}
    )
    point = monitoring_v3.Point({"interval": interval, "value": {"double_value": load5}})
    series.points = [point]
    client.create_time_series(request={"name": project_name, "time_series": [series]})
    time.sleep(60)

@cpswan cpswan closed this as completed Mar 17, 2022
@t-lo
Copy link
Member

t-lo commented Mar 25, 2022

Amazing work @cpswan I just read your blog about this: https://blog.atsign.dev/google-cloud-custom-metrics-cl16j2q2b05gujonv0h9x2dg5

Just out of curiosity, would you be interested in adding your approach (and your script) to our GCP platform docs as a work-around for our lack of ops agent?
(after merging this would ultimately become available here: https://www.flatcar.org/docs/latest/installing/cloud/gcp/)

@cpswan
Copy link
Contributor Author

cpswan commented Mar 25, 2022

Sure @t-lo I've done a docs PR here before so I'll dive in for another...

@t-lo
Copy link
Member

t-lo commented Mar 25, 2022

Awesome, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants