Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki crashes after ResourceExhausted exc #418

Closed
cbartz opened this issue Jun 6, 2024 · 4 comments
Closed

Loki crashes after ResourceExhausted exc #418

cbartz opened this issue Jun 6, 2024 · 4 comments

Comments

@cbartz
Copy link

cbartz commented Jun 6, 2024

Bug Description

If I query loki and the result is too large, loki quits and restarts after a backoff.

Loki is unavailable for some time

See also grafana/loki#6568 (may be related)

To Reproduce

Query loki (e.g. via Grafana) with an expected large output.

Environment

juju 3.1.8 loki rev 128

Relevant log output

2024-06-06T12:42:07.480Z [loki] level=info ts=2024-06-06T12:42:07.443049992Z caller=roundtrip.go:241 org_id=fake traceID=6beb14df2d1e9b3a msg="executing query" type=range query="sum by (level) (count_over_time({juju_unit=~\"grafana-agent.*\"} |= `ERROR`[1s]))" length=1h0m0s step=1s query_hash=2233776241
2024-06-06T12:42:07.480Z [loki] level=info ts=2024-06-06T12:42:07.443205745Z caller=roundtrip.go:241 org_id=fake traceID=6fe9fbd58352e86e msg="executing query" type=range query="{juju_unit=~\"grafana-agent.*\"} |= `ERROR`" length=1h0m0s step=1s query_hash=1441634808
2024-06-06T12:42:07.480Z [loki] level=info ts=2024-06-06T12:42:07.445664656Z caller=engine.go:234 component=querier org_id=fake traceID=6fe9fbd58352e86e msg="executing query" type=range query="{juju_unit=~\"grafana-agent.*\"} |= `ERROR`" length=1h0m0s step=1s query_hash=1441634808
2024-06-06T12:42:07.480Z [loki] level=info ts=2024-06-06T12:42:07.445695844Z caller=engine.go:234 component=querier org_id=fake traceID=6beb14df2d1e9b3a msg="executing query" type=range query="sum by (level) (count_over_time({juju_unit=~\"grafana-agent.*\"} |= `ERROR`[1s]))" length=1h0m0s step=1s query_hash=2233776241
2024-06-06T12:42:07.779Z [loki] level=warn ts=2024-06-06T12:42:07.747137446Z caller=grpc_logging.go:78 method=/logproto.Querier/Query duration=300.217086ms err="rpc error: code = ResourceExhausted desc = trying to send message larger than max (5549220 vs. 4194304)" msg=gRPC
2024-06-06T12:42:09.880Z [loki] level=error ts=2024-06-06T12:42:09.802213459Z caller=errors.go:26 org_id=fake traceID=6fe9fbd58352e86e message="closing iterator" error="rpc error: code = ResourceExhausted desc = trying to send message larger than max (5549220 vs. 4194304)"
2024-06-06T12:42:09.880Z [loki] level=info ts=2024-06-06T12:42:09.802446528Z caller=metrics.go:160 component=querier org_id=fake traceID=6fe9fbd58352e86e latency=fast query="{juju_unit=~\"grafana-agent.*\"} |= `ERROR`" query_hash=1441634808 query_type=filter range_type=range length=1h0m0s start_delta=1h0m2.618438092s end_delta=2.618438232s step=1s duration=2.356562891s status=500 limit=1000 returned_lines=1000 throughput=412MB total_bytes=970MB total_bytes_structured_metadata=0B lines_per_second=65639 total_lines=154683 post_filter_lines=2918 total_entries=1000 store_chunks_download_time=645.688015ms queue_time=114.055µs splits=0 shards=0 chunk_refs_fetch_time=27.059301ms cache_chunk_req=350 cache_chunk_hit=206 cache_chunk_bytes_stored=105578575 cache_chunk_bytes_fetched=29595003 cache_chunk_download_time=603.825µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s
2024-06-06T12:42:09.880Z [loki] level=error ts=2024-06-06T12:42:09.809884994Z caller=retry.go:73 org_id=fake traceID=6fe9fbd58352e86e msg="error processing request" try=0 query="{juju_unit=~\"grafana-agent.*\"} |= `ERROR`" err="rpc error: code = Code(500) desc = rpc error: code = ResourceExhausted desc = trying to send message larger than max (5549220 vs. 4194304)\n"
2024-06-06T12:42:09.880Z [loki] level=info ts=2024-06-06T12:42:09.811440904Z caller=engine.go:234 component=querier org_id=fake traceID=6fe9fbd58352e86e msg="executing query" type=range query="{juju_unit=~\"grafana-agent.*\"} |= `ERROR`" length=1h0m0s step=1s query_hash=1441634808
2024-06-06T12:42:10.079Z [loki] level=warn ts=2024-06-06T12:42:10.054895491Z caller=grpc_logging.go:78 method=/logproto.Querier/Query duration=242.990002ms err="rpc error: code = ResourceExhausted desc = trying to send message larger than max (5549220 vs. 4194304)" msg=gRPC
2024-06-06T12:42:10.244Z [pebble] Service "loki" stopped unexpectedly with code 137
2024-06-06T12:42:10.245Z [pebble] Service "loki" on-failure action is "restart", waiting ~500ms before restart (backoff 1)

2024-06-06T12:40:17.255Z [traefik] time="2024-06-06T12:40:17Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:17.255Z [traefik] time="2024-06-06T12:40:17Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:17.406Z [traefik] time="2024-06-06T12:40:17Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:17.457Z [traefik] time="2024-06-06T12:40:17Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:17.559Z [traefik] time="2024-06-06T12:40:17Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:17.577Z [traefik] time="2024-06-06T12:40:17Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:17.708Z [traefik] time="2024-06-06T12:40:17Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:17.821Z [traefik] time="2024-06-06T12:40:17Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:44.350Z [traefik] time="2024-06-06T12:40:44Z" level=debug msg="'502 Bad Gateway' caused by: read tcp 192.168.102.244:44092->192.168.102.219:3100: read: connection reset by peer"
2024-06-06T12:40:44.350Z [traefik] time="2024-06-06T12:40:44Z" level=debug msg="'502 Bad Gateway' caused by: read tcp 192.168.102.244:44106->192.168.102.219:3100: read: connection reset by peer"
2024-06-06T12:40:44.351Z [traefik] time="2024-06-06T12:40:44Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:44.351Z [traefik] time="2024-06-06T12:40:44Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:44.446Z [traefik] time="2024-06-06T12:40:44Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:44.558Z [traefik] time="2024-06-06T12:40:44Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:44.623Z [traefik] time="2024-06-06T12:40:44Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:45.119Z [traefik] time="2024-06-06T12:40:45Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:45.232Z [traefik] time="2024-06-06T12:40:45Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:45.237Z [traefik] time="2024-06-06T12:40:45Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:45.245Z [traefik] time="2024-06-06T12:40:45Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:45.307Z [traefik] time="2024-06-06T12:40:45Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"
2024-06-06T12:40:45.440Z [traefik] time="2024-06-06T12:40:45Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 192.168.102.219:3100: connect: connection refused"

Additional context

No response

@dstathis
Copy link
Contributor

There seems to be two issues here:

  1. Queries larger than a certain size are blocked. This is clearly related to the linked bug and should be easy to fix if we are okay increasing the value. We can discus this within the team.
  2. Loki crashes. This does not happen in the upstream issue. What resources are available to the Loki pod on your system? It might not be enough to handle such large queries. Do you have the cpu or memory config values set? How much memory does your k8s system have?

@cbartz
Copy link
Author

cbartz commented Jun 13, 2024

@dstathis We are using 1 Gi for memory and cpu config is unset.

In terms of memory:

$ kubectl top pod loki-0
NAME     CPU(cores)   MEMORY(bytes)   
loki-0   19m          711Mi
root@loki-0:/# cat /sys/fs/cgroup/memory.max 
1073741824

the container requests

    resources:
      limits:
        memory: "1073741824"
      requests:
        cpu: 250m
        memory: 200Mi

The particular k8s node has 32 GiB of total memory.

@sed-i
Copy link
Contributor

sed-i commented Jun 13, 2024

  • Are you able to manually temporarily update the loki config with a higher grpc_server_max_send_msg_size? You can juju ssh, apt install vim, edit /etc/loki/loki-local-config.yaml and pebble restart loki. If this works then we could easily add it as a config option.
  • Perhaps you could narrow down your LogQL queries so they return less data?

@cbartz
Copy link
Author

cbartz commented Jun 14, 2024

Thanks for your answers. I will close the ticket as the cause of the crash seems to be clearly memory related (137 exit code) and not due to the ResourceExhausted (I saw this also for other queries without the ResourceExhausted), so this probably indicates that we need to increase the memory (1 Gi seems to be too low, just tested with 2 Gi and at first glance it looks better). I wonder if the charm could log a memory-related warning in this case (if loki exits with 137) to avoid the operator having to look into the k8s logs for troubleshooting.

If necessary I will open another issue related to the grpc_server_max_send_msg_size config, I could not reproduce it so far this morning.

And of course narrowing the LogQL queries helps, but sometimes it is hard to narrow when you need to find the root cause of an issue first (so you need to know what to narrow for).

@cbartz cbartz closed this as completed Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants