Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions docs/admin/bootstrap-checks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
```{highlight} sh
```

(bootstrap-checks)=

# Bootstrap checks

If you are binding to a network reachable IP address, CrateDB performs a number
of bootstrap checks during startup. These checks examine your setup and will
prevent startup if a problem is detected.

This best practices guide is intended to help you configure your setup so that
CrateDB passes the bootstrap checks and can perform optimally.

If you are binding to the loopback address, the bootstrap checks will not be
run, but it is still a good idea to follow these instructions.

:::{TIP}
If you are using Docker, these checks are dependent on the host system. You
must configure the host system appropriately if you wish to run CrateDB on
Docker. Consult the additional documentation on Docker {ref}`resource
constraints <resource_constraints>` for more information.
:::

## System settings

### Official packages

If you are using one of the official packages, all of the necessary operating
system configuration is handled for you.

### Tarball

If you have installed CrateDB from a tarball, you must manually configure your
operating system.

Here's what needs to be configured:

- **File descriptors**
: - Set hard and soft limit to unlimited
- **Memory lock**
: - Set hard and soft limit to unlimited
- **Threads**
: - Set hard and soft limit to 4096 (CrateDB versions < 3.0.0 will also work
with 2048)
- **Virtual memory**
: - Set hard and soft limit to unlimited (on Linux only)
- **Memory map**
: - Set limit to 262144 (on Linux only)

You might be able to set these limits per process or per user, depending on
your operating system and setup. And for this to take effect, you may also have
to set these limits for the superuser.

#### Linux

If you're running Linux, you can perform the necessary configuration by
following these instructions.

Edit `/etc/security/limits.conf` and configure these lines:

```
crate soft nofile unlimited
crate hard nofile unlimited

crate soft memlock unlimited
crate hard memlock unlimited

crate soft nproc 4096
crate hard nproc 4096

crate soft as unlimited
crate hard as unlimited
```

Here, `crate` is the user that runs the CrateDB daemon.

Edit `/etc/sysctl.conf` and configure:

```
vm.max_map_count = 262144
```

To apply this change, run:

```console
$ sudo sysctl -p
```

This command will this reload all settings from `/etc/sysctl.conf`.

:::{TIP}
Alternatively, `vm.max_map_count` can be set directly, like so:

```console
$ sysctl -w vm.max_map_count=262144
```

Note, however, this setting will be reset to the value in
`/etc/sysctl.conf` when your system next boots.
:::

## Garbage collection

CrateDB has been tested using the [CMS garbage collector] (default up to and
including CrateDB 4.0) and the [G1 garbage collector] (the default with CrateDB 4.1).

`G1GC` can also be used in earlier CrateDB versions, but should only be used in
combination with Java 11 or later.

:::{WARNING}
Other garbage collectors have not been tested with CrateDB and we do not
support using other GCs in production.
:::

[cms garbage collector]: https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/cms.html
[g1 garbage collector]: https://docs.oracle.com/javase/10/gctuning/garbage-first-garbage-collector.htm
125 changes: 0 additions & 125 deletions docs/admin/bootstrap-checks.rst

This file was deleted.

90 changes: 90 additions & 0 deletions docs/admin/circuit-breaker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
(circuit-breaker)=

# Circuit Breaker

CrateDB's circuit breakers prevent queries and background processes from exhausting node memory and destabilizing the cluster.
This works by estimating the memory each operation will require and aborting the operation before the JVM heap space is exhausted.

## What is a Circuit Breaker

A **circuit breaker** is a safeguard that halts work when resource usage crosses a dangerous threshold.
Think of the miniature breakers inside a household fuse box: if too many appliances pull current on the same line, the breaker
trips and cuts power to prevent the wires from melting. The same principle applies in software, only the resource under pressure
is memory, CPU, file descriptors, or an external service.

In CrateDB, the critical resource is **RAM**. Queries run in parallel across many shards; a single
oversize aggregation or JOIN can allocate gigabytes in milliseconds. The breaker detects this and aborts the query with a
`CircuitBreakingException` instead of letting the JVM run out of heap and crash the node.

## How Circuit Breakers Work in CrateDB

A query executes as an ordered set of operations. Before running each stage, CrateDB estimates the extra memory that step will need.
If the projected total exceeds the breaker limit, the system aborts the query and returns a `CircuitBreakingException`.
This pre-emptive trip prevents the JVM's garbage collector from reaching an unrecoverable out-of-memory state.

It is important to understand CrateDB doesn’t aspire to do a fully accurate memory accounting, but instead opts for a best-effort approach,
since a precise estimate is tricky to achieve.

## Types of Circuit Breakers

There are six different Circuit Breaker types which are described in detail in the [cluster settings] documentation page: `query`,
`request`, `jobs_log`, `operations_log`, `total` and `accounting`, which was deprecated and will be removed soon. The `total` Circuit Breaker, also
known as `parent`, accounts for all others, meaning that it controls the general use of memory, tripping an operation if a
combination of the circuit breakers threatens the cluster.

## Monitoring & Observability

To monitor the Circuit Breaker usage, follow the [JMX monitoring] guide. In particular,
consult the [CircuitBreakers MXBean] section.

For hosted deployments, see [Cloud monitoring] and for self-managed clusters, the [on-prem monitoring] guide. Both describe the complete path from cluster
deployment to collecting metrics and displaying them on a Grafana dashboard.

## Exception Handling

```console
CircuitBreakingException[Allocating 2mb for 'query: mergeOnHandler' failed, breaker would use 976.4mb in total. Limit is 972.7mb. Either increase memory and limit, change the query or reduce concurrent query load]
```

- **Understanding the error**

The memory estimate for **mergeOnHandler** exceeded the `indices.breaker.query.limit`, so the query was aborted and the
exception returned.

- **Immediate actions**

- **Optimize the query** - see {ref}`Query Optimization 101 <performance-optimization>` for detailed guidance.

- **Identify memory-hungry queries** - run:

```psql
SELECT js.id,
stmt,
username,
sum(used_bytes) sum_bytes
FROM sys.operations op
JOIN sys.jobs js ON op.job_id = js.id
GROUP BY js.id, stmt, username
ORDER BY sum_bytes DESC;
```

The query above will return all jobs that are currently being executed in the cluster. To check completed jobs and operations, query
the corresponding system tables `sys.jobs_log` and `sys.operations_log`. Access to these tables are subject to [table permissions].

- **Scale the cluster** - if breakers trip frequently even after query tuning, evaluate scaling your cluster to get more resources.

Similar exceptions exist for the other breaker types `[request]`, `[parent]`, `[jobs_log]`, etc.
If you experience a `CircuitBreakingException [parent]` it is because other queries/tasks were running simultaneously and their summed estimate
exceeded `indices.breaker.total.limit`.

## Further Reading

[The Circuit Breaker Mechanism in CrateDB]

[circuitbreakers mxbean]: https://cratedb.com/docs/crate/reference/en/master/admin/monitoring.html#circuitbreakers-mxbean
[cloud monitoring]: https://community.cratedb.com/t/monitoring-cratedb-cloud-clusters/1397
[cluster settings]: https://cratedb.com/docs/crate/reference/en/master/config/cluster.html#query-circuit-breaker
[jmx monitoring]: inv:crate-reference#jmx_monitoring
[on-prem monitoring]: https://community.cratedb.com/t/monitoring-a-self-managed-cratedb-cluster-with-prometheus-and-grafana/1236
[table permissions]: inv:crate-reference#jobs_table_permissions
[the circuit breaker mechanism in cratedb]: https://zignar.net/2021/06/17/the-circuit-breaker-mechanism-in-cratedb/
Loading