Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

octopus: doc: start: Update hardware recommendations #34225

Merged
merged 1 commit into from
Apr 1, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
122 changes: 79 additions & 43 deletions doc/start/hardware-recommendations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,45 @@ configured to consume. We recommend 1 GB as a minimum for most systems. See
OSDs (ceph-osd)
---------------

By default, OSDs that use the BlueStore backend require 3-5 GB of RAM. You can
adjust the amount of memory the OSD consumes with the ``osd_memory_target`` configuration option when BlueStore is in use. When using the legacy FileStore backend, the operating system page cache is used for caching data, so no tuning is normally needed, and the OSD memory consumption is generally related to the number of PGs per daemon in the system.
Memory
======

Bluestore uses its own memory to cache data rather than relying on the
operating system page cache. In bluestore you can adjust the amount of memory
the OSD attempts to consume with the ``osd_memory_target`` configuration
option.

- Setting the osd_memory_target below 2GB is typically not recommended (it may
fail to keep the memory that low and may also cause extremely slow performance.

- Setting the memory target between 2GB and 4GB typically works but may result
in degraded performance as metadata may be read from disk during IO unless the
active data set is relatively small.

- 4GB is the current default osd_memory_target size and was set that way to try
and balance memory requirements and OSD performance for typical use cases.

- Setting the osd_memory_target higher than 4GB may improve performance when
there are many (small) objects or large (256GB/OSD or more) data sets being
processed.

.. important:: The OSD memory autotuning is "best effort". While the OSD may
unmap memory to allow the kernel to reclaim it, there is no guarantee that
the kernel will actually reclaim freed memory within any specific time
frame. This is especially true in older versions of Ceph where transparent
huge pages can prevent the kernel from reclaiming memory freed from
fragmented huge pages. Modern versions of Ceph disable transparent huge
pages at the application level to avoid this, though that still does not
guarantee that the kernel will immediately reclaim unmapped memory. The OSD
may still at times exceed it's memory target. We recommend budgeting around
20% extra memory on your system to prevent OSDs from going OOM during
temporary spikes or due to any delay in reclaiming freed pages by the
kernel. That value may be more or less than needed depending on the exact
configuration of the system.

When using the legacy FileStore backend, the page cache is used for caching
data, so no tuning is normally needed, and the OSD memory consumption is
generally related to the number of PGs per daemon in the system.


Data Storage
Expand Down Expand Up @@ -89,9 +126,6 @@ gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced
at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
foregoing example, using the 1 terabyte disks would generally increase the cost
per gigabyte by 40%--rendering your cluster substantially less cost efficient.
Also, the larger the storage drive capacity, the more memory per Ceph OSD Daemon
you will need, especially during rebalancing, backfilling and recovery. A
general rule of thumb is ~1GB of RAM for 1TB of storage space.

.. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is
**NOT** a good idea.
Expand Down Expand Up @@ -209,31 +243,21 @@ multiple OSDs per host.
Networks
========

We recommend that each host has at least two 1Gbps network interface
controllers (NICs). Since most commodity hard disk drives have a throughput of
approximately 100MB/second, your NICs should be able to handle the traffic for
the OSD disks on your host. We recommend a minimum of two NICs to account for a
public (front-side) network and a cluster (back-side) network. A cluster network
(preferably not connected to the internet) handles the additional load for data
replication and helps stop denial of service attacks that prevent the cluster
from achieving ``active + clean`` states for placement groups as OSDs replicate
data across the cluster. Consider starting with a 10Gbps network in your racks.
Replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a
typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network,
the replication times would be 20 minutes and 1 hour respectively. In a
petabyte-scale cluster, failure of an OSD disk should be an expectation, not an
exception. System administrators will appreciate PGs recovering from a
``degraded`` state to an ``active + clean`` state as rapidly as possible, with
price / performance tradeoffs taken into consideration. Additionally, some
deployment tools (e.g., Dell's Crowbar) deploy with five different networks,
but employ VLANs to make hardware and network cabling more manageable. VLANs
using 802.1q protocol require VLAN-capable NICs and Switches. The added hardware
expense may be offset by the operational cost savings for network setup and
maintenance. When using VLANs to handle VM traffic between the cluster
and compute stacks (e.g., OpenStack, CloudStack, etc.), it is also worth
considering using 10G Ethernet. Top-of-rack routers for each network also need
to be able to communicate with spine routers that have even faster
throughput--e.g., 40Gbps to 100Gbps.
Consider starting with a 10Gbps+ network in your racks. Replicating 1TB of data
across a 1Gbps network takes 3 hours, and 10TBs takes 30 hours! By contrast,
with a 10Gbps network, the replication times would be 20 minutes and 1 hour
respectively. In a petabyte-scale cluster, failure of an OSD disk should be an
expectation, not an exception. System administrators will appreciate PGs
recovering from a ``degraded`` state to an ``active + clean`` state as rapidly
as possible, with price / performance tradeoffs taken into consideration.
Additionally, some deployment tools employ VLANs to make hardware and network

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space after "make"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a cherry-pick, though

cabling more manageable. VLANs using 802.1q protocol require VLAN-capable NICs
and Switches. The added hardware expense may be offset by the operational cost
savings for network setup and maintenance. When using VLANs to handle VM
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
etc.), it is also worth considering using 10G Ethernet. Top-of-rack routers for
each network also need to be able to communicate with spine routers that have
even faster throughput--e.g., 40Gbps to 100Gbps.

Your server hardware should have a Baseboard Management Controller (BMC).
Administration and deployment tools may also use BMCs extensively, so consider
Expand Down Expand Up @@ -266,34 +290,46 @@ and development clusters can run successfully with modest hardware.
+--------------+----------------+-----------------------------------------+
| Process | Criteria | Minimum Recommended |
+==============+================+=========================================+
| ``ceph-osd`` | Processor | - 1x 64-bit AMD-64 |
| | | - 1x 32-bit ARM dual-core or better |
| ``ceph-osd`` | Processor | - 1 core minimum |
| | | - 1 core per 200-500 MB/s |
| | | - 1 core per 1000-3000 IOPS |
| | | |
| | | * Results are before replication. |
| | | * Results may vary with different |
| | | CPU models and Ceph features. |
| | | (erasure coding, compression, etc) |
| | | * ARM processors specifically may |
| | | require additional cores. |
| | | * Actual performance depends on many |
| | | factors including disk, network, and |
| | | client throughput and latency. |
| | | Benchmarking is highly recommended. |
| +----------------+-----------------------------------------+
| | RAM | ~1GB for 1TB of storage per daemon |
| | RAM | - 4GB+ per daemon (more is better) |
| | | - 2-4GB often functions (may be slow) |
| | | - Less than 2GB not recommended |
| +----------------+-----------------------------------------+
| | Volume Storage | 1x storage drive per daemon |
| +----------------+-----------------------------------------+
| | Journal | 1x SSD partition per daemon (optional) |
| | DB/WAL | 1x SSD partition per daemon (optional) |
| +----------------+-----------------------------------------+
| | Network | 2x 1GB Ethernet NICs |
| | Network | 1x 1GbE+ NICs (10GbE+ recommended) |
+--------------+----------------+-----------------------------------------+
| ``ceph-mon`` | Processor | - 1x 64-bit AMD-64 |
| | | - 1x 32-bit ARM dual-core or better |
| ``ceph-mon`` | Processor | - 1 core minimum |
| +----------------+-----------------------------------------+
| | RAM | 1 GB per daemon |
| | RAM | 2GB+ per daemon |
| +----------------+-----------------------------------------+
| | Disk Space | 10 GB per daemon |
| +----------------+-----------------------------------------+
| | Network | 2x 1GB Ethernet NICs |
| | Network | 1x 1GbE+ NICs |
+--------------+----------------+-----------------------------------------+
| ``ceph-mds`` | Processor | - 1x 64-bit AMD-64 quad-core |
| | | - 1x 32-bit ARM quad-core |
| ``ceph-mds`` | Processor | - 1 core minimum |
| +----------------+-----------------------------------------+
| | RAM | 1 GB minimum per daemon |
| | RAM | 2GB+ per daemon |
| +----------------+-----------------------------------------+
| | Disk Space | 1 MB per daemon |
| +----------------+-----------------------------------------+
| | Network | 2x 1GB Ethernet NICs |
| | Network | 1x 1GbE+ NICs |
+--------------+----------------+-----------------------------------------+

.. tip:: If you are running an OSD with a single disk, create a
Expand Down