From 51fccc8be4a4b6e5fb8184628ef3b0e739cbfe6a Mon Sep 17 00:00:00 2001 From: Maria Andersson Date: Mon, 13 Apr 2015 16:47:01 +0200 Subject: [PATCH] A base to build on for 2.0 --- src/cluster/databases.rst | 43 ++++++ src/cluster/index.rst | 33 +++++ src/cluster/nodes.rst | 81 ++++++++++ src/cluster/setup.rst | 163 +++++++++++++++++++++ src/cluster/sharding.rst | 300 ++++++++++++++++++++++++++++++++++++++ src/cluster/theory.rst | 71 +++++++++ src/contents.rst | 1 + 7 files changed, 692 insertions(+) create mode 100644 src/cluster/databases.rst create mode 100644 src/cluster/index.rst create mode 100644 src/cluster/nodes.rst create mode 100644 src/cluster/setup.rst create mode 100644 src/cluster/sharding.rst create mode 100644 src/cluster/theory.rst diff --git a/src/cluster/databases.rst b/src/cluster/databases.rst new file mode 100644 index 00000000..f50cb45b --- /dev/null +++ b/src/cluster/databases.rst @@ -0,0 +1,43 @@ +.. Licensed under the Apache License, Version 2.0 (the "License"); you may not +.. use this file except in compliance with the License. You may obtain a copy of +.. the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT +.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the +.. License for the specific language governing permissions and limitations under +.. the License. + +.. _cluster/databases: + +=================== +Database Management +=================== + +.. _cluster/databases/create: + +Creating a database +=================== + +This will create a database with ``3`` replicas and ``8`` shards. + +.. code-block:: bash + + curl -X PUT "http://xxx.xxx.xxx.xxx:5984/database-name?n=3&q=8" --user admin-user + +The database is in ``data/shards``. Look around on all the nodes and you will +find all the parts. + +If you do not specify ``n`` and ``q`` the default will be used. The default is +``3`` replicas and ``8`` shards. + +.. _cluster/databases/delete: + +Deleteing a database +==================== + +.. code-block:: bash + + curl -X DELETE "http://xxx.xxx.xxx.xxx:5984/database-name --user admin-user diff --git a/src/cluster/index.rst b/src/cluster/index.rst new file mode 100644 index 00000000..4ae88eaf --- /dev/null +++ b/src/cluster/index.rst @@ -0,0 +1,33 @@ +.. Licensed under the Apache License, Version 2.0 (the "License"); you may not +.. use this file except in compliance with the License. You may obtain a copy of +.. the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT +.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the +.. License for the specific language governing permissions and limitations under +.. the License. + +.. _cluster: + +================= +Cluster Reference +================= + +As of 2.0 CouchDB now have two modes of operations: + * Standalone + * Cluster + +This part of the documentation is about setting up and maintain a CouchDB +cluster. + +.. toctree:: + :maxdepth: 2 + + setup + theory + nodes + databases + sharding diff --git a/src/cluster/nodes.rst b/src/cluster/nodes.rst new file mode 100644 index 00000000..4d05a9b6 --- /dev/null +++ b/src/cluster/nodes.rst @@ -0,0 +1,81 @@ +.. Licensed under the Apache License, Version 2.0 (the "License"); you may not +.. use this file except in compliance with the License. You may obtain a copy of +.. the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT +.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the +.. License for the specific language governing permissions and limitations under +.. the License. + +.. _cluster/nodes: + +=============== +Node Management +=============== + +.. _cluster/nodes/add: + +Adding a node +============= + +Go to ``http://server1:45984/_membership`` to see the name of the node and all +the nodes it knows about and are connected too. + +.. code-block:: text + + curl -X GET "http://xxx.xxx.xxx.xxx:5984/_membership" --user admin-user + +.. code-block:: javascript + + { + "all_nodes":[ + "node1@xxx.xxx.xxx.xxx"], + "cluster_nodes":[ + "node1@xxx.xxx.xxx.xxx"] + } + +* ``all_nodes`` are all the nodes thats this node knows about. +* ``cluster_nodes`` are the nodes that are connected to this node. + +To add a node simply do: + +.. code-block:: text + + curl -X PUT "http://xxx.xxx.xxx.xxx:5986/_nodes/node2@yyy.yyy.yyy.yyy" -d {} + +Now look at ``http://server1:5984/_membership`` again. + +.. code-block:: javascript + + { + "all_nodes":[ + "node1@xxx.xxx.xxx.xxx", + "node2@yyy.yyy.yyy.yyy" + ], + "cluster_nodes":[ + "node1@xxx.xxx.xxx.xxx", + "node2@yyy.yyy.yyy.yyy" + ] + } + +And you have a 2 node cluster :) + +``http://yyy.yyy.yyy.yyy:5984/_membership`` will show the same thing, so you +only have to add a node once. + +.. _cluster/nodes/remove: + +Removing a node +=============== + +Before you remove a node, make sure that you have moved all +:ref:`shards ` away from that node. + +To remode ``node2`` from server ``yyy.yyy.yyy.yyy``: + +.. code-block:: text + + curl -X DELETE "http://xxx.xxx.xxx.xxx:5986/_nodes/node2@yyy.yyy.yyy.yyy" -d {} diff --git a/src/cluster/setup.rst b/src/cluster/setup.rst new file mode 100644 index 00000000..eafb203d --- /dev/null +++ b/src/cluster/setup.rst @@ -0,0 +1,163 @@ +.. Licensed under the Apache License, Version 2.0 (the "License"); you may not +.. use this file except in compliance with the License. You may obtain a copy of +.. the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT +.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the +.. License for the specific language governing permissions and limitations under +.. the License. + +.. _cluster/setup: + +===== +Setup +===== + +Everything you need to know to prepare the cluster for the installation of +CouchDB. + +Firewall +======== + +If you do not have a firewall between your servers, then you can skip this. + +CouchDB in cluster mode uses the port ``5984`` just as standalone, but is also +uses ``5986`` for the admin interface. + +Erlang uses TCP port ``4369`` (EPMD) to find other nodes, so all servers must be +able to speak to each other on this port. In an Erlang Cluster, all nodes are +connected to all other nodes. A mesh. + +.. warning:: + If you expose the port ``4369`` to the Internet or any other untrusted + network, then the only thing protecting you is the + :ref:`cookie `. + +Every Erlang application then uses other ports for talking to each other. Yes, +this means random ports. This will obviously not work with a firewall, but it is +possible to force an Erlang application to use a specific port rage. + +This documentation will use the range TCP ``9100-9200``. Open up those ports in +your firewalls and it is time to test it. + +You need 2 servers with working hostnames. Let us call them server1 and server2. + +On server1: + +.. code-block:: bash + + erl -sname bus -setcookie 'brumbrum' -kernel inet_dist_listen_min 9100 -kernel inet_dist_listen_max 9200 + +Then on server2: + +.. code-block:: bash + + erl -sname car -setcookie 'brumbrum' -kernel inet_dist_listen_min 9100 -kernel inet_dist_listen_max 9200 + +An explanation to the commands: + * ``erl`` the Erlang shell. + * ``-sname bus`` the name of the Erlang node. + * ``-setcookie 'brumbrum'`` the "password" used when nodes connect to each + other. + * ``-kernel inet_dist_listen_min 9100`` the lowest port in the rage. + * ``-kernel inet_dist_listen_max 9200`` the highest port in the rage. + +This gives us 2 Erlang shells. shell1 on server1, shell2 on server2. +Time to connect them. The ``.`` is to Erlang what ``;`` is to C. + +In shell1: + +.. code-block:: erlang + + net_kernel:connect_node(car@server2). + +This will connect to the node called ``car`` on the server called ``server2``. + +If that returns true, then you have a Erlang cluster, and the firewalls are +open. If you get false or nothing at all, then you have a problem with the +firewall. + +First time in Erlang? Time to play! +----------------------------------- + +Run in both shells: + +.. code-block:: erlang + + register(shell, self()). + +shell1: + +.. code-block:: erlang + + {shell, car@server2} ! {hello, from, self()}. + +shell2: + +.. code-block:: erlang + + flush(). + {shell, bus@server1} ! {"It speaks!", from, self()}. + +shell1: + +.. code-block:: erlang + + flush(). + +To close the shells, run in both: + +.. code-block:: erlang + + q(). + +Make CouchDB use the open ports. +-------------------------------- + +Open ``sys.config``, on all nodes, and add ``inet_dist_listen_min, 9100`` and +``inet_dist_listen_max, 9200`` like below: + +.. code-block:: erlang + + [ + {lager, [ + {error_logger_hwm, 1000}, + {error_logger_redirect, true}, + {handlers, [ + {lager_console_backend, [debug, { + lager_default_formatter, + [ + date, " ", time, + " [", severity, "] ", + node, " ", pid, " ", + message, + "\n" + ] + }]} + ]}, + {inet_dist_listen_min, 9100}, + {inet_dist_listen_max, 9200} + ]} + ]. + +Configuration files +=================== + +.. _cluster/setup/cookie: + +Erlang Cookie +------------- + +Open up ``vm.args`` and set the ``-setcookie`` to something secret. This must be +identical on all nodes. + +Set ``-name`` to the name the node will have. All nodes must have a unique name. + +Admin +----- + +All nodes authenticates users locally, so you must add an admin user to +local.ini on all nodes. Otherwise you will not be able to login on the cluster. diff --git a/src/cluster/sharding.rst b/src/cluster/sharding.rst new file mode 100644 index 00000000..4df71b03 --- /dev/null +++ b/src/cluster/sharding.rst @@ -0,0 +1,300 @@ +.. Licensed under the Apache License, Version 2.0 (the "License"); you may not +.. use this file except in compliance with the License. You may obtain a copy of +.. the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT +.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the +.. License for the specific language governing permissions and limitations under +.. the License. + +.. _cluster/sharding: + +======== +Sharding +======== + +.. _cluster/sharding/scaling-out: + +Scaling out +=========== + +Normally you start small and grow over time. In the beginning you might do just +fine with one node, but as your data and number of clients grows, you need to +scale out. + +For simplicity we will start fresh and small. + +Start node1 and add a database to it. To keep it simple we will have 2 shards +and no replicas. + +.. code-block:: bash + + curl -X PUT "http://xxx.xxx.xxx.xxx:5984/small?n=1&q=2" --user daboss + +If you look in the directory ``data/shards`` you will find the 2 shards. + +.. code-block:: text + + data/ + +-- shards/ + | +-- 00000000-7fffffff/ + | | -- small.1425202577.couch + | +-- 80000000-ffffffff/ + | -- small.1425202577.couch + +Now, go to the admin panel + +.. code-block:: text + + http://xxx.xxx.xxx.xxx:5986/_utils + +and look in the database ``_dbs``, it is here that the metadata for each +database is stored. As the database is called small, there is a document called +small there. Let us look in it. Yes, you can get it with curl too: + +.. code-block:: javascript + + curl -X GET "http://xxx.xxx.xxx.xxx:5986/_dbs/small" + + { + "_id": "small", + "_rev": "1-5e2d10c29c70d3869fb7a1fd3a827a64", + "shard_suffix": [ + 46, + 49, + 52, + 50, + 53, + 50, + 48, + 50, + 53, + 55, + 55 + ], + "changelog": [ + [ + "add", + "00000000-7fffffff", + "node1@xxx.xxx.xxx.xxx" + ], + [ + "add", + "80000000-ffffffff", + "node1@xxx.xxx.xxx.xxx" + ] + ], + "by_node": { + "node1@xxx.xxx.xxx.xxx": [ + "00000000-7fffffff", + "80000000-ffffffff" + ] + }, + "by_range": { + "00000000-7fffffff": [ + "node1@xxx.xxx.xxx.xxx" + ], + "80000000-ffffffff": [ + "node1@xxx.xxx.xxx.xxx" + ] + } + } + +* ``_id`` The name of the database. +* ``_rev`` The current revision of the metadata. +* ``shard_suffix`` The numbers after small and before .couch. The number of + seconds after UNIX epoch that the database was created. Stored in ASCII. +* ``changelog`` Self explaining. Only for admins to read. +* ``by_node`` Which shards each node have. +* ``by_rage`` On which nodes each shard is. + +Nothing here, nothing there, a shard in my sleeve +------------------------------------------------- + +Start node2 and add it to the cluster. Check in ``/_membership`` that the +nodes are talking with each other. + +If you look in the directory ``data`` on node2, you will see that there is no +directory called shards. + +Go to Fauxton and edit the metadata for small, so it looks like this: + +.. code-block:: javascript + + { + "_id": "small", + "_rev": "1-5e2d10c29c70d3869fb7a1fd3a827a64", + "shard_suffix": [ + 46, + 49, + 52, + 50, + 53, + 50, + 48, + 50, + 53, + 55, + 55 + ], + "changelog": [ + [ + "add", + "00000000-7fffffff", + "node1@xxx.xxx.xxx.xxx" + ], + [ + "add", + "80000000-ffffffff", + "node1@xxx.xxx.xxx.xxx" + ], + [ + "add", + "00000000-7fffffff", + "node2@yyy.yyy.yyy.yyy" + ], + [ + "add", + "80000000-ffffffff", + "node2@yyy.yyy.yyy.yyy" + ] + ], + "by_node": { + "node1@xxx.xxx.xxx.xxx": [ + "00000000-7fffffff", + "80000000-ffffffff" + ], + "node2@yyy.yyy.yyy.yyy": [ + "00000000-7fffffff", + "80000000-ffffffff" + ] + }, + "by_range": { + "00000000-7fffffff": [ + "node1@xxx.xxx.xxx.xxx", + "node2@yyy.yyy.yyy.yyy" + ], + "80000000-ffffffff": [ + "node1@xxx.xxx.xxx.xxx", + "node2@yyy.yyy.yyy.yyy" + ] + } + } + +Then press Save and marvel at the magic. The shards are now on node2 too! We +now have ``n=2``! + +If the shards are large, then you can copy them over manually and only have +CouchDB syncing the changes from the last minutes instead. + +.. _cluster/sharding/move: + +Moving Shards +============= + +Add, then delete +---------------- + +In the world of CouchDB there is no such thing as moving. You can add a new +replica to a shard and then remove the old replica, thereby creating the +illusion of moving. If you try to uphold this illusion with a database that have +``n=1``, you might find yourself in the following scenario: + +#. Copy the shard to a new node. +#. Update the metadata to use the new node. +#. Delete the shard on the old node. +#. Lose all writes made between 1 and 2. + +As the realty "I added a new replica of the shard X on node Y and then I waited +for them to sync, before I removed the replica of shard X from node Z." is a bit +tedious, people and this documentation tend to use the illusion of moving. + +Moving +------ + +When you get to ``n=3`` you should start moving the shards instead of adding +more replicas. + +We will stop on ``n=2`` to keep things simple. Start node number 3 and add it to +the cluster. Then create the directories for the shard on node3: + +.. code-block:: bash + + mkdir -p data/shards/00000000-7fffffff + +And copy over ``data/shards/00000000-7fffffff/small.1425202577.couch`` from +node1 to node3. Do not move files between the shard directories as that will +confuse CouchDB! + +Edit the database document in ``_dbs`` again. Make it so that node3 have a +replica of the shard ``00000000-7fffffff``. Save the document and let CouchDB +sync. If we do not do this, then writes made during the copy of the shard and +the updating of the metadata will only have ``n=1`` until CouchDB has synced. + +Then update the metadata document so that node2 no longer have the shard +``00000000-7fffffff``. You can now safely delete +``data/shards/00000000-7fffffff/small.1425202577.couch`` on node 2. + +The changelog is nothing that CouchDB cares about, it is only for the admins. +But for the sake of completeness, we will update it again. Use ``delete`` for +recording the removal of the shard ``00000000-7fffffff`` from node2. + +Start node4, add it to the cluster and do the same as above with shard +``80000000-ffffffff``. + +All documents added during this operation was saved and all reads responded to +without the users noticing anything. + +.. _cluster/sharding/views: + +Views +===== + +The views needs to be moved together with the shards. If you do not, then +CouchDB will rebuild them and this will take time if you have a lot of +documents. + +The views are stored in ``data/.shards``. + +It is possible to not move the views and let CouchDB rebuild the view every +time you move a shard. As this can take quite some time, it is not recommended. + +.. _cluster/sharding/preshard: + +Reshard? No, Preshard! +====================== + +Reshard? Nope. It can not be done. So do not create databases with to few +shards. + +If you can not scale out more because you set the number of shards to low, then +you need to create a new cluster and migrate over. + +#. Build a cluster with enough nodes to handle one copy of your data. +#. Create a database with the same name, n=1 and with enough shards so you do + not have to do this again. +#. Set up 2 way replication between the 2 clusters. +#. Let it sync. +#. Tell clients to use both the clusters. +#. Add some nodes to the new cluster and add them as replicas. +#. Remove some nodes from the old cluster. +#. Repeat 6 and 7 until you have enough nodes in the new cluster to have 3 + replicas of every shard. +#. Redirect all clients to the new cluster +#. Turn off the 2 way replication between the clusters. +#. Shut down the old cluster and add the servers as new nodes to the new + cluster. +#. Relax! + +Creating more shards than you need and then move the shards around is called +presharding. The number of shards you need depends on how much data you are +going to store. But creating to many shards increases the complexity without any +real gain. You might even get lower performance. As an example of this, we can +take the author's (15 year) old lab server. It gets noticeably slower with more +than one shard and high load, as the hard drive must seek more. + +How many shards you should have depends, as always, on your use case and your +hardware. If you do not know what to do, use the default of 8 shards. diff --git a/src/cluster/theory.rst b/src/cluster/theory.rst new file mode 100644 index 00000000..30f6ddf6 --- /dev/null +++ b/src/cluster/theory.rst @@ -0,0 +1,71 @@ +.. Licensed under the Apache License, Version 2.0 (the "License"); you may not +.. use this file except in compliance with the License. You may obtain a copy of +.. the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT +.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the +.. License for the specific language governing permissions and limitations under +.. the License. + +.. _cluster/theory: + +====== +Theory +====== + +Before we move on, we need some theory. + +As you see in ``etc/default.ini`` there is a section called [cluster] + +.. code-block:: text + + [cluster] + q=8 + r=2 + w=2 + n=3 + +* ``q`` - The number of shards. +* ``r`` - The number of copies of a document with the same revision that have to + be read before CouchDB returns with a ``200`` and the document. If there is + only one copy of the document accessible, then that is returned with ``200``. +* ``w`` - The number of nodes that need to save a document before a write is + returned with ``201``. If the nodes saving the document is ``0``, + ``202`` is returned. +* ``n`` - The number of copies there is of every document. Replicas. + +When creating a database or doing a read or write you can send your own values +with request and thereby overriding the defaults in ``default.ini``. + +We will focus on the shards and replicas for now. + +A shard is a part of a database. The more shards, the more you can scale out. +If you have 4 shards, that means that you can have at most 4 nodes. With one +shard you can have only one node, just the way CouchDB 1.x is. + +Replicas adds fail resistance, as some nodes can be offline without everything +comes crashing down. + +* ``n=1`` All nodes must be up. +* ``n=2`` Any 1 node can be down. +* ``n=3`` Any 2 nodes can be down. +* etc + +Computers goes down and sysadmins pull out network cables in a furious rage from +time to time, so using ``n<2`` is asking for downtime. Having a to high value of +n is adding servers and complexity without any real benefit. The sweetspot is at +``n=3``. + +Say that we have a database with 3 replicas and 4 shards. That would give us a +maximum of 12 nodes. 4*3=12 Every shard have 3 copies. + +We can lose any 2 nodes and still read and write all documents. + +What happens if we lose more nodes? It depends on how lucky we are. As long as +there is at least one copy of every shard online, we can read and write all +documents. + +So, if we are very lucky then we can lose 8 nodes at maximum. diff --git a/src/contents.rst b/src/contents.rst index 6405d3fa..29b947f9 100644 --- a/src/contents.rst +++ b/src/contents.rst @@ -28,6 +28,7 @@ query-server/index fauxton/index api/index + cluster/index json-structure experimental contributing