From 51fccc8be4a4b6e5fb8184628ef3b0e739cbfe6a Mon Sep 17 00:00:00 2001
From: Maria Andersson <maria@dualpose.com>
Date: Mon, 13 Apr 2015 16:47:01 +0200
Subject: [PATCH] A base to build on for 2.0

---
 src/cluster/databases.rst |  43 ++++++
 src/cluster/index.rst     |  33 +++++
 src/cluster/nodes.rst     |  81 ++++++++++
 src/cluster/setup.rst     | 163 +++++++++++++++++++++
 src/cluster/sharding.rst  | 300 ++++++++++++++++++++++++++++++++++++++
 src/cluster/theory.rst    |  71 +++++++++
 src/contents.rst          |   1 +
 7 files changed, 692 insertions(+)
 create mode 100644 src/cluster/databases.rst
 create mode 100644 src/cluster/index.rst
 create mode 100644 src/cluster/nodes.rst
 create mode 100644 src/cluster/setup.rst
 create mode 100644 src/cluster/sharding.rst
 create mode 100644 src/cluster/theory.rst

diff --git a/src/cluster/databases.rst b/src/cluster/databases.rst
new file mode 100644
index 00000000..f50cb45b
--- /dev/null
+++ b/src/cluster/databases.rst
@@ -0,0 +1,43 @@
+.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
+.. use this file except in compliance with the License. You may obtain a copy of
+.. the License at
+..
+..   http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+.. License for the specific language governing permissions and limitations under
+.. the License.
+
+.. _cluster/databases:
+
+===================
+Database Management
+===================
+
+.. _cluster/databases/create:
+
+Creating a database
+===================
+
+This will create a database with ``3`` replicas and ``8`` shards.
+
+.. code-block:: bash
+
+    curl -X PUT "http://xxx.xxx.xxx.xxx:5984/database-name?n=3&q=8" --user admin-user
+
+The database is in ``data/shards``. Look around on all the nodes and you will
+find all the parts.
+
+If you do not specify ``n`` and ``q`` the default will be used. The default is
+``3`` replicas and ``8`` shards.
+
+.. _cluster/databases/delete:
+
+Deleteing a database
+====================
+
+.. code-block:: bash
+
+    curl -X DELETE "http://xxx.xxx.xxx.xxx:5984/database-name --user admin-user
diff --git a/src/cluster/index.rst b/src/cluster/index.rst
new file mode 100644
index 00000000..4ae88eaf
--- /dev/null
+++ b/src/cluster/index.rst
@@ -0,0 +1,33 @@
+.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
+.. use this file except in compliance with the License. You may obtain a copy of
+.. the License at
+..
+..   http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+.. License for the specific language governing permissions and limitations under
+.. the License.
+
+.. _cluster:
+
+=================
+Cluster Reference
+=================
+
+As of 2.0 CouchDB now have two modes of operations:
+    * Standalone
+    * Cluster
+
+This part of the documentation is about setting up and maintain a CouchDB
+cluster.
+
+.. toctree::
+    :maxdepth: 2
+
+    setup
+    theory
+    nodes
+    databases
+    sharding
diff --git a/src/cluster/nodes.rst b/src/cluster/nodes.rst
new file mode 100644
index 00000000..4d05a9b6
--- /dev/null
+++ b/src/cluster/nodes.rst
@@ -0,0 +1,81 @@
+.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
+.. use this file except in compliance with the License. You may obtain a copy of
+.. the License at
+..
+..   http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+.. License for the specific language governing permissions and limitations under
+.. the License.
+
+.. _cluster/nodes:
+
+===============
+Node Management
+===============
+
+.. _cluster/nodes/add:
+
+Adding a node
+=============
+
+Go to ``http://server1:45984/_membership`` to see the name of the node and all
+the nodes it knows about and are connected too.
+
+.. code-block:: text
+
+    curl -X GET "http://xxx.xxx.xxx.xxx:5984/_membership" --user admin-user
+
+.. code-block:: javascript
+
+    {
+        "all_nodes":[
+            "node1@xxx.xxx.xxx.xxx"],
+        "cluster_nodes":[
+            "node1@xxx.xxx.xxx.xxx"]
+    }
+
+* ``all_nodes`` are all the nodes thats this node knows about.
+* ``cluster_nodes`` are the nodes that are connected to this node.
+
+To add a node simply do:
+
+.. code-block:: text
+
+    curl -X PUT "http://xxx.xxx.xxx.xxx:5986/_nodes/node2@yyy.yyy.yyy.yyy" -d {}
+
+Now look at ``http://server1:5984/_membership`` again.
+
+.. code-block:: javascript
+
+    {
+        "all_nodes":[
+            "node1@xxx.xxx.xxx.xxx",
+            "node2@yyy.yyy.yyy.yyy"
+        ],
+        "cluster_nodes":[
+            "node1@xxx.xxx.xxx.xxx",
+            "node2@yyy.yyy.yyy.yyy"
+        ]
+    }
+
+And you have a 2 node cluster :)
+
+``http://yyy.yyy.yyy.yyy:5984/_membership`` will show the same thing, so you
+only have to add a node once.
+
+.. _cluster/nodes/remove:
+
+Removing a node
+===============
+
+Before you remove a node, make sure that you have moved all
+:ref:`shards <cluster/sharding/move>` away from that node.
+
+To remode ``node2`` from server ``yyy.yyy.yyy.yyy``:
+
+.. code-block:: text
+
+    curl -X DELETE "http://xxx.xxx.xxx.xxx:5986/_nodes/node2@yyy.yyy.yyy.yyy" -d {}
diff --git a/src/cluster/setup.rst b/src/cluster/setup.rst
new file mode 100644
index 00000000..eafb203d
--- /dev/null
+++ b/src/cluster/setup.rst
@@ -0,0 +1,163 @@
+.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
+.. use this file except in compliance with the License. You may obtain a copy of
+.. the License at
+..
+..   http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+.. License for the specific language governing permissions and limitations under
+.. the License.
+
+.. _cluster/setup:
+
+=====
+Setup
+=====
+
+Everything you need to know to prepare the cluster for the installation of
+CouchDB.
+
+Firewall
+========
+
+If you do not have a firewall between your servers, then you can skip this.
+
+CouchDB in cluster mode uses the port ``5984`` just as standalone, but is also
+uses ``5986`` for the admin interface.
+
+Erlang uses TCP port ``4369`` (EPMD) to find other nodes, so all servers must be
+able to speak to each other on this port. In an Erlang Cluster, all nodes are
+connected to all other nodes. A mesh.
+
+.. warning::
+    If you expose the port ``4369`` to the Internet or any other untrusted
+    network, then the only thing protecting you is the
+    :ref:`cookie <cluster/setup/cookie>`.
+
+Every Erlang application then uses other ports for talking to each other. Yes,
+this means random ports. This will obviously not work with a firewall, but it is
+possible to force an Erlang application to use a specific port rage.
+
+This documentation will use the range TCP ``9100-9200``. Open up those ports in
+your firewalls and it is time to test it.
+
+You need 2 servers with working hostnames. Let us call them server1 and server2.
+
+On server1:
+
+.. code-block:: bash
+
+    erl -sname bus -setcookie 'brumbrum' -kernel inet_dist_listen_min 9100 -kernel inet_dist_listen_max 9200
+
+Then on server2:
+
+.. code-block:: bash
+
+    erl -sname car -setcookie 'brumbrum' -kernel inet_dist_listen_min 9100 -kernel inet_dist_listen_max 9200
+
+An explanation to the commands:
+    * ``erl`` the Erlang shell.
+    * ``-sname bus`` the name of the Erlang node.
+    * ``-setcookie 'brumbrum'`` the "password" used when nodes connect to each
+      other.
+    * ``-kernel inet_dist_listen_min 9100`` the lowest port in the rage.
+    * ``-kernel inet_dist_listen_max 9200`` the highest port in the rage.
+
+This gives us 2 Erlang shells. shell1 on server1, shell2 on server2.
+Time to connect them. The ``.`` is to Erlang what ``;`` is to C.
+
+In shell1:
+
+.. code-block:: erlang
+
+    net_kernel:connect_node(car@server2).
+
+This will connect to the node called ``car`` on the server called ``server2``.
+
+If that returns true, then you have a Erlang cluster, and the firewalls are
+open. If you get false or nothing at all, then you have a problem with the
+firewall.
+
+First time in Erlang? Time to play!
+-----------------------------------
+
+Run in both shells:
+
+.. code-block:: erlang
+
+    register(shell, self()).
+
+shell1:
+
+.. code-block:: erlang
+
+    {shell, car@server2} ! {hello, from, self()}.
+
+shell2:
+
+.. code-block:: erlang
+
+    flush().
+    {shell, bus@server1} ! {"It speaks!", from, self()}.
+
+shell1:
+
+.. code-block:: erlang
+
+    flush().
+
+To close the shells, run in both:
+
+.. code-block:: erlang
+
+    q().
+
+Make CouchDB use the open ports.
+--------------------------------
+
+Open ``sys.config``, on all nodes, and add ``inet_dist_listen_min, 9100`` and
+``inet_dist_listen_max, 9200`` like below:
+
+.. code-block:: erlang
+
+    [
+        {lager, [
+            {error_logger_hwm, 1000},
+            {error_logger_redirect, true},
+            {handlers, [
+                {lager_console_backend, [debug, {
+                    lager_default_formatter,
+                    [
+                        date, " ", time,
+                        " [", severity, "] ",
+                        node, " ", pid, " ",
+                        message,
+                        "\n"
+                    ]
+                }]}
+            ]},
+            {inet_dist_listen_min, 9100},
+            {inet_dist_listen_max, 9200}
+        ]}
+    ].
+
+Configuration files
+===================
+
+.. _cluster/setup/cookie:
+
+Erlang Cookie
+-------------
+
+Open up ``vm.args`` and set the ``-setcookie`` to something secret. This must be
+identical on all nodes.
+
+Set ``-name`` to the name the node will have. All nodes must have a unique name.
+
+Admin
+-----
+
+All nodes authenticates users locally, so you must add an admin user to
+local.ini on all nodes. Otherwise you will not be able to login on the cluster.
diff --git a/src/cluster/sharding.rst b/src/cluster/sharding.rst
new file mode 100644
index 00000000..4df71b03
--- /dev/null
+++ b/src/cluster/sharding.rst
@@ -0,0 +1,300 @@
+.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
+.. use this file except in compliance with the License. You may obtain a copy of
+.. the License at
+..
+..   http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+.. License for the specific language governing permissions and limitations under
+.. the License.
+
+.. _cluster/sharding:
+
+========
+Sharding
+========
+
+.. _cluster/sharding/scaling-out:
+
+Scaling out
+===========
+
+Normally you start small and grow over time. In the beginning you might do just
+fine with one node, but as your data and number of clients grows, you need to
+scale out.
+
+For simplicity we will start fresh and small.
+
+Start node1 and add a database to it. To keep it simple we will have 2 shards
+and no replicas.
+
+.. code-block:: bash
+
+    curl -X PUT "http://xxx.xxx.xxx.xxx:5984/small?n=1&q=2" --user daboss
+
+If you look in the directory ``data/shards`` you will find the 2 shards.
+
+.. code-block:: text
+
+    data/
+    +-- shards/
+    |   +-- 00000000-7fffffff/
+    |   |    -- small.1425202577.couch
+    |   +-- 80000000-ffffffff/
+    |        -- small.1425202577.couch
+
+Now, go to the admin panel
+
+.. code-block:: text
+
+    http://xxx.xxx.xxx.xxx:5986/_utils
+
+and look in the database ``_dbs``, it is here that the metadata for each
+database is stored. As the database is called small, there is a document called
+small there. Let us look in it. Yes, you can get it with curl too:
+
+.. code-block:: javascript
+
+    curl -X GET "http://xxx.xxx.xxx.xxx:5986/_dbs/small"
+
+    {
+        "_id": "small",
+        "_rev": "1-5e2d10c29c70d3869fb7a1fd3a827a64",
+        "shard_suffix": [
+            46,
+            49,
+            52,
+            50,
+            53,
+            50,
+            48,
+            50,
+            53,
+            55,
+            55
+        ],
+        "changelog": [
+        [
+            "add",
+            "00000000-7fffffff",
+            "node1@xxx.xxx.xxx.xxx"
+        ],
+        [
+            "add",
+            "80000000-ffffffff",
+            "node1@xxx.xxx.xxx.xxx"
+        ]
+        ],
+        "by_node": {
+            "node1@xxx.xxx.xxx.xxx": [
+                "00000000-7fffffff",
+                "80000000-ffffffff"
+            ]
+        },
+        "by_range": {
+            "00000000-7fffffff": [
+                "node1@xxx.xxx.xxx.xxx"
+            ],
+            "80000000-ffffffff": [
+                "node1@xxx.xxx.xxx.xxx"
+            ]
+        }
+    }
+
+* ``_id`` The name of the database.
+* ``_rev`` The current revision of the metadata.
+* ``shard_suffix`` The numbers after small and before .couch. The number of
+  seconds after UNIX epoch that the database was created. Stored in ASCII.
+* ``changelog`` Self explaining.　Only for admins to read.
+* ``by_node`` Which shards each node have.
+* ``by_rage`` On which nodes each shard is.
+
+Nothing here, nothing there, a shard in my sleeve
+-------------------------------------------------
+
+Start node2 and add it to the cluster. Check in ``/_membership`` that the
+nodes are talking with each other.
+
+If you look in the directory ``data`` on node2, you will see that there is no
+directory called shards.
+
+Go to Fauxton and edit the metadata for small, so it looks like this:
+
+.. code-block:: javascript
+
+    {
+        "_id": "small",
+        "_rev": "1-5e2d10c29c70d3869fb7a1fd3a827a64",
+        "shard_suffix": [
+            46,
+            49,
+            52,
+            50,
+            53,
+            50,
+            48,
+            50,
+            53,
+            55,
+            55
+        ],
+        "changelog": [
+        [
+            "add",
+            "00000000-7fffffff",
+            "node1@xxx.xxx.xxx.xxx"
+        ],
+        [
+            "add",
+            "80000000-ffffffff",
+            "node1@xxx.xxx.xxx.xxx"
+        ],
+        [
+            "add",
+            "00000000-7fffffff",
+            "node2@yyy.yyy.yyy.yyy"
+        ],
+        [
+            "add",
+            "80000000-ffffffff",
+            "node2@yyy.yyy.yyy.yyy"
+        ]
+        ],
+        "by_node": {
+            "node1@xxx.xxx.xxx.xxx": [
+                "00000000-7fffffff",
+                "80000000-ffffffff"
+            ],
+            "node2@yyy.yyy.yyy.yyy": [
+                "00000000-7fffffff",
+                "80000000-ffffffff"
+            ]
+        },
+        "by_range": {
+            "00000000-7fffffff": [
+                "node1@xxx.xxx.xxx.xxx",
+                "node2@yyy.yyy.yyy.yyy"
+            ],
+            "80000000-ffffffff": [
+                "node1@xxx.xxx.xxx.xxx",
+                "node2@yyy.yyy.yyy.yyy"
+            ]
+        }
+    }
+
+Then press Save and marvel at the magic. The shards are now on node2 too! We
+now have ``n=2``!
+
+If the shards are large, then you can copy them over manually and only have
+CouchDB syncing the changes from the last minutes instead.
+
+.. _cluster/sharding/move:
+
+Moving Shards
+=============
+
+Add, then delete
+----------------
+
+In the world of CouchDB there is no such thing as moving. You can add a new
+replica to a shard and then remove the old replica, thereby creating the
+illusion of moving. If you try to uphold this illusion with a database that have
+``n=1``, you might find yourself in the following scenario:
+
+#. Copy the shard to a new node.
+#. Update the metadata to use the new node.
+#. Delete the shard on the old node.
+#. Lose all writes made between 1 and 2.
+
+As the realty "I added a new replica of the shard X on node Y and then I waited
+for them to sync, before I removed the replica of shard X from node Z." is a bit
+tedious, people and this documentation tend to use the illusion of moving.
+
+Moving
+------
+
+When you get to ``n=3`` you should start moving the shards instead of adding
+more replicas.
+
+We will stop on ``n=2`` to keep things simple. Start node number 3 and add it to
+the cluster. Then create the directories for the shard on node3:
+
+.. code-block:: bash
+
+    mkdir -p data/shards/00000000-7fffffff
+
+And copy over ``data/shards/00000000-7fffffff/small.1425202577.couch`` from
+node1 to node3. Do not move files between the shard directories as that will
+confuse CouchDB!
+
+Edit the database document in ``_dbs`` again. Make it so that node3 have a
+replica of the shard ``00000000-7fffffff``. Save the document and let CouchDB
+sync. If we do not do this, then writes made during the copy of the shard and
+the updating of the metadata will only have ``n=1`` until CouchDB has synced.
+
+Then update the metadata document so that node2 no longer have the shard
+``00000000-7fffffff``. You can now safely delete
+``data/shards/00000000-7fffffff/small.1425202577.couch`` on node 2.
+
+The changelog is nothing that CouchDB cares about, it is only for the admins.
+But for the sake of completeness, we will update it again. Use ``delete`` for
+recording the removal of the shard ``00000000-7fffffff`` from node2.
+
+Start node4, add it to the cluster and do the same as above with shard
+``80000000-ffffffff``.
+
+All documents added during this operation was saved and all reads responded to
+without the users noticing anything.
+
+.. _cluster/sharding/views:
+
+Views
+=====
+
+The views needs to be moved together with the shards. If you do not, then
+CouchDB will rebuild them and this will take time if you have a lot of
+documents.
+
+The views are stored in ``data/.shards``.
+
+It is possible to not move the views and let CouchDB rebuild the view every
+time you move a shard. As this can take quite some time, it is not recommended.
+
+.. _cluster/sharding/preshard:
+
+Reshard? No, Preshard!
+======================
+
+Reshard? Nope. It can not be done. So do not create databases with to few
+shards.
+
+If you can not scale out more because you set the number of shards to low, then
+you need to create a new cluster and migrate over.
+
+#. Build a cluster with enough nodes to handle one copy of your data.
+#. Create a database with the same name, n=1 and with enough shards so you do
+   not have to do this again.
+#. Set up 2 way replication between the 2 clusters.
+#. Let it sync.
+#. Tell clients to use both the clusters.
+#. Add some nodes to the new cluster and add them as replicas.
+#. Remove some nodes from the old cluster.
+#. Repeat 6 and 7 until you have enough nodes in the new cluster to have 3
+   replicas of every shard.
+#. Redirect all clients to the new cluster
+#. Turn off the 2 way replication between the clusters.
+#. Shut down the old cluster and add the servers as new nodes to the new
+   cluster.
+#. Relax!
+
+Creating more shards than you need and then move the shards around is called
+presharding. The number of shards you need depends on how much data you are
+going to store. But creating to many shards increases the complexity without any
+real gain. You might even get lower performance. As an example of this, we can
+take the author's (15 year) old lab server. It gets noticeably slower with more
+than one shard and high load, as the hard drive must seek more.
+
+How many shards you should have depends, as always, on your use case and your
+hardware. If you do not know what to do, use the default of 8 shards.
diff --git a/src/cluster/theory.rst b/src/cluster/theory.rst
new file mode 100644
index 00000000..30f6ddf6
--- /dev/null
+++ b/src/cluster/theory.rst
@@ -0,0 +1,71 @@
+.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
+.. use this file except in compliance with the License. You may obtain a copy of
+.. the License at
+..
+..   http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+.. License for the specific language governing permissions and limitations under
+.. the License.
+
+.. _cluster/theory:
+
+======
+Theory
+======
+
+Before we move on, we need some theory.
+
+As you see in ``etc/default.ini`` there is a section called [cluster]
+
+.. code-block:: text
+
+    [cluster]
+    q=8
+    r=2
+    w=2
+    n=3
+
+* ``q`` - The number of shards.
+* ``r`` - The number of copies of a document with the same revision that have to
+  be read before CouchDB returns with a ``200`` and the document. If there is
+  only one copy of the document accessible, then that is returned with ``200``.
+* ``w`` - The number of nodes that need to save a document before a write is
+  returned with ``201``. If the nodes saving the document is ``<w`` but ``>0``,
+  ``202`` is returned.
+* ``n`` - The number of copies there is of every document. Replicas.
+
+When creating a database or doing a read or write you can send your own values
+with request and thereby overriding the defaults in ``default.ini``.
+
+We will focus on the shards and replicas for now.
+
+A shard is a part of a database. The more shards, the more you can scale out.
+If you have 4 shards, that means that you can have at most 4 nodes. With one
+shard you can have only one node, just the way CouchDB 1.x is.
+
+Replicas adds fail resistance, as some nodes can be offline without everything
+comes crashing down.
+
+* ``n=1`` All nodes must be up.
+* ``n=2`` Any 1 node can be down.
+* ``n=3`` Any 2 nodes can be down.
+* etc
+
+Computers goes down and sysadmins pull out network cables in a furious rage from
+time to time, so using ``n<2`` is asking for downtime. Having a to high value of
+n is adding servers and complexity without any real benefit. The sweetspot is at
+``n=3``.
+
+Say that we have a database with 3 replicas and 4 shards. That would give us a
+maximum of 12 nodes. 4*3=12 Every shard have 3 copies.
+
+We can lose any 2 nodes and still read and write all documents.
+
+What happens if we lose more nodes? It depends on how lucky we are. As long as
+there is at least one copy of every shard online, we can read and write all
+documents.
+
+So, if we are very lucky then we can lose 8 nodes at maximum.
diff --git a/src/contents.rst b/src/contents.rst
index 6405d3fa..29b947f9 100644
--- a/src/contents.rst
+++ b/src/contents.rst
@@ -28,6 +28,7 @@
     query-server/index
     fauxton/index
     api/index
+    cluster/index
     json-structure
     experimental
     contributing