Skip to content

Commit

Permalink
Docs updates on range distr. and copy from worker in append distr. (#970
Browse files Browse the repository at this point in the history
)

* Fix mismatch type error: boolean to int

* Remove copy from worker for append-partitioned table

* Remove range-distribution mentions in the docs

* Change node_capacity_function return type to real
  • Loading branch information
naisila authored and jonels-msft committed Jan 13, 2021
1 parent 8f4b4c2 commit 30263dd
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 18 deletions.
4 changes: 2 additions & 2 deletions develop/api_metadata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -436,9 +436,9 @@ Here are examples of functions that can be used within new shard rebalancer stra
-- example of node_capacity_function
CREATE FUNCTION v2_node_double_capacity(nodeidarg int)
RETURNS boolean AS $$
RETURNS real AS $$
SELECT
(CASE WHEN nodename LIKE '%.v2.worker.citusdata.com' THEN 2 ELSE 1 END)
(CASE WHEN nodename LIKE '%.v2.worker.citusdata.com' THEN 2.0::float4 ELSE 1.0::float4 END)
FROM pg_dist_node where nodeid = nodeidarg
$$ LANGUAGE sql;
Expand Down
4 changes: 2 additions & 2 deletions develop/api_udf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ will also be rebalanced. If table B does not have a replica identity, the rebala
fail. Therefore, this function can be useful breaking the implicit colocation in that case.

Both of the arguments should be a hash distributed table, currently we do not support colocation
of APPEND or RANGE distributed tables.
of APPEND distributed tables.

Note that this function does not move any data around physically.

Expand Down Expand Up @@ -844,7 +844,7 @@ The example below fetches and displays the table metadata for the github_events
get_shard_id_for_distribution_column
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Citus assigns every row of a distributed table to a shard based on the value of the row's distribution column and the table's method of distribution. In most cases the precise mapping is a low-level detail that the database administrator can ignore. However it can be useful to determine a row's shard, either for manual database maintenance tasks or just to satisfy curiosity. The :code:`get_shard_id_for_distribution_column` function provides this info for hash- and range-distributed tables as well as reference tables. It does not work for the append distribution.
Citus assigns every row of a distributed table to a shard based on the value of the row's distribution column and the table's method of distribution. In most cases the precise mapping is a low-level detail that the database administrator can ignore. However it can be useful to determine a row's shard, either for manual database maintenance tasks or just to satisfy curiosity. The :code:`get_shard_id_for_distribution_column` function provides this info for hash-distributed tables as well as reference tables. It does not work for the append distribution.

Arguments
************************
Expand Down
18 changes: 4 additions & 14 deletions develop/append.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,12 +147,12 @@ command is used to copy data from a file to a distributed table while handling
replication and failures automatically. You can also use the server side `COPY command <http://www.postgresql.org/docs/current/static/sql-copy.html>`_.
In the examples, we use the \\copy command from psql, which sends a COPY .. FROM STDIN to the server and reads files on the client side, whereas COPY from a file would read the file on the server.

You can use \\copy both on the coordinator and from any of the workers. When using it from the worker, you need to add the master_host option. Behind the scenes, \\copy first opens a connection to the coordinator using the provided master_host option and uses master_create_empty_shard to create a new shard. Then, the command connects to the workers and copies data into the replicas until the size reaches shard_max_size, at which point another new shard is created. Finally, the command fetches statistics for the shards and updates the metadata.
You can use \\copy only on the coordinator. Behind the scenes, \\copy uses master_create_empty_shard to create a new shard. Then, the command connects to the workers and copies data into the replicas until the size reaches shard_max_size, at which point another new shard is created. Finally, the command fetches statistics for the shards and updates the metadata.

.. code-block:: psql
SET citus.shard_max_size TO '64MB';
\copy github_events from 'github_events-2015-01-01-0.csv' WITH (format CSV, master_host 'coordinator-host')
\copy github_events from 'github_events-2015-01-01-0.csv' WITH (format CSV)
Citus assigns a unique shard id to each new shard and all its replicas have the same shard id. Each shard is represented on the worker node as a regular PostgreSQL table with name 'tablename_shardid' where tablename is the name of the distributed table and shardid is the unique id assigned to that shard. One can connect to the worker postgres instances to view or run commands on individual shards.

Expand All @@ -161,10 +161,7 @@ By default, the \\copy command depends on two configuration parameters for its b
(1) **citus.shard_max_size :-** This parameter determines the maximum size of a shard created using \\copy, and defaults to 1 GB. If the file is larger than this parameter, \\copy will break it up into multiple shards.
(2) **citus.shard_replication_factor :-** This parameter determines the number of nodes each shard gets replicated to, and defaults to one. Set it to two if you want Citus to replicate data automatically and provide fault tolerance. You may want to increase the factor even higher if you run large clusters and observe node failures on a more frequent basis.

.. note::
The configuration setting citus.shard_replication_factor can only be set on the coordinator node.

Please note that you can load several files in parallel through separate database connections or from different nodes. It is also worth noting that \\copy always creates at least one shard and does not append to existing shards. You can use the method described below to append to previously created shards.
Please note that you can load several files in parallel through separate database connections. It is also worth noting that \\copy always creates at least one shard and does not append to existing shards. You can use the method described below to append to previously created shards.

.. note::

Expand Down Expand Up @@ -272,14 +269,7 @@ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

For very high data ingestion rates, data can be staged via the workers. This method scales out horizontally and provides the highest ingestion rates, but can be more complex to use. Hence, we recommend trying this method only if your data ingestion rates cannot be addressed by the previously described methods.

Append distributed tables support COPY via the worker, by specifying the address of the coordinator in a master_host option, and optionally a master_port option (defaults to 5432). COPY via the workers has the same general properties as COPY via the coordinator, except the initial parsing is not bottlenecked on the coordinator.

.. code-block:: psql
psql -h worker-host-n -c "\COPY events FROM 'data.csv' WITH (FORMAT CSV, MASTER_HOST 'coordinator-host')"
An alternative to using COPY is to create a staging table and use standard SQL clients to append it to the distributed table, which is similar to staging data via the coordinator. An example of staging a file via a worker using psql is as follows:
The technique of staging data via the workers is to create a staging table and use standard SQL clients to append it to the distributed table, which is similar to staging data via the coordinator. An example of staging a file via a worker using psql is as follows:

.. code-block:: bash
Expand Down

0 comments on commit 30263dd

Please sign in to comment.