Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
This commit updates the examples considering
parallel-data-loading-for-pg-shard blog post.
  • Loading branch information
onderkalaci committed Oct 26, 2015
1 parent 6ad2903 commit a1ec121
Showing 1 changed file with 7 additions and 3 deletions.
10 changes: 7 additions & 3 deletions doc/README.md
Expand Up @@ -143,13 +143,17 @@ Call the script with the `-h` for more usage information.

### Increasing INSERT throughput

To maximize INSERT throughput, you should run statements in parallel. This helps utilizing multiple CPU cores. For instance, if you are loading data from two files you could run them in parallel such as the following:
To maximize INSERT throughput, you should run statements in parallel. This helps utilizing multiple CPU cores. For instance, if you want to load the contents of the `input.csv`, first split the file and then run `copy_to_distributed_table` in parallel as shown below:

```
copy_to_distributed_table -CH -d '|' -n NULL input_1.csv users &
copy_to_distributed_table -CH -d '|' -n NULL input_2.csv users &
mkdir chunks
split -n l/64 input.csv chunks/
find chunks/ -type f | xargs -n 1 -P 64 sh -c 'echo $0 `copy_to_distributed_table -C $0 users`'
```

Note that the above commands load the contents of the `input.csv` with 64 concurrent connections. You can optimize that number with respect to your hardware.


Similarly, if you run statements on the PostgreSQL server via psql, you should open multiple connections and run the INSERT statements concurrently.

### Repairing Shards
Expand Down

0 comments on commit a1ec121

Please sign in to comment.