diff --git a/README.md b/README.md index 7aef51a..ba9efa4 100644 --- a/README.md +++ b/README.md @@ -2,29 +2,23 @@ [![Build Status](https://travis-ci.com/carapace/cellar.svg?branch=master)](https://travis-ci.com/carapace/cellar) [![CircleCI](https://circleci.com/gh/carapace/cellar/tree/master.svg?style=svg)](https://circleci.com/gh/carapace/cellar/tree/master) -[![Go Report Card](https://goreportcard.com/badge/github.com/carapace/cellar)](https://goreportcard.com/report/github.com/carapace/cellar) [![Coverage Status](https://coveralls.io/repos/github/carapace/cellar/badge.svg?branch=master)](https://coveralls.io/github/carapace/cellar?branch=master) + [![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause) +[![](https://godoc.org/github.com/carapace/cellar?status.svg)](http://godoc.org/github.com/carapace/cellar) +[![Go Report Card](https://goreportcard.com/badge/github.com/carapace/cellar)](https://goreportcard.com/report/github.com/carapace/cellar) -Cellar is the append-only storage backend in Go designed for the analytical -workloads. It replaces [geyser-net](https://github.com/abdullin/geyser-net). +Cellar is the append-only storage backend in Go designed based on Abdullin Cellar. +This fork is currently being redesigned, so the API should be considered unstable. Core features: - events are automatically split into the chunks; -- chunks are encrypted (LZ4) and compressed; +- chunks may be encrypted using the Cipher interface; - designed for batching operations (high throughput); - supports single writer and multiple concurrent readers; - store secondary indexes, lookups in the metadata DB. -This storage takes ideas from the [Message Vault](https://github.com/abdullin/messageVault), -which was based on the ideas of Kafka and append-only storage in [Lokad.CQRS](https://github.com/abdullin/lokad-cqrs) - -Analytical pipeline on top of this library was deployed at -HappyPancake to run real-time aggregation and long-term data analysis -on the largest social website in Sweden. You can read more about it in -[Real-time Analytics with Go and LMDB](https://abdullin.com/bitgn/real-time-analytics/). - # Contributors In the alphabetical order: @@ -38,100 +32,10 @@ Don't hesitate to send a PR to include your profile. Cellar stores data in a very simple manner: -- LMDB database is used for keeping metadata (including user-defined); +- MetaDB database is used for keeping metadata (including user-defined), see metadb.go; - a single pre-allocated file is used to buffer all writes; - when buffer fills, it is compressed, encrypted and added to the chunk list. -# Writing - -You can have **only one writer at a time**. This writer has two operations: - -- `Append` - adds new bytes to the buffer, but doesn't flush it. -- `Checkpoint` - performs all the flushing and saves the checkpoints. - -The store is optimized for throughput. You can efficiently execute -thousands of appends followed by a single call to `Checkpoint`. - -Whenever a buffer is about to overflow (exceed the predefined max -size), it will be "sealed" into an immutable chunk (compressed, -encrypted and added to the chunk table) and replaced by a new buffer. - -See tests in `writer_test.go` for sample usage patters (for both -writing and reading). - -# Reading - -At any point in time **multiple readers could be created** via -`NewReader(folder, encryptionKey)`. You can optionally configure -reader after creation by setting `StartPos` or `EndPos` to constrain -reading to a part of the database. - - -Readers have following operations available: - -- `Scan` - reads the database by executing the passed function against - each record; -- `ReadDb` - executes LMDB transaction against the metadata database - (used to read lookup tables or indexes stored by the - custom writing logic); -- `ScanAsync` - launches reading in a goroutine and returns a buffered - channel that will be filled up with records. - -Unit tests in `writer_test.go` feature use of readers as well. - -Note, that the reader tries to help you in achieving maximum -throughput. While reading events from the chunk, it will decrypt and -unpack the entire file in one go, allocating a memory buffer. All -individual event reads will be performed against this buffer. - -# Example: Incremental Reporting - -This library was used as a building block for capturing millions and -billions of events and then running reports on them. Consider a -following example of building an incremental reporting pipeline. - -There is an external append-only storage with billions of events and a -few terabytes of data (events are compressed separately with an -equivalent of Snappy). It is located on a remote storage (cloud or a -NAS). It is required to run custom reports on this data, refreshing -them every hour. - -Cellar storage could be used to serve as a local cache on a dedicated -reporting machine (e.g. you can find an instance with 32GB of RAM, -Intel Xeon and 500GB of NNVMe SSD under 100 EUR per month). Since -Cellar storage compresses events in chunks, high compression ratio -could be achieved. For instance, protobuf messages tend to get -compression of 2-10 in chunks. - -A solution might include an equivalent of a cron job that will execute -following apps in sequence: - -- import job - a golang console that reads the last retrieved offset - from the cellar, requests any new data from the remote storage and - stores it locally in raw format; -- compaction job - a golang console that incrementally pumps data from - the "raw" cellar storage to another (using checkpoints to determine - the location), while compacting and filtering events to keep only - the ones needed for reporting; -- report jobs - apps that perform a full scan on the compacted data, - building reports in memory and then dumping them into the TSV (or - whatever is format is used by your data processing framework). - -All these steps usually execute fast even on large datasets, since (1) -and (2) are incremental and operate only on the fresh data. (3) can -require full DB, however it works with the optimized and compacted -data, hence it will be fast as well. To get the most performance, you -might need to structure your messages for very fast reads without -unnecessary memory allocations or CPU work (e.g. using something like -FlatBuffers instead of JSON or ProtoBuf). - -Note, that the compaction job is optional. However, on fairly large -datasets, it might make sense to optimize messages for very fast -reads, while discarding all the unnecessary information. Should the -job requirements change, you'll need to update the compaction logic, -discard the compacted store and re-process all the raw data from the -start. - # License 3-clause BSD license.