Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNM: erasure code XOR plugin #1164

Closed
wants to merge 15 commits into from
Closed

Conversation

apeters1971
Copy link
Contributor

This is a performance optimized EC plug-in computing simple parity similar to RAID-4 algorithms.
It is particular useful in combination with the EC pyramid code to compute local parities.

The implementation uses SSE2 assembler and region XOR'ing of 512-bit blocks. If not available it falls back to vector operations with 128-bit or 64-bit arithmetic.

Loic Dachary and others added 15 commits January 28, 2014 19:24
There is no need to specialize more than ostream : it only makes it
impossible to use cerr or cout as a parameter to str_map.

Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
So that a plugin can provide a more efficient implementation.

Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
* With year 2014
* Use "Ceph distributed storage system" instead of "Ceph - scalable
  distributed file system"

Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
The encode and decode interface are expected to allocate properly
aligned chunks when needed and convert bufferlists into chunks. The
encode_chunks and decode_chunks are lower level interfaces that assume
all chunks are allocated and that all chunks are to be decoded or
encoded.

They are meant to be used in contexts where these constraints are
enforced by default, such as when a plugin is used within the pyramid
erasure code implementation.

Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
The ErasureCode class is derived from ErasureCodeInterface and
implements stubs for some of the methods.

The encode() method stub relies on encode_chunk(). The decode() method
stub relies on decode_chunk(). Both are otherwise copied from
ErasureCodeJerasure.

The minimum_to_decode() and minimum_to_decode_with_cost() implementation
are copied verbatim from ErasureCodeJerasure.

The corresponding ErasureCodeJerasure methods are removed.

The existing decode_concat() helper is moved from ErasureCodeInterface to
ErasureCode so ErasureCodeInterface only contains pure virtual as is
expected from an interface definition.

Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
Implementation of the corresponding ErasureCodeInterface methods which
convert bufferlists into char * and map almost exactly to the jerasure
method prototype ( modulo the function name which is wrapped in a
virtual method ).

Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
The Mutex scope is restricted to only protect the load() method and not
the factory() method. This allows a plugin to load another plugin from
within the factory() method. This is convenient for the pyramid plugin
where each layer can specify a different plugin.

Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
The erasure code example plugin is re-implemented using the
encode_chunks() and decode_chunks() methods to show how they
work. Betting on the fact that a plugin implementor is likely to find
this API more straightforward to adapt than the encode() and decode()
helpers which are more convenient from the point of view of the caller
but not from the point of view of the plugin implementor.

Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
The decode() stub method from ErasureCode implements the intended side
effect of only returning the chunks required by want_to_decode. Modify
the tests to reflect this change.

Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
Add tests demonstrating that decode() and encode() methods avoid copying
the buffers when they can.

Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
Reviewed-By: Christophe Courtaut <christophe.courtaut@gmail.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
An erasure code plugin providing an implementation of
ErasureCodeInterface. The caller can specify how to recursively apply
erasure coding to the chunks to control the placement of the erasure
coded chunks.

For instance with a crush ruleset containing the following steps:

    take root
    set choose datacenter 2
    set choose devices 5

An erasure coded pool is given 10 OSDs ( 0123456789 ) the first five (
01234 ) are in a datacenter, the last five ( 56789 ) are in another
datacenter.

Creating a pyramidal layout by which recovering from the loss of a
single OSD does not require getting data from an OSD located in another
datacenter can be done with:

    [
     { "plugin": "jerasure",
       "technique": "cauchy_good",
       "k": "6",
       "m": "2",
       "mapping": "^-ABCDEF-^",
     },

     { "plugin": "xor",
       "k": "3",
       "m": "1",
       "type": "datacenter",
       "size": 2,
       "mapping": "-^ABCDEF^-",
     },
    ]

The object is first encoded in six data chunks ( "k": "6" ) and two
coding chunks ( "m": "2" ) by the first layer of the pyramid, using the
jerasure plugin ( "plugin": "jerasure" ). The jerasure plugin creates a
total of eight chunks ( k=6 + m=2 == 8 ) and ensures that the first six
contain the original data. If the data chunks were designated by letters
and the coding chunks by ^, it could be something like ABCDEF^^

If used outside of the context of the pyramid plugin, the jerasure
plugin would spread data and coding chunks as follows ( the dash -
designates a chunk that is not being used ):

    01324 56789
    ABCDE F^^--

i.e. with the first five data chunks in a datacenter ( the crush ruleset
above provides OSDs 01234 in one datacenter and 56789 in another ) and
the remaining chunks in another datacenter. The pyramid plugin remaps it (
"mapping": "^-ABCDEF-^" ) and the chunk placement becomes :

    01234 56789
    ^-ABC DEF-^

which is more evenly distributed, with three data chunks and a coding
chunk in one datacenter plus three data chunks and a coding chunk in
another datacenter.

The next level of the pyramid is expected to create coding chunks that
allows recovery without crossing datacenter boundaries, using a XOR
coding ( "plugin": "xor" ) supporting the loss of a single chunk ( "m":
"1" ) out of the three ( "k": "3" ) found in a given datacenter. It
starts by splitting the chunks in two ( "size": 2 ), starting with:

    01234
    ^-ABC

The XOR plugin is given ABC as an input and creates ABC^ which are
remapped into

    01234
    -^ABC

as specified in the first half of the mapping ( "mapping": "-^ABCDEF^-"
). The coding chunk of the previous level of the pyramid is left
undisturbed because the dash ( - ) in the mapping requires that it is
not used.

The same logic can be applied to three levels with:

    take root
    set choose datacenter 2
    set choose rack 2
    set choose devices 5

    ^-ABC-DEF--GHI-JKL-^
    -^ABC-DEF--GHI-JKL^-  datacenter
    --ABC^DEF^^GHI^JKL--  rack

http://tracker.ceph.com/issues/7238 Fixes: ceph#7238

Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
@apeters1971 apeters1971 deleted the wip-xor branch January 30, 2014 09:50
@apeters1971 apeters1971 reopened this Jan 30, 2014
@ghost
Copy link

ghost commented Feb 19, 2014

Reopen when work starts again : a link is added to http://tracker.ceph.com/issues/6478#note-6 to not loose track. It would be convenient to have a "DRAFT" category of pull requests but that will do. Having a long running DNM in the list of open pull request is not a good practice because it make noise.

@ghost ghost closed this Feb 19, 2014
liewegas pushed a commit to liewegas/ceph that referenced this pull request Nov 18, 2016
buildpackages: backport make-rpm.sh improvements

Reviewed-by: Loic Dachary <ldachary@redhat.com>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant