Skip to content

Latest commit

 

History

History
552 lines (483 loc) · 21 KB

developer_notes.rst

File metadata and controls

552 lines (483 loc) · 21 KB

Erasure Code developer notes

Introduction

Each chapter of this document explains an aspect of the implementation of the erasure code within Ceph. It is mostly based on examples being explained to demonstrate how things work. It is written as if the implementation is complete although it may not be the case. For instance the plugin system and the jerasure plugin are implemented but the erasure coded pool is not.

Reading and writing encoded chunks from and to OSDs

An erasure coded pool stores each object as K+M chunks. It is divided into K data chunks and M coding chunks. The pool is configured to have a size of K+M so that each chunk is stored in an OSD in the acting set. The rank of the chunk is stored as an attribute of the object.

For instance an erasure coded pool is created to use five OSDs ( K+M = 5 ) and sustain the loss of two of them ( M = 2 ).

When the object NYAN containing ABCDEFGHI is written to it, the erasure encoding function splits the content in three data chunks, simply by dividing the content in three : the first contains ABC, the second DEF and the last GHI. The content will be padded if the content length is not a multiple of K. The function also creates two coding chunks : the fourth with YXY and the fifth with GQC. Each chunk is stored in an OSD in the acting set. The chunks are stored in objects that have the same name ( NYAN ) but reside on different OSDs. The order in which the chunks were created must be preserved and is stored as an attribute of the object ( shard_t ), in addition to its name. Chunk 1 contains ABC and is stored on OSD5 while chunk 4 contains YXY and is stored on OSD3.

+-------------------+
name | NYAN |

+-------------------+

content | ABCDEFGHI |
+--------+----------+

|

v

+------+------+

+---------------+ encode(3,2) +-----------+ | +--+--+---+---+ | | | | | | | +-------+ | +-----+ | | | | | |

+--v---+ +--v---+ +--v---+ +--v---+ +--v---+

name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |

+------+ +------+ +------+ +------+ +------+

shard | 1 | | 2 | | 3 | | 4 | | 5 |

+------+ +------+ +------+ +------+ +------+

content | ABC | | DEF | | GHI | | YXY | | QGC |
+--+---+ +--+---+ +--+---+ +--+---+ +--+---+
         | | | |
         | | | |
         | +--+---+ | |
         | | OSD1 | | |
         | +------+ | |
         | +------+ | |
         +------>| OSD2 | | |
                 +------+ | |
                 +------+ | |
                 | OSD3
                 +------+ |
                 +------+ |
                 | OSD4 |<--------------+
                 +------+
                 +------+
+----------------->| OSD5 |

+------+

When the object NYAN is read from the erasure coded pool, the decoding function reads three chunks : chunk 1 containing ABC, chunk 3 containing GHI and chunk 4 containing YXY and rebuild the original content of the object ABCDEFGHI. The decoding function is informed that the chunks 2 and 5 are missing ( they are called erasures ). The chunk 5 could not be read because the OSD4 is out. The decoding function can be called as soon as three chunks are read : OSD2 was the slowest and its chunk was not taken into account.

+-------------------+
name | NYAN |

+-------------------+

content | ABCDEFGHI |
+--------+----------+

^

+------+------+
decode(3,2) |
erasures 2,5|

+-------------->| | | +-------------+ | ^ ^ | | +-----+ | | |

+--+---+ +------+ +--+---+ +--+---+

name | NYAN | | NYAN | | NYAN | | NYAN |

+------+ +------+ +------+ +------+

shard | 1 | | 2 | | 3 | | 4 |

+------+ +------+ +------+ +------+

content | ABC | | DEF | | GHI | | YXY |
+--+---+ +--+---+ +--+---+ +--+---+

^ . ^ ^ | TOO . | | | SLOW . +--+---+ | | ^ | OSD1 | | | | +------+ | | | +------+ | | +-------| OSD2 | | | +------+ | | +------+ | | | OSD3 +------+ | +------+ | | OSD4 | OUT | +------+ | +------+ +------------------| OSD5 | +------+

Interrupted full writes

In an erasure coded pool the primary OSD in the up set receives all write operations. It is responsible for encoding the payload into K+M chunks and sends them to the other OSDs. It is also responsible for maintaining an authoritative version of the placement group logs.

primary

+---OSD 1---+ | log | | | | 1,1 | +-----------+ +---OSD 2---+ | 1,1 | +-----------+ +---OSD 3---+ | log | | | | 1,1 | +-----------+

An erasure coded placement group has been created with K = 2 + M = 1 and is supported by three OSDs, two for K and one for M. The acting set of the placement group is made of OSD 1, OSD 2 and OSD 3. An object has been encoded and stored in the OSDs : the chunk D1v1 (i.e. Data chunk number 1 version 1) is on OSD 1, D2v1 on OSD 2 and C1v1 (i.e. Coding chunk number 1 version 1) on OSD 3. The placement group logs on each OSD are identical (i.e. 1,1).

primary

+---OSD 1---+ | 1,2 +----+ | | 1,1 | +++---------+ || +---OSD 2---+ || +----+ | 1,1 | | +----+ | +-----------+ | +---OSD 3---+ | +----------> 1,2 | | 1,1 | +-----------+

OSD 1 is the primary and receives a WRITE FULL from a client, which means the payload is to replace the object entirely instead of overwriting a portion of it. Version two of the object is created to override version one. OSD 1 encodes the payload into three chunks : D1v2 (i.e. Data chunk number 1 version 2) will be on OSD 1, D2v2 on OSD 2 and C1v2 (i.e. Coding chunk number 1 version 2) on OSD 3. Each chunk is sent to the target OSD, including the primary OSD which is responsible for storing chunks in addition to handling write operations and maintaining an authoritative version of the placement group logs. When an OSD receives the message instructing it to write the chunk, it also creates a new entry in the placement group logs to reflect the change. For instance, as soon as OSD 3 stores C1v2, it adds the entry 1,2 ( i.e. epoch 1, version 2 ) to its logs. Because the OSDs work asynchronously, some chunks may still be in flight ( such as D2v2 ) while others are acknowledged and on disk ( such as C1v1 and D1v1 ).

primary

+---OSD 1---+ | 1,2 +----+ | | 1,1 | +++---------+ || +---OSD 2---+ || D2v2| 1,2 | | | | | 1,1 | | | +-----------+ | +---OSD 3---+ | +----------> 1,2 | | 1,1 | +-----------+

If all goes well, the chunks are acknowledged on each OSD in the acting set and the logs' last_complete pointer can move from 1,1 to 1,2 and the files used to store the chunks of the previous version of the object can be removed : D1v1 on OSD 1, D2v1 on OSD 2 and C1v1 on OSD 3.

+---OSD 1---+
|           |
|   DOWN    |
|           |
+-----------+
+---OSD 2---+
|+----+ log |
||D2v1| 1,1 |
|+----+     |
+-----------+
+---OSD 3---+
|+----+ log |
||C1v2| 1,2 |
|+----+     |
|+----+     |
||C1V1| 1,1 |
|+----+     |

primary +-----------+

+---OSD 4---+ | log | | 1,1 | | | +-----------+

But accidents happen. If OSD 1 goes down while D2v2 is still in flight, the object's version 2 is partially written : OSD 3 has one chunk but that is no not enough to recover. It lost two chunks : D1v2 and D2v2 and the erasure coding parameters K = 2 + M = 1 require that at least two chunks are available to rebuild the third. OSD 4 becomes the new primary and finds that the last_complete log entry ( i.e. all objects before this entry were known to be available on all OSDs in the previous acting set ) is 1,1 and that will be the head of the new authoritative log.

+---OSD 2---+
|+----+ log |
||D2v1| 1,1 |
|+----+     |
+-----------+
+---OSD 3---+
|+----+ log |
||C1V1| 1,1 |
|+----+     |

primary +-----------+

+---OSD 4---+ | log | | 1,1 | | | +-----------+

The log entry 1,2 found on OSD 3 is divergent from the new authoritative log provided by OSD 4 : it is discarded and the file containing the C1v2 chunk is removed.

+---OSD 2---+
|+----+ log |
||D2v1| 1,1 |
|+----+     |
+-----------+
+---OSD 3---+
|+----+ log |
||C1V1| 1,1 |
|+----+     |

primary +-----------+

+---OSD 4---+ | 1,1 | +-----------+

The D1v1 chunk is rebuilt with the decode function of the erasure coding library during scrubbing and stored on the new primary OSD 4.

Interrupted append

An object is coded in stripes, either because it is too big or because it is created with multiple write operations instead of a single full write. When appending to an existing object, the stripe size is retrieved from the attributes of the object. It applies, for instance, when rgw writes an object with a sequence of appends instead of a single full write.

primary

+---OSD 1---+ | 1,2 | | 1,1 | +++---------+ || +---OSD 2---+ || +-s2-+ | 1,1 | | +----+ | +-----------+ | +---OSD 3---+ | +----------> 1,2 | | | | 1,1 | +-----------+

OSD 1 is the primary and receives an APPEND from a client, meaning the payload is to be appended to the end of the object. OSD 1 encodes the payload into three chunks : S2D1 (i.e. Stripe two data chunk number 1 ) will be in s1 ( shard 1 ) on OSD 1, S2D2 in s2 on OSD 2 and S2C1 (i.e. Stripe two coding chunk number 1 ) in s3 on OSD 3. Each chunk is sent to the target OSD, including the primary OSD which is responsible for storing chunks in addition to handling write operations and maintaining an authoritative version of the placement group logs. When an OSD receives the message instructing it to write the chunk, it also creates a new entry in the placement group logs to reflect the change. For instance, as soon as OSD 3 stores S2C1, it adds the entry 1,2 ( i.e. epoch 1, version 2 ) to its logs. The log entry also carries the nature of the operation: in this case 1,2 is an APPEND where 1,1 was a CREATE. Because the OSDs work asynchronously, some chunks may still be in flight ( such as S2D2 ) while others are acknowledged and on disk (such as S2D1 and S2C1).

+---OSD 1---+
|           |
|   DOWN    |
|           |
+-----------+
+---OSD 2---+
|+-s2-+ log |
||S1D2| 1,1 |
|+----+     |
+-----------+
+---OSD 3---+
|+-s3-+ log |
||S1C1| 1,2 |
||----|     |
||S2C1| 1,1 |
|+----+     |

primary +-----------+

+---OSD 4---+ | log | | 1,1 | | | +-----------+

If OSD 1 goes down while S2D2 is still in flight, the payload is partially appended : s3 (shard 3) in OSD 3 has one chunk but does not have enough to recover. Two chunks were lost (S2D1 and S2D2) but the erasure coding parameters K = 2 + M = 1 requires that at least two chunks are available to rebuild the third. OSD 4 becomes the new primary and finds that the last_complete log entry ( i.e. all objects before this entry were known to be available on all OSDs in the previous acting set ) is 1,1 and will be the head of the new authoritative log.

+---OSD 2---+
|+-s2-+ log |
||S1D2| 1,1 |
|+----+     |
+-----------+
+---OSD 3---+
|+-s3-+ log |
||S1C1| 1,1 |
|+----+     |

primary +-----------+

+---OSD 4---+ | log | | 1,1 | | | +-----------+

The log entry 1,2 found on OSD 3 is divergent from the new authoritative log provided by OSD 4 : it is discarded and the file containing the S2C1 chunk is truncated to the nearest multiple of the stripe size.

Erasure code library

Using Reed-Solomon, with parameters K+M, object O is encoded by dividing it into chunks O1, O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks out of the available K+M chunks can be used to obtain the original object. If data chunk O2 or coding chunk P2 are lost, they can be repaired using any K chunks out of the K+M chunks. If more than M chunks are lost, it is not possible to recover the object.

Reading the original content of object O could be a simple concatenation of O1, O2, ... OM, because the plugins are using systematic codes. Otherwise the chunks must be given to the erasure code library decode method to retrieve the content of the object.

Reed-Solomon is significantly more expensive to encode than fountain codes with the current jerasure implementation. However gf-complete that will be used in the upcoming version of jerasure is twice faster and the difference becomes negligible. The difference is even more important when an object is divided in hundreds or more chunks, but Ceph will typically be used with less than 32 chunks.

Performance depend on the parameters to the encoding functions and is also influenced by the packet sizes used when calling the encoding functions ( for Cauchy or Liberation for instance ): smaller packets means more calls and more overhead.

Although Reed-Solomon is provided as a default, Ceph uses it via an abstract API designed to allow each pool to choose the plugin that implements it using key=value pairs when creating the pool.

ceph osd pool create <pool> \
   erasure-code-directory=<dir> \
   erasure-code-plugin=<plugin>

The <plugin> is dynamically loaded from <dir> (defaults to /usr/lib/ceph/erasure-code ) and expected to implement the int __erasure_code_init(charplugin_name)* function which is responsible for registering an object derived from ErasureCodePlugin in the registry. The ErasureCodePluginExample plugin reads:

ErasureCodePluginRegistry &instance = 
                           ErasureCodePluginRegistry::instance();
instance.add(plugin_name, new ErasureCodePluginExample());

The ErasureCodePlugin derived object must provide a factory method from which the concrete implementation of the ErasureCodeInterface object can be generated. The ErasureCodePluginExample plugin reads:

virtual int factory(const map<std::string,std::string> &parameters,
                    ErasureCodeInterfaceRef *erasure_code) {
  *erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters));
  return 0;
} 

The parameters argument is the list of key=value pairs that were set when the pool was created. Each key must be prefixed with erasure-code to avoid name collisions:

ceph osd pool create poolname 123 \
   erasure-code-directory=<dir>         \ # mandatory
   erasure-code-plugin=jerasure         \ # mandatory
   erasure-code-m=10                    \ # optional and plugin dependant
   erasure-code-k=3                     \ # optional and plugin dependant
   erasure-code-technique=reed_sol_van  \ # optional and plugin dependant

Scrubbing

See also Refactor scrub to use PGBackend methods The simplest form of scrubbing is to check with each OSDs holding a chunk if it exists locally. If more thank M chunks are missing the object is marked as lost. If up to M chunks are missing they are repaired and written to the relevant OSDs.

From time to time it may make sense to attempt to read an object, using all of its chunks. If the decode function fails, the object is lost.

Bit flips happen. Not often, but it is possible. Here is an article from 2011 also search for "bit rot" and "bit error rate". To detect corrupted chunks, a checksum (CRC23C for instance) must be added as an attribute of the file containing the chunk ( or shard ) so that deep scrubbing can check that the chunk is valid by recomputing the content of the chunk and compare it with the signature. BTRFS and ZFS have a CRC32C check built-in on a per block basis.

Notes

If the objects are large, it may be impractical to encode and decode them in memory. However, when using RBD a 1TB device is divided in many individual 4MB objects and RGW does the same.

Encoding and decoding is implemented in the OSD. Although it could be implemented client side for read write, the OSD must be able to encode and decode on its own when scrubbing.