Fixes and code refactoring (#38)

* Removed space * RSBX_SPECS overhaul * Comment fix * Comment and spacing * RSBX_SPECS overhaul * RSBX_SPECS overhaul * RSBX_SPECS overhaul * Casing fix * RSBX_SPECS overhaul * RSBX_SPECS update * RSBX_SPECS overhaul * RSBX_SPECS overhaul * Updated RSBX_SPECS * Overhaul of block_utils::get_ref_block * Switched to using release build for test scripts * Added code to handle repairing of cut off block set properly * Updated changelog
darrenldl · Mar 22, 2018 · b70416f · b70416f
1 parent 46c0df6
commit b70416f
Show file tree

Hide file tree

Showing 18 changed files with 238 additions and 101 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@
   - file size
   - container size
 - General output text polishing
+- Fixed repair mode code to handle block sets with blocks missing due to truncation properly
 
 ## 0.9.3
 - Various UI/UX improvements in subcommands

diff --git a/README.md b/README.md
@@ -40,7 +40,7 @@ The [wiki](https://github.com/darrenldl/rust-SeqBox/wiki) contains comprehensive
 [Changelog](CHANGELOG.md)
 
 ## Specification
-[Sbx format](SBX_FORMAT.md)
+[SBX format](SBX_FORMAT.md)
 
 [rsbx specs](RSBX_SPECS.md)
 

diff --git a/RSBX_SPECS.md b/RSBX_SPECS.md
@@ -1,4 +1,5 @@
 # Specification of rust-SeqBox
+The specification is only concerned with actual data operations rather than UI/UX related matter.
 
 ## Exit code
 rsbx returns
@@ -11,43 +12,68 @@ rsbx returns
   - This applies to encoding, decoding, rescuing (showing does not generate any files)
   - This is mainly for in case the partial data is useful to the user
 
+## Block handling in general
+#### Block validity
+Block is valid if
+- Header can be parsed
+- CRC-CCITT is correct
+
+#### Handling of duplicate metadata in metadata block given the block is valid
+- For a given ID, only the first occurance of the metadata will be used
+  - e.g. if there are two FNM metadata fields in the metadata block, only the first (in terms of byte order) will be used
+- This applies to everywhere where metadata fields need to be accessed
+
+#### Handling of incorrect metadata fields in metadata block given the block is valid
+- To avoid propogation of error into core logic, incorrect fields either fail the parsing stage, or are filtered out immediately after the parsing stage. That is, invalid metadata fields are never accessible by other modules.
+- This tradeoff means rsbx's error messages regarding metadata fields will be very coarse. For example, if the recorded file name is not a valid UTF-8 string, the core logic code will only see the field as missing, as it is dropped by the `sbx_block` module during parsing, and would not be able to tell whether the field is missing or incorrect, and would not be able to tell the user why the field is incorrect, etc.
+- This overall means trading flexibility for security.
+
+## Finding reference block
+1. The entire SBX container is scanned using alignment of 128 bytes, 128 is used as it is the largest common divisor of 512(block size for version 1), 128(block size for verion 2), and 4096(block size for version 3)
+  - if any block type is allowed
+    - the first whatever valid block(i.e. valid metadata or data block) will be used as reference block
+  - else
+    - if there is any valid metadata block in SBX container, then the first one will be used as reference block
+    - else the first valid data block will be used as reference block
+
+## Guessing burst error resistance level
+1. Read sequence numbers of first up to **1 + parity shard count + 1000** blocks
+- if block is valid, record the sequence number
+- else mark the sequence number as missing
+- a ref block is required to provide guidance on version and uid accepted
+2. Go through level 0 to 1000(inclusive), calculate supposed sequence number at each block position, record number of mismatches for each level
+- if sequence number was marked missing, then it is ignored and checked for mismatch
+3. return the level with least amount of mismatches
+
 ## Encode workflow
-1. If metadata is enabled, the following file metadata are gathered from file or retrieved from user input : file name, SBX file name, file size, file last modification time, encoding start time
+1. If metadata is enabled, the following file metadata are gathered from file or retrieved from user input
+- file name
+- SBX file name
+- file size
+- file last modification time
+- encoding start time
 2. If metadata is enabled, then a partial metadata block is written into the output file as filler
   - The written metadata block is valid, but does not contain the actual file hash, a filler pattern of 0x00 is used in place of the hash part of the multihash(the header and length indicator of multihash are still valid)
 3. Load version specific data sized chunk one at a time from input file to encode and output(and if metadata is enabled, Multihash hash state/ctx is updated as well(the actual hash state/ctx used depends on hash type, defaults to SHA256)
   - data size = block size - header size (e.g. version 1 has data size of 512 - 16 = 496)
 4. If metadata is enabled, the encoder seeks back to starting position of output file and overwrites the metadata block with one that contains the actual hash
 
 ## Decode workflow
-Metadata block is valid if and only if
-- Header can be parsed
-- All metadata fields(duplicate or not) can be parsed successfully
-  - Duplicate refers to metadata fields with the same ID
-- All remaining space is filled with 0x1A pattern
-- Version(specifically alignment/block size) matches reference block(see below)
-- CRC-CCITT is correct
+Metadata block is valid if
+- Basic block validity criteria are satisfied(see **Block handling in general above**)
+- Version and uid matches reference block(see below)
+- Several aspects are relaxed and allowed to not conform to `SBX_FORMAT`
+  - Metadata fields are optional, i.e. do not have to be parsable
+  - Padding of 0x1A is not mandatory
 
 Data block is valid if and only if
-- Header can be parsed
+- Basic block validity criteria are satisfied(see **Block handling in general above**)
 - Version and uid matches reference block(see below)
-- CRC-CCITT is correct
 
-1. A reference block is retrieved first(which is used for guidance on alignment, version, and uid)
-  - the entire SBX container is scanned using alignment of 128 bytes, 128 is used as it is the largest common divisor of 512(block size for version 1), 128(block size for verion 2), and 4096(block size for version 3)
-  - if no-meta flag is specified
-    - the first whatever valid block(i.e. valid metadata or data block) will be used as reference block
-  - else
-    - if there is any valid metadata block in SBX container, then the first one will be used as reference block
-    - else the first valid data block will be used as reference block
-  - if the version of reference block is 1, 2, or 3
-    - the block can be either `Data` or `Meta`, and all metadata fields are optional
-  - else if the version of reference block is 17, 18, or 19
-    - the block must be `Meta`, and metadata fields `RSD`, `RSP` must be present
+1. A reference block is retrieved first and is used for guidance on alignment, version, and uid(see **Finding reference block** procedure specified above)
 2. Scan for valid blocks from start of SBX container to decode and output using reference block's block size as alignment
   - if a block is invalid, nothing is done
   - if a block is valid, and is a metadata block, nothing is done
-  - if a block is valid, and is a metadata parity block, nothing is done
   - if a block is valid, and is a data parity block, nothing is done
   - if a block is valid, and is a data block, then it will be written to the writepos at output file, where writepos = (sequence number - 1) * block size of reference block in bytes
 3. If possible, truncate output file to remove data padding done for the last block during encoding
@@ -62,10 +88,6 @@ Data block is valid if and only if
 - First valid metadata block will be used(if exists)
 - For all other data blocks, the last seen valid data block will be used for a given sequence number
 
-#### Handling of duplicate metadata in metadata block given the block is valid
-- For a given ID, only the first occurance of the metadata will be used
-  e.g. if there are two FNM metadata fields in the metadata block, only the first (in terms of byte order) will be used
-
 #### Handling of corrupted/missing blocks
 - Corrupted blocks or missing blocks are not repaired in this mode
 - User needs to invoke repair mode to repair the archive
@@ -77,61 +99,57 @@ Data block is valid if and only if
   - if the log file exists, then it will be used to initialize the scan's starting position
     - bytes_processed field will be rounded down to closest multiple of 128 automatically
   - the log file will be updated on every ~1.0 second
-- each block is appended to OUTDIR/uid, where :
+- each block is appended to OUTDIR/UID, where :
   - OUTDIR = output directory specified
-  - uid    = uid of the block in hex
-- the original bytes in the file is used, that is, the output block bytes are not generated from scratch by oSBX
-2. User is expected to attempt to decode the rescued data in OUTDIR using the oSBX decode command
+  - UID    = uid of the block in hex(uppercase)
+- the original bytes in the file is used, that is, the output block bytes are not generated from scratch by rsbx
+2. User is expected to attempt to decode the rescued data in OUTDIR using the rsbx decode command
 
 ## Show workflow
 1. Scan for metadata blocks from start of provided file using 128 bytes alignment
-  - if block scanned has sequence number 0, then
-    - if the block is a valid metadatablock, it will be collected
-    - up to some specified maximum number of blocks are collected(defaults to 1)
-  - else
-    - nothing is done
-2. Metadata of collected list of metadata blocks are displayed
+- if show all flag is supplied, all valid metadata blocks are displayed
+- else only the first valid metadata block are displayed
+- all displaying of blocks are immediate(no buffering of blocks)
 
 ## Repair workflow
-1. Load metadata block and the 3 parity blocks, repair any of the 4 blocks if necessary
-2. Load up to M + N blocks sequentially, where M is the number of data shards and N is the number of parity shards
-3. Check CRC of all blocks and record invalid blocks
-4. Reconstruct the invalid blocks if possible
+1. A reference block is retrieved first and is used for guidance on alignment, version, and uid(see **Finding reference block** procedure specified above)
+- a metadata block must be used as reference block in this mode
+2. If the version of ref block does not use RS, then exit
+3. If `RSD` and `RSP` fields are not found in the ref block, then exit
+4. Total block count is then calculated from
+- `FSZ` field in ref block if present
+- otherwise is estimated the container size
+5. Go through all positions where metadata blocks are stored in container
+- if the metadata block is valid, nothing is done
+- else the metadata block is overwritten by the reference block
+6. Go through sequence numbers sequentially until the block count reaches calculated total block count
+- For each sequence number, calculate the block position and try to parse
+- Each valid block is loaded into the RS codec, and repair process starts for the current block set when the current block set is filled
+7. If current blockset contains enough blocks for repair, but repair process failed to start due to the block count reaching the calculated total block count
+- This indicates blocks are missing due to truncation
+- The the RS codec is invoked once to attempt repair, and write out remaining blocks if repair is successful
 
 ## Check workflow
-1. A reference block is retrieved first(which is used for guidance on alignment, version, and uid)
-  - the entire SBX container is scanned using alignment of 128 bytes, 128 is used as it is the largest common divisor of 512(block size for version 1), 128(block size for verion 2), and 4096(block size for version 3)
-  - if no-meta flag is specified
-    - the first whatever valid block(i.e. valid metadata or data block) will be used as reference block
-  - else
-    - if there is any valid metadata block in SBX container, then the first one will be used as reference block
-    - else the first valid data block will be used as reference block
-  - if the version of reference block is 1, 2, or 3
-    - the block can be either `Data` or `Meta`, and all metadata fields are optional
-  - else if the version of reference block is 17, 18, or 19
-    - the block must be `Meta`, and metadata fields `RSD`, `RSP` must be present
+1. A reference block is retrieved first and is used for guidance on alignment, version, and uid(see **Finding reference block** procedure specified above)
 2. Scan for valid blocks from start of SBX container to decode and output using reference block's block size as alignment
-  - if a block is invalid, and error message is shown
-  - if a block is valid, nothing is done
+- if a block is invalid, and error message is shown
+- if a block is valid, nothing is done
+- By default, completely blank sections are ignored as they usually indicate gaps introduced by the burst error resistance pattern
 
 #### Handling of irreparable blocks
 - Output sequence number of the blocks to log
 
-#### Handling of duplicate, out of order blocks, or block sequence number jumps
-- Halt repair process
-
 ## Sort workflow
-1. Check if destination has sufficient space for a complete replica of the current file(may not be sufficient estimate)
-2. Read block from input file sequentailly, write to position calculated from sequence number and block size to output file
+1. Read block from input file sequentailly, and write to position calculated from sequence number, block size and burst error resistance level to output file
+- The burst error resistance level by default is guessed using the **Guessing burst error resistance level** procedure specified above
+- The first metadata block is used for all metadata blocks in output container
+- The last valid data block is used for each sequence number
 
 #### Handling of missing blocks
 - Jumps/gaps caused by missing blocks are left to file system to handle(i.e. this may result in sparse file, or file with blank data in the gaps)
 
-#### Handling of corrupted blocks
-- Still write to output file
-
-#### Handling of duplicate metadata/data blocks
-- Append block to FILENAME.TIME.rSBX.leftover, where FILENAME is the specified archive name(not the name stored in metadata), TIME is string of format "%Y-%M-%D_%h%m" of the start of the sorting process
+## Calc workflow
+Calc mode only operates at UI/UX level and does not handle any file data, thus it is not documented here.
 
 ## To successfully encode a file
 - File size must be within threshold
@@ -145,12 +163,11 @@ Data block is valid if and only if
 - If data padding was done for the last block, then at least one valid metadata block must exist and the first block amongst the valid metadata blocks needs to contain a field for the file size in order for truncation of the output file to happen
 
 ## To successfully rescue your SBX container
-- Get enough valid SBX blocks of your container such that a successful decoding may take place
+- Get enough valid SBX blocks of your container such that a successful decoding or repair may take place
 
 ## To successfully repair your SBX container
 - The container has metadata block(or enough metadata parity blocks to reconstruct if corrupted/missing)
-- The container blocks are sorted by the sequence number in increasing order
-- The container has no duplicate blocks
+- The blocks' sequence numbers are in consistent order
 - The container has enough valid parity blocks to correct all errors
 
 ## To successfully sort your SBX container

diff --git a/build.rs b/build.rs
@@ -24,7 +24,7 @@
  *
  * The above copyright notice and this permission notice shall be included in all
  * copies or substantial portions of the Software.
- * 
+ *
  * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
  * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
  * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

diff --git a/src/block_utils.rs b/src/block_utils.rs
@@ -25,6 +25,13 @@ use progress_report::*;
 
 use general_error::Error;
 
+#[derive(Clone, Copy, Debug, PartialEq)]
+pub enum RefBlockChoice {
+    Any,
+    Prefer(BlockType),
+    MustBe(BlockType),
+}
+
 pub struct LazyReadResult {
     pub len_read : usize,
     pub usable   : bool,
@@ -108,7 +115,7 @@ pub fn read_block_lazily(block  : &mut Block,
 }
 
 pub fn get_ref_block(in_file            : &str,
-                     use_any_block_type : bool,
+                     ref_block_choice   : RefBlockChoice,
                      pr_verbosity_level : PRVerbosityLevel,
                      stop_flag          : &Arc<AtomicBool>)
                      -> Result<Option<(u64, Block)>, Error> {
@@ -167,19 +174,54 @@ pub fn get_ref_block(in_file            : &str,
             }
         }
 
-        if use_any_block_type {
-            if let Some(_) = meta_block { break; }
-            if let Some(_) = data_block { break; }
-        } else {
-            if let Some(_) = meta_block { break; }
+        match ref_block_choice {
+            RefBlockChoice::Any        => {
+                if let Some(_) = meta_block { break; }
+                if let Some(_) = data_block { break; }
+            },
+            RefBlockChoice::Prefer(bt) |
+            RefBlockChoice::MustBe(bt) => match bt {
+                BlockType::Meta => {
+                    if let Some(_) = meta_block { break; }
+                },
+                BlockType::Data => {
+                    if let Some(_) = data_block { break; }
+                }
+            },
         }
     }
 
     reporter.stop();
 
-    Ok(if      let Some(x) = meta_block { Some(x) }
-       else if let Some(x) = data_block { Some(x) }
-       else                             { None    })
+    Ok(match ref_block_choice {
+        RefBlockChoice::Any        => match (meta_block, data_block) {
+            (Some(m), _      ) => Some(m),
+            (_      , Some(d)) => Some(d),
+            (None,    None   ) => None,
+        },
+        RefBlockChoice::Prefer(bt) => match bt {
+            BlockType::Meta => match (meta_block, data_block) {
+                (Some(m), _      ) => Some(m),
+                (_      , Some(d)) => Some(d),
+                (None,    None   ) => None,
+            },
+            BlockType::Data => match (meta_block, data_block) {
+                (_      , Some(d)) => Some(d),
+                (Some(m), _      ) => Some(m),
+                (None,    None   ) => None,
+            }
+        },
+        RefBlockChoice::MustBe(bt) => match bt {
+            BlockType::Meta => match (meta_block, data_block) {
+                (Some(m), _      ) => Some(m),
+                (_      , _      ) => None,
+            },
+            BlockType::Data => match (meta_block, data_block) {
+                (_      , Some(d)) => Some(d),
+                (_      , _      ) => None,
+            }
+        }
+    })
 }
 
 pub fn guess_burst_err_resistance_level(in_file       : &str,
@@ -217,7 +259,7 @@ pub fn guess_burst_err_resistance_level(in_file       : &str,
 
     let pred = block_pred_same_ver_uid!(ref_block);
 
-    // record first up to 1000 seq nums
+    // record first up to 1 + parity count + 1000 seq nums
     loop {
         let read_res = reader.read(sbx_block::slice_buf_mut(ref_block.get_version(),
                                                             &mut buffer))?;

diff --git a/src/check_core.rs b/src/check_core.rs
@@ -23,24 +23,26 @@ use sbx_specs::{SBX_LARGEST_BLOCK_SIZE,
 use time_utils;
 use block_utils;
 
+use block_utils::RefBlockChoice;
+
 use cli_utils::report_ref_block_info;
 
 pub struct Param {
-    no_meta            : bool,
+    ref_block_choice   : RefBlockChoice,
     report_blank       : bool,
     in_file            : String,
     verbose            : bool,
     pr_verbosity_level : PRVerbosityLevel,
 }
 
 impl Param {
-    pub fn new(no_meta            : bool,
+    pub fn new(ref_block_choice   : RefBlockChoice,
                report_blank       : bool,
                in_file            : &str,
                verbose            : bool,
                pr_verbosity_level : PRVerbosityLevel) -> Param {
         Param {
-            no_meta,
+            ref_block_choice,
             report_blank,
             in_file  : String::from(in_file),
             verbose,

diff --git a/src/cli_check.rs b/src/cli_check.rs
@@ -23,7 +23,7 @@ pub fn check<'a>(matches : &ArgMatches<'a>) -> i32 {
     let pr_verbosity_level = get_pr_verbosity_level!(matches);
 
     let in_file  = get_in_file!(matches);
-    let param = Param::new(matches.is_present("no_meta"),
+    let param = Param::new(get_ref_block_choice!(matches),
                            matches.is_present("report_blank"),
                            in_file,
                            matches.is_present("verbose"),