# Persistence and Consistency

## Crash Consistency

Since updates to the file system require multiple updates to the data structures of the file system, a crash in the middle of this process could lead to the file system ending up in some inconsistent state.

## Example

To write to a new block in a file (whose inode has room to point to this data block), we need to update:

- INODE
- DATA BLOCK
- data BITMAP

### Crash Cases

1) The DATA BLOCK is written, but nothing else.
    - The user's data has been written, but neither the BITMAP nor INODE have.
    - This is as if the data wasn't written since the inode doesn't refer to that data block. From the user's perspective, they lost data.
    - From, the system's perspective, everything is still consistent. 
2) The BITMAP is written, but nothing else.
    - The bitmap will day that the block is claimed, but the inode won't refer to it.Since no file refers to it, it can NEVER be freed.
    - This is space leak. The file system is permanently 4KB smaller.
    - The inodes and bitmaps disagree. The system is in an INCONSISTENT state.
3) The INODE is written, but nothing else.
    - The inode points to the data block, but the data block contains garbage. The file will likely get corrupted.
    - The bitmap says its free so this data block might be allocated to a second file. If we have two files pointing to the same data block, they cna both corrupt each other at some distant point in the future.
    - The user will lose data AND the sytem is in an INCONSISTENT state.
4) The DATA BLOCK and BITMAP are written, but not the INODE.
    - The system will be INCONSISTENT since the metadata disagrees.
    - This is a space and the user loses data.
5) The DATA BLOCK and INODE are written, but not the BITMAP.
    - The system will be INCONSISTENT since the metadata disagrees.
    - The data block can be allocated to a second file, leading to the same outcome
    as case three.
6) The BITMAP and INODE are written, but not the DATA BLOCK.
    - The system will be consistent since the meta data agrees.
    - The user loses data and the file is likely corrupted.


# How to maintain consistency despite unexpected crashes

One way is to fix inconsistencies on reboot. Scan all inodes and bitmaps and updating them if they don't agree.

Of course, this is an expensive price to pay to fix a single inconsistency.

The linux utility `fsck` does exactly this. On small systems, it could add seconds to minutes to boot time. On large systems, hours to days.

Not ideal. A better way:

# A Journaling File System

To prevent inconsistencies, the file system maintains a journal.

When a request comes along, the file system first records what it is about to do in the journal, then it actually does it.

After a poweroff, the system checks the journal and completes any pending requests.

If a crash occurs in the middle of writing to the journal, the write is lost, but the system is consistent.

If a crash happens in the middle of writing to the disk, on reboot, the journal is replayed and the write is completed.

The OS writes **transactions** to the journal. A **transaction** has a header and tail. If the tail is missing, then the system crashed in the middle of writing to the journal.

Between the header and tail, the OS writes the inodes, bitmaps, and data blocks to be written to disk.

This is not efficient. It effectively halves the speed of the disk by requiring everything to be written out twice.

The journal is stored after the super block. 

<br>
<img src="images/14-fs-with-journal.png" width="500">
<br>

**A transaction**
<br>
<img src="images/15-journal.png" width="500">
<br>

Rather than write everything to the journal, just write the bitmaps and inodes.

DON'T write the data blocks.

Question: Should we write the data block before the journal transaction or after?

**Case 1**: Write to the journal, then write data block to disk. 

What if the system crashes before writing the data block, what is the state of the file upon reboot?

The file will be corrupted. The inode points to a garbage data block.

**Case 2**: Write data block to disk, then transaction to journal.

What if a crash happens between these two steps?

The last write will be lost, but the file will NOT be corrupted.

Our journaled file system goes with the second option, first write data block to disk, then transaction to journal.