# UTILS.ipynb 
# Helpful Data Management Utilities

---
## Overview
- cache.py

# 1 - cache.py

This file is responsible for project data management. `ensure_dir` is used often throughout the pipeline to drastically save runtime, and `file_sha256` is a useful debugging tool.

# 1.1 - Imports

This section covers the necessary imports. 

Annotations/type hints are imported for clarity. Path is used for file system management. `hashlib` is used for file integrity checks.

In [None]:
from __future__ import annotations
from pathlib import Path
import hashlib

# 1.2 - Ensure Directory

This simple function guarantees that an input directory `p` (Path) exists, and returns the same object for chaining.

`p.mkdir` creates directory `p`, unless the directory already exists. 

In [None]:
def ensure_dir(p: Path) -> Path:
    p.mkdir(parents=True, exist_ok=True)
    return p

# 1.3 - File SHA-256

This function computes a SHA-256 checksum of a file. This is useful for cache invalidation, change detection, data integrity verification, and reproducible pipeline. The function accepts a Path `path` and an integer `chunk_size` (default 1 MB). The function returns a string.

Line-by-line breakdown:
- Initialize a SHA-256 hash object with an empty internal state.
- Open `path` in binary mode.
- Begin an infinite loop that exits when the file has been completely read. Read `chunk_size` bytes from the file, exiting the loop when the read returns empty bytes. 
- Feed the chunk into the hash algorithm and update the internal hash state incrementally. 
- After the loop exits, return the final hash converted from a binary hash into a readable hex string.

In [None]:
def file_sha256(path: Path, chunk_size: int = 1024 * 1024) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            h.update(chunk)
    return h.hexdigest()