##

## Hardware-Level Caching

### Leveraging the Chip Layer Caches
- Big Cache, sure... L1, L2, L3
- Choosing your hardware purposefully, eg EPYC MegaL3* Cache vs Xeon MAX**

[*] Not its formal marketing name, but close
[**] This is its actual marketing name

### DRAM Faster All The Time
DDR4 no more.. we go with DDR5 and soon DDR6.

#### Concerns for DRAM
- NUMA, and it applies to chiplet designs just as much as it applies to multi-socket motherboards.
- Interleaved memory: and how it can be used to improve performance, improve high-availability, as well as concerns about degrading performance if the design does not fit the requirements.
- Memory channel density: this can increase total DRAM density, though in some architectures there is a DRAM speed penalty for populating more than 1 DIMM per Channel (1DPC).


### NVMe are not all equal
- U.2, U.3, AIC, or Rulers? Different specs all around, and their interconnects to the PCIe bus make a substantial difference.
- SLC on there, is it an SLC cache or is the full NAND array made of SLC

### RDMA and Network Shared Memory
Is OpenMPI ever enough? Is RDMA what we want? Why not Myrcom

#### Interconnect Designs for Memory Nets
As a starting point, the Torus has been a highly performing base architecture for Top-500 supercomputers. More on that here:
- https://en.wikipedia.org/wiki/Torus_interconnect#Performance

Variations on the theme, including SeaStar and Gemini interconnects are described here:
- https://github.com/jeffhammond/HPCInfo/blob/master/docs/Cray.md

#### Memory Passing Interface - aka MPI
What would be do with a shared memory network if not pass data back and forth between compute nodes? Yes, and in a standardized manner, where MPI offers plenty of examples. This has been an evolving field of computer science, which deserves a variety of source code examples and tutorials:
- From the great Jeff Hammond: https://github.com/jeffhammond/HPCInfo/tree/master/mpi
- Lawrence Livermore Labs gets a mention: https://hpc-tutorials.llnl.gov/mpi/


### Multi-Node Distributed Caching
This necessarily involves an amalgam of the various cache layers and technologies, and so we'll simply consider this to be the default when dealing with clusters of KVUs. The exceptions include hypothetical designs like the following:

- Micro-Cluster: A single CS-N with 2-4U compute nodes, vertically scaled for high-density resources per node, leveraging 10s-100s of VMs per single-node.
    - Each single physical node could be deployed as a single dedicated tenant, similar in concept to using single-physical L2 network domains for dis-aggregation and strong physical isolation of systems when logical partitioning is not relevant by CSO requirements or DoD spec or similar architectural design requirement.
    - In this design the VMs could participate in aggregation-level distributed caching internally to the physical node, but would not leverage hardware offloads directly on NICs or DPUs unless distributed via SR-IOV or NPAR style controls within the single-node's OS and kernel + CGroups security layers.


## Application Level Caching


### Semantics-Schamantics with these LLM Queryings


### Library Layer Cache Optimizers

#### Template and Prefix Caching
#### Multi-Turn Conversions
#### RAG and Top-K Concerns
#### Vector Embeds, aka The Matryoshka Method
#### SDK Optimizers Per-Platform


## The Many Methods of Popularized Cache Algorithms

- Most Recently Used (MRU): evicts the most recently used item, which can be useful in page-scanning engines or cyclical-control loop caches. This operates under the assumption that the MRU item was scanned and accessed, but then will not be referenced again until the next full-loop or full-cycle/page. It is useful in very specific circumstances, or as a layered algo with trigger calls to enable/disable its functionality as-needed.

. MRU is only beneficial in specific patterns (like scanning cyclical patterns). It’s generally not used as a primary policy in caching systems for LLMs, except possibly in some layered buffer situations.

- Least Recently Used (LRU): possibly the most commonly seen cache algo in the wild, LRU evicts items accessed the longest time ago, operating via the assumption that if it hasn’t been used in a while then it’s less likely to be needed soon. Downside: LRU only tracks recency, not frequency of use, and has no weighting controls.

- LRU variants (2Q, LRU/K, etc):
  - 2Q is a strategy that introduces a probationary queue and a protected queue
  - LRU/K and LRU/2 track the k-th most recent access rather than the last access when making eviction decisions

- Least Frequently Used (LFU): evicts the item with the lowest access count to better capture long-term hot items vs cold items, another entry in the list of comparatively simple cache algos.

- Random: as it stands, it's A method, but not an often-best method for cache controls.

- FIFO: a circular buffer type cache policy, similar to Round-Robin method in Load-Balancing controls. Very simple, and sometimes KISS is all that's needed.

- Clock Second Chance (CSCh): An approximation of LRU used in OS kernel page replacement, which arranges pages in a circular buffer like a clock and gives a "second chance" to pages that have been accessed recently by marking them before eviction. Effectively this is a Round-Robin method with a simple weighting system.

- Adaptive Replacement Cache (ARC): ARC is a highly regarded algorithm that adapts between LRU and LFU dynamically, algorithmically defined via tunables, used in the ZFS filesystem's DRAM and L2 data segments, as well as various applications which require real-time analysis and automated tuning of the caching system. This is the most advanced of the cache controls described here.

- Segmented LRU (S/LRU): Takes the LRU method and splits cache entries into segments, referred to as "protected vs probationary".

- Window TinyLFU (Wt/LFU): an advancement upon the standard LFU policy which uses a small LRU "windowing controller" and adapts the concept of "TinyLFU" for its cache admission policy design. These actions can be modeled by ARC and improved upon dynamically, where-as the non-ARC methods of Wt/LFU may become stale over time or induce inefficiency which is difficult to change during production use - unless real-time controls are implemented.


## Cache invalidation strategies:
- Common strategies include time-to-live (TTL)



## Concerns in Caching Architectures