Locas-Memory

A fast-convergence, plug-and-play, expandable FFN-style memory mechanism for LLMs.

Overview

Locas Memory is a lightweight parametric memory mechanism that augments LLM Feed-Forward Network (FFN) layers with external, trainable memory entries. Unlike KV-cache or retrieval-augmented approaches, Locas Memory directly injects structured knowledge into the MLP computation path, enabling:

Fast convergence – Memory tensors are extracted from activation statistics in a single forward pass (memorize()), then refined through gradient-based optimization.
Plug-and-play – Works with standard HuggingFace Qwen3 models out of the box; no architectural redesign required.
Dense export – Memory can be fused back into standard MLP weights via to_dense(), producing a vanilla Qwen3 model that is compatible with any inference engine (e.g., vLLM).
Flexible training – Supports Next-Token Prediction (NTP), Self-Distillation (SD), and Reinforcement Learning (GRPO) training paradigms.
LoRA baseline – Includes a LoRA adapter baseline for fair comparison.

Architecture

                    ┌──────────────────────────┐
                    │   Qwen3 Decoder Layer     │
                    │                          │
  hidden_states ──► │  Self-Attention           │
                    │       │                  │
                    │  Post-Attention LayerNorm │
                    │       │                  │
                    │  ┌────┴────┐             │
                    │  │Original │  ┌────────┐ │
                    │  │  MLP    │  │ Memory │ │
                    │  │(frozen) │  │  MLP   │ │
                    │  └────┬────┘  └───┬────┘ │
                    │       └─────+─────┘      │
                    │             │             │
                    │         output            │
                    └──────────────────────────┘

Memory tensor shape: (num_layers, batch_size, memory_size, hidden_size, 3) — storing key, gate, and value components per layer.

Key Operations

Method	Description
`model.memorize(input_ids, keep_top=k)`	Extract top-k memory entries from activation importance scores
`model.to_dense(memory)`	Fuse memory into MLP weights → standard `Qwen3ForCausalLM`
`model.compute_nll(input_ids, memory)`	Compute per-token negative log-likelihood with memory

Project Structure

Locas-Memory/
├── models/
│   └── modeling_qwen3_locas.py        # Core: Qwen3ForCausalLMWithMemory model
├── utils/
│   └── evaluate_mmlu.py               # MMLU benchmark evaluation (NLL-based)
├── launch_pg19_experiment.py           # PG-19 long-document perplexity evaluation
├── launch_locomo_experiments.py        # LoCoMo conversational QA evaluation
├── requirements.txt                    # Python dependencies
└── data/
    ├── pg-19-docs/                     # PG-19 long documents
    └── locomo.json                     # LoCoMo dataset

Installation

pip install -r requirements.txt

Core Dependencies

Package	Purpose
`torch`	Core deep learning framework
`transformers`	Qwen3 model backbone
`peft`	LoRA adapter baseline
`datasets`	HuggingFace dataset loading
`flash_attn`	Efficient attention computation

Quick Start

1. Memory Extraction & Dense Export

from models.modeling_qwen3_locas import Qwen3ForCausalLMWithMemory

# Load base model
model = Qwen3ForCausalLMWithMemory.from_pretrained("Qwen/Qwen3-0.6B")

# Extract memory from sample input
input_ids = tokenizer.encode("Your knowledge text here", return_tensors="pt")
memory = model.memorize(input_ids, keep_top=32, memory_init="highest")
# memory shape: (num_layers, 1, 32, hidden_size, 3)
# continue to update your memory via BP

# Fuse memory into standard dense model (compatible with vLLM, etc.)
dense_model = model.to_dense(memory)
dense_model.save_pretrained("./output/dense_model")

2. PG-19 Perplexity Evaluation

Evaluate long-document language modeling with online memory adaptation:

python launch_pg19_experiment.py \
    --model Qwen/Qwen3-1.7B-Base \
    --memory_width 64 \
    --memory_init highest \
    --loss_function NTP \
    --lr 1e-3 \
    --window_size 1024 \
    --num_gpus 8

Supported loss functions:

NTP – Next-Token Prediction
ST – Self-Distillation (teacher = frozen base model)
MIX – Mix-NTP Distillation

3. LoCoMo Conversational QA Evaluation

Evaluate long-context conversational question answering on the LoCoMo benchmark:

python launch_locomo_experiments.py \
    --model Qwen/Qwen3-1.7B-Base \
    --memory_type locas \
    --ttt_loss SD \
    --context_mode date_split

Memory types: locas (Locas Memory) or lora (LoRA adapter baseline)

Memory Initialization Strategies

Strategy	Description
`highest`	Select neurons with highest activation magnitude (default)
`lowest`	Select neurons with lowest activation magnitude
`random`	Random Gaussian initialization
`random_index`	Randomly permute extracted neuron indices

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Locas-Memory

Overview

Architecture

Key Operations

Project Structure

Installation

Core Dependencies

Quick Start

1. Memory Extraction & Dense Export

2. PG-19 Perplexity Evaluation

3. LoCoMo Conversational QA Evaluation

Memory Initialization Strategies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
models		models
utils		utils
LICENSE		LICENSE
README.md		README.md
hf_token		hf_token
launch_locomo_experiments.py		launch_locomo_experiments.py
launch_pg19_experiment.py		launch_pg19_experiment.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Locas-Memory

Overview

Architecture

Key Operations

Project Structure

Installation

Core Dependencies

Quick Start

1. Memory Extraction & Dense Export

2. PG-19 Perplexity Evaluation

3. LoCoMo Conversational QA Evaluation

Memory Initialization Strategies

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages