Skip to content

desire2020/Locas-Memory

Repository files navigation

Locas-Memory

A fast-convergence, plug-and-play, expandable FFN-style memory mechanism for LLMs.

Overview

Locas Memory is a lightweight parametric memory mechanism that augments LLM Feed-Forward Network (FFN) layers with external, trainable memory entries. Unlike KV-cache or retrieval-augmented approaches, Locas Memory directly injects structured knowledge into the MLP computation path, enabling:

  • Fast convergence – Memory tensors are extracted from activation statistics in a single forward pass (memorize()), then refined through gradient-based optimization.
  • Plug-and-play – Works with standard HuggingFace Qwen3 models out of the box; no architectural redesign required.
  • Dense export – Memory can be fused back into standard MLP weights via to_dense(), producing a vanilla Qwen3 model that is compatible with any inference engine (e.g., vLLM).
  • Flexible training – Supports Next-Token Prediction (NTP), Self-Distillation (SD), and Reinforcement Learning (GRPO) training paradigms.
  • LoRA baseline – Includes a LoRA adapter baseline for fair comparison.

Architecture

                    ┌──────────────────────────┐
                    │   Qwen3 Decoder Layer     │
                    │                          │
  hidden_states ──► │  Self-Attention           │
                    │       │                  │
                    │  Post-Attention LayerNorm │
                    │       │                  │
                    │  ┌────┴────┐             │
                    │  │Original │  ┌────────┐ │
                    │  │  MLP    │  │ Memory │ │
                    │  │(frozen) │  │  MLP   │ │
                    │  └────┬────┘  └───┬────┘ │
                    │       └─────+─────┘      │
                    │             │             │
                    │         output            │
                    └──────────────────────────┘

Memory tensor shape: (num_layers, batch_size, memory_size, hidden_size, 3) — storing key, gate, and value components per layer.

Key Operations

Method Description
model.memorize(input_ids, keep_top=k) Extract top-k memory entries from activation importance scores
model.to_dense(memory) Fuse memory into MLP weights → standard Qwen3ForCausalLM
model.compute_nll(input_ids, memory) Compute per-token negative log-likelihood with memory

Project Structure

Locas-Memory/
├── models/
│   └── modeling_qwen3_locas.py        # Core: Qwen3ForCausalLMWithMemory model
├── utils/
│   └── evaluate_mmlu.py               # MMLU benchmark evaluation (NLL-based)
├── launch_pg19_experiment.py           # PG-19 long-document perplexity evaluation
├── launch_locomo_experiments.py        # LoCoMo conversational QA evaluation
├── requirements.txt                    # Python dependencies
└── data/
    ├── pg-19-docs/                     # PG-19 long documents
    └── locomo.json                     # LoCoMo dataset

Installation

pip install -r requirements.txt

Core Dependencies

Package Purpose
torch Core deep learning framework
transformers Qwen3 model backbone
peft LoRA adapter baseline
datasets HuggingFace dataset loading
flash_attn Efficient attention computation

Quick Start

1. Memory Extraction & Dense Export

from models.modeling_qwen3_locas import Qwen3ForCausalLMWithMemory

# Load base model
model = Qwen3ForCausalLMWithMemory.from_pretrained("Qwen/Qwen3-0.6B")

# Extract memory from sample input
input_ids = tokenizer.encode("Your knowledge text here", return_tensors="pt")
memory = model.memorize(input_ids, keep_top=32, memory_init="highest")
# memory shape: (num_layers, 1, 32, hidden_size, 3)
# continue to update your memory via BP

# Fuse memory into standard dense model (compatible with vLLM, etc.)
dense_model = model.to_dense(memory)
dense_model.save_pretrained("./output/dense_model")

2. PG-19 Perplexity Evaluation

Evaluate long-document language modeling with online memory adaptation:

python launch_pg19_experiment.py \
    --model Qwen/Qwen3-1.7B-Base \
    --memory_width 64 \
    --memory_init highest \
    --loss_function NTP \
    --lr 1e-3 \
    --window_size 1024 \
    --num_gpus 8

Supported loss functions:

  • NTP – Next-Token Prediction
  • ST – Self-Distillation (teacher = frozen base model)
  • MIX – Mix-NTP Distillation

3. LoCoMo Conversational QA Evaluation

Evaluate long-context conversational question answering on the LoCoMo benchmark:

python launch_locomo_experiments.py \
    --model Qwen/Qwen3-1.7B-Base \
    --memory_type locas \
    --ttt_loss SD \
    --context_mode date_split

Memory types: locas (Locas Memory) or lora (LoRA adapter baseline)

Memory Initialization Strategies

Strategy Description
highest Select neurons with highest activation magnitude (default)
lowest Select neurons with lowest activation magnitude
random Random Gaussian initialization
random_index Randomly permute extracted neuron indices

License

Currently released part: MIT License The complete project: Copyright 2026 Tencent

About

A fast-convergence, plug-and-play, expandable FFN-style memory mechanism for LLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages