Skip to content

UnityLoop-official/Spectra-NSA

Repository files navigation

Spectra-NSA: Neural Semantic Architecture

License Python 3.8+ PyTorch Status

⚠️ ALPHA RELEASE: Spectra-NSA is under active development. Breaking changes may occur. Use for research and experimentation.

Advanced neural embedding model combining spectral-entropy attention, Matryoshka representations, and anomalous detection for state-of-the-art semantic similarity tasks.


🎯 Key Features

  • 🧠 Spectral-Entropy Attention - Novel attention mechanism with learnable fractal enrichment
  • 🎭 Matryoshka Embeddings - Multi-scale semantic representations (768β†’512β†’256)
  • πŸ” Anomalous Detection - Built-in out-of-distribution detection via learned anomalous basis
  • πŸ“Š Multiple Size Presets - M456 (458M), M600 (600M), M700 (700M), M1B (1B params)
  • πŸŽ›οΈ Component Toggles - Flexible ablation studies via CLI flags
  • πŸ“ˆ Real-time Monitoring - Integrated SOTA comparison and health checks
  • πŸ’Ύ Auto-backup to Google Drive - Automatic checkpoint backup every 6 saves

πŸ“Š Model Sizes & Performance

Model Parameters STS-B Target VRAM Training Time (A100) Best For
M456 458M >0.825 ~35GB 4-6h Colab Pro βœ…
M600 600M >0.835 ~37GB 6-8h Balanced
M700 700M >0.840 ~38GB 8-10h SOTA target
M1B 1.0B >0.850 ~55GB 12-16h Research

Default: M456 - Optimized for Google Colab Pro (A100 40GB)


πŸš€ Quick Start

Installation

pip install transformers datasets accelerate scipy torch

Training (Default M456)

python anomalous_embedding_ultimate.py --mode train --epochs 3

Expected output:

  • Training time: ~4-6 hours (A100, fp32 debug mode)
  • STS-B score: >0.825
  • Automatic checkpoints every 500 steps
  • Drive backup every 6 checkpoints

Evaluation

python anomalous_embedding_ultimate.py --mode eval --checkpoint checkpoints/best_sts.pt

Embedding Extraction

python anomalous_embedding_ultimate.py \
    --mode extract \
    --checkpoint checkpoints/best_sts.pt \
    --texts "sample text" "another text"

πŸŽ›οΈ Advanced Usage

Size Presets

# SOTA target (M700)
python anomalous_embedding_ultimate.py --size M700 --mode train --epochs 3

# Research scale (M1B) - requires 80GB GPU
python anomalous_embedding_ultimate.py --size M1B --mode train --epochs 3

Ablation Studies

# Disable spectral attention
python anomalous_embedding_ultimate.py --no-spectral --mode train --epochs 1

# Disable anchor64 head
python anomalous_embedding_ultimate.py --no-anchor --mode train --epochs 1

# Minimal configuration
python anomalous_embedding_ultimate.py \
    --no-spectral --no-anchor --no-bridge --no-matryoshka \
    --mode train --epochs 1

πŸ—οΈ Architecture Overview

Core Components

  1. CustomEncoder - Transformer backbone with spectral-entropy attention

    • Learnable fractal depth mixing
    • Learnable spectral fusion weights
    • Temperature-annealed contrastive learning
  2. Matryoshka Heads - Multi-scale embeddings

    • Semantic: 768, 512, 256 dimensions
    • Entity: 384, 192, 96 dimensions
    • Progressive nesting for efficient inference
  3. Anomalous Projection - OOD detection

    • Learned anomalous basis (16 prototypes)
    • Spectral regularization
    • Ranking head for retrieval tasks
  4. LossStack - Multi-objective training

    • InfoNCE (semantic, anchor, fast retrieval)
    • Triplet margin loss
    • Matryoshka angular alignment
    • Bridge loss (semantic↔entity)
    • Spectral entropy regularization

Training Features

  • Temperature Scheduling: Cosine decay (0.07β†’0.05)
  • Gradient Accumulation: Effective batch size 64
  • Mixed Precision: FP16 support (currently disabled for debugging)
  • Early Stopping: Configurable patience (default: disabled)
  • Auto-backup: Google Drive sync every 6 checkpoints

πŸ“ Project Structure

NSA_2.0/
β”œβ”€β”€ anomalous_embedding_ultimate.py  # Main training script
β”œβ”€β”€ training_monitor.py              # Real-time monitoring & health checks
β”œβ”€β”€ anomalous_eval_suite.py          # Comprehensive evaluation suite
β”œβ”€β”€ colab_training.ipynb             # Google Colab training notebook
β”œβ”€β”€ ULTIMATE_GUIDE.md                # Detailed usage guide
β”œβ”€β”€ DATASET_INFO.md                  # Dataset information
β”œβ”€β”€ GOOGLE_DRIVE_SETUP.md            # Drive setup instructions
└── checkpoints/                     # Saved models (auto-created)

πŸ”§ Configuration

Key parameters in Config dataclass:

# Model Architecture
hidden_size: int = 1024          # M456 default
num_layers: int = 24
num_heads: int = 16
spectral_dim: int = 192
max_length: int = 160            # VRAM optimized

# Training
batch_train: int = 8             # Physical batch size
grad_accum: int = 8              # Effective batch = 64
epochs: int = 3
lr: float = 2e-4
fp16: bool = False               # Debug mode (use fp32)

# Monitoring
save_every: int = 500            # Checkpoint frequency
eval_every: int = 500            # STS evaluation frequency
early_stop_patience: int = 9999  # Disabled (duration by epochs)

πŸ“ˆ Training Metrics

Expected Timeline (M456, 3 epochs, 98k samples)

Step Epoch Event
500 0.04 First STS evaluation (~0.68)
3,000 0.24 1st Drive backup
12,250 1.00 End epoch 1 (STS ~0.78)
24,500 2.00 End epoch 2 (STS ~0.82)
36,750 3.00 Final (STS >0.825)

Total training steps: ~36,750
Drive backups: ~12 (every 3,000 steps)
Checkpoints saved: ~73 (every 500 steps)


🎯 Performance Targets

M456 (Default)

  • βœ… STS-B > 0.825 (competitive with BGE-base)
  • βœ… Fits in Colab Pro (40GB VRAM)
  • βœ… Training completes in 4-6h (A100)
  • βœ… No OOM errors

M700 (SOTA)

  • βœ… STS-B > 0.840 (competitive with GTE-large)
  • βœ… NDCG@10 > 0.420
  • βœ… Recall@1 > 0.550
  • βœ… Training completes in 8-10h (A100)

πŸ› Troubleshooting

Out of Memory (OOM)

For M456:

# Edit Config in anomalous_embedding_ultimate.py
batch_train = 6        # Reduce from 8
max_length = 128       # Reduce from 160

For M700:

  • Switch to M456 or use 80GB GPU
  • Enable gradient checkpointing (future)

Training Too Slow

Check GPU type:

!nvidia-smi

Should show: A100-SXM4-40GB or A100-80GB

STS Not Improving

  1. Verify all components enabled (no --no-* flags)
  2. Check temperature annealing is active
  3. Monitor loss components (should decrease)
  4. Wait until warmup ends (8% of training)

Checkpoints Not Backing Up to Drive

Verify Drive is mounted:

import os
print(os.path.exists("/content/drive/MyDrive"))  # Should be True

πŸ“ Citation

If you use this work in your research, please cite:

@software{spectra_nsa,
  title = {Spectra-NSA: Neural Semantic Architecture - Advanced Embeddings},
  author = {Your Name},
  year = {2025},
  url = {https://github.com/yourusername/spectra-nsa},
  note = {Alpha release - Architecture under active development}
}

πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Copyright 2025

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

⚠️ Disclaimer

ALPHA SOFTWARE: This architecture is experimental and under active development.

  • βœ… Use for: Research, experimentation, benchmarking
  • ⚠️ Not recommended for: Production systems without extensive testing
  • πŸ”„ Breaking changes: May occur between releases
  • πŸ› Known issues: NaN debugging enabled (fp32 mode for stability)

Current Status:

  • Core architecture: Stable
  • Training pipeline: Stable
  • Evaluation suite: Stable
  • Multi-GPU support: In development
  • Gradient checkpointing: Planned
  • Mixed precision (fp16): Disabled for debugging

🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Areas of interest:

  • Multi-GPU training optimization
  • FP16 stability improvements
  • Additional evaluation benchmarks
  • Memory optimization techniques
  • Documentation improvements

πŸ“ž Contact

For questions, issues, or collaboration:


πŸ™ Acknowledgments

  • Hugging Face - Transformers & Datasets libraries
  • PyTorch Team - Deep learning framework
  • Sentence-Transformers - MS MARCO dataset
  • MTEB - Benchmark datasets
  • Google Colab - Training infrastructure

πŸ“š Additional Resources


Built with ❀️ for the NLP research community

πŸ”— Links:

  • Repository: spectra-nsa
  • Documentation: Coming Soon
  • Paper: In preparation

Last updated: November 15, 2025

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published