SecureBERT 2.0 is Cisco AI's officially released, domain-adapted encoder-based language model for cybersecurity and threat intelligence. Built on the ModernBERT architecture, it incorporates hierarchical encoding and long-context modeling, enabling effective processing of complex cybersecurity documents, source code, and threat intelligence reports. Pretrained on a massive, multi-modal corpus—including over 13 billion text tokens and 53 million code tokens—SecureBERT 2.0 achieves state-of-the-art performance in semantic search, named entity recognition, code vulnerability detection, and threat analysis. With this release, Cisco aims to advance research in cybersecurity and AI by promoting transparency, enabling collaboration, and empowering practitioners, researchers, and organizations to build upon this work, accelerate innovation, and strengthen defenses against emerging cyber threats.
- Domain-Specific Pretraining: Extensive cybersecurity corpus, including threat reports, vulnerability advisories, technical blogs, and source code.
- Multi-Modal Understanding: Integrates natural language and code for advanced vulnerability detection and threat intelligence.
- Hierarchical & Long-Context Modeling: Captures both fine-grained and high-level structures across extended documents.
- Optimized for Cybersecurity Tasks:
- Semantic search and document retrieval
- Named entity recognition (NER)
- Code vulnerability detection
- Threat intelligence analysis
| Dataset Category | Code Tokens | Text Tokens |
|---|---|---|
| Seed corpus | 9,406,451 | 256,859,788 |
| Large-scale web text | 268,993 | 12,231,942,693 |
| Reasoning-focused data | -- | 3,229,293 |
| Instruction-tuning data | 61,590 | 2,336,218 |
| Code vulnerability corpus | 2,146,875 | -- |
| Cybersecurity dialogue data | 41,503,749 | 56,871,556 |
| Original baseline dataset | -- | 1,072,798,637 |
| Total | 53,387,658 | 13,623,037,185 |
SecureBERT 2.0 demonstrates strong domain-specific understanding:
| Top-n | Objects (Nouns) | Verbs (Actions) | Code Tokens |
|---|---|---|---|
| 1 | 56.20% | 45.02% | 39.27% |
| 5 | 82.72% | 74.12% | 55.41% |
| 10 | 88.80% | 81.64% | 60.03% |
Outperforms general-purpose models in predicting cybersecurity-specific terms and code elements.
Cross-Encoder Results
| Model | mAP | R@1 | NDCG@10 | MRR@10 |
|---|---|---|---|---|
| ms-marco-TinyBERT-L2 | 0.920 | 0.849 | 0.964 | 0.955 |
| SecureBERT 2.0 | 0.955 | 0.948 | 0.986 | 0.983 |
Bi-Encoder Results
| Model | mAP | R@1 | MRR@10 |
|---|---|---|---|
| all-MiniLM-L12-v2 | 0.912 | 0.924 | 0.945 |
| SecureBERT 2.0 | 0.951 | 0.984 | 0.989 |
Demonstrates high precision in semantic search and scalable retrieval.
| Model | F1 | Recall | Precision |
|---|---|---|---|
| CyBERT | 0.351 | 0.281 | 0.467 |
| SecureBERT | 0.734 | 0.759 | 0.717 |
| SecureBERT 2.0 | 0.945 | 0.965 | 0.927 |
Near-perfect recognition of cybersecurity entities such as Malware, Vulnerability, System, Indicator, and Organization.
| Model | Accuracy | F1 | Recall | Precision |
|---|---|---|---|---|
| CodeBERT | 0.627 | 0.372 | 0.241 | 0.821 |
| CyBERT | 0.459 | 0.630 | 1.000 | 0.459 |
| SecureBERT 2.0 | 0.655 | 0.616 | 0.602 | 0.630 |
Balanced detection performance with higher F1 score and reduced false positives compared to prior models.
** All for models are available on Huggingface **
| Task | Model Path |
|---|---|
| SecureBERT 2.0 | cisco-ai/SecureBERT2.0-base |
| Cross Encoder | cisco-ai/SecureBERT2.0-cross_encoder |
| Bi-Encoder | cisco-ai/SecureBERT2.0-biencoder |
| Named Entity Recognition (NER) | cisco-ai/SecureBERT2.0-NER |
| Vulnerability Classification | cisco-ai/SecureBERT2.0-code-vuln-detection |
This repository provides the full framework for pretraining, fine-tuning, and evaluating SecureBERT 2.0 across key cybersecurity tasks.
Repository Structure
.
├── mlm/ # Model pretraining (Masked Language Modeling)
│ ├── train.py # Pretraining script for MLM
│ └── SecureBERT_mlm_eval.py # MLM evaluation script
├── vuln_classification/ # Code vulnerability detection
│ ├── CodeVuln_train.py # Fine-tuning SecureBERT for vulnerability detection
│ └── CodeVuln_eval.py # Evaluation on code vulnerability datasets
├── rt2/ner/ # Named Entity Recognition (NER) tasks
│ ├── NER_train.py # Fine-tuning SecureBERT for cybersecurity NER
│ └── NER_eval.py # Evaluation script for NER models
├── doc_embedding/ # Document embedding tasks
│ ├── BiEncoder_train.py # Bi-encoder training for semantic search
│ ├── CrossEncoder_train.py # Cross-encoder training for fine-grained ranking
│ ├── BiEncoder_eval.py # Bi-encoder evaluation
│ └── CrossEncoder_eval.py # Cross-encoder evaluation
├── opensource_data/ # Preprocessed datasets
│ ├── data_vuln_dataset.parquet
│ ├── data_vuln_dataset_test.parquet
│ ├── data_NER_train.json
│ ├── data_NER_test.json
│ ├── data_sentence_pairs.parquet
│ ├── data_sentence_pairs_test.parquet
│ └── data_pretrain.parquet
├── dataset.py # Dataset loading and preprocessing utilities
├── requirements.txt # Python dependencies
├── LICENSE
├── README.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── SECURITY.md
├── MAINTAINERS.md
└── .gitignore
- Python 3.10+
- PyTorch 2.1+ with CUDA
- Hugging Face Transformers
- Lightning Fabric
- tqdm
-
Clone the repository:
git clone https://github.com/cisco-ai-defense/securebert2.git
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: `venv\Scripts\activate`
-
Install the required Python packages:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # Adjust cu121 for your CUDA version pip install transformers lightning tqdm pandas pyarrowNote: Ensure your
torchinstallation matches your CUDA version. The example above is for CUDA 12.1. -
Ensure
dataset.pyis available: and also the datasets are available
Every directory contains train and eval files. Make sure to customize them with your desired model or dataset path.
cd mlmBy default, the dataset is set to ModernBertDataset() from dataset.py, and the model is set to nswerdotai/ModernBERT-base.
To start training on a single GPU, simply run:
python train.pyFor multi-GPU setting, run:
torchrun --nproc_per_node=8 train.py For evaluation, provide a list of Hugging Face model IDs along with the evaluation dataset. Below is an example format for the MLM task.
sentences = [
"The attacker gained access through a [MASK] vulnerability.",
"Users should always enable [MASK] authentication for better security.",
"The malicious [MASK] was detected by the intrusion detection system.",
"The ransomware encrypted all [MASK] on the server.",
"A strong [MASK] policy helps prevent brute-force attacks."
]
ground_truths = ["software", "multi-factor", "payload", "files", "password"]
model_ids = [
"cisco-ai/SecureBERT2.0-base",
"answerdotai/ModernBERT-base",
"ehsanaghaei/SecureBERT",
]Similar to training, simply run:
python SecureBERT2_mlm_eval.pyWe welcome contributions to improve SecureBERT 2.0, including:
- New datasets and pretraining corpora
- Additional downstream cybersecurity tasks
- Model architecture enhancements
- Optimized evaluation pipelines
Please review CONTRIBUTING.md for guidelines.