SecureBERT 2.0: Advanced Domain-Specific Language Model for Cybersecurity Intelligence

About The Project

SecureBERT 2.0 is Cisco AI's officially released, domain-adapted encoder-based language model for cybersecurity and threat intelligence. Built on the ModernBERT architecture, it incorporates hierarchical encoding and long-context modeling, enabling effective processing of complex cybersecurity documents, source code, and threat intelligence reports. Pretrained on a massive, multi-modal corpus—including over 13 billion text tokens and 53 million code tokens—SecureBERT 2.0 achieves state-of-the-art performance in semantic search, named entity recognition, code vulnerability detection, and threat analysis. With this release, Cisco aims to advance research in cybersecurity and AI by promoting transparency, enabling collaboration, and empowering practitioners, researchers, and organizations to build upon this work, accelerate innovation, and strengthen defenses against emerging cyber threats.

Key Features

Domain-Specific Pretraining: Extensive cybersecurity corpus, including threat reports, vulnerability advisories, technical blogs, and source code.
Multi-Modal Understanding: Integrates natural language and code for advanced vulnerability detection and threat intelligence.
Hierarchical & Long-Context Modeling: Captures both fine-grained and high-level structures across extended documents.
Optimized for Cybersecurity Tasks:
- Semantic search and document retrieval
- Named entity recognition (NER)
- Code vulnerability detection
- Threat intelligence analysis

Pretraining Dataset

Dataset Category	Code Tokens	Text Tokens
Seed corpus	9,406,451	256,859,788
Large-scale web text	268,993	12,231,942,693
Reasoning-focused data	--	3,229,293
Instruction-tuning data	61,590	2,336,218
Code vulnerability corpus	2,146,875	--
Cybersecurity dialogue data	41,503,749	56,871,556
Original baseline dataset	--	1,072,798,637
Total	53,387,658	13,623,037,185

MLM Evaluation (Masked Language Modeling)

SecureBERT 2.0 demonstrates strong domain-specific understanding:

Top-n	Objects (Nouns)	Verbs (Actions)	Code Tokens
1	56.20%	45.02%	39.27%
5	82.72%	74.12%	55.41%
10	88.80%	81.64%	60.03%

Outperforms general-purpose models in predicting cybersecurity-specific terms and code elements.

Downstream Tasks

1. Document Embedding

Cross-Encoder Results

Model	mAP	R@1	NDCG@10	MRR@10
ms-marco-TinyBERT-L2	0.920	0.849	0.964	0.955
SecureBERT 2.0	0.955	0.948	0.986	0.983

Bi-Encoder Results

Model	mAP	R@1	MRR@10
all-MiniLM-L12-v2	0.912	0.924	0.945
SecureBERT 2.0	0.951	0.984	0.989

Demonstrates high precision in semantic search and scalable retrieval.

2. Named Entity Recognition (NER)

Model	F1	Recall	Precision
CyBERT	0.351	0.281	0.467
SecureBERT	0.734	0.759	0.717
SecureBERT 2.0	0.945	0.965	0.927

Near-perfect recognition of cybersecurity entities such as Malware, Vulnerability, System, Indicator, and Organization.

3. Code Vulnerability Detection

Model	Accuracy	F1	Recall	Precision
CodeBERT	0.627	0.372	0.241	0.821
CyBERT	0.459	0.630	1.000	0.459
SecureBERT 2.0	0.655	0.616	0.602	0.630

Balanced detection performance with higher F1 score and reduced false positives compared to prior models.

** All for models are available on Huggingface **

Hugging Face Model Paths

Task	Model Path
SecureBERT 2.0	`cisco-ai/SecureBERT2.0-base`
Cross Encoder	`cisco-ai/SecureBERT2.0-cross_encoder`
Bi-Encoder	`cisco-ai/SecureBERT2.0-biencoder`
Named Entity Recognition (NER)	`cisco-ai/SecureBERT2.0-NER`
Vulnerability Classification	`cisco-ai/SecureBERT2.0-code-vuln-detection`

Getting Started

This repository provides the full framework for pretraining, fine-tuning, and evaluating SecureBERT 2.0 across key cybersecurity tasks.

Repository Structure

.
├── mlm/                       # Model pretraining (Masked Language Modeling)
│   ├── train.py                # Pretraining script for MLM
│   └── SecureBERT_mlm_eval.py # MLM evaluation script
├── vuln_classification/        # Code vulnerability detection
│   ├── CodeVuln_train.py       # Fine-tuning SecureBERT for vulnerability detection
│   └── CodeVuln_eval.py        # Evaluation on code vulnerability datasets
├── rt2/ner/                    # Named Entity Recognition (NER) tasks
│   ├── NER_train.py            # Fine-tuning SecureBERT for cybersecurity NER
│   └── NER_eval.py             # Evaluation script for NER models
├── doc_embedding/              # Document embedding tasks
│   ├── BiEncoder_train.py      # Bi-encoder training for semantic search
│   ├── CrossEncoder_train.py   # Cross-encoder training for fine-grained ranking
│   ├── BiEncoder_eval.py       # Bi-encoder evaluation
│   └── CrossEncoder_eval.py    # Cross-encoder evaluation
├── opensource_data/            # Preprocessed datasets
│   ├── data_vuln_dataset.parquet
│   ├── data_vuln_dataset_test.parquet
│   ├── data_NER_train.json
│   ├── data_NER_test.json
│   ├── data_sentence_pairs.parquet
│   ├── data_sentence_pairs_test.parquet
│   └── data_pretrain.parquet
├── dataset.py                  # Dataset loading and preprocessing utilities
├── requirements.txt            # Python dependencies
├── LICENSE
├── README.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── SECURITY.md
├── MAINTAINERS.md
└── .gitignore

Requirements

Python 3.10+
PyTorch 2.1+ with CUDA
Hugging Face Transformers
Lightning Fabric
tqdm

Installation

Clone the repository:

git clone https://github.com/cisco-ai-defense/securebert2.git

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate # On Windows: `venv\Scripts\activate`

Install the required Python packages:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # Adjust cu121 for your CUDA version
pip install transformers lightning tqdm pandas pyarrow

Note: Ensure your torch installation matches your CUDA version. The example above is for CUDA 12.1.

Ensure dataset.py is available: and also the datasets are available

Train and Evaluate

Every directory contains train and eval files. Make sure to customize them with your desired model or dataset path.

cd mlm

By default, the dataset is set to ModernBertDataset() from dataset.py, and the model is set to nswerdotai/ModernBERT-base.

To start training on a single GPU, simply run:

python train.py

For multi-GPU setting, run:

torchrun --nproc_per_node=8 train.py

For evaluation, provide a list of Hugging Face model IDs along with the evaluation dataset. Below is an example format for the MLM task.

    sentences = [
        "The attacker gained access through a [MASK] vulnerability.",
        "Users should always enable [MASK] authentication for better security.",
        "The malicious [MASK] was detected by the intrusion detection system.",
        "The ransomware encrypted all [MASK] on the server.",
        "A strong [MASK] policy helps prevent brute-force attacks."
    ]

    ground_truths = ["software", "multi-factor", "payload", "files", "password"]

    model_ids = [
        "cisco-ai/SecureBERT2.0-base",
        "answerdotai/ModernBERT-base",
        "ehsanaghaei/SecureBERT",
        
    ]

Similar to training, simply run:

python SecureBERT2_mlm_eval.py

Contribution

We welcome contributions to improve SecureBERT 2.0, including:

New datasets and pretraining corpora
Additional downstream cybersecurity tasks
Model architecture enhancements
Optimized evaluation pipelines

Please review CONTRIBUTING.md for guidelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SecureBERT 2.0: Advanced Domain-Specific Language Model for Cybersecurity Intelligence

About The Project

Key Features

Pretraining Dataset

MLM Evaluation (Masked Language Modeling)

Downstream Tasks

1. Document Embedding

2. Named Entity Recognition (NER)

3. Code Vulnerability Detection

Hugging Face Model Paths

Getting Started

Requirements

Installation

Train and Evaluate

Contribution

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
doc embedding		doc embedding
mlm		mlm
ner		ner
opensource_data		opensource_data
vuln_classification		vuln_classification
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
SECURITY.md		SECURITY.md
dataset.py		dataset.py
requirements.txt		requirements.txt

License

cisco-ai-defense/securebert2

Folders and files

Latest commit

History

Repository files navigation

SecureBERT 2.0: Advanced Domain-Specific Language Model for Cybersecurity Intelligence

About The Project

Key Features

Pretraining Dataset

MLM Evaluation (Masked Language Modeling)

Downstream Tasks

1. Document Embedding

2. Named Entity Recognition (NER)

3. Code Vulnerability Detection

Hugging Face Model Paths

Getting Started

Requirements

Installation

Train and Evaluate

Contribution

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages