Skip to content

cisco-ai-defense/securebert2

SecureBERT 2.0: Advanced Domain-Specific Language Model for Cybersecurity Intelligence

About The Project

SecureBERT 2.0 is Cisco AI's officially released, domain-adapted encoder-based language model for cybersecurity and threat intelligence. Built on the ModernBERT architecture, it incorporates hierarchical encoding and long-context modeling, enabling effective processing of complex cybersecurity documents, source code, and threat intelligence reports. Pretrained on a massive, multi-modal corpus—including over 13 billion text tokens and 53 million code tokens—SecureBERT 2.0 achieves state-of-the-art performance in semantic search, named entity recognition, code vulnerability detection, and threat analysis. With this release, Cisco aims to advance research in cybersecurity and AI by promoting transparency, enabling collaboration, and empowering practitioners, researchers, and organizations to build upon this work, accelerate innovation, and strengthen defenses against emerging cyber threats.


Key Features

  • Domain-Specific Pretraining: Extensive cybersecurity corpus, including threat reports, vulnerability advisories, technical blogs, and source code.
  • Multi-Modal Understanding: Integrates natural language and code for advanced vulnerability detection and threat intelligence.
  • Hierarchical & Long-Context Modeling: Captures both fine-grained and high-level structures across extended documents.
  • Optimized for Cybersecurity Tasks:
    • Semantic search and document retrieval
    • Named entity recognition (NER)
    • Code vulnerability detection
    • Threat intelligence analysis

Pretraining Dataset

Dataset Category Code Tokens Text Tokens
Seed corpus 9,406,451 256,859,788
Large-scale web text 268,993 12,231,942,693
Reasoning-focused data -- 3,229,293
Instruction-tuning data 61,590 2,336,218
Code vulnerability corpus 2,146,875 --
Cybersecurity dialogue data 41,503,749 56,871,556
Original baseline dataset -- 1,072,798,637
Total 53,387,658 13,623,037,185

MLM Evaluation (Masked Language Modeling)

SecureBERT 2.0 demonstrates strong domain-specific understanding:

Top-n Objects (Nouns) Verbs (Actions) Code Tokens
1 56.20% 45.02% 39.27%
5 82.72% 74.12% 55.41%
10 88.80% 81.64% 60.03%

Outperforms general-purpose models in predicting cybersecurity-specific terms and code elements.


Downstream Tasks

1. Document Embedding

Cross-Encoder Results

Model mAP R@1 NDCG@10 MRR@10
ms-marco-TinyBERT-L2 0.920 0.849 0.964 0.955
SecureBERT 2.0 0.955 0.948 0.986 0.983

Bi-Encoder Results

Model mAP R@1 MRR@10
all-MiniLM-L12-v2 0.912 0.924 0.945
SecureBERT 2.0 0.951 0.984 0.989

Demonstrates high precision in semantic search and scalable retrieval.


2. Named Entity Recognition (NER)

Model F1 Recall Precision
CyBERT 0.351 0.281 0.467
SecureBERT 0.734 0.759 0.717
SecureBERT 2.0 0.945 0.965 0.927

Near-perfect recognition of cybersecurity entities such as Malware, Vulnerability, System, Indicator, and Organization.


3. Code Vulnerability Detection

Model Accuracy F1 Recall Precision
CodeBERT 0.627 0.372 0.241 0.821
CyBERT 0.459 0.630 1.000 0.459
SecureBERT 2.0 0.655 0.616 0.602 0.630

Balanced detection performance with higher F1 score and reduced false positives compared to prior models.

** All for models are available on Huggingface **

Hugging Face Model Paths

Task Model Path
SecureBERT 2.0 cisco-ai/SecureBERT2.0-base
Cross Encoder cisco-ai/SecureBERT2.0-cross_encoder
Bi-Encoder cisco-ai/SecureBERT2.0-biencoder
Named Entity Recognition (NER) cisco-ai/SecureBERT2.0-NER
Vulnerability Classification cisco-ai/SecureBERT2.0-code-vuln-detection

Getting Started

This repository provides the full framework for pretraining, fine-tuning, and evaluating SecureBERT 2.0 across key cybersecurity tasks.

Repository Structure

.
├── mlm/                       # Model pretraining (Masked Language Modeling)
│   ├── train.py                # Pretraining script for MLM
│   └── SecureBERT_mlm_eval.py # MLM evaluation script
├── vuln_classification/        # Code vulnerability detection
│   ├── CodeVuln_train.py       # Fine-tuning SecureBERT for vulnerability detection
│   └── CodeVuln_eval.py        # Evaluation on code vulnerability datasets
├── rt2/ner/                    # Named Entity Recognition (NER) tasks
│   ├── NER_train.py            # Fine-tuning SecureBERT for cybersecurity NER
│   └── NER_eval.py             # Evaluation script for NER models
├── doc_embedding/              # Document embedding tasks
│   ├── BiEncoder_train.py      # Bi-encoder training for semantic search
│   ├── CrossEncoder_train.py   # Cross-encoder training for fine-grained ranking
│   ├── BiEncoder_eval.py       # Bi-encoder evaluation
│   └── CrossEncoder_eval.py    # Cross-encoder evaluation
├── opensource_data/            # Preprocessed datasets
│   ├── data_vuln_dataset.parquet
│   ├── data_vuln_dataset_test.parquet
│   ├── data_NER_train.json
│   ├── data_NER_test.json
│   ├── data_sentence_pairs.parquet
│   ├── data_sentence_pairs_test.parquet
│   └── data_pretrain.parquet
├── dataset.py                  # Dataset loading and preprocessing utilities
├── requirements.txt            # Python dependencies
├── LICENSE
├── README.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── SECURITY.md
├── MAINTAINERS.md
└── .gitignore

Requirements

  • Python 3.10+
  • PyTorch 2.1+ with CUDA
  • Hugging Face Transformers
  • Lightning Fabric
  • tqdm

Installation

  1. Clone the repository:

    git clone https://github.com/cisco-ai-defense/securebert2.git
  2. Create a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate # On Windows: `venv\Scripts\activate`
  3. Install the required Python packages:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # Adjust cu121 for your CUDA version
    pip install transformers lightning tqdm pandas pyarrow

    Note: Ensure your torch installation matches your CUDA version. The example above is for CUDA 12.1.

  4. Ensure dataset.py is available: and also the datasets are available

Train and Evaluate

Every directory contains train and eval files. Make sure to customize them with your desired model or dataset path.

cd mlm

By default, the dataset is set to ModernBertDataset() from dataset.py, and the model is set to nswerdotai/ModernBERT-base.

To start training on a single GPU, simply run:

python train.py

For multi-GPU setting, run:

torchrun --nproc_per_node=8 train.py 

For evaluation, provide a list of Hugging Face model IDs along with the evaluation dataset. Below is an example format for the MLM task.

    sentences = [
        "The attacker gained access through a [MASK] vulnerability.",
        "Users should always enable [MASK] authentication for better security.",
        "The malicious [MASK] was detected by the intrusion detection system.",
        "The ransomware encrypted all [MASK] on the server.",
        "A strong [MASK] policy helps prevent brute-force attacks."
    ]

    ground_truths = ["software", "multi-factor", "payload", "files", "password"]

    model_ids = [
        "cisco-ai/SecureBERT2.0-base",
        "answerdotai/ModernBERT-base",
        "ehsanaghaei/SecureBERT",
        
    ]

Similar to training, simply run:

python SecureBERT2_mlm_eval.py

Contribution

We welcome contributions to improve SecureBERT 2.0, including:

  • New datasets and pretraining corpora
  • Additional downstream cybersecurity tasks
  • Model architecture enhancements
  • Optimized evaluation pipelines

Please review CONTRIBUTING.md for guidelines.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages