# Datasheet for the NSL-KDD Dataset

## 1. Motivation

- **Purpose:**  
  The NSL-KDD dataset is a refined version of the original KDD Cup 99 dataset, designed to evaluate and benchmark network intrusion detection systems. Its purpose is to provide a standardized dataset for training and evaluating machine learning models in the cybersecurity domain.

- **Intended Uses:**  
  - Training and evaluating machine learning models for intrusion detection.  
  - Research in anomaly detection, cybersecurity, and network traffic analysis.

- **Beneficiaries:**  
  - Researchers and developers in the field of cybersecurity.  
  - Organizations aiming to improve their network intrusion detection systems.

## 2. Composition

- **Data Description:**  
  The dataset comprises network connection records with various features describing the behavior of the connection. Each record is labeled as either "normal" or as one of several types of attacks.  
  - **Features:**  
    - **Numeric Features:** e.g., `duration`, `src_bytes`, `dst_bytes`, and several others.  
    - **Categorical Features:**  
      - `protocol_type`: 3 unique values (e.g., tcp, udp, icmp)  
      - `service`: 70 unique values (e.g., http, ftp, smtp, etc.)  
      - `flag`: 11 unique values (e.g., SF, S0, REJ, etc.)
  - **Label:** Binary indicator where:
    - `0` denotes a normal (benign) connection.
    - `1` denotes an attack (malicious connection).

- **Size and Structure:**  
  - **Training Set:** Approximately 125,973 records with 42 columns (including a "difficulty" column, which is typically dropped during preprocessing).  
  - **Test Set:** Approximately 22,543 records with 41 columns (with proper alignment after adjustments).

- **Data Format:**  
  - The data is stored in CSV files with each row representing a network connection and columns representing the features and label.

## 3. Collection Process

- **Methodology:**  
  - The NSL-KDD dataset was derived from the original KDD Cup 99 dataset. It was created by removing redundant records and addressing some of the inherent issues in the original data.
  - The records represent simulated network traffic and known attack patterns, making the dataset a useful benchmark for intrusion detection research.

- **Timeframe and Geography:**  
  - The original KDD Cup 99 data was collected in the late 1990s. NSL-KDD was subsequently released to address identified shortcomings.
  
- **Inclusion/Exclusion Criteria:**  
  - The dataset includes a curated set of network connection records to ensure a diverse representation of both normal and attack instances.
  - The "difficulty" column in the training set indicates how challenging a record is to classify; however, this column is typically dropped during model training.

## 4. Preprocessing/Cleaning

- **Processing Steps:**  
  - **Dropping the 'difficulty' Column:**  
    The training set contains an extra "difficulty" column which is not used for modeling and is removed.
  - **Handling Categorical Features:**  
    - The categorical features (`protocol_type`, `service`, and `flag`) are one-hot encoded.
  - **Feature Scaling:**  
    - Numeric features are standardized (e.g., using StandardScaler) to improve model performance.
  - **Alignment:**  
    - Special attention is given to align the training and test sets (especially after one-hot encoding) to ensure both sets have identical feature columns.

- **Known Issues:**  
  - Class imbalance: The dataset may have more instances of one class (normal) than the other (attack).
  - Historical Bias: As the data is based on network traffic from the 1990s, it might not reflect modern attack vectors or network behaviors.

## 5. Intended Uses and Limitations

- **Primary Use Cases:**  
  - Research and development of intrusion detection systems.
  - Benchmarking new anomaly detection and classification techniques in cybersecurity.

- **Misuse:**  
  - The dataset should not be used as a definitive representation of modern network traffic.
  - Models trained solely on this dataset might not generalize well to current, real-world network conditions without additional, updated data.

- **Licensing and Distribution:**  
  - The NSL-KDD dataset is publicly available for research purposes. Users should review the original licensing information provided with the dataset for details on its use and distribution.

## 6. Ethical Considerations

- **Privacy:**  
  - The dataset does not contain personally identifiable information (PII). It is based on simulated network traffic rather than real user data.
  
- **Bias and Fairness:**  
  - The dataset reflects the network conditions and attack types prevalent during its collection period.  
  - There may be biases in the representation of different attack types, which could affect the generalizability of models trained on this data.

- **Usage Cautions:**  
  - Researchers and practitioners should be cautious when applying models trained on this dataset to modern network traffic. It is advisable to supplement NSL-KDD with more recent data when deploying models in production.

## 7. Additional Information

- **References:**  
  - Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009). "A Detailed Analysis of the KDD CUP 99 Data Set." In *IEEE Symposium on Computational Intelligence for Security and Defense Applications*.  
  - [NSL-KDD Dataset GitHub Repository](https://github.com/defcom17/NSL_KDD)

- **Contact Information:**  
  - For further inquiries or clarification regarding the dataset, refer to the dataset documentation provided by its maintainers or the hosting repository.

---

