<a href="https://colab.research.google.com/github/fjadidi2001/Cyber-Attack-Detection/blob/main/DL4cyber.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# "Using Deep Learning Algorithms to Improve Cybersecurity."



# Introduction

The **CICIDS2017 dataset** is a comprehensive collection of labeled network traffic data curated by the Canadian Institute for Cybersecurity (CIC). It is widely used for evaluating the performance of intrusion detection systems (IDS) and benchmarking cybersecurity models.

This dataset was created to reflect real-world scenarios by simulating benign and malicious traffic using realistic network environments, including both modern and legacy protocols. The data was generated using B-Profile systems, which emulate the behavior of human users based on statistical distributions. It captures traffic from a variety of attack types, making it suitable for training and testing both signature-based and anomaly-based IDS models.

Key characteristics of the CICIDS2017 dataset include:

- **Time-stamped flow-based data** collected using CICFlowMeter.
- **Multiple attack scenarios** such as DDoS, brute-force, botnet, infiltration, port scanning, web attacks, and more.
- **Seven-day capture** (Monday to Sunday) with each day focusing on different attack types and benign traffic profiles.
- **Features**: Over 80 network traffic features including flow duration, packet size, header flags, and inter-arrival times.
- **Labeling**: Each traffic flow is labeled as either benign or one of the specific attack types.

The CICIDS2017 dataset is particularly valuable for researchers and developers working on:

- Supervised learning-based IDS
- Unsupervised anomaly detection
- Real-time traffic classification
- Security policy and defense system testing

By offering a well-structured and diverse dataset, CICIDS2017 helps bridge the gap between academic research and practical cybersecurity applications.



 Reasons CICIDS2017 is Good for Cybersecurity:
1. Realistic Traffic
Simulates real-world network traffic, both benign and malicious.

Includes user behavior profiles that mimic human interactions with networks.

2. Diverse Attack Types
Covers multiple attack categories, including:

- DDoS

- Brute-force attacks

- Port scanning

- Botnet activity

- Infiltration

- Web attacks (e.g., SQL injection, XSS)

3. Labeled and Time-stamped
- All network flows are clearly labeled, allowing for supervised ML training.

- Time-series structure enables research in real-time detection and temporal analysis.

4. Rich Feature Set
Extracted using CICFlowMeter, includes 80+ flow features like:

- Flow duration

- Packet sizes

- Header flags

- Flow direction

- Suitable for deep analysis and feature engineering.

5. Well-documented and Open-Source
- Publicly available for academic and commercial research.

- Comes with detailed documentation and tools.

# Step 1: Set Up the Environment



In [1]:
# Install required libraries
!pip install transformers torch pandas numpy scikit-learn imbalanced-learn

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [12]:
# Import libraries
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
import zipfile

In [5]:
# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# Set device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


- Install transformers for SecBERT/BERT, torch for model training, pandas for data handling, scikit-learn for metrics, and imbalanced-learn for SMOTE.

- Mount Google Drive to load the CICIDS2017 dataset (assumed to be in /content/drive/MyDrive/CICIDS2017/).

- Set the device to GPU for faster training.



# Step 2: Load and Combine the CICIDS2017 Dataset



In [15]:
# Path to the zip file
zip_path = '/content/drive/MyDrive/network-intrusion-dataset.zip'
extract_dir = '/content/cicids2017/'

In [16]:
# Unzip the dataset
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

In [9]:
# List of CSV files
csv_files = [
    'Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv',
    'Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv',
    'Friday-WorkingHours-Morning.pcap_ISCX.csv',
    'Monday-WorkingHours.pcap_ISCX.csv',
    'Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv',
    'Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv',
    'Tuesday-WorkingHours.pcap_ISCX.csv',
    'Wednesday-workingHours.pcap_ISCX.csv'
]

# Load and combine CSVs
dfs = []
for file in csv_files:
    file_path = os.path.join(extract_dir, file)
    df = pd.read_csv(file_path)
    dfs.append(df)

# Concatenate all DataFrames
df = pd.concat(dfs, ignore_index=True)

# Display basic information
print("Combined Dataset Shape:", df.shape)
print("Columns:", df.columns)
print("Label Distribution:\n", df[' Label'].value_counts())

# Handle missing values
df = df.replace([np.inf, -np.inf], np.nan)  # Replace infinities
df = df.dropna()

# Simplify labels (binary classification: Benign vs. Malicious)
df[' Label'] = df[' Label'].apply(lambda x: 0 if x == 'BENIGN' else 1)
print("Binary Label Distribution:\n", df[' Label'].value_counts())

NameError: name 'zipfile' is not defined