## Machine Learning Model for Cyberattack Detection and Classification

**Authors:** Conner Jordan, Matt Perona, Nathan Nawrocki

## Introduction

We aim to develop a machine learning model capable of identifying and classifying various types of cyberattacks using the Incribo synthetic cybersecurity dataset. By leveraging both supervised and unsupervised learning techniques, we will build a system that not only recognizes known attack signatures but also detects anomalies indicative of potential zero-day attacks. This project will enhance our understanding of cybersecurity threats and contribute to the development of more robust defense mechanisms.

## Dataset

The dataset used in this project is the Incribo synthetic cyber dataset from Kaggle, which consists of 25 varied metrics and 40,000 records. The dataset simulates real-world cyberattack scenarios and includes metrics such as timestamps, IP addresses, ports, protocols, packet lengths, malware indicators, anomaly scores, and more.

**Link to dataset:** [Incribo Synthetic Cyber Dataset](https://www.kaggle.com/datasets/teamincribo/cyber-security-attacks)

**Key Metrics in the Dataset:**

- Timestamp
- Source IP Address
- Destination IP Address
- Source Port
- Destination Port
- Protocol
- Packet Length
- Packet Type
- Traffic Type
- Payload Data
- Malware Indicators
- Anomaly Scores
- Alerts/Warnings
- Attack Type
- Attack Signature
- Action Taken
- Severity Level
- User Information
- Device Information
- Network Segment
- Geo-location Data
- Proxy Information
- Firewall Logs
- IDS/IPS Alerts
- Log Source

## What We Are Going to Predict

Our goal is to build a system that predicts the type of cyberattack (Attack Type) and identifies anomalies that may indicate zero-day attacks.

## Features We Plan to Use as Predictors

We will use a subset of the provided metrics as predictors. These include:

- Source IP Address
- Destination IP Address
- Source Port
- Destination Port
- Protocol
- Packet Length
- Packet Type
- Traffic Type
- Malware Indicators
- Anomaly Scores
- Severity Level
- Network Segment
- Geo-location Data
- Proxy Information
- Firewall Logs
- IDS/IPS Alerts

## Preliminary Work on Data Preparation

**Data Cleaning:**

- Handling missing values
- Removing duplicate records
- Converting categorical data to numerical format (if necessary)

**Feature Engineering:**

- Creating new features from existing ones (e.g., combining Source IP and Source Port into a single feature)
- Normalizing/standardizing data

## Preliminary Work on Data Exploration and Visualization

**Exploratory Data Analysis (EDA):**

- Summary statistics of key metrics
- Distribution plots for numerical features
- Bar charts for categorical features

**Visualization:**

- Heatmap of correlation between features
- Time series analysis of attack occurrences

## Preliminary Work on Machine Learning to Make Predictions

**Train/Test Split:**

- Splitting the dataset into training (80%) and testing (20%) sets

**Initial Model Building:**

- Implementing a basic decision tree classifier to predict the Attack Type
- Evaluating model performance using accuracy, precision, recall, and F1-score

**Anomaly Detection:**

- Using unsupervised learning techniques (e.g., isolation forest) to identify potential zero-day attacks based on anomaly scores and other relevant features


In [3]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import IsolationForest

# Load the dataset
data = pd.read_csv('cybersecurity_attacks.csv')

# Data Cleaning
data = data.dropna()  # Dropping missing values for simplicity

# Feature Engineering
# Example: Combining Source IP and Source Port into a single feature
data['Source_IP_Port'] = data['Source IP Address'].astype(str) + ':' + data['Source Port'].astype(str)

# Selecting features and target
features = ['Source_IP_Port', 'Destination IP Address', 'Destination Port', 'Protocol', 'Packet Length',
            'Packet Type', 'Traffic Type', 'Malware Indicators', 'Anomaly Scores', 'Severity Level',
            'Network Segment', 'Geo-location Data', 'Proxy Information', 'Firewall Logs', 'IDS/IPS Alerts']
X = data[features]
y = data['Attack Type']

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initial Model Building
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Anomaly Detection
anomaly_detector = IsolationForest()
anomaly_detector.fit(X)
data['Anomaly_Score'] = anomaly_detector.decision_function(X)
data['Anomaly'] = anomaly_detector.predict(X)


ValueError: could not convert string to float: '122.187.37.42:61869'