# Cyber Security Attacks Model

The model is supposed to predict a cyber attack type based on user input.

Questions:
- What user input? Which fields can they input? Presumably all fields that are going to be used in final dataset.

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = None

## Data Loading

In [2]:
df = pd.read_csv("../data/cybersecurity_attacks.csv")
df.head(3)

Unnamed: 0,Timestamp,Source IP Address,Destination IP Address,Source Port,Destination Port,Protocol,Packet Length,Packet Type,Traffic Type,Payload Data,Malware Indicators,Anomaly Scores,Alerts/Warnings,Attack Type,Attack Signature,Action Taken,Severity Level,User Information,Device Information,Network Segment,Geo-location Data,Proxy Information,Firewall Logs,IDS/IPS Alerts,Log Source
0,2023-05-30 06:33:58,103.216.15.12,84.9.164.252,31225,17616,ICMP,503,Data,HTTP,Qui natus odio asperiores nam. Optio nobis ius...,IoC Detected,28.67,,Malware,Known Pattern B,Logged,Low,Reyansh Dugal,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment A,"Jamshedpur, Sikkim",150.9.97.135,Log Data,,Server
1,2020-08-26 07:08:30,78.199.217.198,66.191.137.154,17245,48166,ICMP,1174,Data,HTTP,Aperiam quos modi officiis veritatis rem. Omni...,IoC Detected,51.5,,Malware,Known Pattern A,Blocked,Low,Sumer Rana,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment B,"Bilaspur, Nagaland",,Log Data,,Firewall
2,2022-11-13 08:23:25,63.79.210.48,198.219.82.17,16811,53600,UDP,306,Control,HTTP,Perferendis sapiente vitae soluta. Hic delectu...,IoC Detected,87.42,Alert Triggered,DDoS,Known Pattern B,Ignored,Low,Himmat Karpe,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,Segment C,"Bokaro, Rajasthan",114.133.48.179,Log Data,Alert Data,Firewall


In [3]:
# data from https://ipapi.is/geolocation.html
# last updated March 10, 2025

df_geolocation = pd.read_csv("../data/geolocationDatabaseIPv4.csv")

def rm_leading_zero_octet(ip_address):
    return ".".join(str(int(octet)) for octet in ip_address.split("."))

for col in ["start_ip", "end_ip"]:
    df_geolocation[col] = df_geolocation[col].apply(lambda x: rm_leading_zero_octet(x))

df_geolocation.head(3)

Unnamed: 0,ip_version,start_ip,end_ip,continent,country_code,country,state,city,zip,timezone,latitude,longitude,accuracy
0,4,175.103.32.0,175.103.32.255,AS,ID,Indonesia,,Tangerang,,Asia/Jakarta,-6.144135,106.723992,2
1,4,1.178.160.0,1.178.175.255,OC,AU,Australia,New South Wales,Sydney,1001.0,Australia/Sydney,-33.823931,151.192832,2
2,4,202.9.90.0,202.9.90.255,AS,TH,Thailand,Khon Kaen,Bangkok,40350.0,Asia/Bangkok,13.738564,100.524805,2


In [4]:
# removes leading 0's within octets

def rm_leading_zero_octet(ip_address):
    return ".".join(str(int(octet)) for octet in ip_address.split("."))

for col in ["start_ip", "end_ip"]:
    df_geolocation[col] = df_geolocation[col].apply(lambda x: rm_leading_zero_octet(x))

## Exploration

### Data Extraction

#### IP Addresses

The 2 IP Address columns can be used to extract more valuable data. According to https://ipinfo.io/blog/ip-address-information, we can get information like location, ISP, network info (ASN and its type - ASN is a block of IPs owned by an org, hostname, number of domains on IP, privacy detection - coming from VPN or proxy).

Most of the data is behind a paywall except for the geolocation data. Although, data like ASN and IP addresses known for attacks could be useful.

In our case, a downloaded database is used to compare with the help of a package: https://pypi.org/project/ipaddress/

In [5]:
import ipaddress

# changes all IP addresses in the geolocation database to integers making IP address comparison easier
df_geolocation['start_ip'] = df_geolocation['start_ip'].apply(lambda x: int(ipaddress.IPv4Address(x)))
df_geolocation['end_ip'] = df_geolocation['end_ip'].apply(lambda x: int(ipaddress.IPv4Address(x)))

In [6]:
class IPData:

    def __init__(self, ip_address):
        self.ip_address = int(ipaddress.IPv4Address(ip_address))

    def get_ip_location_data(self):
        matched_location = df_geolocation.loc[(df_geolocation['start_ip'] <= self.ip_address) & (df_geolocation['end_ip'] >= self.ip_address)]
        return matched_location

    @property
    def country(self):
        data = self.get_ip_location_data()
        if data.size == 0:
            return None
        return data["country"].iloc[0]

IPData("78.199.217.198").get_ip_location_data()

Unnamed: 0,ip_version,start_ip,end_ip,continent,country_code,country,state,city,zip,timezone,latitude,longitude,accuracy
978145,4,1321205760,1325400063,EU,FR,France,Grand Est,Charleville-Mézières,8800,Europe/Paris,49.7495,4.6095,3


In [7]:
# appends country to DataFrame and removes the IP address columns
# this takes a few minutes to run ~ 5 mins

df = df.assign(**{"Source Country": df["Source IP Address"].apply(lambda x : IPData(x).country)})
df = df.assign(**{"Destination Country": df["Destination IP Address"].apply(lambda x : IPData(x).country)})
df.drop(["Source IP Address", "Destination IP Address"], axis=1, inplace=True)
df.head(3)

Unnamed: 0,Timestamp,Source Port,Destination Port,Protocol,Packet Length,Packet Type,Traffic Type,Payload Data,Malware Indicators,Anomaly Scores,Alerts/Warnings,Attack Type,Attack Signature,Action Taken,Severity Level,User Information,Device Information,Network Segment,Geo-location Data,Proxy Information,Firewall Logs,IDS/IPS Alerts,Log Source,Source Country,Destination Country
0,2023-05-30 06:33:58,31225,17616,ICMP,503,Data,HTTP,Qui natus odio asperiores nam. Optio nobis ius...,IoC Detected,28.67,,Malware,Known Pattern B,Logged,Low,Reyansh Dugal,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment A,"Jamshedpur, Sikkim",150.9.97.135,Log Data,,Server,China,United Kingdom
1,2020-08-26 07:08:30,17245,48166,ICMP,1174,Data,HTTP,Aperiam quos modi officiis veritatis rem. Omni...,IoC Detected,51.5,,Malware,Known Pattern A,Blocked,Low,Sumer Rana,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment B,"Bilaspur, Nagaland",,Log Data,,Firewall,France,United States
2,2022-11-13 08:23:25,16811,53600,UDP,306,Control,HTTP,Perferendis sapiente vitae soluta. Hic delectu...,IoC Detected,87.42,Alert Triggered,DDoS,Known Pattern B,Ignored,Low,Himmat Karpe,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,Segment C,"Bokaro, Rajasthan",114.133.48.179,Log Data,Alert Data,Firewall,United States,


#### Device Information

The values in this column has information in the form of user agents. We can extract info like browser, operating system, device model, etc.

There is Python package that can parse this data: https://pypi.org/project/user-agents/

In [8]:
from user_agents import parse

# df = df.assign(**{"Browser": df["Device Information"].apply(lambda x : parse(x).browser)})
# df = df.assign(**{"Browser": df["Device Information"].apply(lambda x : parse(x).browser)})
# df.drop("Device Information", axis=1, inplace=True)
df["Device Information"] = df["Device Information"].apply(lambda x : str(parse(x)))
df.head(3)

Unnamed: 0,Timestamp,Source Port,Destination Port,Protocol,Packet Length,Packet Type,Traffic Type,Payload Data,Malware Indicators,Anomaly Scores,Alerts/Warnings,Attack Type,Attack Signature,Action Taken,Severity Level,User Information,Device Information,Network Segment,Geo-location Data,Proxy Information,Firewall Logs,IDS/IPS Alerts,Log Source,Source Country,Destination Country
0,2023-05-30 06:33:58,31225,17616,ICMP,503,Data,HTTP,Qui natus odio asperiores nam. Optio nobis ius...,IoC Detected,28.67,,Malware,Known Pattern B,Logged,Low,Reyansh Dugal,PC / Windows 8 / IE 9.0,Segment A,"Jamshedpur, Sikkim",150.9.97.135,Log Data,,Server,China,United Kingdom
1,2020-08-26 07:08:30,17245,48166,ICMP,1174,Data,HTTP,Aperiam quos modi officiis veritatis rem. Omni...,IoC Detected,51.5,,Malware,Known Pattern A,Blocked,Low,Sumer Rana,PC / Windows Vista / IE 8.0,Segment B,"Bilaspur, Nagaland",,Log Data,,Firewall,France,United States
2,2022-11-13 08:23:25,16811,53600,UDP,306,Control,HTTP,Perferendis sapiente vitae soluta. Hic delectu...,IoC Detected,87.42,Alert Triggered,DDoS,Known Pattern B,Ignored,Low,Himmat Karpe,PC / Windows 8 / IE 9.0,Segment C,"Bokaro, Rajasthan",114.133.48.179,Log Data,Alert Data,Firewall,United States,


In [None]:
# find all with PC

### Raw Data

In [10]:
df.columns

Index(['Timestamp', 'Source Port', 'Destination Port', 'Protocol',
       'Packet Length', 'Packet Type', 'Traffic Type', 'Payload Data',
       'Attack Type', 'Attack Signature', 'Action Taken', 'Severity Level',
       'User Information', 'Device Information', 'Network Segment',
       'Geo-location Data', 'Proxy Information', 'Firewall Logs',
       'IDS/IPS Alerts', 'Log Source', 'Source Country',
       'Destination Country'],
      dtype='object')

In [11]:
dups = df.duplicated()
dups[dups == True]

Series([], dtype: bool)

In [12]:
# showing number of unique values
values_unique = df.nunique()
values_unique

Timestamp              39997
Source Port            29761
Destination Port       29895
Protocol                   3
Packet Length           1437
Packet Type                2
Traffic Type               3
Payload Data           40000
Malware Indicators         1
Anomaly Scores          9826
Attack Type                3
Attack Signature           2
Action Taken               3
Severity Level             3
User Information       32389
Device Information     14716
Network Segment            3
Geo-location Data       8723
Proxy Information      20148
Firewall Logs              1
IDS/IPS Alerts             1
Log Source                 2
Source Country           177
Destination Country      169
dtype: int64

In [None]:
# showing all columns with null values and their count
null_count = df.isna().sum()
null_count[null_count > 0]

In [None]:
df.info()

### Data Insights

In [None]:
categorical_possible = values_unique[values_unique < 10]

for col_name, val_count in categorical_possible.items():
    msg = ""
    col_unique_vals = df[col_name].unique()
    
    if val_count == 1:
        col_unique_vals = [x for x in col_unique_vals if not pd.isnull(x)]
        msg = "Removed null value. Possible boolean?"
        
    print(f"{col_name}, Values: {col_unique_vals} {msg}")

---

From the raw data above, we can see that some columns have few unique values. These values can indicate categories and therefore, they can be encoded making it easier for the algorithms to understand.
Both ordinal and nominal encoding should be considered.

Possible fields for ordinal encoding: Severity Level

In addition that that, there are columns that contain only 1 unique value; usually the single value and others are populated by null values. Those columns can be possibly used as booleans.

In [None]:
df["Payload Data"].tolist()[10000:10010]

In [None]:
# df_corr = df.drop(columns=['Timestamp','Source IP Address','Destination IP Address'])
# plt.figure(figsize=(10,10))
# sns.heatmap(df_corr.corr(),annot=True)
# plt.show()

In [None]:
# plt.figure(figsize=(8,5))
# sns.boxplot(x=df[df["Attack Type"]=="Malware"], y=df['Packet Length'], palette="coolwarm")
# # sns.boxplot(x=df.filter('Attack Type'=='Malware'), y=df['Packet Length'], palette="coolwarm")
# plt.title("Packet Length vs Attach Type")
# plt.show()

In [None]:
# numeric_cols = df.select_dtypes(include=['int64','float64']).columns

# for col in numeric_cols:
#     print(f"Valores extremos da coluna {col}:")
#     print("Mínimo:", df[col].min())
#     print("Máximo:", df[col].max())
#     print("-----------")

## Analysis

- No duplicated rows.

Based on the data, some columns are NOT valuable for model training:
- the 2 IP Address columns. The reason is because they are going to be unique and might be too "random". However, we can probably extract data from the IP address that would be more useful. According to https://ipinfo.io/blog/ip-address-information, we can get information like location, ISP, network info (ASN and its type - ASN is a block of IPs owned by an org, hostname, number of domains on IP, privacy detection - coming from VPN or proxy).
- Payload Data seems fairly useless. It looks like irrelevant latin text - possibly auto-generated.

IDS/IPS NEEEEEEEEED.


## Need Clarity

- From whose perspective is the data from? The attacker?
- Is timestamp using UTC time or some other uniform time zone?
- Are countries relevant? Because we don't know if the data collection is concentrated in a certain area. Model results can be skewed.
- Ports above a certain value can be used for anything unlike the ports below that threshold. Can we assume that the ports selected are randomly assigned by the attacker?
- Columns liks Action Taken has data from after the attack but they may not know what type of attack it is initially. With that data, we can figure out the type after the attack happened.
- What are Anomaly Score? Not sure which part of the process it is in. How does it relate to Attack Signatures?

## Extra Checks

- Packet Type has some relation to Protocol. Do some more checks to make sure.
- More checks on Packet Length.
- Check which values from some other column are paired with Malware Indicators.

## TODO
- Extract browser, OS with version from Device Info. Find out the format of the Device Info column.

In [None]:
df_base["Hourly"] = pd.to_datetime(df["Timestamp"]).dt.hour
df_base

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt

plt.figure(figsize=(8,5))
sns.countplot(x=df_base,y=)
plt.title("")
plt.show()