# Data Explanation

### 1. Load the Data into a Pandas DataFrame

In [1]:
# imports
import pandas as pd
from sklearn.preprocessing import StandardScaler

##### Step 1: Check the Column Names

In [1]:
# Open and read the kddcup.names file
with open('data/kddcup.names', 'r') as file:
    # Read lines and exclude the first line which is not a column name
    lines = file.readlines()[1:]

# Extract column names from each line
column_names = [line.split(":")[0] for line in lines]

# The dataset also has a 'label' column which represents the type of network interaction or attack type
column_names.append("label")

print(column_names)

['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'label']


The KDD Cup 1999 dataset is one of the most widely-used datasets for intrusion detection research. It contains a variety of features related to network connections, both basic and content features. Here's a breakdown of the columns and their descriptions:

### Basic Features:

1. **duration**: Continuous. Represents the number of seconds of the connection.
2. **protocol_type**: Categorical. The type of the protocol. Examples include `tcp`, `udp`, and `icmp`.
3. **service**: Categorical. The network service on the destination, e.g., `http`, `telnet`, etc.
4. **flag**: Categorical. Status of the connection, e.g., `SF` (normal), `S0` (connection attempt seen, no reply), etc.
5. **src_bytes**: Continuous. Number of data bytes from source to destination.
6. **dst_bytes**: Continuous. Number of data bytes from destination to source.
7. **land**: Binary. `1` if the connection is from/to the same host/port, `0` otherwise.
8. **wrong_fragment**: Continuous. Number of "wrong" fragments.
9. **urgent**: Continuous. Number of urgent packets.

### Content Features:

These features are derived from the connection's data payload.

10. **hot**: Continuous. Number of "hot" indicators.
11. **num_failed_logins**: Continuous. Number of failed login attempts.
12. **logged_in**: Binary. `1` if successfully logged in, `0` otherwise.
13. **num_compromised**: Continuous. Number of "compromised" conditions.
14. **root_shell**: Binary. `1` if root shell is obtained, `0` otherwise.
15. **su_attempted**: Binary. `1` if "su root" command attempted, `0` otherwise.
16. **num_root**: Continuous. Number of root accesses.
17. **num_file_creations**: Continuous. Number of file creation operations.
18. **num_shells**: Continuous. Number of shell prompts.
19. **num_access_files**: Continuous. Number of operations on access control files.
20. **num_outbound_cmds**: Continuous. Number of outbound commands in an FTP session.
21. **is_host_login**: Binary. `1` if the login belongs to the "host" list, `0` otherwise.
22. **is_guest_login**: Binary. `1` if the login is a "guest" login, `0` otherwise.

### Time-based Traffic Features:

These features are computed using a two-second time window.

23. **count**: Continuous. Number of connections to the same host as the current connection in the past two seconds.
24. **srv_count**: Continuous. Number of connections to the same service (e.g., HTTP) as the current connection in the past two seconds.
25. **serror_rate**: Continuous. Percentage of connections that have "SYN" errors.
26. **srv_serror_rate**: Continuous. Percentage of connections that have "SYN" errors to the same service.
27. **rerror_rate**: Continuous. Percentage of connections that have "REJ" errors.
28. **srv_rerror_rate**: Continuous. Percentage of connections that have "REJ" errors to the same service.
29. **same_srv_rate**: Continuous. Percentage of connections to the same service.
30. **diff_srv_rate**: Continuous. Percentage of connections to different services.
31. **srv_diff_host_rate**: Continuous. Percentage of connections to different hosts.

### Host-based Traffic Features:

These features are designed to assess the behavior of the connection data over a much longer period, typically hours.

32-41: Various counts and percentages related to connections with the same host, different services, and other features over a longer time window.

Finally,

42. **label**: Categorical. The type of network interaction or specific attack type, e.g., "normal.", "smurf.", "neptune.", etc.

Given that the main dataset is compressed (kddcup.data.gz), we'll use Pandas to directly read from the gzipped file.

In [3]:
# Load the main dataset
data_path = "data/kddcup.data.gz"
column_names = pd.read_csv("data/kddcup.names", sep=":", skiprows=1, header=None)[0].tolist()  # Extracting feature names
column_names.append("label")  # The last column is the label
df = pd.read_csv(data_path, header=None, names=column_names)

# Display the first few rows to get an overview
df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,http,SF,215,45076,0,0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,162,4528,0,0,0,0,...,1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,236,1228,0,0,0,0,...,2,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,233,2032,0,0,0,0,...,3,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,239,486,0,0,0,0,...,4,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,normal.


## 2. Familiarize with the Data Structure

#### a. Basic Information

In [4]:
# Get basic information about the dataset
df.info()

# Get a summary of the numerical columns
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898431 entries, 0 to 4898430
Data columns (total 42 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   duration                     int64  
 1   protocol_type                object 
 2   service                      object 
 3   flag                         object 
 4   src_bytes                    int64  
 5   dst_bytes                    int64  
 6   land                         int64  
 7   wrong_fragment               int64  
 8   urgent                       int64  
 9   hot                          int64  
 10  num_failed_logins            int64  
 11  logged_in                    int64  
 12  num_compromised              int64  
 13  root_shell                   int64  
 14  su_attempted                 int64  
 15  num_root                     int64  
 16  num_file_creations           int64  
 17  num_shells                   int64  
 18  num_access_files             int64  
 19  

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
count,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,...,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0
mean,48.34243,1834.621,1093.623,5.716116e-06,0.0006487792,7.961733e-06,0.01243766,3.205108e-05,0.143529,0.008088304,...,232.9811,189.2142,0.7537132,0.03071111,0.605052,0.006464107,0.1780911,0.1778859,0.0579278,0.05765941
std,723.3298,941431.1,645012.3,0.002390833,0.04285434,0.007215084,0.4689782,0.007299408,0.3506116,3.856481,...,64.02094,105.9128,0.411186,0.1085432,0.4809877,0.04125978,0.3818382,0.3821774,0.2309428,0.2309777
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,45.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,49.0,0.41,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,520.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,255.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,1032.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,255.0,1.0,0.04,1.0,0.0,0.0,0.0,0.0,0.0
max,58329.0,1379964000.0,1309937000.0,1.0,3.0,14.0,77.0,5.0,1.0,7479.0,...,255.0,255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


#### b. Check for Missing Values

In [5]:
# Check for any missing values in the dataset
missing_values = df.isnull().sum()

# Display columns with missing values (if any)
missing_values[missing_values > 0]

Series([], dtype: int64)

There are no missing values in the dataset. This means you can proceed without having to handle null or NaN entries, making the preprocessing stage smoother.

#### c. Explore the Target Variable (Labels)

In [6]:
# Check the distribution of the 'label' column
df['label'].value_counts()

label
smurf.              2807886
neptune.            1072017
normal.              972781
satan.                15892
ipsweep.              12481
portsweep.            10413
nmap.                  2316
back.                  2203
warezclient.           1020
teardrop.               979
pod.                    264
guess_passwd.            53
buffer_overflow.         30
land.                    21
warezmaster.             20
imap.                    12
rootkit.                 10
loadmodule.               9
ftp_write.                8
multihop.                 7
phf.                      4
perl.                     3
spy.                      2
Name: count, dtype: int64

The distribution of the label column shows the following insights:

1. "smurf." and "neptune." are the most frequent attack types, with 2,807,886 and 1,072,017 occurrences, respectively.
2. "normal." interactions (non-malicious traffic) appear 972,781 times, which is less frequent than "smurf." attacks but more than other types.
3. The dataset contains a variety of attacks, including but not limited to "back.", "satan.", "ipsweep.", and "portsweep.", among others. Each of these represents different methods or vectors of cyberattacks.
4. Some attacks, like "perl.", "spy.", and "phf.", are relatively rare in this dataset.

This distribution provides insights into the balance (or imbalance) of the dataset regarding normal vs. malicious traffic. This is important when modeling, as a highly imbalanced dataset might require strategies like resampling or using specific evaluation metrics to get a model that performs well across all categories.

#### d. Explore Feature Types

In [7]:
# Check the data types of each column
df.dtypes

duration                         int64
protocol_type                   object
service                         object
flag                            object
src_bytes                        int64
dst_bytes                        int64
land                             int64
wrong_fragment                   int64
urgent                           int64
hot                              int64
num_failed_logins                int64
logged_in                        int64
num_compromised                  int64
root_shell                       int64
su_attempted                     int64
num_root                         int64
num_file_creations               int64
num_shells                       int64
num_access_files                 int64
num_outbound_cmds                int64
is_host_login                    int64
is_guest_login                   int64
count                            int64
srv_count                        int64
serror_rate                    float64
srv_serror_rate          

#### e. Explore Categorical Features

The KDD Cup 1999 dataset contains some categorical features. Let's explore their unique values:

In [8]:
categorical_features = df.select_dtypes(include=['object']).columns
for feature in categorical_features:
    print(f"\nUnique values for {feature}:")
    print(df[feature].value_counts())



Unique values for protocol_type:
protocol_type
icmp    2833545
tcp     1870598
udp      194288
Name: count, dtype: int64

Unique values for service:
service
ecr_i        2811660
private      1100831
http          623091
smtp           96554
other          72653
              ...   
tftp_u             3
harvest            2
aol                2
http_8001          2
http_2784          1
Name: count, Length: 70, dtype: int64

Unique values for flag:
flag
SF        3744328
S0         869829
REJ        268874
RSTR         8094
RSTO         5344
SH           1040
S1            532
S2            161
RSTOS0        122
OTH            57
S3             50
Name: count, dtype: int64

Unique values for label:
label
smurf.              2807886
neptune.            1072017
normal.              972781
satan.                15892
ipsweep.              12481
portsweep.            10413
nmap.                  2316
back.                  2203
warezclient.           1020
teardrop.               979
pod.   

The KDD cup dataset is consists of 3 main protocol types:

1. icmp
2. tcp
3. udp

And 22 different attack types:
1. back
2. buffer_overflow
3. ftp_write
4. guess_passwd
5. imap
6. ipsweep
7. land
8. loadmodule
9. multihop
10. neptune
11. nmap
12. perl
13. phf
14. pod
15. portsweep
16. rootkit
17. satan
18. smurf
19. spy
20. teardrop
21. warezclient
22. warezmaster