# Feature Selection
## Dataset: WEB-IDS23 Dataset
#### Tasks:
    1. Feature selection using Forward - Backward Analysis

#### Navigation:
1. [Import Libraries](#import-libraries)
2. [Utility Functions](#utility-functions)
3. [Load the Dataset](#load-the-dataset)
4. [Data Preprocessing](#data-preprocessing)
5. [Model Building and Feature Selection](#model-building-and-feature-selection)

        Author: Nithusikan T.
        Email: e19266@eng.pdn.ac.lk
        Date: 29/05/2025

### Import Libraries [üè†](#feature-selection) <a id="import-libraries"></a>

In [1]:
import os
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.preprocessing import LabelEncoder


### Utility Functions [üè†](#feature-selection) <a id="utility-functions"></a>

In [11]:
def load_top_rows_from_csvs(folder_path, n_rows=10000):
    """
    Reads the first `n_rows` from each CSV in the specified folder
    and concatenates them into a single DataFrame.

    Note: It omits the "web-ids23_smtp_enum.csv" file as it contains only 7 rows.
    
    Parameters:
        folder_path (str): Path to the folder containing CSV files.
        n_rows (int): Number of rows to read from each file.

    Returns:
        pd.DataFrame: Combined DataFrame from all CSVs.
    """
    all_dfs = []
    
    for filename in os.listdir(folder_path):
        if filename.endswith(".csv") and filename not in ["web-ids23_smtp_enum.csv", "ssh_login.csv", "ssh_login_successful.csv"]:
            file_path = os.path.join(folder_path, filename)
            try:
                df = pd.read_csv(file_path, nrows=n_rows)
                all_dfs.append(df)
                print(f"{filename} loaded with {df.shape[0]} rows and {df.shape[1]} columns.")
            except Exception as e:
                print(f"Could not read {filename}: {e}")
    
    combined_df = pd.concat(all_dfs, ignore_index=True)
    return combined_df

def preprocess_data(df, columns_2_drop, target_columns=['attack_type', 'attack']):
    """
    Preprocess the DataFrame by handling missing values, encoding categorical variables,
    and scaling numerical features.

    Parameters:
    df (pd.DataFrame): Input DataFrame.
    target_column (str): Name of the target variable column.
    columns_2_drop (list): List of columns to drop from the DataFrame.

    Returns:
    X (pd.DataFrame): Features DataFrame.
    y_1 (pd.Series): Target variable Series.
    y_2 (pd.Series): Second target variable Series.
    """
    # Label encoder
    label_encoder = LabelEncoder()

    # Handle missing values
    df = df.dropna(axis=0, how='any')  # Drop rows with any missing values

    # Separate features and target variable
    y_1 = df[target_columns[0]]  # attack_type
    y_2 = df[target_columns[1]]  # attack
    X = df.drop(columns=target_columns)

    # Drop some columns that are not needed for analysis
    print(f"Dropping columns: {columns_2_drop}") # df.columns[0:4])
    X = X.drop(columns=columns_2_drop)

    # Encode categorical variables if any
    X = pd.get_dummies(X, drop_first=True)

    # Replace labels with numerical values
    if y_1.dtype == 'object':
        y_1 = label_encoder.fit_transform(y_1)
    
    return pd.DataFrame(X, columns=X.columns), pd.Series(y_1), pd.Series(y_2)

### Load the dataset [üè†](#feature-selection) <a id="load-the-dataset"></a>

In [12]:
df = load_top_rows_from_csvs("E:\\Accadamics\\Semesters\\Final Year Project\\e19-4yp-The-Compound-Prediction-Analysis-of-Cybersecurity\\Data\\web-ids23")

web-ids23_benign.csv loaded with 10000 rows and 38 columns.
web-ids23_bruteforce_http.csv loaded with 10000 rows and 38 columns.
web-ids23_bruteforce_https.csv loaded with 10000 rows and 38 columns.
web-ids23_dos_http.csv loaded with 10000 rows and 38 columns.
web-ids23_dos_https.csv loaded with 10000 rows and 38 columns.
web-ids23_ftp_login.csv loaded with 10000 rows and 38 columns.
web-ids23_ftp_version.csv loaded with 10000 rows and 38 columns.
web-ids23_hostsweep_Pn.csv loaded with 10000 rows and 38 columns.
web-ids23_hostsweep_sn.csv loaded with 10000 rows and 38 columns.
web-ids23_portscan.csv loaded with 10000 rows and 38 columns.
web-ids23_revshell_http.csv loaded with 8549 rows and 38 columns.
web-ids23_revshell_https.csv loaded with 9404 rows and 38 columns.
web-ids23_smtp_version.csv loaded with 10000 rows and 38 columns.
web-ids23_sql_injection_http.csv loaded with 10000 rows and 38 columns.
web-ids23_sql_injection_https.csv loaded with 10000 rows and 38 columns.
web-ids23_

### Data Preprocessing [üè†](#feature-selection) <a id="data-preprocessing">

#### Drop unnecessary columns

In [13]:
# Drop columns that are not needed for analysis
columns_2_drop = [
    'uid',               # Unique flow ID (not predictive)
    'ts',                # Timestamp (not useful directly; time-series analysis might use it differently)
    'id.orig_h',         # Origin IP ‚Äî environment-specific
    'id.resp_h',         # Destination IP ‚Äî environment-specific
    'service',           # Tool-specific, may not generalize
    'traffic_direction', # Typically derived from IPs ‚Äî not generalizable
    'fwd_init_window_size', # Typically the window size varies with OS
    'bwd_init_window_size', # ""
    'fwd_last_window_size', # ""
    'bwd_last_window_size'  # ""
]

target_columns = [       # Define target columns
    'attack_type', 
    'attack'
]  

X, y_attack_type, y_attack = preprocess_data(df, columns_2_drop=columns_2_drop, target_columns=target_columns)


Dropping columns: ['uid', 'ts', 'id.orig_h', 'id.resp_h', 'service', 'traffic_direction', 'fwd_init_window_size', 'bwd_init_window_size', 'fwd_last_window_size', 'bwd_last_window_size']


#### Feature and Label Split

In [None]:
feature_cols = [
    "flow_duration", "fwd_pkts_tot", "bwd_pkts_tot",
    "fwd_data_pkts_tot", "bwd_data_pkts_tot",
    "fwd_pkts_per_sec", "bwd_pkts_per_sec", "flow_pkts_per_sec",
    "down_up_ratio", "payload_bytes_per_second",
    "flow_FIN_flag_count", "flow_SYN_flag_count", "flow_RST_flag_count",
    "flow_ACK_flag_count", "fwd_PSH_flag_count", "bwd_PSH_flag_count",
    "flow_CWR_flag_count", "flow_ECE_flag_count",
    "fwd_URG_flag_count", "bwd_URG_flag_count",
    "fwd_header_size_tot", "bwd_header_size_tot",
    "fwd_header_size_min", "fwd_header_size_max",
    "bwd_header_size_min", "bwd_header_size_max"
]


print(f"The length of feature_cols is {len(feature_cols)}.")
print(f"The shape of X is {X.shape}")

26
(137024, 26)


#### One hot encoding

In [16]:
X = pd.get_dummies(X, drop_first=True)
print(f"The shape of X is {X.shape}")

The shape of X is (137024, 26)


#### Train test split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y_attack_type, test_size=0.2, random_state=42)

### Model Building and Feature Selection [üè†](#feature-selection) <a id="model-building-and-feature-selection"></a>

In [18]:
# Initialize the model
rfc = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

#### Forward feature selection

In [None]:
sfs_forward = SFS(
    rfc,
    k_features='best',
    forward=True,
    floating=False,
    scoring='accuracy',
    cv=5,
    n_jobs=-1 
)

sfs_forward = sfs_forward.fit(X_train, y_train)

print("üîç Forward Selection Best Features:")
print(list(sfs_forward.k_feature_names_))

#### Backward feature selection

In [None]:
sfs_backward = SFS(rfc,
                   k_features='best',
                   forward=False,
                   floating=False,
                   scoring='accuracy',
                   cv=5,
                   n_jobs=-1)

sfs_backward = sfs_backward.fit(X_train, y_train)

print("\nüîç Backward Elimination Best Features:")
print(list(sfs_backward.k_feature_names_))

#### Selected Features

In [None]:
union_features = list(sfs_backward.k_feature_names_) | list(sfs_backward.k_feature_names_)
print("The union of fwd and bwd features:\n")
print(union_features)

intersection_features = list(sfs_backward.k_feature_names_) & list(sfs_backward.k_feature_names_)
print("\nThe common (intersection) of fwd and bwd features:\n")
print(intersection_features)
