# Recursive feature elimination

### This method involves recursively removing features from the dataset and building a decision tree model using the remaining features. The feature set that results in the highest accuracy is selected as the final set of features. This method can be computationally expensive, but it can be effective for identifying the most important features.

Step 1: Load the dataset

Load your dataset into a pandas dataframe.
Step 2: Split the dataset into features and target variable

Split your dataset into a feature matrix X and a target variable y.
Step 3: Split the dataset into training and testing sets

Split your dataset into training and testing sets using the train_test_split function from scikit-learn.
Step 4: Define a decision tree classifier

Define a decision tree classifier using the DecisionTreeClassifier class from scikit-learn.
Step 5: Apply Recursive Feature Elimination

Apply Recursive Feature Elimination using the RFE class from scikit-learn.
Specify the decision tree classifier and the number of features to select using the n_features_to_select parameter.
Fit the RFE object to the training data using the fit method.
Step 6: Identify the selected features

Identify the selected features using the support_ attribute of the RFE object.
Step 7: Create a new feature matrix with selected features

Create a new feature matrix X_selected containing only the selected features.
Step 8: Build a decision tree model with the selected features

Build a decision tree model using the selected features and the same decision tree classifier as in Step 4.
Fit the decision tree model to the training data.
Evaluate the performance of the decision tree model on the testing data.

In [15]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_selection import RFE
import numpy as np

le = LabelEncoder()

In [2]:
st_path = 'C:/Users/lenovo/DataSet/CIC-IDS-2017/GeneratedLabelledFlows/TrafficLabelling/'
st_file = 'Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv'
st_file2 = 'Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv'
# st_file_merged = 'merged.csv'
# st_file = 'Friday-WorkingHours-Morning.pcap_ISCX.csv'
# st_file = 'Monday-WorkingHours.pcap_ISCX.csv'
# st_file = 'Tuesday-WorkingHours.pcap_ISCX.csv'
encoding = 'utf_8'
df_data = pd.read_csv(os.path.join(st_path, st_file), encoding=encoding)
df_data2 = pd.read_csv(os.path.join(st_path, st_file2), encoding=encoding)
df_data.columns

Index(['Flow ID', ' Source IP', ' Source Port', ' Destination IP',
       ' Destination Port', ' Protocol', ' Timestamp', ' Flow Duration',
       ' Total Fwd Packets', ' Total Backward Packets',
       'Total Length of Fwd Packets', ' Total Length of Bwd Packets',
       ' Fwd Packet Length Max', ' Fwd Packet Length Min',
       ' Fwd Packet Length Mean', ' Fwd Packet Length Std',
       'Bwd Packet Length Max', ' Bwd Packet Length Min',
       ' Bwd Packet Length Mean', ' Bwd Packet Length Std', 'Flow Bytes/s',
       ' Flow Packets/s', ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max',
       ' Flow IAT Min', 'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std',
       ' Fwd IAT Max', ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean',
       ' Bwd IAT Std', ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags',
       ' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags',
       ' Fwd Header Length', ' Bwd Header Length', 'Fwd Packets/s',
       ' Bwd Packets/s', ' Min Packet Length', ' Max Pa

In [3]:
st_path_test_file = 'C:/Users/lenovo/DataSet/MachineLearningCSV/MachineLearningCVE/'
st_file_test_file = 'Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv'
encoding = 'utf_8'
df_data_test_file= pd.read_csv(os.path.join(st_path_test_file, st_file_test_file), encoding=encoding)

In [4]:
filtered_df = df_data[df_data[' Label'] != 'BENIGN']
label_column = filtered_df[' Label']
unique_labels = label_column.unique()
print(unique_labels)

['DDoS']


In [11]:
# Explain why we are dropping these columns
# df_features = df_data.drop(['Flow ID',' Source IP', 'Flow ID', ' Timestamp', ' Destination IP', ' Source Port'], axis=1)  # Features
# df_target = df_data[' Label']  # Target variable

# df_features_test_file = df_data_test_file  # Features
# df_target_test_file = df_data_test_file[' Label']  # Target variable

In [5]:
# Split the dataset into features and target variable
X = df_data.drop(" Label", axis=1)
y = df_data[" Label"]

In [6]:
# Iterate over the columns in the dataframe to check if they are strings
for st_col in X.columns:
    if X[st_col].dtypes not in ['int64', 'float64']:
        print(X[st_col].dtypes)
        X[st_col] = le.fit_transform(X[st_col])

# for st_col in df_features_test_file.columns:
#     if df_features_test_file[st_col].dtypes not in ['int64', 'float64']:
#         print(df_features_test_file[st_col].dtypes)
#         df_features_test_file[st_col] = le.fit_transform(df_features_test_file[st_col])

object
object
object
object


In [7]:
# Search for columns with infinite values
lt_columns = X[X.columns[X.max() == np.inf]].columns
# lt_columns_test_file = df_features_test_file[df_features_test_file.columns[df_features_test_file.max() == np.inf]].columns

In [8]:
# modify infinite values (10 x max)
for st_column_inf in lt_columns:
    print(st_column_inf)
    df_column_aux = X[st_column_inf]
    # identify the max value
    vl_max_aux = df_column_aux[df_column_aux < np.inf].max()
    print(vl_max_aux)
    # .loc is important to modify the value in the dataframe
    X.loc[X[st_column_inf] == np.inf, st_column_inf] = 10*vl_max_aux

Flow Bytes/s
2070000000.0
 Flow Packets/s
3000000.0


In [9]:
# check if there are still columns with infinite values

lt_columns = X[X.columns[X.max() == np.inf]].columns
print('columns inf', lt_columns)

columns inf Index([], dtype='object')


In [11]:
# Search for the columns with NaN values
for st_column_nan in X.columns:
    df_column_aux = X[X[st_column_nan].isna()].copy()
    if len(df_column_aux) > 0:
        print(df_column_aux.transpose())
        print(X[X[st_column_nan].isna()].transpose())
        print(st_column_nan)
        print('The total amount of NaNs are', len(X[X[st_column_nan].isna()]))
        print(X[st_column_nan].describe())
# Drop the rows with NaN values
X.dropna(inplace=True)
y = y[y.index.isin(X.index)]

                    6796     14739    15047    209728
Flow ID            57339.0  59849.0  58541.0  58653.0
 Source IP           779.0    703.0    698.0    698.0
 Source Port          80.0  37575.0  48283.0  39026.0
 Destination IP      860.0    863.0      0.0    867.0
 Destination Port  36812.0  53581.0     80.0  18467.0
...                    ...      ...      ...      ...
 Active Min            0.0      0.0      0.0      0.0
Idle Mean              0.0      0.0      0.0      0.0
 Idle Std              0.0      0.0      0.0      0.0
 Idle Max              0.0      0.0      0.0      0.0
 Idle Min              0.0      0.0      0.0      0.0

[84 rows x 4 columns]
                    6796     14739    15047    209728
Flow ID            57339.0  59849.0  58541.0  58653.0
 Source IP           779.0    703.0    698.0    698.0
 Source Port          80.0  37575.0  48283.0  39026.0
 Destination IP      860.0    863.0      0.0    867.0
 Destination Port  36812.0  53581.0     80.0  18467.0
...  

In [12]:
# Scale numerical features
scaler = StandardScaler()
mt_features_scaled = scaler.fit_transform(X)


In [13]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [14]:
# Define a decision tree classifier
clf = DecisionTreeClassifier(random_state=42)

In [16]:
# Apply Recursive Feature Elimination
rfe = RFE(estimator=clf, n_features_to_select=5)
rfe.fit(X_train, y_train)

In [20]:
# Define a random forest classifier
from sklearn.ensemble import RandomForestClassifier


clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Apply Recursive Feature Elimination
rfe = RFE(estimator=clf, n_features_to_select=5)
rfe.fit(X_train, y_train)

KeyboardInterrupt: 

In [19]:
# Identify the selected features
selected_features = X.columns[rfe.support_]

# Create a new feature matrix with selected features
X_selected = X[selected_features]
print( X_selected.head())

# Build a decision tree model with the selected features
clf_selected = DecisionTreeClassifier(random_state=42)
clf_selected.fit(X_train[selected_features], y_train)

# Evaluate the performance of the decision tree model on the testing data
accuracy = clf_selected.score(X_test[selected_features], y_test)
print("Accuracy:", accuracy)

   Flow ID   Source IP   Destination Port   Idle Max   Idle Min
0    82985          22              54865          0          0
1    83017          37              55054          0          0
2    83018          37              55055          0          0
3    57106          45              46236          0          0
4    83023          49              54863          0          0
Accuracy: 0.9999335533455891
