# Build a Malware Classification Model Using Random Forest

In a clinical laboratory setting, a large dataset of patient test results has been compiled, including thousands of entries with numerous biological and chemical measurements. The objective is to develop a predictive model that can classify whether a patient has a particular disease based on a subset of these measurements. The dataset contains both numerical and categorical data, and there are potential correlations among the features that need to be accounted for.

The hospital's research team has decided to implement a Random Forest classifier to model the disease classification problem. However, due to the large number of features, a feature selection technique is required to identify the most relevant attributes. The team also seeks to evaluate the need for feature scaling and perform hyperparameter tuning to optimize the Random Forest model. Recursive Feature Elimination (RFE) will be used for feature selection, and GridSearchCV will be employed to find the best combination of hyperparameters.

Question:

You are tasked with building a predictive model for disease classification using the given clinical dataset, which contains thousands of entries with multiple biological measurements. To achieve this, you will:
 

1. Data Preparation:

 * Load a random sample of 1,000 records from the larger dataset.
 * Conduct a data quality check, including handling missing values and removing non-numeric columns.

2. Feature Selection:

 * Implement Recursive Feature Elimination (RFE) with a Random Forest classifier to select the top 10 most important features.

3. Train/Test Split:

 * Split the dataset into training and test sets using a 70/30 ratio.

4. Model Training and Evaluation:

 * Train a Random Forest classifier using the selected features.
 * Evaluate the model’s performance with accuracy, confusion matrix, and classification report.

5. Hyperparameter Tuning:

 * Apply GridSearchCV to tune the hyperparameters of the Random Forest model.

6. Scaling Impact Analysis:

 * Assess the impact of feature scaling by comparing the model's accuracy with and without scaling.

7. Optimization:

 * Determine the best model configuration based on hyperparameter tuning and scaling.
 

In [1]:
print("Hello, Begin Your Data Journey")


Hello, Begin Your Data Journey


In [3]:
!pip3 install matplotlib
!pip3 install sklearn
!pip3 install seaborn


Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [4]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
import random
import json
import os
import warnings
warnings.filterwarnings("ignore")

jupyter_notebook_dataset = os.getenv("dataset_url", "https://d3dyfaf3iutrxo.cloudfront.net/general/upload/eab3a345c4f148748ae95eec5c9af955.csv")
n = 5210 #number of records in file
s = 1000 #desired sample size
skip = sorted(random.sample(range(1,n),n-s))
data = pd.read_csv(jupyter_notebook_dataset,skiprows=skip,header = 0)



In [5]:
data


Unnamed: 0,e_cblp,e_cp,e_cparhdr,e_maxalloc,e_sp,e_lfanew,NumberOfSections,CreationYear,FH_char0,FH_char1,...,sus_sections,non_sus_sections,packer,packer_type,E_text,E_data,filesize,E_file,fileinfo,class
0,144,3,4,65535,184,184,4,1,0,1,...,1,3,0,NoPacker,5.205926,2.123522,7680,5.318221,0,0
1,144,3,4,65535,184,224,5,1,0,1,...,1,4,0,NoPacker,6.355626,0.702621,48128,5.545531,1,0
2,144,3,4,65535,184,272,8,1,0,1,...,4,4,1,Armadillov1xxv2xx,6.595606,2.843601,397936,6.295515,1,0
3,144,3,4,65535,184,128,3,1,0,1,...,1,2,1,NETexecutableMicrosoft,4.618148,0.000000,26528,3.954612,0,0
4,144,3,4,65535,184,272,8,1,0,1,...,3,5,1,Armadillov1xxv2xx,5.861511,4.873827,787048,5.864200,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
997,144,3,4,65535,184,216,4,1,1,1,...,0,4,0,NoPacker,5.951382,0.352759,11264,5.727615,0,1
998,144,3,4,65535,184,240,4,1,1,1,...,2,2,0,NoPacker,3.465415,1.282154,102400,5.700887,0,1
999,144,3,4,65535,184,216,5,1,0,1,...,2,3,0,NoPacker,6.503422,3.790871,227328,7.823114,0,1
1000,144,3,4,65535,184,248,5,1,1,1,...,1,4,0,NoPacker,6.115208,7.919091,271616,7.886012,0,1


In [6]:
# Display the first few rows of the dataset to understand its structure
data.head()


Unnamed: 0,e_cblp,e_cp,e_cparhdr,e_maxalloc,e_sp,e_lfanew,NumberOfSections,CreationYear,FH_char0,FH_char1,...,sus_sections,non_sus_sections,packer,packer_type,E_text,E_data,filesize,E_file,fileinfo,class
0,144,3,4,65535,184,184,4,1,0,1,...,1,3,0,NoPacker,5.205926,2.123522,7680,5.318221,0,0
1,144,3,4,65535,184,224,5,1,0,1,...,1,4,0,NoPacker,6.355626,0.702621,48128,5.545531,1,0
2,144,3,4,65535,184,272,8,1,0,1,...,4,4,1,Armadillov1xxv2xx,6.595606,2.843601,397936,6.295515,1,0
3,144,3,4,65535,184,128,3,1,0,1,...,1,2,1,NETexecutableMicrosoft,4.618148,0.0,26528,3.954612,0,0
4,144,3,4,65535,184,272,8,1,0,1,...,3,5,1,Armadillov1xxv2xx,5.861511,4.873827,787048,5.8642,1,0


In [7]:
# Display the shape of the dataset (rows, columns)
data.shape


(1002, 70)

In [8]:
# Generate descriptive statistics to understand the range, mean, and other info of each feature
data.describe()


Unnamed: 0,e_cblp,e_cp,e_cparhdr,e_maxalloc,e_sp,e_lfanew,NumberOfSections,CreationYear,FH_char0,FH_char1,...,LoaderFlags,sus_sections,non_sus_sections,packer,E_text,E_data,filesize,E_file,fileinfo,class
count,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0,...,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0
mean,155.362275,22.904192,16.787425,65048.968064,265.57485,222.666667,4.721557,0.994012,0.352295,1.0,...,1.0,1.409182,3.312375,0.164671,4.874386,2.675332,888890.9,6.375087,0.54491,0.543912
std,556.440648,633.312168,405.853396,5468.333464,2137.307744,46.596646,1.922919,0.077189,0.477924,0.0,...,0.0,1.609858,1.12781,0.371068,2.578308,2.729365,6896763.0,1.114365,0.498228,0.498317
min,0.0,0.0,0.0,0.0,0.0,12.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1536.0,0.942085,0.0,0.0
25%,144.0,3.0,4.0,65535.0,184.0,208.0,4.0,1.0,0.0,1.0,...,1.0,1.0,3.0,0.0,3.793057,0.0,61576.0,5.700882,0.0,0.0
50%,144.0,3.0,4.0,65535.0,184.0,232.0,5.0,1.0,0.0,1.0,...,1.0,1.0,4.0,0.0,6.11486,1.919523,122880.0,6.402027,1.0,1.0
75%,144.0,3.0,4.0,65535.0,184.0,248.0,5.0,1.0,1.0,1.0,...,1.0,2.0,4.0,0.0,6.496799,4.834551,321024.0,7.281841,1.0,1.0
max,17739.0,20050.0,12851.0,65535.0,65534.0,600.0,29.0,1.0,1.0,1.0,...,1.0,24.0,7.0,1.0,7.999859,7.998312,165708100.0,7.999819,1.0,1.0


In [9]:
# Step 2: Data quality check
# Checking for missing values
data.isnull().sum()


e_cblp        0
e_cp          0
e_cparhdr     0
e_maxalloc    0
e_sp          0
             ..
E_data        0
filesize      0
E_file        0
fileinfo      0
class         0
Length: 70, dtype: int64

In [10]:
# Identify and remove non-numeric columns
data.dropna(inplace=True)



In [11]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1002 entries, 0 to 1001
Data columns (total 70 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   e_cblp                       1002 non-null   int64  
 1   e_cp                         1002 non-null   int64  
 2   e_cparhdr                    1002 non-null   int64  
 3   e_maxalloc                   1002 non-null   int64  
 4   e_sp                         1002 non-null   int64  
 5   e_lfanew                     1002 non-null   int64  
 6   NumberOfSections             1002 non-null   int64  
 7   CreationYear                 1002 non-null   int64  
 8   FH_char0                     1002 non-null   int64  
 9   FH_char1                     1002 non-null   int64  
 10  FH_char2                     1002 non-null   int64  
 11  FH_char3                     1002 non-null   int64  
 12  FH_char4                     1002 non-null   int64  
 13  FH_char5          

In [12]:
# Drop non-numeric columns from the dataset for feature selection

data.drop(columns=['packer_type'], axis=1, inplace=True)


In [13]:
# Drop the target column ('class') from the feature set

# Define the target variable 'class'
X = data.iloc[:, :-1] 
y = data.iloc[:, -1]


In [14]:
# Step 3: Feature Selection with Recursive Feature Elimination (RFE)
# Initialize the RandomForestClassifier with random state 42
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)




In [15]:
# Perform Recursive Feature Elimination to select the top 10 most important features
rfe = RFE(rf , n_features_to_select=10)

rfe = rfe.fit(X, y)


In [16]:
# Display selected features

X.columns[rfe.support_]





Index(['FH_char12', 'AddressOfEntryPoint', 'MajorImageVersion', 'CheckSum',
       'OH_DLLchar0', 'OH_DLLchar2', 'E_data', 'filesize', 'E_file',
       'fileinfo'],
      dtype='object')

In [17]:
# Step 4: Train/Test Split
# Select only the top 10 features based on RFE for the modeling process

# X = data.iloc[:, :-1] 
# y = data.iloc[:, -1]

X_selected = X[X.columns[rfe.support_]]

In [18]:
# Split the dataset into training (70%) and testing (30%) sets keeping random state state 42

X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42) 




In [19]:
# Step 5: Train the Random Forest Classifier
# Initialize the RandomForestClassifier and set number of trees (n_estimators) to 100 keeping random state state 42

rf_class = RandomForestClassifier(n_estimators=100, random_state=42)



In [20]:
# Step 5: Train the Random Forest Classifier
# Initialize the RandomForestClassifier and set number of trees (n_estimators) to 100 keeping random state state 42

rf_class = RandomForestClassifier(n_estimators=100, random_state=42)



In [22]:
rf_class.fit(X_train, y_train)

In [23]:
# Step 6: Model Evaluation
# Predict on the test data
y_pred = rf_class.predict(X_test)


In [24]:
# Calculate and display the accuracy score and store it in a variable 'accuracy'
def get_accuracy():  
    ans = accuracy_score(y_test, y_pred)
    return ans# replace with actual accuracy

In [25]:
accuracy = get_accuracy()
accuracy


0.9601328903654485

In [26]:
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred) 

# Calculate precision For binary classification
precision = precision_score(y_test, y_pred) 

# Calculate recall For binary classification
recall = recall_score(y_test, y_pred) 

# Calculate F1-score For binary classification
F1_score = f1_score(y_test, y_pred)


In [27]:
# Step 7: Hyperparameter Tuning with GridSearchCV
# Define a parameter grid to search through for hyperparameter tuning
# - 'n_estimators' should be set to [50, 100, 200] to define the number of trees in the forest.
# - 'max_depth' should be set to [10, 20, None] to establish the maximum depth of each tree.
# - 'min_samples_split' should be set to [2, 5, 10] to determine the minimum number of samples required to split an internal node.




param_grid = {'n_estimators':[50, 100, 200],
             'max_features':[10, 20, None],
              'min_samples_split':[2, 5, 10]
             }

In [28]:
# Initialize GridSearchCV with 5-fold cross-validation and accuracy as the scoring metric



GS = GridSearchCV(estimator=rf_class, param_grid=param_grid, cv=5, scoring='accuracy') 



In [29]:
# Fit the GridSearch model to the training data
GS.fit(X_train,y_train)


In [30]:
# Print the best parameters found by GridSearchCV

GS.best_params_


{'max_features': 10, 'min_samples_split': 2, 'n_estimators': 100}

In [31]:
# best accuracy score from GridSearchCV
# Calculate and display the accuracy score from GridSearchCV and store it in a variable 'accuracy_g'
def get_accuracy_g():  
    ans = GS.best_score_
    return ans # replace with actual accuracy from GridSearchCV


In [32]:
accuracy_g = get_accuracy_g()
accuracy_g

0.9586322188449848

In [33]:
# Step 8: Check whether scaling is required
# StandardScaler scales the features to have zero mean and unit variance
sc = StandardScaler() 
X_train_scale = sc.fit_transform(X_train) 
X_test_scale = sc.transform(X_test) 


In [34]:
# Fit the RandomForest classifier on scaled data and keeping random_state=42, n_estimators=100

rf = RandomForestClassifier(n_estimators = 200, random_state = 42)
rf.fit(X_train_scale,y_train)


In [35]:
# Predict using the scaled data
y_pred_scale = rf.predict(X_test_scale)


In [36]:
# Compare the accuracy of the model before and after scaling
accuracy_score(y_test, y_pred_scale)


0.9634551495016611