# LAB: Malware detection - ML approaches
The malware detection dataset as proposed by *Borah, P., Bhattacharyya, D. K., & Kalita, J. K. (2020, December). Malware dataset generation and evaluation. In 2020 IEEE 4th Conference on Information & Communication Technology (CICT) (pp. 1-6). IEEE.* contains 4465 instances and 241 attributes. The target is categorical (malware - goodware).

### Imports

In [9]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.svm import SVC
import xgboost as xgb
import warnings
warnings.filterwarnings("ignore")

### Load dataset
Load the malware dataset and show the first 10 lines

In [14]:
df = pd.read_csv("TUANDROMD.csv")
df.head(10)

Unnamed: 0,ACCESS_ALL_DOWNLOADS,ACCESS_CACHE_FILESYSTEM,ACCESS_CHECKIN_PROPERTIES,ACCESS_COARSE_LOCATION,ACCESS_COARSE_UPDATES,ACCESS_FINE_LOCATION,ACCESS_LOCATION_EXTRA_COMMANDS,ACCESS_MOCK_LOCATION,ACCESS_MTK_MMHW,ACCESS_NETWORK_STATE,...,Landroid/telephony/TelephonyManager;->getLine1Number,Landroid/telephony/TelephonyManager;->getNetworkOperator,Landroid/telephony/TelephonyManager;->getNetworkOperatorName,Landroid/telephony/TelephonyManager;->getNetworkCountryIso,Landroid/telephony/TelephonyManager;->getSimOperator,Landroid/telephony/TelephonyManager;->getSimOperatorName,Landroid/telephony/TelephonyManager;->getSimCountryIso,Landroid/telephony/TelephonyManager;->getSimSerialNumber,Lorg/apache/http/impl/client/DefaultHttpClient;->execute,Label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,malware
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,malware
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,malware
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,malware
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,malware
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,malware
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,malware
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,malware
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,malware
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,malware


### Exploratory Data Analysis
Perform EDA on the dataset

- Give the shape of the dataset
- Check for unique and missing values
- Drop the rows with missing values
- Encode the label with a LabelEncoder
- Count the target variables, is there an imbalance?
- Transform the columns to integer

In [44]:
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} features")

The dataset has 4465 rows and 242 features


In [50]:
for col in df.columns:
    print(f"{col} has the following values: {set(df[col].unique())} and contains {df[col].isna().sum()} missing values")

ACCESS_ALL_DOWNLOADS has the following values: {0.0, nan, 1.0} and contains 1 missing values
ACCESS_CACHE_FILESYSTEM has the following values: {0.0, 1.0, nan} and contains 1 missing values
ACCESS_CHECKIN_PROPERTIES has the following values: {0.0, 1.0, nan} and contains 1 missing values
ACCESS_COARSE_LOCATION has the following values: {0.0, 1.0, nan} and contains 1 missing values
ACCESS_COARSE_UPDATES has the following values: {0.0, 1.0, nan} and contains 1 missing values
ACCESS_FINE_LOCATION has the following values: {0.0, 1.0, nan} and contains 1 missing values
ACCESS_LOCATION_EXTRA_COMMANDS has the following values: {0.0, 1.0, nan} and contains 1 missing values
ACCESS_MOCK_LOCATION has the following values: {0.0, 1.0, nan} and contains 1 missing values
ACCESS_MTK_MMHW has the following values: {0.0, nan, 1.0} and contains 1 missing values
ACCESS_NETWORK_STATE has the following values: {0.0, 1.0, nan} and contains 1 missing values
ACCESS_PROVIDER has the following values: {0.0, nan} a

In [52]:
df = df.dropna()
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} features after deleting rows with missing values")

The dataset has 4464 rows and 242 features after deleting rows with missing values


In [62]:
le = LabelEncoder()
df["Label"] = le.fit_transform(df["Label"])
print(f"Classes: {le.classes_}")
df.head().T

Classes: [0 1]


Unnamed: 0,0,1,2,3,4
ACCESS_ALL_DOWNLOADS,0.0,0.0,0.0,0.0,0.0
ACCESS_CACHE_FILESYSTEM,0.0,0.0,0.0,0.0,0.0
ACCESS_CHECKIN_PROPERTIES,0.0,0.0,0.0,0.0,0.0
ACCESS_COARSE_LOCATION,0.0,0.0,0.0,0.0,0.0
ACCESS_COARSE_UPDATES,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...
Landroid/telephony/TelephonyManager;->getSimOperatorName,0.0,0.0,0.0,0.0,0.0
Landroid/telephony/TelephonyManager;->getSimCountryIso,0.0,1.0,0.0,1.0,0.0
Landroid/telephony/TelephonyManager;->getSimSerialNumber,0.0,0.0,0.0,0.0,0.0
Lorg/apache/http/impl/client/DefaultHttpClient;->execute,1.0,0.0,0.0,0.0,0.0


0 = goodware; 1 = malware

In [65]:
df["Label"].value_counts() / df.shape[0] * 100

Label
1    79.861111
0    20.138889
Name: count, dtype: float64

The dataset is not balanced, almost 80% belongs to the malware class and only 20% to the goodware class.

In [75]:
for col in df.columns:
    df[col] = df[col].astype(int)

### Feature selection
*Lasso regression* is commonly used for feature selection because of its ability to automatically select features by shrinking coefficients of less important features to zero. This is achieved via $L_1$ regularization, penalizing the sum of the absolute values of the coefficients. 

Mathematically let $w = (w_1, ..., w_p )$ the coefficients, $X$ the input, $y$ the real output and $\alpha$ the regularization parameter, then Lasso consists of a linear model with a regularization term, where $||w||_1$ is the $l_1$ norm of the coefficient vector

$$\left(\frac{1}{2 * n_{samples}}\right) * ||y - Xw||^2_2 + \alpha * ||w||_1$$

- Perform Lasso regression with $\alpha=0.0004$ on the dataset
- Drop the zero-columns from the important feature list$

In [80]:
lasso = Lasso(alpha=0.0004)
lasso.fit(df.drop("Label", axis=1), df["Label"])
important_features = []
for i in range(len(lasso.coef_)):
    if(lasso.coef_[i] != 0):
        important_features.append(df.drop("Label", axis=1).columns[i])
important_features

['ACCESS_COARSE_LOCATION',
 'ACCESS_FINE_LOCATION',
 'ACCESS_NETWORK_STATE',
 'ACCESS_WIFI_STATE',
 'BATTERY_STATS',
 'BLUETOOTH',
 'CALL_PHONE',
 'CHANGE_NETWORK_STATE',
 'CLEAR_APP_CACHE',
 'DISABLE_KEYGUARD',
 'FLASHLIGHT',
 'GET_ACCOUNTS',
 'GET_TASKS',
 'INTERNET',
 'KILL_BACKGROUND_PROCESSES',
 'MANAGE_ACCOUNTS',
 'NFC',
 'PROCESS_OUTGOING_CALLS',
 'READ_CALL_LOG',
 'READ_CONTACTS',
 'READ_EXTERNAL_STORAGE',
 'READ_LOGS',
 'READ_PHONE_STATE',
 'READ_SMS',
 'READ_SOCIAL_STREAM',
 'RECEIVE_BOOT_COMPLETED',
 'RECEIVE_SMS',
 'RECEIVE_WAP_PUSH',
 'RECORD_AUDIO',
 'RESTART_PACKAGES',
 'SEND_SMS',
 'SET_TIME_ZONE',
 'SYSTEM_ALERT_WINDOW',
 'USE_SIP',
 'VIBRATE',
 'WAKE_LOCK',
 'WRITE_EXTERNAL_STORAGE',
 'WRITE_INTERNAL_STORAGE',
 'WRITE_SETTINGS',
 'WRITE_SOCIAL_STREAM',
 'Ljava/lang/reflect/Method;->invoke',
 'Ljavax/crypto/Cipher;->doFinal',
 'Ljava/lang/Runtime;->exec',
 'Ljava/lang/System;->load',
 'Ldalvik/system/DexClassLoader;->loadClass',
 'Ljava/net/URL;->openConnection',
 'Lan

Since this is a classification problem, we can use logistic regression to verify the feature selection. We compare the performance of the logistic model that uses all features and a logistic model using the selected features. 

- Perform logistic regression on all features
- Perform logistic regression on the selected features

In [90]:
X_train, X_test, y_train, y_test = train_test_split(df.drop("Label", axis=1), df["Label"], test_size=0.2, random_state=22)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(f"Accuracy with all features: {round(accuracy_score(y_test, clf.predict(X_test)) * 100, 3)} %")

Accuracy with all features: 98.656 %


In [92]:
X_train, X_test, y_train, y_test = train_test_split(df[important_features], df["Label"], test_size=0.2, random_state=22)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(f"Accuracy with selected features: {round(accuracy_score(y_test, clf.predict(X_test)) * 100, 3)} %")

Accuracy with selected features: 98.88 %


### ML model training
Compare the performance of a variety of ML models on the selected features. Track the accuracy, MSE and FP/FN ratio. What classifier scores best?
- SVM (use the Support Vector Classifier, SVC)
- XGBoost Classifier
- Logistic Regression
- AdaBoost Classifier
- K-NN Classifier
- Random Forest Classifier
- Gaussian Process Classifier
- Gradient Boosting Classifier
- Histogram-Based Gradient Boosting Classifier

#### SVM Classifier

The SVC classifier is a SVM classifier but with a built-in amount of regularization $C$, if $C$ is small, SVM will focus on achieving a large margin, allowing more misclassifications. If $C$ is large, we emphasize on minimizing the training error, resulting in a narrow margin. 

#### Logistic Regression

Logistic or logit regression is used primarly for binary classification tasks. The goal is to predict the probability of a data instance belonging to one of two classes. The probability of the positive class for data point $i$ is predicted as: 
$$P(y_i = 1 | X_i) = \sigma (X_i w + w_0)$$

#### Gaussian Process Classifier

The Gaussian Process Classifier (GPC) is a non-parametric, probabilistic model used in machine learning for classification tasks. It operates by assuming that data points in feature space have an underlying function, where any finite collection of points follows a joint Gaussian distribution. This approach allows GPC to quantify uncertainty in predictions by estimating a distribution over possible functions that could fit the training data, rather than a single fixed decision boundary. GPC leverages kernel functions to measure similarity between points, enabling it to model complex, non-linear relationships in the data. During prediction, GPC provides not only a class label but also an estimate of confidence by calculating the posterior probability for each class. While powerful, GPCs are computationally intensive and scale poorly with large datasets due to the inversion of large covariance matrices, which is a key limitation of the method.

#### Gradient Boosting Classifier

An ensemble method to combine predictions of several base estimators with a given learning algorithm in order to generalize the estimator. It builds an additive model in a forward-stage fashion, allowing for the optimization of differentiable loss functions. 

#### Histogram-Based Gradient Boosting

Is a faster variant of the Gradient Boosting algorithm for intermediate and large datasets $(n >= 10 000)$. Features are binned to reduce the computational complexity. 

#### AdaBoost Classifier

One of the more popular boosting algorithms is AdaBoost. The basic principle is to fit a sequence of weaker learners on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote to produce the final prediction. 

#### XGBoost Classifier

XGBoost, or Extreme Gradient Boosting, is an advanced implementation of gradient boosting algorithms, specifically designed to optimize both speed and performance. It works by building an ensemble of decision trees in a sequential manner, where each new tree aims to correct the errors of the previous ones by focusing more on poorly predicted instances. Unlike standard gradient boosting, XGBoost introduces several innovations, such as regularization (to prevent overfitting), efficient handling of missing values, and a clever usage of hardware resources, which enables parallel processing and improved computational efficiency. 

In [98]:
models = [
    SVC(),
    xgb.XGBClassifier(random_state=22, max_iter=300),
    LogisticRegression(),
    AdaBoostClassifier(n_estimators=2000),
    KNeighborsClassifier(n_neighbors=10),
    RandomForestClassifier(n_estimators=100, max_features='log2'),
    DecisionTreeClassifier(max_depth=10),
    GaussianProcessClassifier(),
    GradientBoostingClassifier(n_estimators=2000),
    HistGradientBoostingClassifier(max_iter=100),
]

In [126]:
best_model = None
best_acc = None
best_loss = None
for clf in models:
    clf.fit(X_train, y_train) 
    y_pred = clf.predict(X_test)

    acc = round(accuracy_score(y_test, y_pred) * 100,4)
    mse = round(mean_squared_error(y_test, y_pred), 5)
    cnf = confusion_matrix(y_test,y_pred)
    print(f"{clf.__class__.__name__:30}: Accuraccy: {acc:7} %, MSE: {mse:7} FP|FN: {cnf[0][1]:2}|{cnf[1][0]:2}")

    if best_loss != None:
        if best_loss > mse:
            best_model = clf
            best_acc = acc
            best_loss = mse
    else:
        best_model = clf
        best_acc = round(acc, 4)
        best_loss = round(mse, 5)

print("-"*92)
print(f"{best_model.__class__.__name__:30}: Accuraccy: {acc:7} %, MSE: {best_loss:7}")

SVC                           : Accuraccy: 98.8802 %, MSE:  0.0112 FP|FN:  4| 6
XGBClassifier                 : Accuraccy: 99.4401 %, MSE:  0.0056 FP|FN:  3| 2
LogisticRegression            : Accuraccy: 98.8802 %, MSE:  0.0112 FP|FN:  3| 7
AdaBoostClassifier            : Accuraccy: 98.3203 %, MSE:  0.0168 FP|FN:  9| 6
KNeighborsClassifier          : Accuraccy: 98.5442 %, MSE: 0.01456 FP|FN:  6| 7
RandomForestClassifier        : Accuraccy: 99.6641 %, MSE: 0.00336 FP|FN:  2| 1
DecisionTreeClassifier        : Accuraccy: 98.8802 %, MSE:  0.0112 FP|FN:  8| 2
GaussianProcessClassifier     : Accuraccy: 99.4401 %, MSE:  0.0056 FP|FN:  2| 3
GradientBoostingClassifier    : Accuraccy:  99.776 %, MSE: 0.00224 FP|FN:  1| 1
HistGradientBoostingClassifier: Accuraccy: 99.5521 %, MSE: 0.00448 FP|FN:  3| 1
--------------------------------------------------------------------------------------------
GradientBoostingClassifier    : Accuraccy: 99.5521 %, MSE: 0.00224
