# Project Overview
The goal of this project is to compare the accuracy of existing popular machining learning models that have been studied much more in-depth (Like Extra Trees Classifier from PyCaret) to that of a novel machine learning model called "KAN" that has only been implemented into a python use case about one week ago. At the end of the project, we hope to identify whether KAN models are able to more accuractly predict malware from our dataset than the more traditional models. This will be determined based on which model has the higher test accuracy.

## What is a KAN?
KANs are cutting edge artifical nueral network that draw inspiration from the Kolmogorov-Arnold mathematical theorem. KANs utilize something called univariate spline functions. The idea behind these is that they are able to find more complex relationships within the data with less training.

## How do KANs differ from Muti-Layer Perceptron (MLP)?
In our replicated study, Kolmogorov-Arnold Networks (KANs) are compared to Muti-Layer Perceptron (MLP). Compared to MLPs, KANs have the ability to learn more complex tasks with less training data thanks to the Spline functions mentioned above. The spline functions also help with interpretibiltiy because they can provide additional insights into how the relationships were learned. The research paper claims that "a 2-Layer width-10 KAN is 100 times more accurate than a 4-Layer width-100 MLP (10−7 vs 10−5 MSE) and 100 times more parameter efficient (102 vs 104 parameters)." Will this hold up in our testing in Colab? Lets find out!



## Resources Utilized
- https://archive.ics.uci.edu/dataset/722/naticusdroid+android+permissions+dataset
- https://arxiv.org/pdf/2404.19756
- https://github.com/KindXiaoming/pykan/blob/master/tutorials/Example_3_classfication.ipynb

# Imports & Installs

In [None]:
!pip install pycaret
!pip install pykan
!pip install ucimlrepo

import pandas as pd
import matplotlib.pyplot as plt
import torch
import numpy as np
from pycaret.classification import *
from ucimlrepo import fetch_ucirepo
from kan import KAN
from sklearn.model_selection import train_test_split



# Get Data from UCI Repository

In [None]:
# fetch dataset
naticusdroid_android_permissions = fetch_ucirepo(id=722)

# data (as pandas dataframes)
X = naticusdroid_android_permissions.data.features
Y = naticusdroid_android_permissions.data.targets.reset_index()

# metadata
print(naticusdroid_android_permissions.metadata)

# variable information
print(naticusdroid_android_permissions.variables)

{'uci_id': 722, 'name': 'NATICUSdroid (Android Permissions)', 'repository_url': 'https://archive.ics.uci.edu/dataset/722/naticusdroid+android+permissions+dataset', 'data_url': 'https://archive.ics.uci.edu/static/public/722/data.csv', 'abstract': 'Contains permissions extracted from more than 29000 benign & malware Android apps released between 2010-2019.', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 29333, 'num_features': 86, 'feature_types': [], 'demographics': [], 'target_col': ['Result'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2021, 'last_updated': 'Tue Apr 09 2024', 'dataset_doi': '10.24432/C5FS64', 'creators': ['Akshay Mathur'], 'intro_paper': {'title': 'NATICUSdroid: A malware detection framework for Android using native and custom permissions', 'authors': 'A. Mathur, Laxmi M. Podila, Keyur Kulkarni, Quamar Niyaz, A. Javaid', 'published_in': 'J. Inf. Se

# Congregate Data into a Merged DF

In [None]:
# Merge to create a comprehensive df
df = pd.merge(X, Y, left_index = True, right_index = True)
df = df.drop(columns=['index']) # Remove index column

df

Unnamed: 0,android.permission.GET_ACCOUNTS,com.sonyericsson.home.permission.BROADCAST_BADGE,android.permission.READ_PROFILE,android.permission.MANAGE_ACCOUNTS,android.permission.WRITE_SYNC_SETTINGS,android.permission.READ_EXTERNAL_STORAGE,android.permission.RECEIVE_SMS,com.android.launcher.permission.READ_SETTINGS,android.permission.WRITE_SETTINGS,com.google.android.providers.gsf.permission.READ_GSERVICES,...,com.android.launcher.permission.UNINSTALL_SHORTCUT,com.sec.android.iap.permission.BILLING,com.htc.launcher.permission.UPDATE_SHORTCUT,com.sec.android.provider.badge.permission.WRITE,android.permission.ACCESS_NETWORK_STATE,com.google.android.finsky.permission.BIND_GET_INSTALL_REFERRER_SERVICE,com.huawei.android.launcher.permission.READ_SETTINGS,android.permission.READ_SMS,android.permission.PROCESS_INCOMING_CALLS,Result
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29327,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
29328,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
29329,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
29330,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


# Exploring the Data

## Check the value counts
Checking the value counts helps to give a more indepth understanding of the data, particularly with regard to amount of instances that malware was detected or not in the 'Results' column in our DF. The output shows that the amount of 0s and 1s are appromimately even considering the size of the dataset.

In [None]:
df['Result'].value_counts()

Result
1    14700
0    14632
Name: count, dtype: int64

# Shuffle the Dataframe

In [None]:
shuffled_df = df.sample(frac=1, random_state=1).reset_index(drop=True)

# Define Test and Train Data
- First 80% of the rows should be for training
- Last 20% of the rows should be for testing

In [None]:
# Calculate the index for the split
split_index = int(len(shuffled_df) * 0.8)

# Split the dataframe into training and testing sets
train_df = shuffled_df.iloc[:split_index]
test_df = shuffled_df.iloc[split_index:]

# Now you have train_df for training and test_df for testing

# PyCaret Model Experimentation

In [None]:
clf_setup = setup(
    data = train_df,
    target = 'Result',
    session_id = 2024,
    test_data = test_df # by not having this; you essentially trained on 70% of the train data and used the other 30% for testing
    )

best_model = compare_models(n_select=5)

Unnamed: 0,Description,Value
0,Session id,2024
1,Target,Result
2,Target type,Binary
3,Original data shape,"(29332, 87)"
4,Transformed data shape,"(29332, 87)"
5,Transformed train set shape,"(23465, 87)"
6,Transformed test set shape,"(5867, 87)"
7,Numeric features,86
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.9715,0.9921,0.9662,0.9768,0.9715,0.9431,0.9431,3.18
rf,Random Forest Classifier,0.9709,0.9932,0.9668,0.9751,0.9709,0.9419,0.9419,2.452
xgboost,Extreme Gradient Boosting,0.9693,0.9938,0.9675,0.9713,0.9694,0.9386,0.9387,0.898
lightgbm,Light Gradient Boosting Machine,0.9667,0.9937,0.9667,0.967,0.9668,0.9334,0.9335,1.398
dt,Decision Tree Classifier,0.9647,0.9775,0.9602,0.9691,0.9646,0.9293,0.9294,0.253
knn,K Neighbors Classifier,0.9626,0.9815,0.9663,0.9595,0.9628,0.9252,0.9252,1.149
gbc,Gradient Boosting Classifier,0.9603,0.9896,0.9661,0.9553,0.9606,0.9206,0.9206,3.379
lr,Logistic Regression,0.9589,0.9887,0.9652,0.9535,0.9593,0.9177,0.9178,1.102
svm,SVM - Linear Kernel,0.9572,0.9874,0.9634,0.952,0.9576,0.9144,0.9146,0.241
ada,Ada Boost Classifier,0.9571,0.9876,0.9652,0.9502,0.9576,0.9142,0.9144,1.287


Processing:   0%|          | 0/69 [00:00<?, ?it/s]

## Predicting on Test Data

In [None]:
# Use the predict_model function with the best model and the test data
predictions = predict_model(best_model[0]) # the number is going to correspond to the row number where random forest is shown in the table above - 1

# predictions now contains the predictions along with the original data.
# It adds columns for the label (prediction) and the score (probability of the positive class for binary classification)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.9693,0.992,0.9648,0.9735,0.9691,0.9386,0.9387


# PyKAN Experimentation

## Sampling the Dataframe
Due to the computational requirements of the KAN algorithm and the memory limitations of Colab, we decided that we needed to sample the data frame. The original dataset has ~29000 rows while the sampled has ~5000. This shouldn't affect the training and testing accuracy that much since it is already a significant amount of data.

In [None]:
pykan_sampled_df = (
    df
    .sample(n=5000, random_state=2024)
    .reset_index(drop=True)
)

## Shuffling the Dataframe

In [None]:
pykan_shuffled_df = (
    pykan_sampled_df
    .sample(frac=1, random_state=2024)
    .reset_index(drop=True)
)

## Check 'Result' Count
This step is to ensure that the amount of 1s and 0s in the 'Result' column are still relatively even (50/50). This demonstrates that they are.

In [None]:
pykan_shuffled_df['Result'].value_counts()

Result
1    2518
0    2482
Name: count, dtype: int64

## Training & Testing the Model
Upon training and testing the model using pykan's classification formula, we were able to receive a training score of 0.9512 and testing accuracy of 0.941. What does this really mean? Training accuracy refers to the how well the machine learning model predicts or fits the data it was trained on. This score is calculated by comparing the predictions of the data against the actual training data. In our you case, this means that pykan was able to learn patterns from our dataset with only about a 5% error. As for testing, this seeks to demonstrate how well the model performs on a new dataset. This is a more reliable indicator of how accurate the model is because it working with new data. Our test accuracy of 0.941 means that we were successfully able to predict 94.1% of malware instances.

We Accomplished this with the following considerations:
- Training data percentage: 80%
- Testing data percentage: 20%
- Width of [86, 1] means that 86 predictor variables are working to predict one outcome variable
- Optimizer: LBFGS
- Steps: 20

We experimented with numerous optimizer types, step counts, grid values, and k values. After all of our testing these struck the best balance between performance and time efficiency. These configurations took ~ 5 minutes to execute in Colab. Any more computational demand would have made load times egregious for only marginal benefit.

In [None]:
# Split the data into features (X) and target (y)
X = pykan_shuffled_df.drop('Result', axis=1)
y = pykan_shuffled_df['Result']

# Split the data into train and test sets
train_input, test_input, train_label, test_label = train_test_split(X, y, test_size=0.2, random_state=2024)

dataset = {}
dataset['train_input'] = torch.from_numpy(train_input.values).float()
dataset['test_input'] = torch.from_numpy(test_input.values).float()
dataset['train_label'] = torch.from_numpy(train_label.values).float().unsqueeze(1)
dataset['test_label'] = torch.from_numpy(test_label.values).float().unsqueeze(1)

model = KAN(width=[86, 1], grid=7, k=7)  # Update the width to match the number of features

def train_acc():
    return torch.mean((torch.round(model(dataset['train_input'])[:,0]) == dataset['train_label'][:,0]).float())

def test_acc():
    return torch.mean((torch.round(model(dataset['test_input'])[:,0]) == dataset['test_label'][:,0]).float())

results = model.train(dataset, opt="LBFGS", steps=20, metrics=(train_acc, test_acc))
print(f"\nFinal train accuracy: {results['train_acc'][-1]}")
print(f"Final test accuracy: {results['test_acc'][-1]}")

train loss: 2.28e-01 | test loss: 2.41e-01 | reg: 1.85e+01 : 100%|██| 20/20 [05:01<00:00, 15.09s/it]


Final train accuracy: 0.9512500166893005
Final test accuracy: 0.9409999847412109





# Results

## How does KAN's Accuracy compare to that of our testing in Pycaret
- Pycaret Test Accuracy - Extra Trees Classifier: 0.9715
- Pykan Test Accuracy - KAN(LBFGS Opt): 0.941

As demonstrated by the results above. In our use case with limited computational resouces, Pycaret's Extra trees classifier was able to more accurately predict the outcome of result column by about 3%. While this may seem insignificant, if applied to the whole data set this can be the difference of 895 correctly predicted records. This has tremendous implications in the cybersecurity realm as even just one or a few devices infected with malware has dangerous potential.
- 29332 * 0.9715 = 28496.04
- 29332 * 0.941 = 27601.41
- 28496 - 27601 = 895

Currently, since pykan is so new and people are just beginning to experiment with it more optimization is sure to occur in the future. Until then, however, according to our testing in an environment limited by its computational resources, Pycaret's extra trees classifier is still the winner!
