# Machine Learning to Detect Android Malware using Android App Permissions

## Project Overview

This project uses a public data set of Android permissions collected from over 29000 benign and malware Android apps.
The goal of my project is to explore several supervised ML algorithms and compare how effectively they 
can distinguish harmless apps from malware. The problem is of interest because computer malware 
on mobile devices has significant economic impact as well as violations of privacy. 
This is a supervised ML problem using a labeled data set. The task is binary classification -- determine whether a given app is 
likely to be malware or not
based on the presence or absence of specific Android permissions.

### Project Repository

https://github.com/albert-kepner/Supervised_ML_Project

### The Data Set

This project uses the NATICUSdroid (Android Permissions) Dataset from UCI ML data repository: https://archive.ics.uci.edu/ml/datasets.php.
A link to this specific data set is here: https://archive-beta.ics.uci.edu/ml/datasets/naticusdroid+android+permissions+dataset .

Citation: Mathur, Akshay & Mathur, Akshay. (2022). NATICUSdroid (Android Permissions) Dataset. UCI Machine Learning Repository.

The data set data.csv can be downloaded from the above website.
The data set consists of 86 features which are either standard or customize Android permissions. These features were selected 
from a larger set possible Android permissions by the data set authors. These features have already been selected
with the goal of maximizing discrimination between malware and benign apps. 
Each permission is either present or absent for a given app. 
So we have 86 columns containing 0 or 1 for the presence of a given permission.
The last column of the data set is the label which is 0 for benign or 1 for malware. 
The data is already clean with no missing values.
There are 29332 rows where each row represents 1 Android app known to be malware or not.
14700 of the apps are malware, and 14632 are benign, so the two classes are evenly balanced.
The data was collected from benign and malware Android applications over the period from 2010 to 2019.

## Exploratory Data Analysis and Feature Selection

In this data set all the features are permissions encoded 0/1 as is the label. 
There are limited choices to display this data graphically. One thing of interest is
how correlated the features are with each other. I created a correlation matrix and heat map showing all
the pairwise correlations between features.

See more details in the project notebook here: 
    
https://github.com/albert-kepner/Supervised_ML_Project/blob/master/Data_Set_And_Exploratory_Data_Analysis.ipynb

In the above notebook I also looked at the pairwise correlations between the 86 features. 
I eliminated one feature of each pair with the highest correlation until there were no pairs correlated above 0.90 .
This process eliminated 12 features, leaving 74 feature columns.
At the end of this notebook, I used sklearn.model_selection.train_test_split 
to save the training data (70%) and testing data (30%) off in separate CSV files train_data.csv and test_data.csv. This will
make it convenient to train and evaluate multiple models on the same data in separate notebooks.

## Training ML models on this dataset

I set out to compare how well I could use this data set for prediction with several of the ML algorithms in this course. 
I also wanted to try out neural networks with keras/tensorflow on the same problem. 
Each of the models is in a separate notebook in my Github project. 
I created the following models which can viewed using these links:
    

https://github.com/albert-kepner/Supervised_ML_Project/blob/master/Logistic_Regression_Model.ipynb

https://github.com/albert-kepner/Supervised_ML_Project/blob/master/KNeighborsClassifier_Model.ipynb

https://github.com/albert-kepner/Supervised_ML_Project/blob/master/Decision_Tree_Model.ipynb

https://github.com/albert-kepner/Supervised_ML_Project/blob/master/RandomForest_Model.ipynb

https://github.com/albert-kepner/Supervised_ML_Project/blob/master/AdaBoostClassifier_Model.ipynb

https://github.com/albert-kepner/Supervised_ML_Project/blob/master/GradientBoostingClassifier_Model.ipynb

https://github.com/albert-kepner/Supervised_ML_Project/blob/master/SupportVectorClassifier_Model.ipynb

https://github.com/albert-kepner/Supervised_ML_Project/blob/master/NeuralNetwork_Model1.ipynb

https://github.com/albert-kepner/Supervised_ML_Project/blob/master/NeuralNetwork_Model2.ipynb

https://github.com/albert-kepner/Supervised_ML_Project/blob/master/NeuralNetwork_Model3.ipynb


## Hyper Parameter Tuning

For each of the traditional ML models (all except neural networks) 
I used sklearn.model_selection.GridSearchCV to search for the best
values of appropriate tuning parameters. For this tuning I used the
default 5-fold cross validation with accuracy as the scoring parameter.
At the end of each model notebook I used the best model found to predict
based on the held out test data set, so that we have a final accuracy score
which can be compared across models.

## Logistic Regression

For logistic regression I tried the following values for the 'C' parameter:
    
[0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0, 
20.0, 50.0, 100.0, 200.0, 500.0, 1000.0, 2000.0, 5000.0]

The best model based on cross validatation accuracy used C = 1000.

## K Nearest Neighbors

For this model the following parameters were tried:

    parameters = {'n_neighbors': [1,3,5,7,9,11,15],
             'weights': ['uniform','distance'], 'p':[1,2]}

The best model used these parameters:
    
    {'n_neighbors': 7, 'p': 1, 'weights': 'distance'}
    
## Decision Tree

Parameters tried:
    
    parameters = {'max_depth':[3,5,7,10,12,13,15, 17], 'min_samples_leaf':[1,2,5,10]}
    
Best model:
    
    {'max_depth': 12, 'min_samples_leaf': 1}
    
## Random Forest

Parameters tried:
    
    parameters = {'n_estimators': [100, 110, 120],
              'max_depth': [20,22,24], 
              'min_samples_split': [2],
             'min_samples_leaf': [1],
             'max_features': ['log2'],
             'ccp_alpha': [0.0, 0.005, 0.01]}
Best model:
    
    {'ccp_alpha': 0.0,
 'max_depth': 22,
 'max_features': 'log2',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 120}
    
## AdaBoost Classifier

Parameters tried:
    
    parameters = {'learning_rate': [0.85, 0.75, 0.65, 0.50], 'n_estimators': [50, 75, 90]}
    
Best model:
    
    {'learning_rate': 0.65, 'n_estimators': 90}
    
## Gradient Boosting Classifier

Parameters tried:
    
    parameters = {'learning_rate': [0.05, 0.075, 0.10, 0.15], 'n_estimators': [100, 150, 200]}
    
Best model:
    
    {'learning_rate': 0.15, 'n_estimators': 200}
    
## Support Vector Classifier

Parameters tried:
    
    parameters = {'C': [ 1.0, 2.0, 5.0, 10.0],
              'kernel': ['linear','poly','rbf','sigmoid']}
    
Best model:
    
    {'C': 5.0, 'kernel': 'rbf'}


## Neural Network Models

Three models were tried. All three models have a single output node with "sigmoid" activation, 
which is appropriate for binary classification. The three models are:
    
* Model1 -- one dense hidden layer with 50 units
* Model2 -- three dense hidden layers with 148/74/37 units in the layers
* Model3 -- three dense hidden layers with 148/74/37 units in the layers plus 2 Dropout(rate=0.5) layers between the
hidden layers

The more complex models turned out to be slightly more accurate than Model1 but the difference was minor. 
All three models used a validation split of 0.2 (20% of data held out for validatation) during training. 
All three models early stopping based on validation loss to prevent overfitting.