# Machine Learning to Detect Android Malware using Android App Permissions

## Project Overview

This project uses a public data set of Android permissions collected from over 29000 benign and malware Android apps.
The goal of my project is to explore several supervised ML algorithms and compare how effectively they 
can distinguish harmless apps from malware. The problem is of interest because computer malware 
on mobile devices has significant economic impact as well as violations of privacy. 
This is a supervised ML problem using a labeled data set. The task is binary classification -- determine whether a given app is 
likely to be malware or not
based on the presence or absence of specific Android permissions.

### Project Repository

https://github.com/albert-kepner/Supervised_ML_Project

### The Data Set

This project uses the NATICUSdroid (Android Permissions) Dataset from UCI ML data repository: https://archive.ics.uci.edu/ml/datasets.php.
A link to this specific data set is here: https://archive-beta.ics.uci.edu/ml/datasets/naticusdroid+android+permissions+dataset .

Citation: Mathur, Akshay & Mathur, Akshay. (2022). NATICUSdroid (Android Permissions) Dataset. UCI Machine Learning Repository.

The data set data.csv can be downloaded from the above website.
The data set consists of 86 features which are either standard or customize Android permissions. These features were selected 
from a larger set possible Android permissions by the data set authors. These features have already been selected
with the goal of maximizing discrimination between malware and benign apps. 
Each permission is either present or absent for a given app. 
So we have 86 columns containing 0 or 1 for the presence of a given permission.
The last column of the data set is the label which is 0 for benign or 1 for malware. 
The data is already clean with no missing values.
There are 29332 rows where each row represents 1 Android app known to be malware or not.
14700 of the apps are malware, and 14632 are benign, so the two classes are evenly balanced.
The data was collected from benign and malware Android applications over the period from 2010 to 2019.

## Exploratory Data Analysis and Feature Selection

In this data set all the features are permissions encoded 0/1 as is the label. 
There are limited choices to display this data graphically. One thing of interest is
how correlated the features are with each other. I created a correlation matrix and heat map showing all
the pairwise correlations between features.

See more details in the project notebook here: 
    
https://github.com/albert-kepner/Supervised_ML_Project/blob/master/Data_Set_And_Exploratory_Data_Analysis.ipynb

In the above notebook I also looked at the pairwise correlations between the 86 features. 
I eliminated one feature of each pair with the highest correlation until there were no pairs correlated above 0.90 .
This process eliminated 12 features, leaving 74 feature columns.
At the end of this notebook, I used sklearn.model_selection.train_test_split 
to save the training data (70%) and testing data (30%) off in separate CSV files train_data.csv and test_data.csv. This will
make it convenient to train and evaluate multiple models on the same data in separate notebooks.