# Data Exploration JSON Parser
This is a modified version of a notebook used to explore initial test data from the MalDroid dataset. This version crawls through random samples from different malware classes, allowing us to note and account for any discrepancies.
## Objectives
1. Make a dictionary with classes as keys and a list of file paths to the json files
2. Define a function to log the frequency of json fields in static data to a csv for each class
3. Repeat Objective 2 for the frequency of behavior types in dynamic data

In [1]:
import pandas as pd
import json
import ast
from flatten_json import flatten
from pandas.io.json import json_normalize
import os
import numpy as np

# Objective 1

In [2]:
class_folders = ["adware", "banking", "benign_comp", "riskware", "sms"]
path_suffix = "sample_for_analysis.apk\\sample_for_analysis.apk.json"
ref_dict = {
    "adware": [],
    "banking": [],
    "benign_comp": [],
    "riskware": [],
    "sms": []
}

for base_dir in class_folders:
    path_list = ref_dict[base_dir]
    for sample_folder in os.listdir(base_dir):
        sample_dir = os.path.join(base_dir, sample_folder)
        json_path = os.path.join(sample_dir, path_suffix)
        path_list.append(json_path)
    ref_dict[base_dir] = path_list

# Objective 2
To make things a bit easier, we're going to first make a list of all possible static features and then go back over the JSON files to log frequency. This way we don't have to worry about iterativly adding to a DataFrame.

In [6]:
feat_names = []

for sample_class in class_folders:
    json_paths = ref_dict[sample_class]
    for path in json_paths:
        with open(path) as open_json:
            try:
                json_data = json.load(open_json)
            except:
                print("Error loading json from path: " + path)
                continue
        static = json_data['behaviors']['static']
        for item in static.keys():
            if item not in feat_names:
                if type(static[item]) is dict:
                    #can only parse one subdict
                    for subkey in static[item].keys():
                        full_key = item + ': ' + subkey
                        if full_key not in feat_names:
                            feat_names.append(full_key)
                else:
                    feat_names.append(item)

list_iterator = 0
freq_matrix = np.zeros((5, len(feat_names)))

for sample_class in class_folders:
    json_paths = ref_dict[sample_class]
    for path in json_paths:
        with open(path) as open_json:
            try:
                json_data = json.load(open_json)
            except:
                print("Error loading json from path: " + path)
                continue
        static = json_data['behaviors']['static']
        feat_iterator = 0
        for feature in feat_names:
            for item in static.keys():
                if type(static[item]) is dict:
                    #can only parse one subdict
                    for subkey in static[item].keys():
                        full_key = item + ': ' + subkey
                        if full_key.lower() == feature.lower():
                            freq_matrix[list_iterator, feat_iterator] += 1
                elif item.lower() == feature.lower():
                        freq_matrix[list_iterator, feat_iterator] += 1
            feat_iterator += 1
    list_iterator += 1

static_feat_freq = pd.DataFrame(freq_matrix, columns = feat_names)
static_feat_freq['class'] = class_folders
static_feat_freq

Error loading json from path: benign_comp\60777C42F710E9774C0F057D91239E34A86A63943EEEFA569FBD0A6DB3131AC4.copperdroid\sample_for_analysis.apk\sample_for_analysis.apk.json
Error loading json from path: benign_comp\84D2D583E7AA3D69D9641B493C7C1193D296D177E0D1E267CE043942B913CDD3.copperdroid\sample_for_analysis.apk\sample_for_analysis.apk.json
Error loading json from path: riskware\sample_for_analysis.apk_0\sample_for_analysis.apk\sample_for_analysis.apk.json
Error loading json from path: riskware\sample_for_analysis.apk_1\sample_for_analysis.apk\sample_for_analysis.apk.json
Error loading json from path: riskware\sample_for_analysis.apk_2\sample_for_analysis.apk\sample_for_analysis.apk.json
Error loading json from path: riskware\sample_for_analysis.apk_3\sample_for_analysis.apk\sample_for_analysis.apk.json
Error loading json from path: benign_comp\60777C42F710E9774C0F057D91239E34A86A63943EEEFA569FBD0A6DB3131AC4.copperdroid\sample_for_analysis.apk\sample_for_analysis.apk.json
Error loadin

Unnamed: 0,num_intent_actions,intent_actions,num_permissions,intent_consts,num_intent_const_android_intent,used_permissions: android.permission.ACCESS_FINE_LOCATION,used_permissions: android.permission.VIBRATE,used_permissions: android.permission.INTERNET,used_permissions: android.permission.SET_WALLPAPER,used_permissions: android.permission.ACCESS_WIFI_STATE,...,incognito.method_tags: GRAPHICS,incognito.method_tags: WEBKIT,file: C,file: raw,file: TTComp,incognito.method_tags: PROVIDER,incognito.num_intent_action_android_net,file: Apple,file: PE32,class
0,11.0,11.0,11.0,11.0,11.0,10.0,10.0,11.0,1.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,adware
1,16.0,16.0,16.0,15.0,15.0,6.0,7.0,12.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,banking
2,11.0,11.0,11.0,11.0,11.0,10.0,9.0,10.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,benign_comp
3,14.0,14.0,14.0,10.0,10.0,5.0,6.0,9.0,0.0,7.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,riskware
4,33.0,33.0,33.0,27.0,27.0,0.0,19.0,9.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,sms


# Objective 3
Variances in the structure of objects even within the same class need to be noted and accounted for. Each dictionary contains a list of the possible structures within a class. Attributes are treated as keys in the dictionary, where the values are their corresponding dtypes or sub-Attributes with their own dictionaries.
The end result is a JSON object where the primary Attributes are each class type, containing a list of dictionaries of each possible structure within that class.
## Issues parsing lists
- Encountered issues documenting the contents of lists where where arguments and parameters keys in the binder and syscalls classes contain a list of varied string parameters and JSON
- These are of varying sizes and variables, but will not be included in the attribute documentation for brevity and the lack of a reliable/repeatable documentation process
- These features could be valuble, so they will need to be dealt with/taken into consideration during feature engineering
- ### These values were able to be effectively extracted in the flattening process

In [None]:
class_attributes_dict = {}

def dictparse(item):
    #A bit messy, but seems to be accurate. Had to workaround some strange formatting, long dtypes, and unicode
    attribute_dict = {}
    for key in item.keys():
        attribute = item[key]
        # if key == 'blob' and type(attribute) is str:
        #     if '{' in attribute:
        #         attribute = attribute.replace("L,", ",")
        #         attribute = attribute.replace("L}", "}")
        #         attribute = attribute.replace("u\'", "\'")
        #         try:
        #             attribute = ast.literal_eval(attribute)
        #         except:
        #             print(item)
        if type(attribute) is dict:
            attribute_dict[key] = dictparse(attribute)
        elif type(attribute) is list:
            #WARNING: This does not account for n-dimensional lists
            attribute_list = []
            for entry in attribute:
                if type(entry) is dict:
                    to_validate = dictparse(entry)
                    if to_validate not in attribute_list:
                        attribute_list.append(to_validate)
                # else:
                #     attribute_list.append(str(type(attribute)))
            attribute_dict[key] = attribute_list
        else:
            attribute_dict[key] = str(type(attribute))
    return attribute_dict

for class_type in class_dict.keys():
    class_attributes_dict[class_type] = []
    for item in class_dict[class_type]:
        item_dict = dictparse(item)
        if item_dict not in class_attributes_dict[class_type]:
            class_attributes_dict[class_type].append(item_dict)

with open('data_exploration/class_attributes.json', 'w') as write_json:
        json.dump(class_attributes_dict, write_json)

# Objective 4
In order to correctly flatten the json data, subdictionaries in entries that are enclosed by double quotes need to be properly formatted into dictionaries/JSON. This fucntionality will be added to Objective 2 and removed from Objective 3. It is also worth keeping in mind that each entry in the list under the content header will need to be flattened seperately and incrementally added to the dataframe, otherwise every entry will be along the x-axis. Additionally, the different data formats for different classes as displayed in Objective 3 may introduce none/NaN values. These will need to be imputed a some point and may cause issues exporting the dataframes to csv or ingesting them into data visalization software/libraries. Thankfully, imputation will be made easier by the steps taken to document each 

In [None]:
#sample flattening of a test json into csv
# path = file_path_dict['FS PIPE ACCESS']

# with open(path) as read:
#     json_to_flat = json.load(read)

# list1 = json_to_flat['content']
# flat = flatten(list1[0])
# flat = json_normalize(flat)
# flat = pd.DataFrame(flat)

# flat1 = flatten(list1[2])
# flat1 = json_normalize(flat1)
# flat1 = pd.DataFrame(flat1)

# pd.concat([flat, flat1], axis=0, ignore_index=True)

In [None]:
def flatten_and_export (structured_json, key):
    content_list = structured_json['content']
    first_element = flatten(content_list[0])
    first_element = json_normalize(first_element)
    dataframe1 = pd.DataFrame(first_element)
    del content_list[0]
    for data in content_list:
        data = flatten(data)
        data = json_normalize(data)
        dataframe2 = pd.DataFrame(data)
        dataframe1 = pd.concat([dataframe1, dataframe2], axis=0, ignore_index=True)
    with open('data_exploration/csv_files/' + key + '.csv', 'w') as write_csv:
        dataframe1.to_csv(write_csv)

for key, value in file_path_dict.items():
    with open(value, 'r') as path:
        flatten_and_export(json.load(path), key)

# Objective 5
The goal of extensive documentation of the key attributes is to provide semantical context to both those reviewing the data and the model when evaluating it. Ideally, each key will have a semantic definition that will be given additional context by it's associated value and related mother and child keys and values. To start, each key will be given a definition but these may have to be changed due to the context of the same key within a particular class or dict.

In [None]:
key_refrence = {
    "class": "Categorical type of the observed dynmaic behavior",
    "low": "List of low-level features associated with the observed behavior",
    "sysname": "Name of the system object associated with low-level features",
    "type": "Type of low-level process, either system call or binder",
    "id": "ID of low-level behavior features, in order of observation",
    "parameters": "Parameters sent to the system object in the SYSCALL class",
    "ts": "Time-stamp of observed low-level behaviors in miliseconds since January 1, 1970",
    "sin_port": "Service port of the low-level process",
    "in_addr": "IP address of the low-level process",
    "sin_family": "The type of addresses the socket can communicate with",
    "classType": "", #Unsure, looks to be some sort of categorical feature of an observed behavior
    "operationFlags": "", #Unsure, looks to be some sort of categorical feature of an observed behavior
    "procname": "Name of the high-level process associated with the observed behavior",
    "blob": "SQL BLOB object of a column value of a row of a database table storing data associated with the low-level behavior",
    "flags": "", #Unsure, looks to be some sort of categorical feature of the blob object
    "mode": "", #Unsure, looks to be some sort of categorical feature of the blob object
    "filename": "Local file path or socket of the blob object",
    "xref": "The id of the parent/primary low-level feature in the same observed behavior",
    "tid": "Tread identifier of the schedulable object in the kernel",
    "size": "Size in bytes of the blob object",
    "devicename": "", #Unsure, found only in fs acces(c,w) and device_access. Seems to be no different than filename in former whereas in latter is used to address /dev/binder, /dev/ashmem, /proc/meminfo and /dev/urandom but still seems to be no different than filename
    "pid": "Process identifier of the group of schedulable objects that share memory and file descriptors",
    "socket domain": "Data communications endpoint for exchanging data between processes executing on the same host",
    "socket type": "Defines the semantics and properties of communications using that socket",
    "socket protocol": "Protocol on which the socket communicates with other processes",
    "host": "Path or IP address of the network host",
    "port": "Port number the process is exposed on",
    "returnValue": "Value returned by the network call",
    "subclass": "Subclass of the associated behavior class",
    "read fd": "File descriptor to be read", 
    "write fd": "File descriptor to be written to",
    "interfaceGroup": "Parent package of the attached binder interface",
    "interface": "Full package name of the attached binder interface",
    "arguments": "List of arguments passed with the binder"
}