# Data Exploration JSON Parser
The goal of this notebook is to flatten and extract the data into .csv from .apk.json files included un the MalDroid dataset containing the CopperDroid analysis of android APKs. This notebook focuses on the data included under the dynamic:host: headers of the JSON file. This is done to be able to more easily visualize and analyze the houndreds of thousands of rows of sys calls. 
## Objectives
1. Isolate the data under the 'dynamic' header into a JSON array under the header 'host'
2. Create seperate JSON files of objects with the same "class" attribute
3. Record a nested dictionary of common attributes associated with each "class" attribute
4. Flatten "class"-separated JSON files into their own CSVs for analysis

In [1]:
import pandas as pd
import json
import ast
from flatten_json import flatten
from pandas.io.json import json_normalize

with open('075049984D2937039DDE452818BC6B844C8C8CD17232DB8D951306F02234B2EA/sample_for_analysis.apk.json') as path:
    full_json = json.load(path)

# Objective 1

In [2]:
isolated_json = full_json['behaviors']['dynamic']

# Objective 2
A list of all class names is compilied and the dictionary is formed. Then the dictionary entries are exported to JSON and a dictionary is formed and exported of the relative file paths for each class for refrence.

In [3]:
class_list = []
class_dict = {}

for item in isolated_json['host']:
    if type(item) != dict:
        print("item not of type dict")
        break
    item_class = item['class']
    if item_class not in class_list:
        class_list.append(item_class)
        class_dict[item_class] = [item]
    else:
        class_dict[item_class].append(item)

#parsing strings of subdicts, see Objective 4 for explanation
def reformat(dict_in):
    updated_dict = {}
    for key in dict_in.keys():
        attribute = dict_in[key]
        if key == 'blob' and type(attribute) is str:
            if '{' in attribute:
                attribute = attribute.replace("L,", ",")
                attribute = attribute.replace("L}", "}")
                attribute = attribute.replace("u\'", "\'")
                try:
                    attribute = ast.literal_eval(attribute)
                except:
                    print(dict_in)
        if type(attribute) is dict:
            updated_dict[key] = reformat(attribute)
        elif type(attribute) is list:
        #WARNING: This does not account for n-dimensional lists
            updated_list = []
            for entry in attribute:
                if type(entry) is dict:
                    updated_list.append(reformat(entry))
                else:
                    updated_list.append(entry)
            updated_dict[key] = updated_list
        else:
            updated_dict[key] = attribute
    return updated_dict

for key_value in class_dict.keys():
    updated_values = []
    for entry in class_dict[key_value]:
        updated_entry = reformat(entry)
        updated_values.append(updated_entry)
    class_dict[key_value] = updated_values

file_path_dict = {}

for class_type in class_dict.keys():
    holder_dict = {'content': class_dict[class_type]}
    path = 'data_exploration_TEST/' + class_type + '.json'
    file_path_dict[class_type] = path
    with open('data_exploration_TEST/' + class_type + '.json', 'w') as write_json:
        json.dump(holder_dict, write_json)

with open('data_exploration_TEST/relative_class_file_paths.json', 'w') as write_json:
        json.dump(file_path_dict, write_json)

# Objective 3
Variances in the structure of objects even within the same class need to be noted and accounted for. Each dictionary contains a list of the possible structures within a class. Attributes are treated as keys in the dictionary, where the values are their corresponding dtypes or sub-Attributes with their own dictionaries.
The end result is a JSON object where the primary Attributes are each class type, containing a list of dictionaries of each possible structure within that class.

In [4]:
class_attributes_dict = {}

def dictparse(item):
    #A bit messy, but seems to be accurate. Had to workaround some strange formatting, long dtypes, and unicode
    attribute_dict = {}
    for key in item.keys():
        attribute = item[key]
        # if key == 'blob' and type(attribute) is str:
        #     if '{' in attribute:
        #         attribute = attribute.replace("L,", ",")
        #         attribute = attribute.replace("L}", "}")
        #         attribute = attribute.replace("u\'", "\'")
        #         try:
        #             attribute = ast.literal_eval(attribute)
        #         except:
        #             print(item)
        if type(attribute) is dict:
            attribute_dict[key] = dictparse(attribute)
        elif type(attribute) is list:
            #WARNING: This does not account for n-dimensional lists
            for entry in attribute:
                if type(entry) is dict:
                    attribute_dict[key] = dictparse(entry)
                else:
                    attribute_dict[key] = str(type(attribute))
        else:
            attribute_dict[key] = str(type(attribute))
    return attribute_dict

for class_type in class_dict.keys():
    class_attributes_dict[class_type] = []
    for item in class_dict[class_type]:
        item_dict = dictparse(item)
        if item_dict not in class_attributes_dict[class_type]:
            class_attributes_dict[class_type].append(item_dict)

with open('data_exploration_TEST/class_attributes.json', 'w') as write_json:
        json.dump(class_attributes_dict, write_json)

# Objective 4
In order to correctly flatten the json data, subdictionaries in entries that are enclosed by double quotes need to be properly formatted into dictionaries/JSON. This fucntionality will be added to Objective 2 and removed from Objective 3. It is also worth keeping in mind that each entry in the list under the content header will need to be flattened seperately and incrementally added to the dataframe, otherwise every entry will be along the x-axis. Additionally, the different data formats for different classes as displayed in Objective 3 may introduce none/NaN values. These will need to be imputed a some point and may cause issues exporting the dataframes to csv or ingesting them into data visalization software/libraries. Thankfully, imputation will be made easier by the steps taken to document each 

In [5]:
#sample flattening of a test json into csv
path = file_path_dict['FS PIPE ACCESS']

with open(path) as read:
    json_to_flat = json.load(read)

flat = flatten(json_to_flat)
flat = json_normalize(flat)

flat

Unnamed: 0,content_0_classType,content_0_operationFlags,content_0_procname,content_0_low_0_sysname,content_0_low_0_ts,content_0_low_0_id,content_0_low_0_blob_type,content_0_low_0_blob_filename,content_0_low_0_type,content_0_low_0_read fd,...,content_2_low_0_read fd,content_2_low_1_sysname,content_2_low_1_ts,content_2_low_1_write fd,content_2_low_1_blob_type,content_2_low_1_blob_filename,content_2_low_1_type,content_2_low_1_id,content_2_tid,content_2_class
0,0,2048,zygote,pipe,1545638795.518,327,read/write pipe,anonymous pipe [read],SYSCALL,36,...,106,pipe,1545638805.521,113,read/write pipe,anonymous pipe [write],SYSCALL,21746,1163,FS PIPE ACCESS
