# Extract Features (Vectorization)

In this notebook there is the extraction of features from dataset.
Each json_file (asm_code) is trasformed as an array considering the following things:
                                        (# = number of / occurences of)
    > CFG ANALYSIS
        - #loops
        - #edges
        - #nodes
        - Cyclomatic Complexity (M)
            M = E - N + 2P
                E = # edge
                N = # nodes
                P = Connected Components (=1)
        
    >ASM ANALYSIS
        In FOLDER_KEYWORD_ASM_ANALYSIS folder there are several file, each of them contains a type of keyword. (e.g. a type of instructions, a type of registers)
        There are two different file extension:
            .instr -> contains keywords that refered to a specific assembly type instruction (e.g. jump)
            .reg    -> contains keywords of a type of registers 
        The differents is necessary since during the search, in a line can be only one instructions but two o more registers (>2 e.g. in a memory access)
        Moreover there is also a difference in the vectorization of this two kinds of keyword, in the vectorization form of an asm_code there is the number of occurences of each keyword, and also a number that consideres the all categories. In the case of instructions, there is also the sum of all occurences of that type of instructions. In the case of registers instead, there is also the number of different register used (we are not interested about how many times the code uses register, but how much it uses, it is based on the concept that a complex program uses many registers)

        It is also considered the number of memory access (NOTE: it is obtained by counting the occurrences of '[') and the number of lines of asm code

        So, regarding to this part, the features are the following:
            - #memory access
            - #asm lines

            For each file in FOLDER_KEYWORD_ASM_ANALYSIS: 
                if extension == .instr:
                    for keyword in file:
                        - #keyword
                        - sum of all occurences
                
                else if extension == .reg:
                    for keyword in file:
                        - #keyword
                        - number of keyword that occurences more than zero

        
NOTE:

(1) The memory access are counted by the occurrences of '[' <br>
(2) A Register and its subregisters are considered as different registers (i.e. %eax and %ax are considered as two different registers) <br>
(3) The order of features regarding to ASM ANALYSIS, after #memory access and #asm lines, are in alphabetic order according to the name of keyword file 




In [1]:
import json
import os
import networkx as nx
from networkx.readwrite import json_graph

In [2]:
FOLDER_KEYWORD_ASM_ANALYSIS = './keyword_asm_analysis'

#extensions of keyword files
INSTR_EXT = '.instr'
REG_EXT = '.reg'

In [3]:
semantic_code_to_int = {'undefined': 0, 'encryption':1, 'string': 2,  'sort':3, 'math':4}

semantic_code_to_string = ['undefined','encryption', 'string','sort', 'math']
     

In [4]:
def cfg_analysis(dict_graph):
    '''
    @dict_graph = graph as dictionary (output of json load)
    @return an array with the following components:
        > CFG ANALYSIS
            - Number of loops
            - Number of edges
            - Number of nodes
            - Cyclomatic Complexity:
                M = E - N + 2P
                    E = # edge
                    N = # nodes
                    P = Connected Components (=1)
    '''
    graph  = json_graph.adjacency_graph(dict_graph)

    n_loops = len(nx.cycle_basis(graph.to_undirected()))
    n_edges = graph.number_of_edges()
    n_nodes = graph.number_of_nodes()
    M = n_edges - n_nodes + 2

    x = [n_loops, n_edges, n_nodes, M]
    x_meaning = ['num_loops', 'num_edges', 'num_nodes', 'cyclomatic_complex']
    
    return x, x_meaning


In [5]:
def to_empty_dictionary(file_keys_path):
    '''
        Aux function to build empty dictionary used in asm analysis
    '''
    file_keys = open(file_keys_path, 'r', encoding='utf8')
    lines =file_keys.readlines()
    
    dict_result = {}
    for line in lines:
        key = line.strip() 
        dict_result[key] = 0
    return dict_result

In [6]:
def asm_decomposed_list(lista_asm):
    '''
        convert lista_asm to a list containing a dict for each instruction where key are: instruction, S, D
        if an instruction doesn't have S or D, the value is empty = ''

        @lista_asm = asm from json file
        @return list of instruction
    '''
    result = []
    raw_list = lista_asm[1:-2].strip().split("',") # used "'," instead of a single comma to avoid split of a instrcution like: "mov rbp, rsp"

    for instr in raw_list:
        line = instr.strip()[1:] #remove "'" at the beginning

        S=''
        D=''
        decomposed = line.split(" ", 1) #INSTRUCTION - S,D
        instr = decomposed[0].strip()
        if(len(decomposed)>1): #search S and D
            s_d = decomposed[1].split(',')
            S = s_d[0].strip()
            if(len(s_d) == 2):
                D = s_d[1].strip()
        dic = {}
        dic['instruction']=instr
        dic['S'] = S 
        dic['D'] = D
        result.append(dic)

    return result

In [7]:
def search_keyword(dict_keyword, string_to_parse, early_stop_search=True, are_register=False):
    '''
        search each keyword of dict inside string_to_parse (single asm instrcution) and increment the relative dictionary's entry

        @dict_keyword = dict with keyword to search
        @string_to_parse

        @early_stop_search = (if true then it stop search at first keyword occurs)
            // set to False during register search

        @are_register = for each register, search for (e.g. ax):
            ' ax' -> avoid count it in eax
            '[ax' -> for memory access
            'ax]'
            ',ax'

        @return:
            False   no keyword found
            True   keyword found

        @side effect:   entries of dict corresponding to keyword found are incremented
    '''
    found_keyword = False
    keys = dict_keyword.keys()
    for key in keys:
        if are_register:
            if (' ' + key in string_to_parse):
                found_keyword = True
                dict_keyword[key] = dict_keyword[key] + 1 
                if early_stop_search:
                    return found_keyword

            if ('[' + key in string_to_parse):
                found_keyword = True
                dict_keyword[key] = dict_keyword[key] + 1 
                if early_stop_search:
                    return found_keyword
            '''
            if (key + ']' in string_to_parse):
                found_keyword = True
                dict_keyword[key] = dict_keyword[key] + 1 
                if early_stop_search:
                    return found_keyword
            '''

        else:
            if key in string_to_parse:
                found_keyword = True
                dict_keyword[key] = dict_keyword[key] + 1 
                if early_stop_search:
                    return found_keyword

    return found_keyword


In [8]:
def asm_analysis(asm_list_dict, keyword_folder):
    '''
        @asm_list_dict = asm code as list of dictionary in which each element is a line (NOT obtained directly by json load function). The keys are: instruction, S, D
        if one operand is not in the line, then it is set to blank space.

        @keyword_folder contains keywords to find in the assembly code. Each file containts a type of instruction divided according to their type. There are used two type of file extensions:
        .instr = file contains keywords that are instrcution, so in each line only one of instrctrion can be there
        .reg = file contains registers, instead in this case, in the src, dst part of asm code, can be two or more different registers (>2 for example in memory access) 

        @return array with component described above
            NOTE: For each .instr file, the sum of all occurences of same categories are considered, also the singular occurence of a keyword.
            (e.g. file.instr = {xor or and}, the features of the code will be:
                #xor        (# = occurences of)
                # or
                # and
                #xor + #or + #and)

            NOTE_2: For .reg file instead and considered the number of different of register found in the asm code
            (e.g. file.reg = {eax, ebx, ecx}
                #eax
                #ebx
                #ecx
                (#eax>0?1:0)+(#ebx>0?1:0)+(#ecx>0?1:0)
            )
    '''

    #build a empty dictionary for each files
    ordered_list_files = sorted(os.listdir(FOLDER_KEYWORD_ASM_ANALYSIS))
    all_keywords ={} #contains a dictionary of dictionaries, the keys are filenames and the contents are dict with the occurences of each keys
    '''
    e.g.
        all_keywords = {
            'jump.instr': {'jmp':3, 'je':4},
            'register.reg':{'eax':3, 'ecx':1}
        }
    '''
    for filename in ordered_list_files:
        all_keywords[filename] = to_empty_dictionary(os.path.join(FOLDER_KEYWORD_ASM_ANALYSIS, filename))

    MEMORY_ACCESS = '['
    number_of_memory_access = 0

    number_of_asm_lines = len(asm_list_dict)

    for line_dict in asm_list_dict:
        instruction_found = False #used to avoid search of other instruction

        instruction = line_dict['instruction']
        operands = " " + line_dict['S'] + " " + line_dict['D']

        for keyword_type in all_keywords.keys():
            if REG_EXT in keyword_type: #file contains keywords that refrers to registers
                search_keyword(all_keywords[keyword_type], operands, early_stop_search=False, are_register=True)

            if (INSTR_EXT in keyword_type) and (not instruction_found):
                instruction_found = search_keyword(all_keywords[keyword_type], instruction)

    
        if MEMORY_ACCESS in operands:
            number_of_memory_access = number_of_memory_access + 1

    x=[]
    x_meaning = []

    x.append(number_of_memory_access)
    x_meaning.append('memory_access')

    x.append(number_of_asm_lines)
    x_meaning.append('num_asm_lines')


    for keyword_type in sorted(all_keywords):
        name_type = keyword_type.split('.')[0]

        if INSTR_EXT in keyword_type:
            sum_occurences = 0
            for keyword in all_keywords[keyword_type]:
                occurences = all_keywords[keyword_type][keyword]
                sum_occurences = sum_occurences + occurences
                x.append(occurences)
                x_meaning.append(keyword)
            x.append(sum_occurences)
            x_meaning.append(name_type)

        elif REG_EXT in keyword_type:
            sum_no_zero_occurences = 0
            for keyword in all_keywords[keyword_type]:
                occurences = all_keywords[keyword_type][keyword]
                if occurences > 0:
                    sum_no_zero_occurences = sum_no_zero_occurences + 1
                x.append(occurences)
                x_meaning.append(keyword)
            x.append(sum_no_zero_occurences)
            x_meaning.append(name_type)    

    return x, x_meaning    

In [9]:
def vectorization(json_file):
    '''
        return a dictionary:
            ID = id of file (same of json file)
            x = vectorization of file
            y = {0,1,2,3,4} //ground truth
                0: undefined (instance of test without label)
                1: encryption
                2: string
                3: sort
                4: math
    '''
    ID = json_file['id']
    if 'semantic' in json_file.keys():
        y = semantic_code_to_int[json_file['semantic']]
    else:
        y = semantic_code_to_int['undefined']   #blind_dataset

    y_meaning = semantic_code_to_string

    #start build x:
    code_graph = json_file['cfg']
    x_cfg, x_meaning_cfg = cfg_analysis(code_graph)

    asm_file = json_file['lista_asm']
    asm_code_list_dict = asm_decomposed_list(asm_file)
    x_asm, x_meaning_asm = asm_analysis(asm_code_list_dict, FOLDER_KEYWORD_ASM_ANALYSIS)

    x = x_cfg + x_asm
    x_meaning = x_meaning_cfg + x_meaning_asm


    return ID, x_meaning, y_meaning, x, y 

# EXPORT DATASET TO JSON FILE

Exported file is a singular json file that contains the following keys: <br>
- x_meaning = meaning of each features <br>
- y_meaning = mapping index to semantic <br>
- dataset = dataset: <br>
>For each sample:<br>
    |--------ID -> ID from dataset provided <br>
    |--------x -> features <br>
    |--------y -> class (index) <br>

In [10]:
def export_dataset_to_file(_path, _x_meaning, _y_meaning, _IDs, _X, _Y,):
    dataset_out_dict={} #dataset in dict format
    dataset_out_dict['x_meaning'] = _x_meaning
    dataset_out_dict['y_meaning'] = _y_meaning

    dataset_out_dict['dataset']=[]

    if (len(_IDs) != len(_X)) or (len(_X) != len(_Y)):
        print("Error, wrong dimensions")
        return None
  
    for idx in range(len(_X)):
        new_element = {}
        new_element['ID'] = _IDs[idx]
        new_element['x'] = _X[idx]
        new_element['y'] = _Y[idx]

        dataset_out_dict['dataset'].append(new_element)

    #write dataset_out_dict on json file
    file_dataset_out = open(_path, 'w')
    json.dump(dataset_out_dict, file_dataset_out)
    file_dataset_out.close()

In [11]:
def vectorization_dataset_and_export(_path_src, _path_dst):
    dataset_file = open(_path_src, 'r')
    dataset_file_lines = dataset_file.readlines()

    json_files = []
    for line in dataset_file_lines:
        json_files.append(json.loads(line.strip()))

    #write first keys
    ID, _x_meaning, _y_meaning, x, y  = vectorization(json_files[0])
    x_meaning = _x_meaning
    y_meaning = _y_meaning
    
    X = []
    Y = []
    IDs = []
    for json_file in json_files:
        ID, _x_meaning, _y_meaning, x, y  = vectorization(json_file)
        X.append(x)
        Y.append(y)
        IDs.append(ID) 

    export_dataset_to_file(_path_dst, x_meaning, y_meaning, IDs, X,Y)


## Dataset with duplicate

In [1]:
SRC_DATASET_DUP = './dataset/original/DUP/dataset.json'
DST_DATASET_DUP =  './dataset/vectorizated/DUP/dataset.json'

SRC_DATASET_DUP_BLIND = './dataset/original/DUP/blindtest.json'
DST_DATASET_DUP_BLIND = './dataset/vectorizated/DUP/blindtest.json'

In [13]:
vectorization_dataset_and_export(SRC_DATASET_DUP,DST_DATASET_DUP)

In [14]:
vectorization_dataset_and_export(SRC_DATASET_DUP_BLIND,DST_DATASET_DUP_BLIND)

In [2]:
f = open(SRC_DATASET_DUP_BLIND, 'r')
lines = f.readlines()
len(lines)

757

## Dataset with NO duplicate

In [15]:
SRC_DATASET_NO_DUP = './dataset/original/NO_DUP/noduplicatedataset.json'
DST_DATASET_NO_DUP =  './dataset/vectorizated/NO_DUP/noduplicatedataset.json'

SRC_DATASET_NO_DUP_BLIND = './dataset/original/NO_DUP/nodupblindtest.json'
DST_DATASET_NO_DUP_BLIND = './dataset/vectorizated/NO_DUP/nodupblindtest.json'

In [16]:
vectorization_dataset_and_export(SRC_DATASET_NO_DUP,DST_DATASET_NO_DUP)

In [17]:
vectorization_dataset_and_export(SRC_DATASET_NO_DUP_BLIND,DST_DATASET_NO_DUP_BLIND)