# Tokenization
The goal of this notebook is to decide on a pre-built tokenization algorithm for the sample JSON data. 

In [1]:
import tensorflow as tf
import pandas as pd 
import os
import json 
import numpy as np
import tensorflow_text as text

## Test WordPiece Vocabulary
WordPiece is BERTs tokenizer and we can use it to generate a custom vocabulary from our data. This is a test with a sample JSON to see what this kind of vocab would look like. This is a potential alternative to a custom algorithm as it creates a vocab iteratively from subwords. 
Requires: tensorflow_text_nightly and tf-nightly

Testing with different vocabulary length yielded mixed results where it was sometimes able to effectively pick out pertinent tokens but mostly fixated on smaller tokens found in the structure of the JSON file or behavior parameters, most of which being unnecessary. A custom algorithm or reference dictionary will likely need to be developed.

In [2]:
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

In [3]:
# Takes rand sample of data to form test dataset for dict building

adware_sample = np.random.randint(0, high=887, size=(1,50))
banking_sample = np.random.randint(0, high=2049, size=(1,50))
riskware_sample = np.random.randint(0, high=2374, size=(1,50))
sms_sample = np.random.randint(0, high=3853, size=(1,50))

sample_mat = np.concatenate((adware_sample, banking_sample), axis=0)
sample_mat = np.concatenate((sample_mat, riskware_sample), axis=0)
sample_mat = np.concatenate((sample_mat, sms_sample), axis=0)

dir_list = ['adware', 'banking', 'riskware', 'sms']

# train_data = tf.data.TextLineDataset(str('adware\\' + os.listdir('adware')[sample_mat[0,0]] + '\\sample_for_analysis.apk.json'))

# mat_index = 0
# for sample_dir in dir_list: 
#     sample_list = os.listdir(sample_dir)
#     if sample_dir is 'adware':
#         start_index = 1
#     else:
#         start_index = 0
#     for rand_ind in sample_mat[mat_index,start_index:]:
#         train_data.concatenate(tf.data.TextLineDataset(str(sample_dir + '\\' + sample_list[rand_ind] + '\\sample_for_analysis.apk.json')))
#     mat_index += 1

In [4]:
# bert_tokenizer_params=dict(lower_case=True)
# reserved_tokens=["[PAD]", "[UNK]", "[START]", "[END]"]

# bert_vocab_args = dict(
#     # The target vocabulary size
#     vocab_size = 800,
#     # Reserved tokens that must be included in the vocabulary
#     reserved_tokens=reserved_tokens,
#     # Arguments for `text.BertTokenizer`
#     bert_tokenizer_params=bert_tokenizer_params,
#     # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
#     learn_params={},
# )

# test_vocab = bert_vocab.bert_vocab_from_dataset(
#     train_data.shuffle(45).batch(45).prefetch(2),
#     **bert_vocab_args
# )

# with open('test_vocab_rand_sample.txt', 'w') as f:
#     for token in test_vocab:
#         print(token, file=f)

## Custom Tokenization
Due to the apparent underperformance of pre-built models in deriving their own dictionaries, a custom method is required. The goal of this process is to derive reusable tokens and preserve the contextual, nested structure of the JSON file. We will do this by iteratively extracting tokens from a random sample until no meaningful (frequent) tokens remain.

In [5]:
#grabs rand sample from mat above
sample_data = []

mat_index = 0
for sample_dir in dir_list: 
    sample_list = os.listdir(sample_dir)
    for rand_ind in sample_mat[mat_index,0:]:
        with open(sample_dir + '\\' + sample_list[rand_ind] + '\\sample_for_analysis.apk.json') as sample_file:
            sample_data.append(sample_file.read().replace("\n", " "))
    mat_index += 1

In [6]:
# See the frequency of all high level keys
high_keys = []

for sample in sample_data:
    sample_json = json.loads(sample)['behaviors']['dynamic']['host']
    for behavior in sample_json:
        for key in behavior.keys():
            high_keys.append(key)

uniq_keys, key_frequency = np.unique(high_keys, return_counts=True)
print(uniq_keys)
print(key_frequency)

['arguments' 'class' 'classType' 'interface' 'interfaceGroup' 'low'
 'method' 'operationFlags' 'procname' 'subclass' 'tid']
[  21036 4805170   64309   21036   21036 4805170   21036   56733   71113
   10006   71113]


### High Level Tokens
Inspection of high-level keys in a random sample revealed 11 keys
* class
* classType
* interface
* interfaceGroup
* method
* operationFlags
* procname
* subclass
* tid
* low
* arguments
#
Of these 11, low contains lower-level features. tid and arguments will likely not be used as tokens.

In [8]:
# See the frequency of low-level tokens
low_keys = []
for sample in sample_data:
    sample_json = json.loads(sample)['behaviors']['dynamic']['host']
    for behavior in sample_json:
        for low_behavior in behavior['low']:
            for key in low_behavior.keys():
                low_keys.append(key)

uniq_keys, key_frequency = np.unique(low_keys, return_counts=True)
print(uniq_keys)
print(key_frequency)

['blob' 'id' 'methodName' 'method_name' 'parameters' 'read fd' 'socket fd'
 'sysname' 'ts' 'type' 'write fd' 'xref']
[ 685381 5426236   21036    6804 4713021    2109    1102 5405200 5426236
 5426236    2109  618406]


In [7]:
# uniq, counts = np.unique(value_list, return_counts=True)

## ID and TS Features