This tool has the ability to create datasets for several ML(machine learning) programs in a NLP(natural language processing). The main feature is converted to disassembly-code from executable files in PE and ELF of x86_64, and make these CSV-files for easy handling in ML. The advantage is more simpler usage, and to use on free and open-source without Paid tools like IDA-Python.
-
Input
- DirectoryPath :
PEformat(.exe) filesorELFformat exefiles
- DirectoryPath :
-
Output
[dirname]_EP.csv: Extract entry-point for all files[dirname]_EP_asm.csv: Extract disassembly code from entry-point of all files at arbitrary bytes[dirname]_TEXTSec_asm.csv: Extract disassembly code from text-section of all files[dirname]_TEXTSec_asm_TRANS.csv: Transform disassembly code from csvfile by arbitrary rules[dirname]_FUNC_asm.csv: Extract disassembly code from all-function of all files(ELFbinary)[dirname]_FUNC_asm_TRANS.csv: Transform disassembly code from csvfile by arbitrary rules(ELFbinary)
pip3 install -r requirements.txt
pip3 install .
pip3 install -r requirements.txt
pip3 install -e .
pinja [INPUT_DIRPATH]
pinja --help
pinja data/infilePE
pinja -f elf data/infileELF -b 180 -o OUTPUTNAME
pip3 install pandas
pip3 install gensim
cd UseCase/
python3 use.py out_TEXTSec_asm.csv
#!/usr/bin/python3
# Usage: python3 use.py [CSVfile]
import pandas as pd
from gensim import models
import pprint
import sys
args = sys.argv
filename = args[1]
print('>>>> Read CSV PINJA dataset')
df = pd.read_csv(filename, header=0, index_col=0, dtype='str')
df = df.fillna('EMPTY')
asm_all = []
for row in df.values.tolist():
asm_all.append([s for s in row if s != 'EMPTY'])
print('>>>> Make the ML model')
asmtext = []
for x in asm_all:
asmtext.append(models.doc2vec.TaggedDocument(words=x, tags=[x[0]]))
model = models.Doc2Vec(asmtext, dm=1, vector_size=300, window=5, alpha=.025, min_alpha=.025, min_count=0, sample=1e-6)
print('>>>> Learning and Save model')
epoch_num = 8
for epoch in range(epoch_num):
print('Epoch: {}'.format(epoch + 1))
model.train(asmtext, epochs=model.iter, total_examples=model.corpus_count)
model.alpha -= (0.025 - 0.0001) / (epoch_num - 1)
model.min_alpha = model.alpha
model.save(filename + ".doc2vec")
print('>>>> Load model and Print Binary Code Similarity')
model = models.Doc2Vec.load(filename + '.doc2vec')
sim_list = []
for x in asm_all:
sim_list.append([x[0], model.docvecs.most_similar([x[0]])])
pprint.pprint(sim_list)Project based on the cookiecutter data science project template. #cookiecutterdatascience


