Skip to content
*pinja_sec
Python C
Branch: master
Clone or download
Latest commit 31cd4c2 Oct 19, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
UseCase add UseCase Oct 19, 2019
data fix README.md Oct 1, 2019
mics add UseCase Oct 19, 2019
output_example add output_example and fix README.md Oct 1, 2019
pinja remove pycache Oct 19, 2019
.gitignore add UseCase Oct 19, 2019
LICENSE second commit Sep 13, 2019
README.md add UseCase Oct 19, 2019
requirements.txt add: make csv of 256b asm from eop for elf binary Sep 29, 2019
setup.py add: make csv of 256b asm from eop for elf binary Sep 29, 2019

README.md

BINARY PINJA

GitHub release (latest by date)

GitHub top language GitHub repo size GitHub license GitHub stars Twitter

This tool has the ability to create datasets for several ML(machine learning) programs in a NLP(natural language processing). The main feature is converted to disassembly-code from executable files in PE and ELF of x86_64, and make these CSV-files for easy handling in ML. The advantage is more simpler usage, and to use on free and open-source without Paid tools like IDA-Python.

FEATURES:

  • Input

    • DirectoryPath : PEformat(.exe) files or ELFformat exefiles
  • Output

    • [dirname]_EP.csv : Extract entry-point for all files
    • [dirname]_EP_asm.csv : Extract disassembly code from entry-point of all files at arbitrary bytes
    • [dirname]_TEXTSec_asm.csv : Extract disassembly code from text-section of all files
    • [dirname]_TEXTSec_asm_TRANS.csv: Transform disassembly code from csvfile by arbitrary rules
    • [dirname]_FUNC_asm.csv : Extract disassembly code from all-function of all files(ELFbinary)
    • [dirname]_FUNC_asm_TRANS.csv : Transform disassembly code from csvfile by arbitrary rules(ELFbinary)
out_FUNC_asm.csv

csv sample

HOW TO INSTALL:

pip3 install -r requirements.txt 
pip3 install .

HOW TO INSTALL for Developper:

pip3 install -r requirements.txt 
pip3 install -e . 

Usage:

pinja [INPUT_DIRPATH]
Example Command:
pinja --help
pinja data/infilePE
pinja -f elf data/infileELF  -b 180 -o OUTPUTNAME 
DEMO:

usage

Use Case: Binarycode Similarity with pinja-dataset

Purpose: Get the binarycode similarity score by using pinja-dataset
Type: Machine Laernning of the Narural Language processing.
Overview: Output each similarity score as a numerical value from 0 to 1 for each binary code.
pip3 install pandas
pip3 install gensim
cd UseCase/
python3 use.py out_TEXTSec_asm.csv
DEMO: about 1 minuts

usecase

Source code
#!/usr/bin/python3
# Usage: python3 use.py [CSVfile]

import pandas as pd
from gensim import models
import pprint
import sys

args = sys.argv
filename = args[1]

print('>>>> Read CSV PINJA dataset')
df = pd.read_csv(filename, header=0, index_col=0, dtype='str')
df = df.fillna('EMPTY')
asm_all = []
for row in df.values.tolist():
    asm_all.append([s for s in row if s != 'EMPTY'])

print('>>>> Make the ML model')
asmtext = []
for x in asm_all:
    asmtext.append(models.doc2vec.TaggedDocument(words=x, tags=[x[0]]))
model = models.Doc2Vec(asmtext, dm=1, vector_size=300, window=5, alpha=.025, min_alpha=.025, min_count=0, sample=1e-6)

print('>>>> Learning and Save model')
epoch_num = 8
for epoch in range(epoch_num):
    print('Epoch: {}'.format(epoch + 1))
    model.train(asmtext, epochs=model.iter, total_examples=model.corpus_count)
    model.alpha -= (0.025 - 0.0001) / (epoch_num - 1)
    model.min_alpha = model.alpha
model.save(filename + ".doc2vec")

print('>>>> Load model and Print Binary Code Similarity')
model = models.Doc2Vec.load(filename + '.doc2vec')
sim_list = []
for x in asm_all:
    sim_list.append([x[0], model.docvecs.most_similar([x[0]])])
pprint.pprint(sim_list)

Reference

click

capstone

glob

pefile

pefile UsageExamples.md

pefile.DIRECTORY_ENTRY

PEheader pefileformat.html

elftools

elftools user's guide

elftools example

ELF about elf


Project based on the cookiecutter data science project template. #cookiecutterdatascience

You can’t perform that action at this time.