Skip to content
Python C
Branch: master
Clone or download
Latest commit 31cd4c2 Oct 19, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
UseCase add UseCase Oct 19, 2019
data fix Oct 1, 2019
mics add UseCase Oct 19, 2019
output_example add output_example and fix Oct 1, 2019
pinja remove pycache Oct 19, 2019
.gitignore add UseCase Oct 19, 2019
LICENSE second commit Sep 13, 2019 add UseCase Oct 19, 2019
requirements.txt add: make csv of 256b asm from eop for elf binary Sep 29, 2019 add: make csv of 256b asm from eop for elf binary Sep 29, 2019


GitHub release (latest by date)

GitHub top language GitHub repo size GitHub license GitHub stars Twitter

This tool has the ability to create datasets for several ML(machine learning) programs in a NLP(natural language processing). The main feature is converted to disassembly-code from executable files in PE and ELF of x86_64, and make these CSV-files for easy handling in ML. The advantage is more simpler usage, and to use on free and open-source without Paid tools like IDA-Python.


  • Input

    • DirectoryPath : PEformat(.exe) files or ELFformat exefiles
  • Output

    • [dirname]_EP.csv : Extract entry-point for all files
    • [dirname]_EP_asm.csv : Extract disassembly code from entry-point of all files at arbitrary bytes
    • [dirname]_TEXTSec_asm.csv : Extract disassembly code from text-section of all files
    • [dirname]_TEXTSec_asm_TRANS.csv: Transform disassembly code from csvfile by arbitrary rules
    • [dirname]_FUNC_asm.csv : Extract disassembly code from all-function of all files(ELFbinary)
    • [dirname]_FUNC_asm_TRANS.csv : Transform disassembly code from csvfile by arbitrary rules(ELFbinary)

csv sample


pip3 install -r requirements.txt 
pip3 install .

HOW TO INSTALL for Developper:

pip3 install -r requirements.txt 
pip3 install -e . 


Example Command:
pinja --help
pinja data/infilePE
pinja -f elf data/infileELF  -b 180 -o OUTPUTNAME 


Use Case: Binarycode Similarity with pinja-dataset

Purpose: Get the binarycode similarity score by using pinja-dataset
Type: Machine Laernning of the Narural Language processing.
Overview: Output each similarity score as a numerical value from 0 to 1 for each binary code.
pip3 install pandas
pip3 install gensim
cd UseCase/
python3 out_TEXTSec_asm.csv
DEMO: about 1 minuts


Source code
# Usage: python3 [CSVfile]

import pandas as pd
from gensim import models
import pprint
import sys

args = sys.argv
filename = args[1]

print('>>>> Read CSV PINJA dataset')
df = pd.read_csv(filename, header=0, index_col=0, dtype='str')
df = df.fillna('EMPTY')
asm_all = []
for row in df.values.tolist():
    asm_all.append([s for s in row if s != 'EMPTY'])

print('>>>> Make the ML model')
asmtext = []
for x in asm_all:
    asmtext.append(models.doc2vec.TaggedDocument(words=x, tags=[x[0]]))
model = models.Doc2Vec(asmtext, dm=1, vector_size=300, window=5, alpha=.025, min_alpha=.025, min_count=0, sample=1e-6)

print('>>>> Learning and Save model')
epoch_num = 8
for epoch in range(epoch_num):
    print('Epoch: {}'.format(epoch + 1))
    model.train(asmtext, epochs=model.iter, total_examples=model.corpus_count)
    model.alpha -= (0.025 - 0.0001) / (epoch_num - 1)
    model.min_alpha = model.alpha + ".doc2vec")

print('>>>> Load model and Print Binary Code Similarity')
model = models.Doc2Vec.load(filename + '.doc2vec')
sim_list = []
for x in asm_all:
    sim_list.append([x[0], model.docvecs.most_similar([x[0]])])








PEheader pefileformat.html


elftools user's guide

elftools example

ELF about elf

Project based on the cookiecutter data science project template. #cookiecutterdatascience

You can’t perform that action at this time.