<a href="https://colab.research.google.com/github/binaryinferno/binaryinferno/blob/main/BinaryInferno.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Standard Caveats: 
* Grad Student Code. 
* Papers first, then code cleanup. 

What does it need to run? 

* Python3.7?
* the shell `parallel` command for the pattern search and speedups
* `scikit` and `sklearn` for a couple places. Those might be vestigial. 

What do you need to do?

* Get your data extracted as hex, one message per line. 
* You can do this with `t-shark`, but be careful to trim it out of the encapsulating TCP / UDP packet.
* If you know a prior about whether the system is big or little endian, run it with that flag. This will restrict the tool to only building descriptions out of that endianess.
** `--detectors BE` for Big Endian
** `--detectors LE` for Little Endian

Time stuff

* If messages are all the same length, no serialization pattern search will be performed, since we assume the fields are fixed length.
* Serialization pattern search is the slowest part. We use parallelization via some lowest-cost technically acceptable shell scripts. 
( More CPUs help with serialization pattern search. We used 40 cores for the paper. And 128GB of RAM
* There's a parameter deep in there which sets the amount of time before the serialization pattern search will give up when searching from a specific offset. 

* If you have a question about use, email me or post an issue. I'll do my best to help. 

* I will work on getting a better set of documentation together in the future as my schedule allows.

* If you're a research group / organization I'm happy to schedule a more in-depth dissussion. 

In [None]:
%%bash
# Setup stuff

# We need parallel because we use a shell script deep down to run the serialization pattern search in parallel
apt -q install parallel > /dev/null 


# We use this stuff to calculate entropy 
pip3 install sklearn > /dev/null
pip3 install scipy > /dev/null

# Get a copy of the source.
git clone https://github.com/binaryinferno/binaryinferno.git



Reading package lists...


Cloning into 'binaryinferno'...


In [None]:
%%bash

# Setup our input file with our hex messages (one message per line)
cat <<EOT > input.txt
00000012000005d60004746573740a6b6b622d7562756e747500
0000001e000009f9030474657374175468697320697320612074657374206d65737361676521
00000017000007570304746573741048656c6c6f202d2074657374696e6721
000000150000068d021349276d20676f696e672061776179206e6f7721
EOT


# The flag "BE" means use only BIG ENDIAN detectors
# Use "LE" for LITTLE ENDIAN detectors
(cd binaryinferno/binaryinferno ; cat ../../input.txt | python3 blackboard.py --detectors BE 1> ../../log.txt 2> ../../errs.txt )

# log.txt contains BinaryInferno's exhaustive output
# errs.txt contains anything which came out on stderr
# We mainly care about the stuff at the very end of log.txt
cat log.txt | awk '/INFERRED DESCRIPTION/,/SPECEND/'

INFERRED DESCRIPTION
--------------------------------------------------------------------------------

	LLLLLLLL | ?????????? RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
	--
	00000012 | 000005D600 04746573740A6B6B622D7562756E747500
	0000001E | 000009F903 0474657374175468697320697320612074657374206D65737361676521
	00000017 | 0000075703 04746573741048656C6C6F202D2074657374696E6721
	00000015 | 0000068D02 1349276D20676F696E672061776179206E6F7721
	--
	0 L BE UINT32 LENGTH + 8 = TOTAL MESSAGE LENGTH 16.0
	1 ? UNKNOWN TYPE 5 BYTE(S) 20.0
	2 R 0T_1L_V_BIG* 88.0

QTY SAMPLES
4
HEADER ONLY
LLLLLLLL | ?????????? RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
SPECSTART
Length 4V_BE (BE uint32 Length + 8 = Total Message Length)
FieldFixed 5V (Unknown Type 5 Byte(s))
FieldRep *Q_0T_1L_1V_BE (0T_1L_V_big*)
SPECEND


The above should have produced output showing the following



```
INFERRED DESCRIPTION
--------------------------------------------------------------------------------

	LLLLLLLL | ?????????? RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
	--
	00000012 | 000005D600 04746573740A6B6B622D7562756E747500
	0000001E | 000009F903 0474657374175468697320697320612074657374206D65737361676521
	00000017 | 0000075703 04746573741048656C6C6F202D2074657374696E6721
	00000015 | 0000068D02 1349276D20676F696E672061776179206E6F7721
	--
	0 L BE UINT32 LENGTH + 8 = TOTAL MESSAGE LENGTH 16.0
	1 ? UNKNOWN TYPE 5 BYTE(S) 20.0
	2 R 0T_1L_V_BIG* 88.0

QTY SAMPLES
4
HEADER ONLY
LLLLLLLL | ?????????? RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
SPECSTART
Length 4V_BE (BE uint32 Length + 8 = Total Message Length)
FieldFixed 5V (Unknown Type 5 Byte(s))
FieldRep *Q_0T_1L_1V_BE (0T_1L_V_big*)
SPECEND
```



In [None]:
%%bash


# This is the example we show in the paper 
# Setup our input file with our hex messages (one message per line)
cat <<EOT > input.txt
01000D60A67AED054150504C45
01001160A67B0504504C554D0450454152
01000E60A67AF9064F52414E4745
EOT

# The flag "BE" means use only BIG ENDIAN detectors
# Use "LE" for LITTLE ENDIAN detectors
# tslow is lower bound for timestamps
# tshigh is upper bound for timestamps 
# Don't worry if it's years too wide, that's totally fine
(cd binaryinferno/binaryinferno; cat ../../input.txt | python3 blackboard.py --detectors BE --tslow "2001-02-08 11:41:41" --tshigh "'2028-02-08 11:41:41'" 1> ../../log.txt 2> ../../errs.txt )


# log.txt contains BinaryInferno's exhaustive output
# errs.txt contains anything which came out on stderr
# We mainly care about the stuff at the very end of log.txt

cat log.txt | awk '/INFERRED DESCRIPTION/,/SPECEND/'

INFERRED DESCRIPTION
--------------------------------------------------------------------------------

	?? LLLL | TTTTTTTT RRRRRRRRRRRR
	--
	01 000D | 60A67AED 054150504C45
	01 0011 | 60A67B05 04504C554D0450454152
	01 000E | 60A67AF9 064F52414E4745
	--
	0 ? UNKNOWN TYPE 1 BYTE(S) 3.0
	1 L BE UINT16 LENGTH + 0 = TOTAL MESSAGE LENGTH 6.0
	2 T BE 32BIT SPAN SECONDS 2001-02-08 11:41:41.000000 TO 2028-02-08 11:41:41.000000 1.0 12.0
	3 R 0T_1L_V_BIG* 23.0

QTY SAMPLES
3
HEADER ONLY
?? LLLL | TTTTTTTT RRRRRRRRRRRR
SPECSTART
FieldFixed 1V (Unknown Type 1 Byte(s))
Length 2V_BE (BE uint16 Length + 0 = Total Message Length)
FieldFixed 4V_BE (BE 32BIT SPAN Seconds 2001-02-08 11:41:41.000000 to 2028-02-08 11:41:41.000000 1.0)
FieldRep *Q_0T_1L_1V_BE (0T_1L_V_big*)
SPECEND


The above should have produced the following results:
```
INFERRED DESCRIPTION
--------------------------------------------------------------------------------

	?? LLLL | TTTTTTTT RRRRRRRRRRRR
	--
	01 000D | 60A67AED 054150504C45
	01 0011 | 60A67B05 04504C554D0450454152
	01 000E | 60A67AF9 064F52414E4745
	--
	0 ? UNKNOWN TYPE 1 BYTE(S) 3.0
	1 L BE UINT16 LENGTH + 0 = TOTAL MESSAGE LENGTH 6.0
	2 T BE 32BIT SPAN SECONDS 2001-02-08 11:41:41.000000 TO 2028-02-08 11:41:41.000000 1.0 12.0
	3 R 0T_1L_V_BIG* 23.0

QTY SAMPLES
3
HEADER ONLY
?? LLLL | TTTTTTTT RRRRRRRRRRRR
SPECSTART
FieldFixed 1V (Unknown Type 1 Byte(s))
Length 2V_BE (BE uint16 Length + 0 = Total Message Length)
FieldFixed 4V_BE (BE 32BIT SPAN Seconds 2001-02-08 11:41:41.000000 to 2028-02-08 11:41:41.000000 1.0)
FieldRep *Q_0T_1L_1V_BE (0T_1L_V_big*)
SPECEND
```
