# Steganography in Python

In [1]:
!cowsay -f stegosaurus "Uhhh... I'm supposed to be here, right?"

 _________________________________________ 
< Uhhh... I'm supposed to be here, right? >
 ----------------------------------------- 
\                             .       .
 \                           / `.   .' " 
  \                  .---.  <    > <    >  .---.
   \                 |    \  \ - ~ ~ - /  /    |
         _____          ..-~             ~-..-~
        |     |   \~~~\.'                    `./~~~/
       ---------   \__/                        \__/
      .'  O    \     /               /       \  " 
     (_____,    `._.'               |         }  \/~~~/
      `----.          /       }     |        /    \__/
            `-.      |       /      |       /      `. ,~~|
                ~-.__|      /_ - ~ ^|      /- _      `..-'   
                     |     /        |     /     ~-.     `-. _  _  _
                     |_____|        |_____|         ~ - . _ _ _ _ _>


## Setup

I'll be using Conda (miniconda to be specific) to handle virtual environments, but feel free to use `venv` or whatever else you like.

If you don't have python:

https://conda.io/miniconda.html

Get python 3.

On MacOS:
```bash
$ cd ~
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
$ bash Miniconda3-latest-MacOSX-x86_64.sh
```
Type `yes` when prompted.

Now set up the environment:

```bash
$ conda create -n steganography python=3.6 jupyter
$ source activate steganography
$ pip install jupyter
$ git clone git@github.com:fastforwardlabs/steganos.git
$ cd steganos
$ python setup.py install
```
And start a notebook server
```bash
$ jupiter notebook
```

If things don't automatically open, navigate to `localhost:8888` in your favorite browser.

In [2]:
cd ~/steganos

/Users/charlesfranzen/steganos


## Brainstorm: What is steganography?

Discuss with a partner. Can you think of any examples of steganography? 

## Text steganography

This blogpost inspired this talk.

http://blog.fastforwardlabs.com/2017/06/23/fingerprinting-documents-with-steganography.html

In [3]:
import steganos

source code here: https://github.com/fastforwardlabs/steganos

Let's start with a short sample of text:

In [4]:
original_text = '"Hiya!" he said.\n\t"I cannot believe there are 6 elephants outside!"'
print(original_text)

"Hiya!" he said.
	"I cannot believe there are 6 elephants outside!"


How can we encode information in this text?

By identifying branchpoints in the text, ie. points at which we can alter the text in a subtle way.

Examples of branchpoints:
* contractions
* numerals
* confusable characters
* invisible characters

Exercise: Read through the source [here](https://github.com/fastforwardlabs/steganos/blob/master/steganos/src/branchpoints.py) and identify at least 2 other branchpoints. Try to think of a branchpoint that's _not_ in the source. Their list is by no means exhaustive!


In [5]:
capacity = steganos.bit_capacity(original_text)
print(f'{capacity} bits can be encoded in \n{original_text}')

12 bits can be encoded in 
"Hiya!" he said.
	"I cannot believe there are 6 elephants outside!"


In [6]:
# ascii branchpoints
steganos.src.branchpoints.ascii_branchpoints(original_text)

[[(17, 18, '    ')], [(21, 27, "can't")]]

What are some other branchpoints in this text sample?

The package can encode bits automatically.

In [7]:
# encoding some bits
hidden_bits = '1001'
encoded_text = steganos.encode(hidden_bits, original_text)
print('***Original***')
print(original_text)
print('\n***Encoded***')
print(encoded_text)
decoded_bits = steganos.decode_full_text(encoded_text, original_text)
print(f'\nThe message "{decoded_bits}" was decoded.')

***Original***
"Hiya!" he said.
	"I cannot believe there are 6 elephants outside!"

***Encoded***
"Hiya!" he said‏‎.
    "I cannot​ believe​ there are six elephants​ outside!"

The message "100110011001" was decoded.


Exercise: Examine the differences in the string representations. Do you notice anything odd? Try flipping some of the hidden bits. Which branchpoints are the most noticeable?

Exercise: Create your own text to encode information into. What's its bit capacity according to `steganos`? Can you create a text that is shorter than the above text, looks innocuous, and has a _higher_ bit capacity than 12? Identify some branchpoints in your message. Try encoding and decoding some bits.

It's great that we can encode bits, but ultimately we want to encode hidden messages.

### Objective: convert message strings into bit strings

Bit strings can be encoded by `steganos`, but only message strings can be read by humans. As it happens, computers _also_ represent strings as bits.

### A brief detour on string encodings

#### Discuss with a partner: how are strings represented in computers?

We won't talk about the data structures used in string representation, just the encoding schemes. There are many ([here's an old western European one](https://en.wikipedia.org/wiki/EBCDIC_1047)), but **ASCII** and **Unicode** are the most important in Python.

**ASCII**:
* Standard encoding in Python 2.
* One byte per character.
* 2^7 = 128 characters (not all of which are printable).
* Each character maps to an integer, which is also the encoding.
    * eg. 'A' = 65 = 01000001
    * eg. 'a' = 97 = 01100001
    * Yes, there's an extra bit. ASCII was concieved before the 8-bit byte became a standard.
* Caused many headaches when having to convert between encodings.

https://ascii.cl/


**Unicode**:
* Standard encoding in Python 3.
* Variable byte encoding (1-4).
* Over 1M characters.
* Each character maps to an integer, or _code point_, which has different encodings.
* One encoding to rule them all. Get text from your colleage in Japan. You still might not be able to read it, but it won't be the encoding's fault!
* ASCII converts directly to UTF-8.

https://www.unicode.org/

While Unicode was created to help people communicate more easily, it also helps our deception through steganography.

### Back to string-to-bit conversion

In [8]:
# useful functions for working with characters

# ord() converts unicode string characters into ints
print(f'"A" as an int: {ord("A")}')
# chr() does the reverse
print(f'65 as a char: {chr(65)}')

# oct() and hex() convert ints to base 8 and 16, respectively
print(f'100 in octal: {oct(100)}')
print(f'100 in hex: {hex(100)}')

# int takes a kwarg for base
print(f'0xff in base 10: {int("0xff", base=16)}')
print(f'0o377 in base 10: {int("0o377", base=8)}')

"A" as an int: 65
65 as a char: A
100 in octal: 0o144
100 in hex: 0x64
0xff in base 10: 255
0o377 in base 10: 255


In [9]:
# converting between ints and chars
# ints returned for Unicode return the code point.
# NOTE: when lookin up code points, they are usually given as hex values
ints = [ord(c) for c in 'Chiρ']
print(ints)
chars = [chr(i) for i in ints]
print(''.join(chars))

[67, 104, 105, 961]
Chiρ


Confusables can be used to create branchpoints.

In [10]:
# confusables in unicode
chr(int('03f9', base=16))

'Ϲ'

In [11]:
chr(int('03c1', base=16))

'ρ'

In [12]:
# unicode support glyphs from many languages
chr(15000)

'㪘'

Exercise: Print your name in confusable characters. Look up confusables for your name [here](https://unicode.org/cldr/utility/confusables.jsp?a=Chip&r=None). Represent your confusable name as a string, as base 10 ints, and as hex values.

In [13]:
def hex_to_str(hex_list):
    return ''.join([chr(int(code, base=16)) for code in hex_list])

confused_chip = ['03f9', '0570', 'ab75', '03c1']
print(hex_to_str(confused_chip))
very_confused_chip = ['1d672', '1d691', '1d692', '1d699']
print(hex_to_str(very_confused_chip))
print('Chip')

Ϲհꭵρ
𝙲𝚑𝚒𝚙
Chip


`int`s can be easily converted into binary. Let's try to use this to encode characters from our message for enciphering.

In [14]:
print(f'λ as an int: {ord("λ")}')
print(f'in binary: {bin(955)}')
print(f'n bits required to represent lambda: {len(bin(955)[2:])}')

λ as an int: 955
in binary: 0b1110111011
n bits required to represent lambda: 10


In [15]:
print(int('01110', base=2))
print(int('11011', base=2))

14
27


In [16]:
chr(14)

'\x0e'

In [17]:
chr(27)

'\x1b'

Huh, the binary representaion of $\lambda$ requires 10 bits. See any problems here?

How to tell the difference between $\lambda$ (01110111011) and Form Feed (01110) followed by End of Transmission Block (10001)?

### All about bytes

Unicode to the rescue! Unicode encodes characters as bytes (blocks of 8 bits). UTF-8 is variable byte, with each character taking up between 1 and 4 bytes. https://en.wikipedia.org/wiki/UTF-8#Description Each character is easily seperable. There are other Unicode encodings (UTF-16 and UTF-32), but UTF-8 tends to be the most compact.

In [18]:
# string.encode() converts strings into bytes
# string.decode() does the reverse

# NOTE: the default encoding in python3 is utf-8, but I'll pass it explicitly here to stress
# that we are using this encoding.
chip_still_confused = 'Chiρ'
print(f'original string: {chip_still_confused}')
encoded = 'Chiρ'.encode('utf-8')
print(f'encoded: {encoded}')
print(f'printing each element: {[c for c in encoded]}')
print(f'type of encoded string: {type(encoded)}')
decoded = encoded.decode('utf-8')
print(f'decoded: {decoded}')

original string: Chiρ
encoded: b'Chi\xcf\x81'
printing each element: [67, 104, 105, 207, 129]
type of encoded string: <class 'bytes'>
decoded: Chiρ


Exercise: write a function that takes in a unicode string and returns a binary string representation. Use UTF-8.

In [19]:
def message_to_bit_string(message):
    bytes_message = message.encode('utf-8', 'strict')
    bin_list = [bin(c) for c in bytes_message]
    bit_string_list = [c[2:].rjust(8, '0') for c in bin_list]
    bit_string = ''.join(bit_string_list)
    return bit_string

def chunk_string(s, n, fill=None):
    if fill is None:
        assert len(s) % n == 0, f'{s}, len {len(s)}, must break evenly into {n} chunks'
    else:
        n_fill_chars = (n - (len(s) % n)) % n
        s += fill * n_fill_chars
    return [s[i:i+n] for i in range(0, len(s), n)]
    
def bit_string_to_message(bit_string):
    bit_string_list = chunk_string(bit_string, 8)
    ints_list = [(int(c, base=2)) for c in bit_string_list]
    byte_string = bytes(ints_list)
    chars = byte_string.decode('utf-8')
    return chars

In [20]:
# tests
assert message_to_bit_string('Chíp') == '0100001101101000110000111010110101110000'
assert message_to_bit_string('Noisebridge') == '0100111001101111011010010111001101100101011000100111001001101001011001000110011101100101'

Exercise: write the reverse, ie. a function that takes in a binary string and returns a unicode message. Use UTF-8. This is a bit harder than the first one.

In [21]:
# tests
assert bit_string_to_message('0100100001101001011110010110000100100001') == 'Hiya!'
assert bit_string_to_message('11101000101011111011011111100111101110111001100111100110100010001001000111100100101110001000000011100110100111011010111111101001100001011001001000100001') == '请给我一杯酒!'

In [22]:
chunk_string('chip is the greatest', 6, fill='x')

['chip i', 's the ', 'greate', 'stxxxx']

In [23]:
# full demo of the above functions
message = 'λ is cool!'
bit_string = message_to_bit_string(message)
print(f'original message: {message}')
print(f'len: {len(message)} unicode characters')
print(f'as bits: {bit_string}')
print(f'len: {len(bit_string)} bits')
print(f'chunked into bytes: {chunk_string(bit_string, 8)}')
print(f'decoded: {bit_string_to_message(bit_string)}')

original message: λ is cool!
len: 10 unicode characters
as bits: 1100111010111011001000000110100101110011001000000110001101101111011011110110110000100001
len: 88 bits
chunked into bytes: ['11001110', '10111011', '00100000', '01101001', '01110011', '00100000', '01100011', '01101111', '01101111', '01101100', '00100001']
decoded: λ is cool!


Now that we did all that work, here are functions provided by `steganos` to do the conversion for you!

In [24]:
steganos.bytes_to_binary('Chip'.encode('utf_8', 'strict'))

'01000011011010000110100101110000'

In [25]:
steganos.binary_to_bytes('01000011011010000110100101110000')

b'Chip'

Let's work with a longer text.

In [26]:
with open('../noisebridge_steg_1/confidential.txt') as f:
    base_text = f.read()
print(base_text)


    CONFIDENTIAL: INTERNAL ONLY
    GOOGLERS ONLY (FULL TIME AND PART TIME EMPLOYEES)

    I’m pleased to share some very, very good news with Googlers worldwide. But first let me say, on behalf of everyone on the management team, that we believe we have the best employees in the world. Period. The brightest, most capable group of this size ever assembled. It’s why I’m excited to come to work every day—and I’m sure you feel the same way. We want to make sure that you feel rewarded for your hard work, and we want to continue to attract the best people to Google.

    So that is why we’ve decided...to give all of you a 10% raise, effective January 1st. This salary increase is global and across the board—everyone gets a raise, no matter their level, to recognize the contribution that each and every one of you makes to Google.

    There’s more. We’ve heard from your feedback on Googlegeist and other surveys that salary is more important to you than any other component of pay (i.e., bonus

What length messages can be encoded?

In [27]:
capacity = steganos.bit_capacity(base_text)
print(f'We can encode {capacity} bits into this file. This is a max of {capacity // 8} unicode characters.')

We can encode 343 bits into this file. This is a max of 42 unicode characters.


In [28]:
secret_message = 'Meet me at midnight at the usual place.'
secret_message_bits = message_to_bit_string(secret_message)
assert len(secret_message_bits) <= capacity
print(f'At {len(secret_message_bits)} bits, this is close to the longest message we can encode.')
encoded_text = steganos.encode(bit_string, base_text)

At 312 bits, this is close to the longest message we can encode.


In [29]:
import os
user_dir = os.path.expanduser('~')
with open(user_dir +'/noisebridge_steg_1/secret_message.txt', mode='w', encoding='utf-8', errors='strict') as f:
    f.write(encoded_text)

In [30]:
print(encoded_text)


    C⁠ONF⁠I⁠D⁠EN⁠TI⁠A⁠L⁠: IN⁠T⁠ERN⁠AL ONLY
    G⁠O⁠OG⁠LER⁠S O⁠N⁠L⁠Y (FU⁠L⁠L TI⁠ME AND PA⁠R⁠T TIM⁠E​ EM⁠P⁠LO⁠Y⁠E⁠E⁠S)

    I⁠’m​ pleased to​ share​ some​ very, very​ good news​ with​ Googlers​ worldwide‏‎. But first let me​ say, on behalf of everyone on​ the​ management​ team, that we believe​ we​ have​ the best​ employees in​ the​ world‏‎. Period‏‎. T⁠he brightest, most capable​ group of this size ever assembled. I⁠t’s​ why I⁠’m excited to​ come to​ work​ every​ day—and I’m​ sure​ you feel the​ same way. We want to make​ sure​ that you feel rewarded​ for​ your hard​ work, and​ we want​ to​ continue​ to​ attract the​ best​ people to​ G⁠oogle‏‎.

    S⁠o that​ is​ why we’ve​ decided...to​ give all of you a​ 10% raise, effective January onest. T⁠his​ salary​ increase is global​ and​ across​ the board—everyone​ gets a​ raise, no​ matter​ their level, to​ recognize​ the contribution that​ each and every one of you makes​ to​ Google‏‎.

    There’s more‏‎. We’ve​ heard​ from​ your feedback 

Exercise: Can you detect whether the text has been tampered with?

Try:
* the default text editor (text edit in macOS)
* Sublime Text
* Atom
* \*nix tools (`less`, `cat`, `diff`, etc.)
* vim/vi
* emacs
* etc.

Exercise: Look at the file `nothing_to_see_here.txt`. Decode the hidden message.

Exercise: I have sent you a document that is fingerprinted with an id number. You want to release the document to the world, but if I know it's you I'll never invite you to our weekly board game night again! Alter the message to evade this attack, so I'll never know it was you who shared my secrets!