# Decoding the Frustrations and Vulnerabilities of Modern Text Processing
## Evan Kohilas

Hello!

This is my 2nd pycon, and my first conference talk!

I'm here because...

### Text processing is awesome!!!

# Until...
![](https://pbs.twimg.com/media/DE35D56UwAALgoJ.jpg:large)

# Join me in part of my adventure
![adventure time](http://www.indiewire.com/wp-content/uploads/2015/03/adventure-time.png?w=780)

But first lets learn about strings

### Python 2:  
* Unicode:
    ``` python
    print(u"...")
    ```
* Bytes:
    ``` python
    print("...")
    ```
    
### Python 3:  
* Unicode:
    ``` python
    print("...")
    ```
* Bytes:
    ``` python
    print(b"...")
    ```

Easy right?

# No...
![](https://media.tenor.com/images/014788c2a75ff6380650f34ca92e04df/tenor.gif)

Even with all this uncode support in python3, we *still* come across errors 

Python 2 would do implicit conversions, but now with python3, although there's more support, things are more strict and everything has to be explicit.

# Bytes vs. Unicode?

So, Unicode strings are cool right?!

😎😎😎👍👍👍🔥🔥🔥😍😍😍

こんにちは

Ḩ̱̙͔͍͉̦̖̮̩͍̝̣ͪ͐͋͐͊ę̭̟̫͇͔̖̠̲̣̬̯͇͍͎̖̟̘̩ͣ͗̽ͯ͢ͅl̸̢̢̢̯̣͇͉̜̥͖̺̭̤̪̣̤̬͚͚̠̪ͣ̅͋͌̒͐̾ͬ̋̄̇͌̊̄͂̽̿ͪͯl̷̵̳͚̠͕͓̈̈́̒͗̄̋ͭ̃̔ͬ͑ͩ̂ǫ̹̼̟͙̋̃͗̏̊̃̎͑ͮ͑̄̏̚̕ ̶̷͓͔͍͇̗̞̫̩͎̗̟̰̗̩̳̗̜̇̅̾͐ͮ̈̔͒̓̈́ͣ̽̂̉͊̀W̴̧͉̱̩̤̞̻ͯ́͋̓̏̓̌ͦ̃ͮ̂̀̕͟o̸̸͕̭͚͇̫̜̮̲͙͐͒͂̒͑̂ͧ̕͢͡ͅŗ̳̳͚̱͈̺̩̹̰͓́͒̔̌ͯͪ̀ͥͣ̐͂ͪͤ͗ͅl̵̪̹̫̠̺̱̯̤͓̞̝̗̆̈́̄̏́͟d̢͇̰̩̹̋͛͌̂ͩ̓ͤ̓́̀͘ͅ

But what can we do with byte strings?

They can do some useful things, like

In [1]:
s = b"abc"
a = s[0]
print(a, chr(a))

97 a


How do I make them?
* Make a bytes literal string
    * `b"this is a bytes string"`
* Use the `bytes()` Constructor
* `"string".encode(encoding = ..., errors = ...)`
* And more

Converting back...
* `b"bytes".decode(encoding = ..., errors = ...)`
* `str(object = b"", encoding = ..., errors = ...)`

However something really cool...

In [2]:
b = bytes.fromhex("2e f0f2 F1")
print(b)
print(b.hex())

b'.\xf0\xf2\xf1'
2ef0f2f1


Handy!

So let's pretend you're me...

And you're trying to use these byte things for science!

Trying to smash the stack

If you don't know what stack smashing is (aka buffer overflow exploits)

It's about overflowing a buffer to overwrite memory at a specific location with a malicious value.

So you write this...
``` bash
$ python3 -c 'print("3"*72 + "\xdb\x07\x40\x00")' > smash
```

But it works for all my friends in Python2!
``` bash
$ python2 -c 'print("3"*72 + "\xdb\x07\x40\x00")' > smash
```

# Why?

* In Python3, sys.stdout is a Text I/O object
* And sys.stdout isn't bytes writeable!
* So when you print a string to stdout in Python3, it will first be encoded into utf8, thus not printing the bytes that we want
* ![](https://cdn1.iconfinder.com/data/icons/emoticon-set-volume-4/512/emoticon-60-512.png)

This creates problems whenever anything is trying to write as bytes through print.

For example:

``` python
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
```
Would be the main solution to:

*"Halp a unicode dragon is breathing fire"*
![](http://www.chrisdillustration.com/wp-content/uploads/2017/03/dragon_fire_vector_LRGsliderthumb.jpg)

``` python
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
```
This gets the streamwriter for the utf8 encoding, then detaches the sys.stdout buffer and sets that as the stream...

What's a better fix?

``` python
import sys; sys.stdout.buffer.write(b"3"*72 + b"\xdb\x07\x40\x00")
```

Writes bytes directly to the stdout buffer!

There's another problem though...

Like, what's this encoded in?

By default, the character encoding is dependant on your system, taken from `locale.getpreferredencoding()`

And of course Windows likes to do things differently.

But if you're really stuck, you should check out the `errors` argument.

* strict - default
* ignore 
* replace
* xmlcharrefreplace 
* backslashreplace
* namereplace - \N{...} (names escape sequences
* surrogateescape - Replaces using the surrogate unicode characters

So it lets you do things like ignoring unicode errors entirely!

Or escaping/replacing them, as you should be...

https://docs.python.org/3/library/codecs.html#error-handlers

So we went over `u''` and `b''` string literals...

Can we break text processing in other ways?

What about f-strings?

Well first, lets take a step back in time into `str.format`

But to do that, What is a format string exploit?

It's about taking advantage of string formatting to create malicious execution

``` c
#include <stdio.h>

int main(int argc, char *argv[]){
    char *s;
    scanf("%s", &s);
    printf(s);
}
```

``` c
#include <stdio.h>

int main(int argc, char *argv[]){
    char *s;
    scanf("%s", &s);
    printf("%s", s);
}
```

Does Python have them?

Yes!

http://cse.unsw.edu.au/~evank/format_string/

``` python
{.__init__.__globals__[USERS][1].password}
```

``` python
{.__init__.__globals__[app].secret_key}

```

How do we fix them?

http://lucumr.pocoo.org/2016/12/29/careful-with-str-format/

So now what about these new fancy f-strings?

Well, they're pretty cool.

In [3]:
print(f"{input()}")

hello!
hello!


Can we break them?

Well to start with:
* f-strings don't use or have access to locals and globals.
* Recursive interpolation is not supported

In [4]:
# What about this?
# What happens here?
a = "a"
b = "b"
f"{a}".format(a = b)

'a'

f-strings are evaluted during parsing, and thus get evaluated before we can even get to execute the .format

In [5]:
# What will this do?
s = r"esc\ape"
f"{s.replace('\\', '')}"

SyntaxError: f-string expression part cannot include a backslash (<ipython-input-5-05a68932bd07>, line 3)

f-strings can't contain backslashes :(

Can they contain colons?

In [6]:
l = (0, 1, 2)
f"{l[:1][0]:10}"

'         0'

What about a lambda?

In [8]:
l = [3, 2, 1]
f"lambda : {lambda x: x**2}"

SyntaxError: unexpected EOF while parsing (<fstring>, line 1)

f-strings inside f-strings?

In [9]:
a = "hello"
print(f"""{a.replace(f"{a.replace('a', 'c')}", "b")}""")

b


nesting?

In [10]:
f"result: {value:{width:{0}}.{precision:1}}"

SyntaxError: f-string: expressions nested too deeply (<ipython-input-10-a9c0ef26de06>, line 1)

So, it's pretty hard to break things using user input like we did with .format
that is provided you don't eval or exec anything that user input is in control of.

But who would do that right?

``` python
os.system(f"echo {message_from_user}")
myquery = sql(f"SELECT {column} FROM {table};")
myresponse = html(f"<html><body>{response.body}</body></html>")
```

how do we fix **that**?

PEP501 Proposes i-strings, which are like f-strings but can be escaped before rendered

So I hope now you have some extra knowledge

In how to avoid this:

![](https://pbs.twimg.com/media/DE35GiCUMAAXbCu.jpg)

and this:

![](http://diysolarpanelsv.com/images/dragon-fire-clipart-4.jpg)

# Evan Kohilas
## Thanks!
### Questions?
### Tweet at me: @ekohilas
### All source on github.com/ekohilas