# `.format` Nanny with Python's AST

##### In which we detect problematic `.format` calls quickly and painlessly using Python's powerful `ast` module.

The issue: where I work, we are still a Python 2 shop owing to technical debt and legacy dependencies. As you might expect, painful Unicode-related problems surface in our applications from time to time. A fairly typical example runs something like this: a developer writes some code to format the data into a string, using the convenient and powerful `.format` method supported by all string objects:

In [26]:
def pretty_format(some_data):
    return 'Hello, {}!'.format(some_data)

Through the course of our exhaustive testing, we prove that this function is correct over a wide range of inputs:

In [27]:
print pretty_format('world')

Hello, world!


The code ships. Months pass without incident, our `pretty_format` routine prettily formatting every bit of data thrown its way. Lulled into complacency through our enjoyment of our apparent success, we move on to other tasks. One day, everything comes to a screeching halt as, completely unprepared, we receive one of the most dreaded error messages in all of software development:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

Here's what happened: much of the data that flows through this format template and others like it is simple ASCII-valued information: dates, simple US addresses, phone numbers, and the like. Having used Python 2 for many years, we are habituated to spell strings, including our template formatting strings, using the simple single quote

    'a typical string'

What happens, though, when our user data contains an accented character or other non-ASCII symbol?

In [28]:
full_name = u'Ariadne Éowyn'
print pretty_format(full_name)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 8: ordinal not in range(128)

Boom! Python detects the mismatch between the binary string template object and the Unicode data containing some multi-byte characters that simply cannot be represented in the target format. In other words, Python is refusing to guess what we want: do we prefer a binary expansion, and, if so, in what encoding? Should the accent characters simply be dropped? Do we want unexpected symbols to be translated into ASCII error characters? Python has no way of knowing which of these options is appropriate to the present situation, so it takes the only reasonable course and raises an exception.

Many Unicode issues can be quite challenging to reconcile, but this case is rather simple: if the format string is specified in Unicode format — rather than as a plain binary string — this entire class of problem would be avoided:

In [29]:
def pretty_format(some_data):
    # Unicode object template prepares this routine to handle non-ASCII symbols
    return u'Hello, {}!'.format(some_data)

print pretty_format(full_name)

Hello, Ariadne Éowyn!


But how do we know where the problematic calls to `.format` are lurking in our code base without waiting for the next error to occur? Is there a way we could find these calls proactively, eliminating them from the system before they wreak havoc on our application?


# What's an AST?

# Getting started: trees and parse

# Visiting `.format`

# Both No and Yes

# Disturbing Dynamism

In [30]:
d = 'Ariadne Éowyn'
du = d.decode('utf-8')

print du
print u'Hello, {}!'.format(du)
d, du

Ariadne Éowyn
Hello, Ariadne Éowyn!


('Ariadne \xc3\x89owyn', u'Ariadne \xc9owyn')

In [31]:
u = u'Ariadne Éowyn'

print u
print u'Hello, {}!'.format(u)
print repr(u'Hello, {}!'.format(u))
u

Ariadne Éowyn
Hello, Ariadne Éowyn!
u'Hello, Ariadne \xc9owyn!'


u'Ariadne \xc9owyn'

In [32]:
u.encode('utf-8')

'Ariadne \xc3\x89owyn'

In [33]:
d

'Ariadne \xc3\x89owyn'

In [34]:
print '{}'.format(u)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 8: ordinal not in range(128)

In [35]:
print '{}'.format(u.encode('utf-8'))

Ariadne Éowyn
