# Building a `.format` Nanny with Python's AST

##### In which we detect problematic `.format` calls quickly and painlessly using Python's powerful `ast` module.

The issue: where I work, we are still a Python 2 shop owing to technical debt and legacy dependencies. As you might expect, painful Unicode-related problems surface in our applications from time to time. A fairly typical example runs something like this: a developer writes some code to format the data into a string, using the convenient and powerful `.format` method supported by all string objects:

In [2]:
def pretty_format(some_data):
    return 'Hello, {}!'.format(some_data)

Through the course of our exhaustive testing, we prove that this function is correct over a wide range of inputs:

In [3]:
print pretty_format('world')

Hello, world!


The code ships. Months pass without incident, our `pretty_format` routine prettily formatting every bit of data thrown its way. Lulled into complacency through our enjoyment of our apparent success, we move on to other tasks. One day, everything comes to a screeching halt as, completely unprepared, we receive one of the most dreaded error messages in all of software development:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

Here's what happened: much of the data that flows through this format template and others like it is simple ASCII-valued information: dates, simple US addresses, phone numbers, and the like. Having used Python 2 for many years, we are habituated to spell strings, including our template formatting strings, using the simple single quote

    'a typical string'

What happens, though, when our user data contains an accented character or other non-ASCII symbol?

In [4]:
full_name = u'Ariadne Éowyn'
print pretty_format(full_name)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 8: ordinal not in range(128)

Boom! Python detects the mismatch between the binary string template object and the Unicode data containing some multi-byte characters that simply cannot be represented in the target format. In other words, Python is refusing to guess what we want: do we prefer a binary expansion, and, if so, in what encoding? Should the accent characters simply be dropped? Do we want unexpected symbols to be translated into ASCII error characters? Python has no way of knowing which of these options is appropriate to the present situation, so it takes the only reasonable course and raises an exception.

Many Unicode issues can be quite challenging to reconcile, but this case is rather simple: if the format string is specified in Unicode format — rather than as a plain binary string — this entire class of problem would be avoided:

In [5]:
def pretty_format(some_data):
    # Unicode object template prepares this routine to handle non-ASCII symbols
    return u'Hello, {}!'.format(some_data)

print pretty_format(full_name)

Hello, Ariadne Éowyn!


But how do we know where the problematic calls to `.format` are lurking in our code base without waiting for the next error to occur? Is there a way we could find these calls proactively, eliminating them from the system before they wreak havoc on our application?


# What's an AST?

# Getting started: trees and parse

In [6]:
import ast

In [7]:
a_variable = 2

def foo():
    another_variable = 2
    
    return a_variable + another_variable

In [8]:
foo()

4

In [9]:
tree = ast.parse("""a_variable = 2

def foo():
    another_variable = 2
    
    return a_variable + another_variable""")

ast.dump(tree)

"Module(body=[Assign(targets=[Name(id='a_variable', ctx=Store())], value=Num(n=2)), FunctionDef(name='foo', args=arguments(args=[], vararg=None, kwarg=None, defaults=[]), body=[Assign(targets=[Name(id='another_variable', ctx=Store())], value=Num(n=2)), Return(value=BinOp(left=Name(id='a_variable', ctx=Load()), op=Add(), right=Name(id='another_variable', ctx=Load())))], decorator_list=[])])"

    Module(body=[
        Assign(targets=[Name(id='a_variable', ctx=Store())], value=Num(n=2)), 
        FunctionDef(name='foo', args=arguments(args=[], vararg=None, kwarg=None, defaults=[]), body=[
            Assign(targets=[Name(id='another_variable', ctx=Store())], value=Num(n=2)), 
            Return(value=
                BinOp(
                    left=Name(id='a_variable', ctx=Load()), 
                    op=Add(), 
                    right=Name(id='another_variable', ctx=Load())))], 
            decorator_list=[])])


# Visiting `.format`

In [10]:
tree = ast.parse("""

'Hello, {}!'.format('world')

'  other string  '.trim()

print len('asdf')

[3, 1, 2].sort()

""")

In [11]:
class FormatVisitor(ast.NodeVisitor):
    def visit_Attribute(self, node):
        print node.attr, node.value
        self.generic_visit(node)

In [12]:
FormatVisitor().visit(tree)

format <_ast.Str object at 0x7f7573fd6790>
trim <_ast.Str object at 0x7f7573328390>
sort <_ast.List object at 0x7f75733285d0>


In [13]:
class FormatVisitor(ast.NodeVisitor):
    def visit_Attribute(self, node):
        if node.attr == 'format':
            print node.attr, node.value
            print node.value.__dict__
        self.generic_visit(node)

In [14]:
FormatVisitor().visit(tree)

format <_ast.Str object at 0x7f7573fd6790>
{'s': 'Hello, {}!', 'lineno': 3, 'col_offset': 0}


In [15]:
class FormatVisitor(ast.NodeVisitor):
    def visit_Attribute(self, node):
        if node.attr == 'format':
            _str = repr(node.value.s)

            if _str[0] != 'u':
                print u'{}: {}'.format(node.lineno, _str)

        self.generic_visit(node)

In [16]:
FormatVisitor().visit(tree)

3: 'Hello, {}!'


# Both No and Yes

# Disturbing Dynamism

In [17]:
d = 'Ariadne Éowyn'
du = d.decode('utf-8')

print du
print u'Hello, {}!'.format(du)
d, du

Ariadne Éowyn
Hello, Ariadne Éowyn!


('Ariadne \xc3\x89owyn', u'Ariadne \xc9owyn')

In [18]:
u = u'Ariadne Éowyn'

print u
print u'Hello, {}!'.format(u)
print repr(u'Hello, {}!'.format(u))
u

Ariadne Éowyn
Hello, Ariadne Éowyn!
u'Hello, Ariadne \xc9owyn!'


u'Ariadne \xc9owyn'

In [19]:
u.encode('utf-8')

'Ariadne \xc3\x89owyn'

In [20]:
d

'Ariadne \xc3\x89owyn'

In [21]:
print '{}'.format(u)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 8: ordinal not in range(128)

In [22]:
print '{}'.format(u.encode('utf-8'))

Ariadne Éowyn


In [24]:
tree = ast.parse("""

from __future__ import unicode_literals

""")

ast.dump(tree)

"Module(body=[ImportFrom(module='__future__', names=[alias(name='unicode_literals', asname=None)], level=0)])"

In [45]:
class FutureVisitor(ast.NodeVisitor):
    def __init__(self):
        super(FutureVisitor, self).__init__()
        self.has_future_import = False
        
    def visit_ImportFrom(self, node):
        if node.module == '__future__':
            names = [name.name for name in node.names]
            self.has_future_import = 'unicode_literals' in names
            print node.names, self.has_future_import
            print node.__dict__
        self.generic_visit(node)
        
def has_future_import(src):
    tree = ast.parse(src)
    visitor = FutureVisitor()
    visitor.visit(tree)
    return visitor.has_future_import

In [46]:
FutureVisitor().visit(tree)

[<_ast.alias object at 0x7f7573328390>] True
{'lineno': 3, 'col_offset': 0, 'names': [<_ast.alias object at 0x7f7573328390>], 'module': '__future__', 'level': 0}


In [47]:
has_future_import("""

from __future__ import unicode_literals

""")

[<_ast.alias object at 0x7f7573328f50>] True
{'lineno': 3, 'col_offset': 0, 'names': [<_ast.alias object at 0x7f7573328f50>], 'module': '__future__', 'level': 0}


True

In [48]:
has_future_import("""

from __future__ import print_function, unicode_literals

""")

[<_ast.alias object at 0x7f7573328c10>, <_ast.alias object at 0x7f7573328450>] True
{'lineno': 3, 'col_offset': 0, 'names': [<_ast.alias object at 0x7f7573328c10>, <_ast.alias object at 0x7f7573328450>], 'module': '__future__', 'level': 0}


True

In [49]:
has_future_import("""

import sys

""")

False

### U Format, I Don't

Proof of concept demonstrating how to detect `'…'.format(…)` calls in Python source.

First, we need some stuff:

Where are we again?

In [34]:
pwd

u'/home/drocco/source/brightlink/brighttrac'

From here, we need a way to find all of the Python files that we'd like to check.

In [35]:
def all_pys():
    """Find all of the Python source files anywhere in the tree rooted at `.`"""
    
    for path, dirs, files in os.walk('.'):
        files = [os.path.join(path, file) 
                 for file in fnmatch.filter(files, '*.py')]
        for file in files:
            yield file

Then we need to check each one

In [44]:
def format_nanny(path):
    """Detect possibly bad `.format` calls in Python source.
    
    Given a Python source filename, find all of the calls to `.format` invoked
    on plain str objects (`''`) rather than Unicode objects (`u''`). For each,
    print out the path, line number, and string in question.
    
    """
    
    src = open(path).read()
    tree = ast.parse(src)

    # this is quick and dirty, the correct approach is to use a 
    # Visitor subclass 
    attrs = [node for node in ast.walk(tree) 
             if isinstance(node, ast.Attribute) and node.attr == 'format']

    for attr in attrs:
        try:
            _str = repr(attr.value.s)
            
            if _str[0] != 'u':
                print '{} {}: {}'.format(path, attr.lineno, _str)
        except:
            pass

#### Engage!

In [45]:
for py in all_pys():
    format_nanny(py)

./brighttrac2/model.py 62: '{}.model.model'
./brighttrac2/migrate/__init__.py 121: ' {}\n'
./brighttrac2/migrate/migrators/model.py 42: '{}_{}'
./brighttrac2/store/brighttrac_satchmo/utils.py 74: 'Invalid item, product, or slug: {0}'
./brighttrac2/store/brighttrac_satchmo/listeners.py 58: 'dummy-{0}'
./brighttrac2/store/brighttrac_satchmo/bt_job_queuer.py 67: 'Queuing job {}.{} to {} failed'
./brighttrac2/store/brighttrac_satchmo/views.py 62: 'Could not add item "{0}" to cart'
./brighttrac2/store/brighttrac_satchmo/views.py 136: 'No item found with ID {}.'
./brighttrac2/store/brighttrac_satchmo/views.py 558: 'Payment with transaction ID {} settled on {}<br/>'
./brighttrac2/store/brighttrac_satchmo/views.py 560: 'Payment with transaction ID {} not found, recording null settlement date'
./brighttrac2/store/brighttrac_satchmo/api/model.py 48: ' ({})'
./brighttrac2/store/brighttrac_satchmo/api/order.py 155: 'Transfered from User ({});'
./brighttrac2/store/brighttrac_satchmo/api/order.py 12

This won't detect certain dynamic cases, where `.format` is being applied to a variable, as here:

In [46]:
# %load -r 834-851 brighttrac2/base_model/renewal_audit.py
def _adjustment_requested(adjustment):
    if adjustment.auditor:
        template = 'Staff requested candidate adjustment for {what} for CE item ' \
                   '{title}'
    else:
        template = 'System-initiated candidate adjustment for {what} for CE item ' \
                   '{title}'

    items = [('credit_error', 'number of credits'),
             ('document_error', 'item documentation')]

    what = [description for attr, description in items
            if getattr(adjustment, attr)]

    note = template.format(what=', '.join(what),
                           title=adjustment.ce_item.title)

    adjustment.audit.note(note, staff=adjustment.auditor)