# Chapter 5. Dual-Mode str and bytes APIs
---

## ToC

1. [str Versus bytes in Regular Expressions](#str-versus-bytes-in-regular-expressions)

---

## str Versus bytes in Regular Expressions

If you build a regular expression with `bytes`, patterns such as `\d` and `\w` only match
ASCII characters; in contrast, if these patterns are given as `str`, they match Unicode
digits or letters beyond ASCII

In [1]:
import re
re_numbers_str = re.compile(r'\d+')
re_words_str = re.compile(r'\w+')
re_numbers_bytes = re.compile(rb'\d+')
re_words_bytes = re.compile(rb'\w+')

# special characters are Tamil digits for 1729
text_str = ("Ramanujan saw \u0be7\u0bed\u0be8\u0bef"
            " as 1729 = 1³ + 12³ = 9³ + 10³.")

text_bytes = text_str.encode('utf_8')

print(f'Text\n {text_str!r}')
print('Numbers')
print(' str :', re_numbers_str.findall(text_str))
print(' bytes:', re_numbers_bytes.findall(text_bytes))
print('Words')
print(' str :', re_words_str.findall(text_str))
print(' bytes:', re_words_bytes.findall(text_bytes))

Text
 'Ramanujan saw ௧௭௨௯ as 1729 = 1³ + 12³ = 9³ + 10³.'
Numbers
 str : ['௧௭௨௯', '1729', '1', '12', '9', '10']
 bytes: [b'1729', b'1', b'12', b'9', b'10']
Words
 str : ['Ramanujan', 'saw', '௧௭௨௯', 'as', '1729', '1³', '12³', '9³', '10³']
 bytes: [b'Ramanujan', b'saw', b'as', b'1729', b'1', b'12', b'9', b'10']


For str regular expressions, there is a `re.ASCII` flag that makes `\w`, `\W`, `\b`, `\B`, `\d`, `\D`, `\s`, and `\S` perform ASCII-only matching.

## str Versus bytes in os Functions

The GNU/Linux kernel is not Unicode savvy, so in the real world you may find filenames made of byte sequences that are not valid in any sensible encoding scheme, and cannot be decoded to `str`. File servers with clients using a variety of OSes are particularly prone to this problem.

In order to work around this issue, all `os` module functions that accept filenames or
pathnames take arguments as `str` or `bytes`. If one such function is called with a `str`
argument, the argument will be automatically converted using the codec named by
`sys.getfilesystemencoding()`, and the OS response will be decoded with the same
codec. This is almost always what you want, in keeping with the Unicode sandwich
best practice.

But if you must deal with (and perhaps fix) filenames that cannot be handled in that
way, you can pass bytes arguments to the os functions to get bytes return values.
This feature lets you deal with any file or pathname, no matter how many gremlins
you may find.

In [10]:
import os
os.listdir('./materials')

['cafe.txt',
 'digits-of-π.txt',
 'dummy',
 'ola.py',
 'ola_broken.py',
 'ola_fixed.py',
 'text-byte.asciidoc.txt',
 'unicode_name_cache.json']

Observe the entry `digits-of-π.txt`

In [11]:
os.listdir(b'./materials')

[b'cafe.txt',
 b'digits-of-\xcf\x80.txt',
 b'dummy',
 b'ola.py',
 b'ola_broken.py',
 b'ola_fixed.py',
 b'text-byte.asciidoc.txt',
 b'unicode_name_cache.json']

Observe the entry `b'digits-of-\xcf\x80.txt'`