# [Chapter 2. Strings and Text](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html)

Almost every useful program involves some kind of processing, whether it is parsing data or generating output.  
This chapter focuses on common problems involving text manipulation, such as pulling apart strings, searching, substitution, lexing, and parsing.  
Many of these tasks can be easily solved using built-in methods of strings.  
However, more complicated operations might require the use of regular expressions or the creation of a full-fledged parser.  
All of those topics will be covered.  
In addition, a few tricky aspects of [working with Unicode](https://en.wikipedia.org/wiki/Unicode) are addressed.  
Let's do this!

## [Splitting Strings on Any of Multiple Delimiters](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_splitting_strings_on_any_of_multiple_delimiters)

### Problem 

You need to split a string into fields, but the delimiters (and spacing around them) aren't consistent throughout the string.

### Solution

The `split()` method of string objects is really meant for very simple cases, and does not allow for multiple delimiters or account for possible whitespace around the delimiters.  
In cases when you need a bit more flexibility, use the [re.split() method](https://docs.python.org/3/library/re.html#re.split):

In [149]:
import re

line = 'asdf fjdk; afed, fjek,asdf,      foo'
re.split(r'[;,\s]\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

Here's some [examples from the documentation](https://docs.python.org/3/library/re.html#re.split):

In [150]:
re.split('\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

In [151]:
re.split('(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

In [152]:
re.split('\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

In [153]:
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

### Discussion

The `re.split()` function is useful because you can specify multiple patterns for the separator.  
For example, as shown in the solution, the separator is either a comma (,), semicolon (;), or whitespace followed by any amount of extra whitespace.  
Whenever that pattern is found, the entire match becomes the delimiter between whatever fields lie on either side of the match.  
The result is a list of fields, just as with `str.split()`.

When using `re.split()`, you need to be a bit careful should the regular expression pattern involve a capture group enclosed in parentheses.  
If capture groups are used, then the matched text is also included in the result.  
For example, watch what happens here:

In [154]:
line = 'asdf fjdk; afed, fjek,asdf,      foo'
fields = re.split(r'(;|,|\s)\s*', line)
fields

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

Getting the split characters might be useful in certain contexts.  
For example, maybe you need the split characters later on to reform an output string:

In [155]:
fields

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

In [156]:
values = fields[::2]
values

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

In [157]:
delimiters = fields[1::2] + ['']
delimiters

[' ', ';', ',', ',', ',', '']

In [158]:
# Now re-form the line using the same delimiters and values:
''.join(v+d for v, d in zip(values, delimiters))

'asdf fjdk;afed,fjek,asdf,foo'

In [159]:
# The original line for reference:
line

'asdf fjdk; afed, fjek,asdf,      foo'

If you don't want the separator characters in the result, but still need to use parentheses to group parts of the regular expression pattern, make sure you use a [noncapture group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-a-question-mark-followed-by-a-colon), specified as `(?:...)`.

In [160]:
re.split(r'(?:,|;|\s)\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

## [Matching Text at the Start or End of a String](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_matching_text_at_the_start_or_end_of_a_string)

### Problem

You need to check the start or end of a string for specific text patterns, such as filename extensions, URL schemes, and so on.

### Solution

A simple way to check the beginning or end of a string is to use the `str.startswith()` or `str.endswith()` methods.

In [161]:
filename = 'spam.txt'
filename.endswith('.txt')

True

In [162]:
filename.startswith('file:')

False

In [163]:
url = 'http://www.python.org'
url.startswith('http:')

True

If you need to check against multiple choices, simply provide a tuple of possibilities to `startswith()` or `endswith()`:

In [164]:
import os

filenames = os.listdir('.')
filenames

['.ipynb_checkpoints',
 '1_Chapter.ipynb',
 '2_Chapter.ipynb',
 'python_ipsum.txt']

In [165]:
[name for name in filenames if name.endswith(('.ipynb', '.txt'))]

['1_Chapter.ipynb', '2_Chapter.ipynb', 'python_ipsum.txt']

In [166]:
any(name.endswith('.ipynb') for name in filenames)

True

Another example, for your edutainment:

In [167]:
from urllib.request import urlopen

def read_data(name):
    if name.startswith(('http:', 'https:', 'ftp:')):
        return urlopen(name).read()
    else:
        with open(name) as f:
            return f.read()
    f.close()

Oddly, this is one part of Python where a tuple is actually required as input.  
If you happen to have the choices specified in a list or set, just make sure that you convert them to a tuple  (using `tuple()`) first.

In [168]:
choices = ['http:', 'ftp']
url = 'http://www.python.org'

In [169]:
url.startswith(tuple(choices))

True

### Discussion

The `startswith()` and `endswith()` methods provide a very convenient way to perform basic prefix and suffix checking.  
Similar operations can be performed with slices, but are far less elegant.

In [170]:
filename = 'spam.txt'
filename[-4:] == '.txt'

True

In [171]:
url = 'http://www.python.org'
url[:5] == 'http:' or url[:6] == 'https:' or url[:4] == 'ftp:'

True

You might also be inclined to use regular expressions as an alternative.

In [172]:
import re

url = 'http://www.python.org'
re.match('http:|https:|ftp:', url)

<_sre.SRE_Match object; span=(0, 5), match='http:'>

This works, but it is often overkill for simple matching.  
Using the str.startswith() or str.endswith() methods is simpler and much faster.

Last, but not least, the `startswith()` and `endswith()` methods look nice when combined with other operations, such as common data reductions.  
In the next example, this statement checks a directory for the presence of certain kinds of files:

## [Matching Strings Using Shell Wildcard Patterns](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_matching_strings_using_shell_wildcard_patterns)

### Problem

You want to match text using the same wildcard patterns that are commonly used when working with Unix shells (e.g., `*.py, Dat[0-9]*.csv`, and so on).

### Solution

The `fnmatch` module provides two functions -- `fnmatch()` and `fnmatchcase()` -- that can be used to perform such matching.

In [173]:
# Here's an example from the Python 3 documentation:

import fnmatch
import os

for file in os.listdir('.'):
    if fnmatch.fnmatch(file, '*.txt'):
        print(file)

python_ipsum.txt


In [174]:
from fnmatch import fnmatch, fnmatchcase

In [175]:
fnmatch('foo.txt', '*.txt')

True

In [176]:
fnmatch('foo.txt', '?oo.txt')

True

In [177]:
fnmatch('Dat45.csv', 'Dat[0-9]*')

True

In [178]:
names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
[name for name in names if fnmatch(name, 'Dat*.csv')]

['Dat1.csv', 'Dat2.csv']

Normally, `fnmatch()` matches patterns using the same case-sensitivity rules as the system's undelying filesystem (which varies based on the operating system).

In [179]:
# On Mac OSX:
fnmatch('foo.txt', '*.TXT')

False

If this distinction matters, use `fnmatchcase()` instead.  
It matches exactly based on the lower- and upper-case conventions that you supply:

In [180]:
fnmatchcase('foo.txt', '*.TXT')

False

An often overlooked feature of these functions is their potential use with data processing of nonfilename strings.  
For example, suppose that you have a list of street addresses like this:

In [181]:
addresses = [
    '5412 N CLARK ST',
    '1060 W ADDISON ST',
    '1039 W GRANVILLE AVE',
    '2122 N CLARK ST',
    '4802 N BROADWAY',
]

You could write list comprehensions like this:

In [182]:
from fnmatch import fnmatchcase

In [183]:
[addr for addr in addresses if fnmatchcase(addr, '* ST')]

['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']

In [184]:
[addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]

['5412 N CLARK ST']

### Discussion

The matching performed by `fnmatch` sits somewhere between the functionality of simple string methods and the full power of regular expressions.  
If you're just trying to provide a simple mechanism for allowing wildcards in data processing operations, it's often a reasonable solution.  
If you're actually trying to write code that matches filenames, use the `glob` module instead.  
See ["Getting A Directory Listing"](http://chimera.labs.oreilly.com/books/1230000000393/ch05.html#dirlisting).

## [Matching and Searching for Text Patterns](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_matching_and_searching_for_text_patterns)

### Problem

You want to match or search text for a specific pattern.

### Solution

If the text you're trying to match is a simple literal, you can often just use the basic string methods, such as `str.find(), str.endswith(), str.startswith(),` or similar.

In [185]:
text = 'yeah, but no, but yeah, but no, but yeah'

In [186]:
# Exact match:
text == 'yeah'

False

In [187]:
# Match at the beginning:
text.startswith('yeah')

True

In [188]:
# Match at the end:
text.endswith('no')

False

In [189]:
# Location of the first occurrence:
text.find('no')

10

For more complicated matching, use regular expressions and the `re` module.  
To illustrate the basic mechanics of using regular expressions, suppose you want to match dates specified as digits, such as `"11/27/2012"`:

In [190]:
text1 = '11/27/2012'
text2 = 'Nov 27, 2012'

In [191]:
import re

# Simple matching.
# \d+ means match one or more digits.
if re.match(r'\d+/\d+/\d+', text1):
    print('yes')
else:
    print('no')

yes


In [192]:
if re.match(r'\d+/\d+/\d+', text2):
    print('yes')
else:
    print('no')

no


If you're going to perform a lot of matches using the same pattern, it usually pays to precompile the regular expression pattern into a pattern object first.

In [193]:
datepat = re.compile(r'\d+/\d+/\d+')
if datepat.match(text1):
    print('yes')
else:
    print('no')

yes


In [194]:
if datepat.match(text2):
    print('yes')
else:
    print('no')

no


The `match()` method always tries to find the match at the start of a string.  
If you want to search text for all occurrences of a pattern, use the `findall()` method instead.

In [195]:
text = 'Today is 10/18/2017. PyCon starts 5/9/2018.'
datepat.findall(text)

['10/18/2017', '5/9/2018']

When defining regular expressions, it is common to introduce capture groups by enclosing parts of the pattern in parentheses.

In [196]:
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
datepat

re.compile(r'(\d+)/(\d+)/(\d+)', re.UNICODE)

Capture groups often simplify subsequent processing of the matched text because of the contents of each group can be extracted individually.

In [197]:
m = datepat.match('11/27/2012')
m

<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>

In [198]:
# Extract the contents of each group:
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))
print(m.groups())

11/27/2012
11
27
2012
('11', '27', '2012')


In [199]:
month, day, year = m.groups()
print(month)
print(day)
print(year)

11
27
2012


In [200]:
# Find all matches.
# Notice the splitting into tuples:
text

'Today is 10/18/2017. PyCon starts 5/9/2018.'

In [201]:
datepat.findall(text)

[('10', '18', '2017'), ('5', '9', '2018')]

In [202]:
for month, day, year in datepat.findall(text):
    print('{}-{}-{}'.format(year, month, day))

2017-10-18
2018-5-9


The `findall()` method searches the text and finds all matches, returning them as a list.  
If you want to find matches iteratively, use the `finditer()` method instead:

In [203]:
for m in datepat.finditer(text):
    print(m.groups())

('10', '18', '2017')
('5', '9', '2018')


### Discussion

A basic tutorial on the theory of regular expressions is beyond the scope of this book.  
However, this recipe illustrates the absolute basics of using the `re` module to match and search for text.  
The essential functionality is first compiling a pattern using `re.compile()` and then using methods such as `match(), findall(),` or `finditer()`.  
When specifying patterns, it is relatively common to use raw strings such as `r'(\d+)/(\d+)/(\d+)'`.  
Such strings leave the backslash character uninterpreted, which can be useful in the context of regular expressions.  
Otherwise, you need to use double backslashes such as `'(\\d+)/(\\d+)/(\\d+)'`.  
Be aware that the `match()` method only checks the beginning of a string.  
It’s possible that it will match things you aren’t expecting.

In [204]:
m = datepat.match('11/27/2012abcdef')
m

<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>

In [205]:
m.group()

'11/27/2012'

If you want an exact match, make sure the pattern includes the end-marker ($), as follows:

In [206]:
datepat = re.compile(r'(\d+)/(\d+)/(\d+)$')
datepat.match('11/27/2012abcdef')

In [207]:
datepat.match('11/27/2012')

<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>

Last, if you're just doing a simple text matching/searching operation, you can often skip the compilation step and use the module-level functions in the `re` module instead.

In [208]:
re.findall(r'(\d+)/(\d+)/(\d+)', text)

[('10', '18', '2017'), ('5', '9', '2018')]

Be aware, though, that if you’re going to perform a lot of matching or searching, it usually pays to compile the pattern first and use it over and over again.  
The module-level functions keep a cache of recently compiled patterns, so there isn’t a huge performance hit, but you’ll save a few lookups and extra processing by using your own compiled pattern.

## [Searching and Replacing Text](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_searching_and_replacing_text)

### Problem

You want to search for and replace a text pattern in a string.

### Solution

For simple literal patterns, use the `str.replace()` method.

In [209]:
text = 'yeah, but no, but yeah, but no, but yeah'

In [210]:
text.replace('yeah', 'yep')

'yep, but no, but yep, but no, but yep'

For more complicated patterns, use the `sub()` functions/methods in the `re` module.  
To illustrate, suppose you want to rewrite dates of the form "11/27/2012" as "2012-11-27".  
Here's a sample of how to do it:

In [211]:
import re

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)

'Today is 2012-11-27. PyCon starts 2013-3-13.'

The first argument to `sub()` is the pattern to match and the second argument is the replacement pattern.  
Backslashed digits such as `\3` refer to capture group numbers in the pattern.

If you're going to perform repeated substitutions of the same pattern, consider compiling it first for better performance.

In [212]:
import re 

datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
datepat.sub(r'\3-\1-\2', text)

'Today is 2012-11-27. PyCon starts 2013-3-13.'

For more complicated substitutions, it's possible to specify a substitution callback function instead.

In [213]:
from calendar import month_abbr

def change_date(m):
    mon_name = month_abbr[int(m.group(1))]
    return '{} {} {}'.format(m.group(2), mon_name, m.group(3))

datepat.sub(change_date, text)

'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'

As input, the argument to the substitution callback is a match object, as returned by `match()` or `find()`.  
Use the `.group()` method to extract specific parts of the match.  
The function should return the replacment text.  
If you want to know how many substitutions were made in addition to getting the replacement text, use `re.subn()` instead.

In [214]:
newtext, n = datepat.subn(r'\3-\1-\2', text)

In [215]:
newtext

'Today is 2012-11-27. PyCon starts 2013-3-13.'

In [216]:
n

2

### Discussion

There isn't much more to regular expression search and replace than the `sub()` method shown.  
The trickiest part is specifying the regular expression pattern -- something that's best left as an exercise for the reader.

## [Searching and Replacing Case-Insensitive Text](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_searching_and_replacing_case_insensitive_text)

### Problem

You need to search for and possibly replace text in a case-insensitive manner.

### Solution

To perform case-insensitive text operations, you need to use the `re` module and supply the `re.IGNORECASE` flag to various operations.

In [217]:
text = 'UPPER PYTHON, lower python, Mixed Python'
re.findall('python', text, flags=re.IGNORECASE)

['PYTHON', 'python', 'Python']

In [218]:
re.sub('python', 'snake', text, flags=re.IGNORECASE)

'UPPER snake, lower snake, Mixed snake'

The last example reveals a limitation that replacing text won't match the case of the matched text.  
If you need to fix this, you might have to use a support function, like so:

In [219]:
def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

Here is an example that uses the function above:

In [220]:
re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)

'UPPER SNAKE, lower snake, Mixed Snake'

### Discussion

For simple cases, simply providing the `re.IGNORECASE` is enough to perform case-insensitive matching.  
However, be aware that this may not be enough for certain kinds of Unicode matching involving case folding.  
See ["Working with Unicode Characters in Regular Expressions"](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#unicodere) for more details.

## [Specifying a Regular Expression for the Shortest Match](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_specifying_a_regular_expression_for_the_shortest_match)

### Problem

You're trying to match a text pattern using regular expressions, but it is identifying the longest possible matches of a pattern.  
Instead, you would like to change it to find the shortest possible match. 

### Solution

This problem often arises in patterns that try to match text enclosed inside a pair of starting and ending delimiters (like a quoted string).

In [221]:
str_pat = re.compile(r'\"(.*)\"')
text1 = 'Computer says "no."'
str_pat.findall(text1)

['no.']

In [222]:
text2 = 'Computer says "no." Phone says "yes."'
str_pat.findall(text2)

['no." Phone says "yes.']

In this example, the pattern `r'\"(.*)\"'` is attempting to match text enclosed inside quotes.  
However, the `*` operator in a regular expression is greedy, so matching is based on finding the longest possible match. Thus, in the second example involving `text2`, it incorrectly matches the two quoted strings.  
To fix this, add the ? modifier after the `*` operator in the pattern, like this:

In [223]:
str_pat = re.compile(r'\"(.*?)\"')
str_pat.findall(text2)

['no.', 'yes.']

This makes the matching nongreedy, and produces the shortest match instead.

### Discussion

This recipe addresses one of the more common problems encountered when writing regular expressions involving the dot (.) character.  
In a pattern, the dot matches any character except a newline.  
However, if you bracket the dot with starting and ending text (such as a quote), matching will try to find the longest possible match to the pattern.  
This causes multiple occurrences of the starting or ending text to be skipped altogether and included in the results of the longer match.  
Adding the ? right after operators such as `*` or `+` forces the matching algorithm to look for the shortest possible match instead.

## [Writing a Regular Expression for Multiline Patterns](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_writing_a_regular_expression_for_multiline_patterns)