# [Chapter 2. Strings and Text](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html)

Almost every useful program involves some kind of processing, whether it is parsing data or generating output.  
This chapter focuses on common problems involving text manipulation, such as pulling apart strings, searching, substitution, lexing, and parsing.  
Many of these tasks can be easily solved using built-in methods of strings.  
However, more complicated operations might require the use of regular expressions or the creation of a full-fledged parser.  
All of those topics will be covered.  
In addition, a few tricky aspects of [working with Unicode](https://en.wikipedia.org/wiki/Unicode) are addressed.  
Let's do this!

## [Splitting Strings on Any of Multiple Delimiters](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_splitting_strings_on_any_of_multiple_delimiters)

### Problem 

You need to split a string into fields, but the delimiters (and spacing around them) aren't consistent throughout the string.

### Solution

The `split()` method of string objects is really meant for very simple cases, and does not allow for multiple delimiters or account for possible whitespace around the delimiters.  
In cases when you need a bit more flexibility, use the [re.split() method](https://docs.python.org/3/library/re.html#re.split):

In [21]:
import re

line = 'asdf fjdk; afed, fjek,asdf,      foo'
re.split(r'[;,\s]\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

Here's some [examples from the documentation](https://docs.python.org/3/library/re.html#re.split):

In [22]:
re.split('\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

In [23]:
re.split('(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

In [24]:
re.split('\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

In [25]:
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

### Discussion

The `re.split()` function is useful because you can specify multiple patterns for the separator.  
For example, as shown in the solution, the separator is either a comma (,), semicolon (;), or whitespace followed by any amount of extra whitespace.  
Whenever that pattern is found, the entire match becomes the delimiter between whatever fields lie on either side of the match.  
The result is a list of fields, just as with `str.split()`.

When using `re.split()`, you need to be a bit careful should the regular expression pattern involve a capture group enclosed in parentheses.  
If capture groups are used, then the matched text is also included in the result.  
For example, watch what happens here:

In [26]:
line = 'asdf fjdk; afed, fjek,asdf,      foo'
fields = re.split(r'(;|,|\s)\s*', line)
fields

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

Getting the split characters might be useful in certain contexts.  
For example, maybe you need the split characters later on to reform an output string:

In [27]:
fields

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

In [28]:
values = fields[::2]
values

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

In [29]:
delimiters = fields[1::2] + ['']
delimiters

[' ', ';', ',', ',', ',', '']

In [30]:
# Now re-form the line using the same delimiters and values:
''.join(v+d for v, d in zip(values, delimiters))

'asdf fjdk;afed,fjek,asdf,foo'

In [31]:
# The original line for reference:
line

'asdf fjdk; afed, fjek,asdf,      foo'

If you don't want the separator characters in the result, but still need to use parentheses to group parts of the regular expression pattern, make sure you use a [noncapture group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-a-question-mark-followed-by-a-colon), specified as `(?:...)`.

In [32]:
re.split(r'(?:,|;|\s)\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

## [Matching Text at the Start or End of a String](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_matching_text_at_the_start_or_end_of_a_string)