# Chapter 2 Strings and Text

Almost every useful program involves some kind of text processing, whether it is parsing data or generating output. This chapter focuses on common problems involving text manipulation, such as pulling apart strings, searching, substitution, lexing, and parsing. Many of these tasks can be easily solved using built-in methods of strings. However, more complicated operations might require the use of regular expressions or the cre‐ ation of a full-fledged parser. All of these topics are covered. In addition, a few tricky aspects of working with Unicode are addressed.

## 2.1 Splitting Strings on Any of Multiple Delimiters
The **split()** method of string objects is really meant for very simple cases, and does not allow for multiple delimiters or account for possible whitespace around the delim‐ iters. In cases when you need a bit more flexibility, use the **re.split()** method


In [10]:
line = 'asdf fjdk; afed, fjek, asdf,       foo'
import re 
result = re.split(r'[;,\s]\s*', line)

print(result)

fields = re.split(r'(;|,|\s)\s*', line)
print(fields)

values = fields[::2]
delimiters = fields[1::2] + ['']
print(values)
print(delimiters)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
[' ', ';', ',', ',', ',', '']


## 2.2 Matching Text at the Start or End of a String
A simple way to check the beginning or end of a string is to use the **str.startswith()** or **str.endswith()** methods. 



In [19]:
filename = 'spam.txt'
print("file end with txt : ", filename.endswith('.txt'))

print("file start with :", filename.startswith('file:'))

url = 'http://www.python.org'
print("url start with http : ", url.startswith('http'))

file end with txt :  True
file start with : False
url start with http :  True


In [26]:
import os
filesname = os.listdir('.')
filesname = ['Makefile', 'foo.c', 'bar.py', 'spam.c', 'spam.h']
print("all files with .h/.c : ", [name for name in filesname if name.endswith(('.c','.h'))])
print("have python file: ", any(name.endswith('.py') for name in filesname))

all files with .h/.c :  ['foo.c', 'spam.c', 'spam.h']
have python file:  True


In [28]:
from urllib.request import urlopen

def read_data(name):
    if names.startwith(('http:', 'https:', 'ftp:')):
        return urlopen(name).read()
    else:
        with open(name) as f:
            return f.read()
choices = ['http:', 'ftp:']
url = 'http://www.python.org'
print(url.startswith(tuple(choices)))

True


## 2.3 Matching Strings Using Shell Wildcard Pattern 

The **fnmatch** module provides two functions—**fnmatch()** and **fnmatchcase()**—that can be used to perform such matching.

In [33]:
from fnmatch import fnmatch, fnmatchcase
print(fnmatch('foo.txt', '*.txt'))

print(fnmatch('foo.txt', '?oo.txt'))
print(fnmatch('Dat45.csv', 'Dat[0-9]*'))

names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']

print([name for name in names if fnmatch(name, 'Dat*.csv')])

True
True
True
['Dat1.csv', 'Dat2.csv']


In [35]:
#  False on OS X (Mac), True on Windows

print(fnmatch('foo.txt', '*.TXT')) # False on OS X (Mac), True on Windows

# disctinction
print(fnmatchcase('foo.txt', '*.TXT'))


False
False


In [38]:
addresses = [
        '5412 N CLARK ST',
        '1060 W ADDISON ST',
        '1039 W GRANVILLE AVE',
        '2122 N CLARK ST',
        '4802 N BROADWAY',
]

from fnmatch import fnmatch, fnmatchcase

print([addr for addr in addresses if fnmatchcase(addr, '* ST')])
print([addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')])

['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']
['5412 N CLARK ST']


In [None]:
## 2.4 Matching and Searching for Text Pattern
If the text you’re trying to match is a simple literal, you can often just use the basic string methods, 
such as **str.find()**, **str.endswith()**, **str.startswith()**, or similar.


In [41]:
text = 'yeah, but no, but yeah, but no, but yeah'

# Exact match 
print(text == 'yeah')

# Match at start or end
print(text.startswith('yeah'))
print(text.endswith('no'))

# Search for the location of the first occurrence
print(text.find('no'))


False
True
False
10


In [46]:
test1 = '11/27/2012'
test2 = 'Nov 27,  2012'

import re
# Simple matching: \d+ means match one or more digits
if re.match(r'\d+/\d+/\d+', test1):
    print('yes')
else:
    print('no')

if re.match(r'\d+/\d+/\d+', test2):
    print('yes')
else:
    print('no')

datepat = re.compile(r'\d+/\d+/\d+')
if datepat.match(test1):
    print('yes')
else:
    print('no')

if datepat.match(test2):
    print('yes')
else:
    print('no')

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
print(datepat.findall(text))

yes
no
yes
no
['11/27/2012', '3/13/2013']


In [53]:
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
m = datepat.match('11/27/2012')
print(m)

print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))
print(m.groups())

month, day, year = m.groups()
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
# Find all matches (notice splitting into tuples)
print(datepat.findall(text))

for month, day, year in datepat.findall(text):
    print('{}-{}-{}'.format(year,month,day))

for m in datepat.finditer(text):
    print(m.groups())

<re.Match object; span=(0, 10), match='11/27/2012'>
11/27/2012
11
27
2012
('11', '27', '2012')
[('11', '27', '2012'), ('3', '13', '2013')]
2012-11-27
2013-3-13
('11', '27', '2012')
('3', '13', '2013')


## 2.5 Searching and Replacing Text

For simple literal patterns, use the **str.replace()** method.

In [58]:
text = 'yeah, but no, but yeah, but no, but yeah'

print(text.replace('yeah', 'yep'))

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
import re
print(re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text))

from calendar import month_abbr

def change_date(m):
    mon_name = month_abbr[int(m.group(1))]
    return '{} {} {}'.format(m.group(2), mon_name, m.group(3))

print(datepat.sub(change_date, text))

yep, but no, but yep, but no, but yep
Today is 2012-11-27. PyCon starts 2013-3-13.
Today is 27 Nov 2012. PyCon starts 13 Mar 2013.


## 2.6 Searching and Replacing Case-Insensitive Text

To perform case-insensitive text operations, you need to use the re module and supply the re.IGNORECASE flag to various operations.


In [60]:
text = 'UPPER PYTHON, lower python, Mixed Python'

print(re.findall('python', text, flags=re.IGNORECASE))
print(re.sub('python','snake', text, flags=re.IGNORECASE))

['PYTHON', 'python', 'Python']
UPPER snake, lower snake, Mixed snake


## 2.7 Specifying a Regular Expression for the Shortest Match

This problem often arises in patterns that try to match text enclosed inside a pair of starting and ending delimiters (e.g., a quoted string).

In [63]:
str_pat = re.compile(r'\"(.*)\"')
text1 = 'Computer say "no."'
print(str_pat.findall(text1))

text2 = 'Computer say "no." Phone says "yes."'
print(str_pat.findall(text2))

str_pat = re.compile(r'\"(.*?)\"')
print(str_pat.findall(text2))

['no.']
['no." Phone says "yes.']
['no.', 'yes.']


## 2.8 Writing a Regular Expression for Multiline Patterns

This problem typically arises in patterns that use the dot (.) to match any character but forget to account for the fact that it doesn’t match newlines.

In [66]:
comment = re.compile(r'/\*(.*?)\*/')
text1 = '/* this is a comment */'
text2 = '''/* this is a
              multiline comment */'''
print(comment.findall(text1))
print(comment.findall(text2))

comment = re.compile(r'/\*((?:.|\n)*?)\*/')
print(comment.findall(text2))

comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
print(comment.findall(text2))

[' this is a comment ']
[]
[' this is a\n              multiline comment ']
[' this is a\n              multiline comment ']


## 2.9 Normalizing Unicode Text to a Standard Representation 

In Unicode, certain characters can be represented by more than one valid sequence of code points. 


In [70]:
s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'

print(s1)
print(s2)
print(s1 == s2)
print(len(s1))
print(len(s2))

Spicy Jalapeño
Spicy Jalapeño
False
14
15


In [72]:
import unicodedata
# NFC means that characters should be fully composed
t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)

print(t1 == t2)
print(ascii(t1))

# NFD means that characters should be fully decomposed with the use of combining char‐ acters.
t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)

print(t3 == t4)
print(ascii(t3))

True
'Spicy Jalape\xf1o'
True
'Spicy Jalapen\u0303o'


In [76]:
s = '\ufb01'

print(s)
print(unicodedata.normalize('NFD',s))

# Notice how the combined letters are broken apart here
print(unicodedata.normalize('NFKD', s))

print(unicodedata.normalize('NFKC', s))

t1 = unicodedata.normalize('NFD', s1)
print(''.join(c for c in t1 if not unicodedata.combining(c)))

ﬁ
ﬁ
fi
fi
Spicy Jalapeno


## 2.10 Working with Unicode Characters in Regular Expressions

By default, the re module is already programmed with rudimentary knowledge of cer‐ tain Unicode character classes.


In [78]:
import re
num = re.compile('\d+')
# ASCII digits
print(num.match('123'))

# Arabic digits
print(num.match('\u0661\u0662\u0663'))

pat = re.compile('stra\u00dfe', re.IGNORECASE)
s = 'straße'
print(pat.match(s))  # Matches
print(pat.match(s.upper())) # Doesn't match

print(s.upper())  # Case folds


<re.Match object; span=(0, 3), match='123'>
<re.Match object; span=(0, 3), match='١٢٣'>
<re.Match object; span=(0, 6), match='straße'>
None
STRASSE


## 2.11 Stripping Unwanted Characters from Strings

The **strip()** method can be used to strip characters from the beginning or end of a string. **lstrip()** and **rstrip()** perform stripping from the left or right side, respectively. By default, these methods strip whitespace, but other characters can be given.


In [87]:
# Whitespace stripping
s = '   hello world   \n'

print(s.strip())
print(s.lstrip())
print(s.rstrip())

# Character stripping
t = '-----hello====='
print(t.lstrip('-'))
print(t.strip('-='))

s = '  hello     world   \n'

print(s.replace(' ', ''))

import re 
print(re.sub('\s+', ' ', s))

hello world
hello world   

   hello world
hello=====
hello
helloworld

 hello world 


In [89]:
## 2.12 Sanitizing and Cleaning Up Text 

s = 'pýtĥöñ\fis\tawesome\r\n'
print(s)

remap = {
    ord('\t'): ' ',
    ord('\f'): ' ',
    ord('\r'): None    # Deleted
}

a = s.translate(remap)
print(a)

b = unicodedata.normalize('NFD', a)
print(b.encode('ascii', 'ignore').decode('ascii'))

# fast apprach to clean up whitespace
def clean_space(s):
    s = s.replace('\r', '')
    s = s.replace('\t', ' ')
    s = s.replace('\f', ' ')
    return s


pýtĥöñis	awesome

pýtĥöñ is awesome

python is awesome



## 2.13 Aligning Text Strings

For basic alignment of strings, the **ljust()**, **rjust()**, and **center()** methods of strings can be used. 

In [98]:
text = 'Hello World'
print(text.ljust(20))

print(text.rjust(20))

print(text.center(20))

print(text.rjust(20, '='))

print(text.center(20, '*'))

print(format(text, '>20'))

print(format(text, '<20'))

print(format(text,'^20'))

print(format(text, '=>20s'))

print(format(text, '*^20s'))

Hello World         
         Hello World
    Hello World     
****Hello World*****
         Hello World
Hello World         
    Hello World     
****Hello World*****


## 2.14 Combining and Concatenating Strings

If the strings you wish to combine are found in a sequence or iterable, the fastest way to combine them is to use the **join()** method. 

In [104]:
parts = ['Is', 'Chicago', 'Not', 'Chicago?']

print(' '.join(parts))
print(','.join(parts))
print(''.join(parts))

a = 'Is Chicago'
b = 'Not Chicago?'
print('{} {}'.format(a, b))

data = ['ACME', 50, 91.1]
print(','.join(str(d) for d in data))

Is Chicago Not Chicago?
Is,Chicago,Not,Chicago?
IsChicagoNotChicago?
Is Chicago Not Chicago?
ACME,50,91.1


## 2.15 Interpolating Variables in Strings

Python has no direct support for simply substituting variable values in strings. However, this feature can be approximated using the **format()** method of strings.

In [107]:
s = '{name} has {n} messages.'
print(s.format(name='Guido', n = 37))

name = 'Guido'
n = 37

print(s.format_map(vars()))

class Info:
    def __init__(self, name, n):
        self.name = name
        self.n = n
a = Info('Guido', 37)
s.format_map(vars(a))

Guido has 37 messages.
Guido has 37 messages.


'Guido has 37 messages.'

## 2.16 Reformatting Text to a Fixed Number of Columns

Use the textwrap module to reformat text for output.

In [115]:
s = "Look into my eyes, look into my eyes, the eyes, the eyes, \ the eyes, not around the eyes, don't look around the eyes, \ look into my eyes, you're under."

import textwrap

print(textwrap.fill(s, 70))
print(textwrap.fill(s, 40))

print(textwrap.fill(s, 40, initial_indent='   '))

print(textwrap.fill(s, 40, subsequent_indent='   '))

Look into my eyes, look into my eyes, the eyes, the eyes, \ the eyes,
not around the eyes, don't look around the eyes, \ look into my eyes,
you're under.
Look into my eyes, look into my eyes,
the eyes, the eyes, \ the eyes, not
around the eyes, don't look around the
eyes, \ look into my eyes, you're under.
   Look into my eyes, look into my eyes,
the eyes, the eyes, \ the eyes, not
around the eyes, don't look around the
eyes, \ look into my eyes, you're under.
Look into my eyes, look into my eyes,
   the eyes, the eyes, \ the eyes, not
   around the eyes, don't look around
   the eyes, \ look into my eyes, you're
   under.


## 2.17 Handling HTML and XML Entities in Text
If you are producing text, replacing special characters such as < or > is relatively easy if you use the **html.escape()** function. 


In [121]:
s = 'Elements are written as "<tag>text</tag>".'

import html
print(s)

print(html.escape(s))
# Disable escaping of quotes
print(html.escape(s, quote=False))

s = 'Spicy &quot;Jalape&#241;o&quot.'
from html.parser import HTMLParser
p = HTMLParser()
p.unescape(s)

t = 'The prompt is &gt;&gt;&gt;'
from xml.sax.saxutils import unescape
unescape(t)

Elements are written as "<tag>text</tag>".
Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.
Elements are written as "&lt;tag&gt;text&lt;/tag&gt;".


'The prompt is >>>'

In [None]:
## 2.18 Tokenizing Text

In [126]:
tokens = [('NAME', 'foo'), ('EQ','='), ('NUM', '23'), ('PLUS','+'),
              ('NUM', '42'), ('TIMES', '*'), ('NUM', '10')]

import re
NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)' 
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
TIMES = r'(?P<TIMES>\*)'
EQ = r'(?P<EQ>=)'
WS = r'(?P<WS>\s+)'
master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))

scanner = master_pat.scanner('foo = 42')
scanner.match()



from collections import namedtuple
Token = namedtuple('Token', ['type', 'value'])

def generate_tokens(pat, text):
    scanner = pat.scanner(text)
    for m in iter(scanner.match, None):
        yield Token(m.lastgroup, m.group())


for tok in generate_tokens(master_pat, 'foo = 42'):
    print(tok)

Token(type='NAME', value='foo')
Token(type='WS', value=' ')
Token(type='EQ', value='=')
Token(type='WS', value=' ')
Token(type='NUM', value='42')


In [None]:
## 2.19 Writing a Simple Recursive Descent Parser



In [3]:
from ply.lex import lex
from ply.yacc import yacc

# Token list
tokens = [ 'NUM', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN' ]

# Ignored characters
t_ignore = ' \t\n'

# Token specifications (as regexs)
t_PLUS   = r'\+'
t_MINUS  = r'-'
t_TIMES  = r'\*'
t_DIVIDE = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'

# Token processing functions
def t_NUM(t): 
    r'\d+'
    t.value = int(t.value) 
    return t

# Error handler
def t_error(t):
    print('Bad character: {!r}'.format(t.value[0])) 
    t.skip(1)

# Build the lexer
lexer = lex()

# Grammar rules and handler functions
def p_expr(p): 
    '''
    expr : expr PLUS term
         | expr MINUS term
    '''
    if p[2] == '+':
        p[0] = p[1] + p[3]
    elif p[2] == '-':
        p[0] = p[1] - p[3]

def p_expr_term(p): 
    '''
    expr : term 
    '''
    p[0] = p[1]


def p_term(p): 
    '''
    term : term TIMES factor
         | term DIVIDE factor
    '''
    if p[2] == '*':
        p[0] = p[1] * p[3]
    elif p[2] == '/':
        p[0] = p[1] / p[3]

def p_term_factor(p): 
    '''
    term : factor 
    '''
    p[0] = p[1]

def p_factor(p): 
    '''
    factor : NUM
    '''
    p[0] = p[1]

def p_factor_group(p): 
    '''
    factor : LPAREN expr RPAREN 
    '''
    p[0] = p[2]

def p_error(p): 
    print('Syntax error')

parser = yacc()

print(parser.parser('2'))

TypeError: <module '__main__'> is a built-in module

In [None]:
## 2.20 Performaing Text Operations on Byte Strings

Byte strings already support most of the same built-in operations as text strings.

In [11]:
data = b'Hello World'
print(data[0:5])
print(data.startswith(b'Hello'))
print(data.split())
data.replace(b'Hello', b'Hello Cruel')

## byte arrays
data = bytearray(b'Hello World')
print(data[0:5])
print(data.startswith(b'Hello'))
print(data.split())
data.replace(b'Hello', b'Hello Cruel')

# regular expression
data = b'FOO:BAR,SPAM'
import re
re.split(b'[:,]', data)

s = b'Hello World'
print(s.decode('ascii'))

b'Hello'
True
[b'Hello', b'World']
bytearray(b'Hello')
True
[bytearray(b'Hello'), bytearray(b'World')]
Hello World
