# Text Formatting, Text Processing, and Regular Expressions

One of Python's greatest strengths is its ability to effortlessly handle text. Other languages, like Java, rely on the Apache StringUtils library, but I have always found this cludgy and somewhat cumbersome. Python provides native support for deftly and clearly handling text.

## Text Formatting
Briefly, let's take a look at the most common methods for "formatting" a string (i.e., combining and coercing disparate pieces of data/types into a string).

In [1]:
# How do I combine these elements into a string
s = 'Shirt'
quantity = 4
cost = 12.99
# expected output: "Shirt x 4 @ $12.99
s + ' x ' + str(quantity) + ' @ $' + str(cost)

'Shirt x 4 @ $12.99'

In [2]:
# all of those str() were rather cludgy, and the result
# is not particularly readable
# we can instead use format strings
res = '{} x {} @ ${}'.format(s, quantity, cost)  # super flexible
res

'Shirt x 4 @ $12.99'

In [3]:
res = f'{s} x {quantity} @ ${cost}'  # not as flexible, but readable
res

'Shirt x 4 @ $12.99'

Another one worth learning about, especially in the context of regular expressions is the `r''`, which treats all characters as literals rather than interpreting them. Let's take a look.

What do these symbols signify?
* `\n`
* `\r`
* `\t`
* `\u0000`
* `\u10A7`

In [4]:
# print these out (all strings are unicode!)
print('\n\t\r')


	


In [5]:
print('\u0000\u10A7')

 Ⴇ


In [6]:
# but this doesn't help us figure out what these actually are
# preface with `r` to input the string literal
r'\u10A7\n\r'

'\\u10A7\\n\\r'

In [7]:
# this is most useful with paths
print('C:\Program Files\tableau\newitem.txt')

C:\Program Files	ableau
ewitem.txt


In [8]:
print(r'C:\Program Files\tableau\newitem.txt')

C:\Program Files\tableau\newitem.txt


# Intro to Regular Expressions
* Regexr.com
    * Text
    * A-z, 0-9
    * \w, \W, \s, \S
    * `*`, `+`, & `?`
    * `.`
    * IGNORECASE flag
    * `(x|y)`

In [9]:
telephone_numbers = [
    '211-3012',
    '(214)-214-2014',
    'cell: 1234567890'
]

In [None]:
help(re.match)

In [11]:
# capture the first one
import re
m = re.match('211-3012', telephone_numbers[0])
m

<_sre.SRE_Match object; span=(0, 8), match='211-3012'>

In [None]:
# explore .start/.end/.group
help(m)

In [12]:
m.group(), m.start(), m.end()

('211-3012', 0, 8)

In [13]:
telephone_numbers[0][m.start(): m.end()]

'211-3012'

In [None]:
re.match(r'\d+-\d+', telephone_numbers[0])

In [16]:
# extra mmse scores from the text
mmse_scores = '''
Your MMSE is 24/30
MMSE had 24 out of 30
mmse is 21/30
mmse: 30/30
'''

In [None]:
help(re.finditer)

In [18]:
for m in re.finditer(r'mmse(\W+\w+){,4}?\W*\d{1,2}(/|\Wout of\W)\d{2}', mmse_scores, re.IGNORECASE):
    print(m)

<_sre.SRE_Match object; span=(6, 19), match='MMSE is 24/30'>
<_sre.SRE_Match object; span=(20, 41), match='MMSE had 24 out of 30'>
<_sre.SRE_Match object; span=(42, 55), match='mmse is 21/30'>
<_sre.SRE_Match object; span=(56, 67), match='mmse: 30/30'>


In [20]:
# capturing
# (?:) -> non-capturing parentheses
# all other parentheses are capturing and can be accessed by match.group()
for m in re.finditer(r'mmse(?:\W+\w+){,4}?\W*(\d{1,2})(?:/|\Wout of\W)(\d{2})', mmse_scores, re.IGNORECASE):
    print(m, m.group(1), m.group(2))

<_sre.SRE_Match object; span=(6, 19), match='MMSE is 24/30'> 24 30
<_sre.SRE_Match object; span=(20, 41), match='MMSE had 24 out of 30'> 24 30
<_sre.SRE_Match object; span=(42, 55), match='mmse is 21/30'> 21 30
<_sre.SRE_Match object; span=(56, 67), match='mmse: 30/30'> 30 30


In [21]:
# named groups
# (?P<name>...) - capture group
# (?P=name) - subsequent match of named group
for m in re.finditer(r'mmse(?:\W+\w+){,4}?\W*(?P<num>\d{1,2})(?:/|\Wout of\W)(?P<denom>\d{2})', mmse_scores, re.IGNORECASE):
    print(m, m.group('num'), m.group('denom'))

<_sre.SRE_Match object; span=(6, 19), match='MMSE is 24/30'> 24 30
<_sre.SRE_Match object; span=(20, 41), match='MMSE had 24 out of 30'> 24 30
<_sre.SRE_Match object; span=(42, 55), match='mmse is 21/30'> 21 30
<_sre.SRE_Match object; span=(56, 67), match='mmse: 30/30'> 30 30


In [None]:
help(re.compile)

# Advanced String Formatting and Math
* Inserting datetime
* Inserting numbers

In [22]:
'{}'.format(2/3)

'0.6666666666666666'

In [23]:
'{:.2}'.format(2/3)

'0.67'

In [24]:
'{:<5}'.format(2)

'2    '

In [25]:
'{:>5}'.format(2)

'    2'

In [26]:
'{:^5}'.format(2)

'  2  '

In [29]:
# How to right-align 2/3?
'{:>10.2}'.format(2/3)

'      0.67'

In [30]:
# In a single line, left align words, and right-align numbers
prices = [
    ('jacket', 10.99),
    ('hat', 4.89),
    ('shirt', 14),
]

In [31]:
for item, price in prices:
    print('{:.<10}{:.>10.2f}'.format(item, price))
    print(len('{:.<10}{:.>10.2f}'.format(item, price)))

jacket.........10.99
20
hat.............4.89
20
shirt..........14.00
20


In [32]:
# format datetime
import datetime
dt = datetime.datetime.now()

In [33]:
'{:%Y%m%d%H%M%S}'.format(dt)

'20180211223723'

In [34]:
dt.strftime('%Y%m%d%H%M%S')

'20180211223723'

See: https://docs.python.org/3.6/library/datetime.html#strftime-and-strptime-behavior

In [36]:
dt.strftime('%Y%m%d%H%M%S - %A (%a)')

'20180211223723 - Sunday (Sun)'