# Text Formatting, Text Processing, and Regular Expressions

One of Python's greatest strengths is its ability to effortlessly handle text. Other languages, like Java, rely on the Apache StringUtils library, but I have always found this cludgy and somewhat cumbersome. Python provides native support for deftly and clearly handling text.

## Text Formatting
Briefly, let's take a look at the most common methods for "formatting" a string (i.e., combining and coercing disparate pieces of data/types into a string).

In [None]:
# How do I combine these elements into a string
s = 'Shirt'
quantity = 4
cost = 12.99
# expected output: "Shirt x 4 @ $12.99


In [None]:
# all of those str() were rather cludgy, and the result
# is not particularly readable
# we can instead use format strings
res = '{} x {} @ ${}'.format(s, quantity, cost)  # super flexible
res

In [None]:
res = f'{s} x {quantity} @ ${cost}'  # not as flexible, but readable
res

Another one worth learning about, especially in the context of regular expressions is the `r''`, which treats all characters as literals rather than interpreting them. Let's take a look.

What do these symbols signify?
* `\n`
* `\r`
* `\t`
* `\u0000`
* `\u10A7`

In [None]:
# print these out (all strings are unicode!)


In [None]:
# but this doesn't help us figure out what these actually are
# preface with `r` to input the string literal
r'\u10A7\n\r'

In [None]:
# this is most useful with paths
print('C:\Program Files\tableau\newitem.txt')

In [None]:
print(r'C:\Program Files\tableau\newitem.txt')

# Intro to Regular Expressions
* Regexr.com
    * Text
    * A-z, 0-9
    * \w, \W, \s, \S
    * `*`, `+`, & `?`
    * `.`
    * IGNORECASE flag
    * `(x|y)`

In [None]:
import re  # or check out regex (fuzzy matching, ++)

In [None]:
telephone_numbers = [
    '211-3012',
    '(214)-214-2014',
    'cell: 1234567890'
]

In [None]:
help(re.match)

In [None]:
# capture the first one
m = re.match('211-3012', telephone_numbers[0])
m

In [None]:
# explore .start/.end/.group
help(m)

In [None]:
m.group(), m.start(), m.end()

In [None]:
# use start and end to get the string itself

In [None]:
# how to generalize? use regexr
pattern = r'\d+-\d+'
re.match(pattern, telephone_numbers[0])

In [None]:
# extra mmse scores from the text
mmse_scores = '''
Your MMSE is 24/30
MMSE had 24 out of 30
Had an mmse of 21/30
mmse: 30/30
'''

In [None]:
help(re.finditer)

In [None]:
pattern = r''
for m in re.finditer(pattern, mmse_scores, re.IGNORECASE):
    print(m)

In [None]:
# capturing

In [None]:
# named groups
# (?P<name>...) - capture group
# (?P=name) - subsequent match of named group

In [None]:
help(re.compile)

# Advanced String Formatting and Math
* Inserting datetime
* Inserting numbers

In [None]:
'{}'.format(2/3)

In [None]:
'{:.2}'.format(2/3)

In [None]:
'{:<5}'.format(2)

In [None]:
'{:>5}'.format(2)

In [None]:
'{:^5}'.format(2)

In [None]:
# How to right-align 2/3?


In [None]:
# In a single line, left align words, and right-align numbers
prices = [
    ('jacket', 10.99),
    ('hat', 4.89),
    ('shirt', 14),
]

In [None]:
for item, price in prices:
    pass

In [None]:
# format datetime
import datetime
dt = datetime.datetime.now()

In [None]:
'{:%Y%m%d%H%M%S}'.format(dt)

In [None]:
dt.strftime('%Y%m%d%H%M%S')

See: https://docs.python.org/3.6/library/datetime.html#strftime-and-strptime-behavior