# 12) Wrangle and Mangle Data <a class="tocSkip">

In this notebook we shall learn practical skills for taming data. Sometimes this is called data munging or ETL (extract/transform/load). Data formats fall roughly into two categories: text and binary. Python strings are used for text data and this notebook includes string information that we have skipped so far:

- Unicode characters.

- Regular expression pattern matching.

The are two more Python built-in types related to binary data:

- Bytes for immutable eight-bit values.

- Bytearrays for mutable ones.

### Text strings: unicode

When data is exchanged with the outside world you need two things: a way to encode character strings to bytes and a way to decode bytes to character strings. UTF-8 is the standard text encoding in Python, Linux and HTML. It is fast, complete and works well. If you use UTF-8 encoding throughout you code, it makes it much easier than trying to hop in and out of various encodings.

You encode a string to bytes. The string encode() functions first argument is the encoding name. The choices for encoding include ascii, utf-8, latin-1, cp-1252 and unicode-escape.

We decode byte strings to Unicode text strings. Whenever we get text from some external source (files, databases, websites, network APIs, etc.) it is encoded as byte strings. The tricky part is knowing which encoding was used so we can run it backwards and get Unicode strings. Below we create a Unicode string called place. We shall encode it in UTF-8 format in a bytes variable and then try to decode it using different encodings.

In [1]:
# Define a unicode string

place = 'caf\u00e9'
place

'café'

In [2]:
# Encode using UTF-8

place_bytes = place.encode('utf-8')
place_bytes

b'caf\xc3\xa9'

In [3]:
# Various decodings

place_utf = place_bytes.decode('utf-8')
place_latin = place_bytes.decode('latin-1')
place_windows = place_bytes.decode('windows-1252')

print('UTF-8 decoding: ', place_utf)
print('Latin decoding: ', place_latin)
print('Windows decoding: ', place_windows)

UTF-8 decoding:  café
Latin decoding:  cafÃ©
Windows decoding:  cafÃ©


Whenever possible use UTF-8 encoding. It works, is supported everywhere, can express every Unicode character and is quickly decoded and encoded.

### Text strings: regular expressions

We can match patterns using regular expressions, provided in the standard module re. You define a string pattern that you want to match and the source string to match against. There are a number of ways we can compare the pattern and the source:

match() matches a pattern starting at the beginning of the source.
search() returns the first match, if any.
findall() returns a list of all non-overlapping matches, if any.
split() splits the source at matches with pattern and returns a list of the string pieces.
sub() takes another replacement argument, and changes all parts of source that are matched by pattern to replacement.

In [4]:
import re

In [7]:
# Using re.search() to search a string

source = 'My name is Bradley'
search = re.search('Bradley', source)

if search:
    print(search.group())

Bradley


In [10]:
# Uing findall() to count the number of letters

source = 'Bradley Anthony Ward'
findall = re.findall('d', source)
findall

['d', 'd']

In [11]:
# Using findall() with any following character

findall = re.findall('d.', source)
findall

['dl']

In [13]:
# Using split to split the string at a particular character

split = re.split('n', source)
split

['Bradley A', 'tho', 'y Ward']

In [14]:
# Using replace to replace a character

replace = re.sub('n', 'm', source)
replace

'Bradley Amthomy Ward'

We can replace our letters with any special character:

    \d - A single digit
    \D - A single non-digit
    \w - An alphanumeric character
    \W - A non-alphanumeric character
    \s - A whitespace character
    \S - A non-whitespace character
    \b - A word boundary
    \B - A non-word boundary

We also have pattern specifiers for regular expressions, which are presented in the table:

    -----------------------------------------------------------------
    | Pattern        | Matches                                      |
    -----------------------------------------------------------------
    | abc            | Literal abc                                  |
    | (expr)         | expr                                         |
    | expr1 | expr2  | expr1 or expr2                               |
    | .              | Any character except \n                      |
    | ^              | Start of source string                       |
    | $              | End of source string                         |
    | prev ?         | Zero or one prev                             |
    | prev *         | Zero or more prev, as many as possible       |
    | prev *?        | Zero or more prev, as few as possible        |
    | prev +         | One or more prev, as many as possible        |
    | prev +?        | One or more prev, as few as possible         |
    | prev {m}       | m consecutive prev                           |
    | prev {m,n}     | m to n consecutive prev, as many as possible |
    | prev {m,n}?    | m to n consecutive prev, as few as possible  |
    | [abc]          | a or b or c                                  |
    | [^abc]         | not (a or b or c)                            |
    | prev (?=next)  | prev if followed by next                     |
    | prev (?!next)  | prev if not followed by next                 |
    | (?<=prev) next | next if preceded by prev                     |
    | (?<!prev) next | next if not preceded by prev                 |
    -----------------------------------------------------------------

The characters ^ and $ are called anchors: ^ anchors the search to the beginning of the search string and $ anchors it to the end. .$ matches any character at the end of the line, including a period. Below are a couple of examples using the patterns above.

In [18]:
# Example source text

text = 'I wish I may, I wish I might have a dish of fish tonight.'

In [19]:
# Finding w or f followed by ish

re.findall('[wf]ish', text)

['wish', 'wish', 'fish']

In [20]:
# Finding ght followed by a non-alphanumeric

re.findall('ght\W', text)

['ght ', 'ght.']

In [22]:
# Finding wish preceded by I

re.findall('(?<=I) wish', text)

[' wish', ' wish']