Strings
====

The process of cleaning data for analysis often requires working with text, for example, to correct typos, convert to standard nomenclature and resolve ambiguous labels. In some statistical fields that deal with (say) processing electronic medical records, information science or recommendations based on user feedback, text must be processed before analysis - for example, by converting to a bag of words.

We will use a whimsical example to illustrate Python tools for *munging* text data using string methods and regular expreessions. Finally, we will see how to format text data for reproting. 

Get "Through the Looking Glass" 
---

In [1]:
import requests

try:
    with open('looking_glass.txt') as f:
        text = f.read()
except IOError:
    url = 'http://www.gutenberg.org/cache/epub/12/pg12.txt'
    res = requests.get(url)
    text = res.text
    with open('looking_glass.txt', 'w') as f:
        f.write(str(text))

Slice to get Jabberwocky
----

In [2]:
start = text.find('JABBERWOCKY')

In [3]:
text[start:start+2000]

"JABBERWOCKY\n\n     'Twas brillig, and the slithy toves\n      Did gyre and gimble in the wabe;\n     All mimsy were the borogoves,\n      And the mome raths outgrabe.\n\n     'Beware the Jabberwock, my son!\n      The jaws that bite, the claws that catch!\n     Beware the Jubjub bird, and shun\n      The frumious Bandersnatch!'\n\n     He took his vorpal sword in hand:\n      Long time the manxome foe he sought--\n     So rested he by the Tumtum tree,\n      And stood awhile in thought.\n\n     And as in uffish thought he stood,\n      The Jabberwock, with eyes of flame,\n     Came whiffling through the tulgey wood,\n      And burbled as it came!\n\n     One, two! One, two! And through and through\n      The vorpal blade went snicker-snack!\n     He left it dead, and with its head\n      He went galumphing back.\n\n     'And hast thou slain the Jabberwock?\n      Come to my arms, my beamish boy!\n     O frabjous day! Callooh! Callay!'\n      He chortled in his joy.\n\n     'Twas bril

In [4]:
end = text.find('It seems very pretty', start)

In [5]:
poem = text[start:end]
poem

"JABBERWOCKY\n\n     'Twas brillig, and the slithy toves\n      Did gyre and gimble in the wabe;\n     All mimsy were the borogoves,\n      And the mome raths outgrabe.\n\n     'Beware the Jabberwock, my son!\n      The jaws that bite, the claws that catch!\n     Beware the Jubjub bird, and shun\n      The frumious Bandersnatch!'\n\n     He took his vorpal sword in hand:\n      Long time the manxome foe he sought--\n     So rested he by the Tumtum tree,\n      And stood awhile in thought.\n\n     And as in uffish thought he stood,\n      The Jabberwock, with eyes of flame,\n     Came whiffling through the tulgey wood,\n      And burbled as it came!\n\n     One, two! One, two! And through and through\n      The vorpal blade went snicker-snack!\n     He left it dead, and with its head\n      He went galumphing back.\n\n     'And hast thou slain the Jabberwock?\n      Come to my arms, my beamish boy!\n     O frabjous day! Callooh! Callay!'\n      He chortled in his joy.\n\n     'Twas bril

In [6]:
print(poem)

JABBERWOCKY

     'Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe;
     All mimsy were the borogoves,
      And the mome raths outgrabe.

     'Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
     Beware the Jubjub bird, and shun
      The frumious Bandersnatch!'

     He took his vorpal sword in hand:
      Long time the manxome foe he sought--
     So rested he by the Tumtum tree,
      And stood awhile in thought.

     And as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
     Came whiffling through the tulgey wood,
      And burbled as it came!

     One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
     He left it dead, and with its head
      He went galumphing back.

     'And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
     O frabjous day! Callooh! Callay!'
      He chortled in his joy.

     'Twas brillig, and the slithy toves
      D

In [7]:
print(poem.title())

Jabberwocky

     'Twas Brillig, And The Slithy Toves
      Did Gyre And Gimble In The Wabe;
     All Mimsy Were The Borogoves,
      And The Mome Raths Outgrabe.

     'Beware The Jabberwock, My Son!
      The Jaws That Bite, The Claws That Catch!
     Beware The Jubjub Bird, And Shun
      The Frumious Bandersnatch!'

     He Took His Vorpal Sword In Hand:
      Long Time The Manxome Foe He Sought--
     So Rested He By The Tumtum Tree,
      And Stood Awhile In Thought.

     And As In Uffish Thought He Stood,
      The Jabberwock, With Eyes Of Flame,
     Came Whiffling Through The Tulgey Wood,
      And Burbled As It Came!

     One, Two! One, Two! And Through And Through
      The Vorpal Blade Went Snicker-Snack!
     He Left It Dead, And With Its Head
      He Went Galumphing Back.

     'And Hast Thou Slain The Jabberwock?
      Come To My Arms, My Beamish Boy!
     O Frabjous Day! Callooh! Callay!'
      He Chortled In His Joy.

     'Twas Brillig, And The Slithy Toves
      D

In [8]:
poem.count('the')

15

In [9]:
print(poem.replace('the', 'XXX'))

JABBERWOCKY

     'Twas brillig, and XXX slithy toves
      Did gyre and gimble in XXX wabe;
     All mimsy were XXX borogoves,
      And XXX mome raths outgrabe.

     'Beware XXX Jabberwock, my son!
      The jaws that bite, XXX claws that catch!
     Beware XXX Jubjub bird, and shun
      The frumious Bandersnatch!'

     He took his vorpal sword in hand:
      Long time XXX manxome foe he sought--
     So rested he by XXX Tumtum tree,
      And stood awhile in thought.

     And as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
     Came whiffling through XXX tulgey wood,
      And burbled as it came!

     One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
     He left it dead, and with its head
      He went galumphing back.

     'And hast thou slain XXX Jabberwock?
      Come to my arms, my beamish boy!
     O frabjous day! Callooh! Callay!'
      He chortled in his joy.

     'Twas brillig, and XXX slithy toves
      D

Find palindormic words in poem if any
----

In [10]:
poem = poem.lower()

In [11]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [12]:
poem  = poem.translate(dict.fromkeys(map(ord, string.punctuation)))
poem

'jabberwocky\n\n     twas brillig and the slithy toves\n      did gyre and gimble in the wabe\n     all mimsy were the borogoves\n      and the mome raths outgrabe\n\n     beware the jabberwock my son\n      the jaws that bite the claws that catch\n     beware the jubjub bird and shun\n      the frumious bandersnatch\n\n     he took his vorpal sword in hand\n      long time the manxome foe he sought\n     so rested he by the tumtum tree\n      and stood awhile in thought\n\n     and as in uffish thought he stood\n      the jabberwock with eyes of flame\n     came whiffling through the tulgey wood\n      and burbled as it came\n\n     one two one two and through and through\n      the vorpal blade went snickersnack\n     he left it dead and with its head\n      he went galumphing back\n\n     and hast thou slain the jabberwock\n      come to my arms my beamish boy\n     o frabjous day callooh callay\n      he chortled in his joy\n\n     twas brillig and the slithy toves\n      did gyre 

In [13]:
words = poem.split()
words[:10]

['jabberwocky',
 'twas',
 'brillig',
 'and',
 'the',
 'slithy',
 'toves',
 'did',
 'gyre',
 'and']

In [14]:
def is_palindrome(word):
    return word == word[::-1]

In [15]:
{word for word in words if is_palindrome(word)}

{'did', 'o'}

Top 10 most frequent words
----

In [16]:
import collections

In [17]:
poem_counter = collections.Counter(words)

In [18]:
poem_counter.most_common(10)

[('the', 19),
 ('and', 14),
 ('he', 7),
 ('in', 6),
 ('my', 3),
 ('jabberwock', 3),
 ('through', 3),
 ('slithy', 2),
 ('did', 2),
 ('that', 2)]

Words that appear exactly twice.
----

In [19]:
[(k, v) for (k, v) in poem_counter.items() if v==2]

[('slithy', 2),
 ('did', 2),
 ('that', 2),
 ('outgrabe', 2),
 ('his', 2),
 ('gimble', 2),
 ('mimsy', 2),
 ('stood', 2),
 ('borogoves', 2),
 ('raths', 2),
 ('vorpal', 2),
 ('gyre', 2),
 ('with', 2),
 ('beware', 2),
 ('wabe', 2),
 ('toves', 2),
 ('mome', 2),
 ('it', 2),
 ('two', 2),
 ('as', 2),
 ('one', 2),
 ('all', 2),
 ('brillig', 2),
 ('went', 2),
 ('twas', 2),
 ('came', 2),
 ('were', 2),
 ('thought', 2)]

Trigrams
----

All possible sequences of 3 words in the poem.

In [20]:
list(zip(words[:-2], words[1:-1], words[2:]))[:10]

[('jabberwocky', 'twas', 'brillig'),
 ('twas', 'brillig', 'and'),
 ('brillig', 'and', 'the'),
 ('and', 'the', 'slithy'),
 ('the', 'slithy', 'toves'),
 ('slithy', 'toves', 'did'),
 ('toves', 'did', 'gyre'),
 ('did', 'gyre', 'and'),
 ('gyre', 'and', 'gimble'),
 ('and', 'gimble', 'in')]

Find words in poem that are over-represented
----

In [21]:
book = text

In [22]:
book = book.lower().translate(dict.fromkeys(map(ord, string.punctuation)))

In [23]:
book_counter = collections.Counter(book.split())

In [24]:
n = sum(book_counter.values())
book_freqs = {k: v/n for k, v in book_counter.items()}

In [25]:
n = sum(poem_counter.values())
stats = [(k, v, book_freqs.get(k,0)*n) for k, v in poem_counter.items()]

In [26]:
from pandas import DataFrame

In [27]:
df = DataFrame(stats, columns = ['word', 'observed', 'expected'])

In [28]:
df['score'] = (df.observed-df.expected)**2/df.expected

In [29]:
df = df.sort_values(['score'], ascending=False)
df.head(n=10)

Unnamed: 0,word,observed,expected,score
44,jabberwock,3,0.01557,572.04551
52,vorpal,2,0.01038,381.363673
59,beware,2,0.01038,381.363673
41,borogoves,2,0.01557,252.917766
84,twas,2,0.01557,252.917766
51,tulgey,1,0.00519,190.681837
22,manxome,1,0.00519,190.681837
23,flame,1,0.00519,190.681837
24,callooh,1,0.00519,190.681837
26,snickersnack,1,0.00519,190.681837


Encode and decode poem using a Caesar cipher
----

In [30]:
print(poem)

jabberwocky

     twas brillig and the slithy toves
      did gyre and gimble in the wabe
     all mimsy were the borogoves
      and the mome raths outgrabe

     beware the jabberwock my son
      the jaws that bite the claws that catch
     beware the jubjub bird and shun
      the frumious bandersnatch

     he took his vorpal sword in hand
      long time the manxome foe he sought
     so rested he by the tumtum tree
      and stood awhile in thought

     and as in uffish thought he stood
      the jabberwock with eyes of flame
     came whiffling through the tulgey wood
      and burbled as it came

     one two one two and through and through
      the vorpal blade went snickersnack
     he left it dead and with its head
      he went galumphing back

     and hast thou slain the jabberwock
      come to my arms my beamish boy
     o frabjous day callooh callay
      he chortled in his joy

     twas brillig and the slithy toves
      did gyre and gimble in the wabe
     all mi

### Encoding

In [31]:
def encode(text, k):
    table = dict(zip(map(ord, string.ascii_lowercase), 
                              string.ascii_lowercase[k:] + string.ascii_lowercase[:k]))
    return text.translate(table)

In [32]:
cipher = encode(poem, 2)
print(cipher)

lcddgtyqema

     vycu dtknnki cpf vjg unkvja vqxgu
      fkf iatg cpf ikodng kp vjg ycdg
     cnn okoua ygtg vjg dqtqiqxgu
      cpf vjg oqog tcvju qwvitcdg

     dgyctg vjg lcddgtyqem oa uqp
      vjg lcyu vjcv dkvg vjg encyu vjcv ecvej
     dgyctg vjg lwdlwd dktf cpf ujwp
      vjg htwokqwu dcpfgtupcvej

     jg vqqm jku xqtrcn uyqtf kp jcpf
      nqpi vkog vjg ocpzqog hqg jg uqwijv
     uq tguvgf jg da vjg vwovwo vtgg
      cpf uvqqf cyjkng kp vjqwijv

     cpf cu kp whhkuj vjqwijv jg uvqqf
      vjg lcddgtyqem ykvj gagu qh hncog
     ecog yjkhhnkpi vjtqwij vjg vwniga yqqf
      cpf dwtdngf cu kv ecog

     qpg vyq qpg vyq cpf vjtqwij cpf vjtqwij
      vjg xqtrcn dncfg ygpv upkemgtupcem
     jg nghv kv fgcf cpf ykvj kvu jgcf
      jg ygpv icnworjkpi dcem

     cpf jcuv vjqw unckp vjg lcddgtyqem
      eqog vq oa ctou oa dgcokuj dqa
     q htcdlqwu fca ecnnqqj ecnnca
      jg ejqtvngf kp jku lqa

     vycu dtknnki cpf vjg unkvja vqxgu
      fkf iatg cpf ikodng kp vjg ycdg
     cnn ok

### Decoding

In [33]:
recovered = encode(cipher, -2)
print(recovered)

jabberwocky

     twas brillig and the slithy toves
      did gyre and gimble in the wabe
     all mimsy were the borogoves
      and the mome raths outgrabe

     beware the jabberwock my son
      the jaws that bite the claws that catch
     beware the jubjub bird and shun
      the frumious bandersnatch

     he took his vorpal sword in hand
      long time the manxome foe he sought
     so rested he by the tumtum tree
      and stood awhile in thought

     and as in uffish thought he stood
      the jabberwock with eyes of flame
     came whiffling through the tulgey wood
      and burbled as it came

     one two one two and through and through
      the vorpal blade went snickersnack
     he left it dead and with its head
      he went galumphing back

     and hast thou slain the jabberwock
      come to my arms my beamish boy
     o frabjous day callooh callay
      he chortled in his joy

     twas brillig and the slithy toves
      did gyre and gimble in the wabe
     all mi

Using Regualr Expresssions
----

- [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)
- [Test your Python regular expressions](http://pythex.org/)

### Find all words with a sequence of two or more identical letters e.g. "look"

In [34]:
import re

In [35]:
regex = re.compile(r'(\w*(\w)\2+\w*)', re.IGNORECASE | re.MULTILINE)

In [36]:
for match in regex.finditer(poem):
    print(match.group(2), match.group(1))

b jabberwocky
l brillig
l all
b jabberwock
o took
e tree
o stood
f uffish
o stood
b jabberwock
f whiffling
o wood
b jabberwock
o callooh
l callay
l brillig
l all


### Convert the identical sequences to uppercase

In [37]:
def f(match):
    word, letter = match.groups()
    return word.replace(letter, letter.upper())

print(regex.sub(f, poem))

jaBBerwocky

     twas briLLig and the slithy toves
      did gyre and gimble in the wabe
     aLL mimsy were the borogoves
      and the mome raths outgrabe

     beware the jaBBerwock my son
      the jaws that bite the claws that catch
     beware the jubjub bird and shun
      the frumious bandersnatch

     he tOOk his vorpal sword in hand
      long time the manxome foe he sought
     so rested he by the tumtum trEE
      and stOOd awhile in thought

     and as in uFFish thought he stOOd
      the jaBBerwock with eyes of flame
     came whiFFling through the tulgey wOOd
      and burbled as it came

     one two one two and through and through
      the vorpal blade went snickersnack
     he left it dead and with its head
      he went galumphing back

     and hast thou slain the jaBBerwock
      come to my arms my beamish boy
     o frabjous day callOOh caLLay
      he chortled in his joy

     twas briLLig and the slithy toves
      did gyre and gimble in the wabe
     aLL mi

String formatting
----

### Selection

In [38]:
import math

In [39]:
stuff = ('bun', 'shoe', ['bee', 'door'], 2, math.pi, 0.05)

In [40]:
'One: {}, Two {}'.format(*stuff)

'One: bun, Two shoe'

In [41]:
'One: {0}, Two {1}'.format(*stuff)

'One: bun, Two shoe'

In [42]:
'One: {1}, Two {1}'.format(*stuff)

'One: shoe, Two shoe'

In [43]:
'One: {0}, Two {2[1]}'.format(*stuff)

'One: bun, Two door'

### Formatting

In [44]:
'One: {0:^10s}, Two {1:_>15s}'.format(*stuff)

'One:    bun    , Two ___________shoe'

In [45]:
'One: {3}, Two {4}'.format(*stuff)

'One: 2, Two 3.141592653589793'

In [46]:
'One: {3:+10d}, Two {4:.4f}'.format(*stuff)

'One:         +2, Two 3.1416'

In [47]:
'One: {3:04d}, Two {4:.4g}'.format(*stuff)

'One: 0002, Two 3.142'

In [48]:
'One: {3:.4e}, Two {4:.4e}'.format(*stuff)

'One: 2.0000e+00, Two 3.1416e+00'

In [49]:
'One: {5:.2%}, Two {5:f}'.format(*stuff)

'One: 5.00%, Two 0.050000'

### Old style formatting

In [50]:
'%s, %s, %a, %d, %.4f, %.2f' % stuff

"bun, shoe, ['bee', 'door'], 2, 3.1416, 0.05"

### Formatting numpy arrays

In [51]:
import numpy as np

In [52]:
x = np.arange(1, 13).reshape(3,4)
x

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [53]:
np.set_printoptions(formatter={'int': lambda x: '%8.2f' % x})

In [54]:
x

array([[    1.00,     2.00,     3.00,     4.00],
       [    5.00,     6.00,     7.00,     8.00],
       [    9.00,    10.00,    11.00,    12.00]])

In [55]:
np.set_printoptions()

In [56]:
x

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])