Strings
====

**Note:** this is a modified version of an [original notebook by Cliburn Chan](https://github.com/cliburn/sta-663-2017/blob/master/notebook/04_Strings.ipynb). The original and this modified version are both released under the terms of the MIT license.

The process of cleaning data for analysis often requires working with text, for example, to correct typos, convert to standard nomenclature and resolve ambiguous labels. In some statistical fields that deal with (say) processing electronic medical records, information science or recommendations based on user feedback, text must be processed before analysis - for example, by converting to a bag of words.

We will use a whimsical example to illustrate Python tools for *munging* text data using string methods and regular expressions. Finally, we will see how to format text data for reporting. 

Get "Through the Looking Glass" 
---

In [72]:
import requests

try:
    with open('looking_glass.txt') as f:
        text = f.read()
except IOError:
    url = 'http://www.gutenberg.org/cache/epub/12/pg12.txt'
    res = requests.get(url)
    text = res.text
    with open('looking_glass.txt', 'w') as f:
        f.write(str(text))

Slice to get Jabberwocky
----

In [73]:
start = text.find('JABBERWOCKY')

In [74]:
text[start:start+2000]

"JABBERWOCKY\n\n     'Twas brillig, and the slithy toves\n      Did gyre and gimble in the wabe;\n     All mimsy were the borogoves,\n      And the mome raths outgrabe.\n\n     'Beware the Jabberwock, my son!\n      The jaws that bite, the claws that catch!\n     Beware the Jubjub bird, and shun\n      The frumious Bandersnatch!'\n\n     He took his vorpal sword in hand:\n      Long time the manxome foe he sought--\n     So rested he by the Tumtum tree,\n      And stood awhile in thought.\n\n     And as in uffish thought he stood,\n      The Jabberwock, with eyes of flame,\n     Came whiffling through the tulgey wood,\n      And burbled as it came!\n\n     One, two! One, two! And through and through\n      The vorpal blade went snicker-snack!\n     He left it dead, and with its head\n      He went galumphing back.\n\n     'And hast thou slain the Jabberwock?\n      Come to my arms, my beamish boy!\n     O frabjous day! Callooh! Callay!'\n      He chortled in his joy.\n\n     'Twas bril

In [75]:
end = text.find('It seems very pretty', start)

In [76]:
poem = text[start:end]
poem

"JABBERWOCKY\n\n     'Twas brillig, and the slithy toves\n      Did gyre and gimble in the wabe;\n     All mimsy were the borogoves,\n      And the mome raths outgrabe.\n\n     'Beware the Jabberwock, my son!\n      The jaws that bite, the claws that catch!\n     Beware the Jubjub bird, and shun\n      The frumious Bandersnatch!'\n\n     He took his vorpal sword in hand:\n      Long time the manxome foe he sought--\n     So rested he by the Tumtum tree,\n      And stood awhile in thought.\n\n     And as in uffish thought he stood,\n      The Jabberwock, with eyes of flame,\n     Came whiffling through the tulgey wood,\n      And burbled as it came!\n\n     One, two! One, two! And through and through\n      The vorpal blade went snicker-snack!\n     He left it dead, and with its head\n      He went galumphing back.\n\n     'And hast thou slain the Jabberwock?\n      Come to my arms, my beamish boy!\n     O frabjous day! Callooh! Callay!'\n      He chortled in his joy.\n\n     'Twas bril

In [77]:
print(poem)

JABBERWOCKY

     'Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe;
     All mimsy were the borogoves,
      And the mome raths outgrabe.

     'Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
     Beware the Jubjub bird, and shun
      The frumious Bandersnatch!'

     He took his vorpal sword in hand:
      Long time the manxome foe he sought--
     So rested he by the Tumtum tree,
      And stood awhile in thought.

     And as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
     Came whiffling through the tulgey wood,
      And burbled as it came!

     One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
     He left it dead, and with its head
      He went galumphing back.

     'And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
     O frabjous day! Callooh! Callay!'
      He chortled in his joy.

     'Twas brillig, and the slithy toves
      D

In [78]:
print(poem.title())

Jabberwocky

     'Twas Brillig, And The Slithy Toves
      Did Gyre And Gimble In The Wabe;
     All Mimsy Were The Borogoves,
      And The Mome Raths Outgrabe.

     'Beware The Jabberwock, My Son!
      The Jaws That Bite, The Claws That Catch!
     Beware The Jubjub Bird, And Shun
      The Frumious Bandersnatch!'

     He Took His Vorpal Sword In Hand:
      Long Time The Manxome Foe He Sought--
     So Rested He By The Tumtum Tree,
      And Stood Awhile In Thought.

     And As In Uffish Thought He Stood,
      The Jabberwock, With Eyes Of Flame,
     Came Whiffling Through The Tulgey Wood,
      And Burbled As It Came!

     One, Two! One, Two! And Through And Through
      The Vorpal Blade Went Snicker-Snack!
     He Left It Dead, And With Its Head
      He Went Galumphing Back.

     'And Hast Thou Slain The Jabberwock?
      Come To My Arms, My Beamish Boy!
     O Frabjous Day! Callooh! Callay!'
      He Chortled In His Joy.

     'Twas Brillig, And The Slithy Toves
      D

In [79]:
poem.count('the')

15

In [80]:
print(poem.replace('the', 'XXX'))

JABBERWOCKY

     'Twas brillig, and XXX slithy toves
      Did gyre and gimble in XXX wabe;
     All mimsy were XXX borogoves,
      And XXX mome raths outgrabe.

     'Beware XXX Jabberwock, my son!
      The jaws that bite, XXX claws that catch!
     Beware XXX Jubjub bird, and shun
      The frumious Bandersnatch!'

     He took his vorpal sword in hand:
      Long time XXX manxome foe he sought--
     So rested he by XXX Tumtum tree,
      And stood awhile in thought.

     And as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
     Came whiffling through XXX tulgey wood,
      And burbled as it came!

     One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
     He left it dead, and with its head
      He went galumphing back.

     'And hast thou slain XXX Jabberwock?
      Come to my arms, my beamish boy!
     O frabjous day! Callooh! Callay!'
      He chortled in his joy.

     'Twas brillig, and XXX slithy toves
      D

Find palindromic words in poem if any
----

In [81]:
poem = poem.lower()

We need to remove the punctuation characters to only leave plain text:

In [82]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [63]:
list(map(ord, string.punctuation))

[33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 91,
 92,
 93,
 94,
 95,
 96,
 123,
 124,
 125,
 126]

In [71]:
'abc'.replace('b', '')

'ac'

In [65]:
import itertools as it
it.cycle()

In [68]:
dict.fromkeys(map(ord, string.punctuation))

{33: None,
 34: None,
 35: None,
 36: None,
 37: None,
 38: None,
 39: None,
 40: None,
 41: None,
 42: None,
 43: None,
 44: None,
 45: None,
 46: None,
 47: None,
 58: None,
 59: None,
 60: None,
 61: None,
 62: None,
 63: None,
 64: None,
 91: None,
 92: None,
 93: None,
 94: None,
 95: None,
 96: None,
 123: None,
 124: None,
 125: None,
 126: None}

In [66]:

dict(zip(map(ord, string.punctuation), it.cycle(['X'])))

{33: 'X',
 34: 'X',
 35: 'X',
 36: 'X',
 37: 'X',
 38: 'X',
 39: 'X',
 40: 'X',
 41: 'X',
 42: 'X',
 43: 'X',
 44: 'X',
 45: 'X',
 46: 'X',
 47: 'X',
 58: 'X',
 59: 'X',
 60: 'X',
 61: 'X',
 62: 'X',
 63: 'X',
 64: 'X',
 91: 'X',
 92: 'X',
 93: 'X',
 94: 'X',
 95: 'X',
 96: 'X',
 123: 'X',
 124: 'X',
 125: 'X',
 126: 'X'}

In [83]:
#poem  = poem.translate(dict.fromkeys(map(ord, string.punctuation)))

poem = poem.translate(dict(zip(map(ord, string.punctuation), it.cycle(['']))))
print(poem)


jabberwocky

     twas brillig and the slithy toves
      did gyre and gimble in the wabe
     all mimsy were the borogoves
      and the mome raths outgrabe

     beware the jabberwock my son
      the jaws that bite the claws that catch
     beware the jubjub bird and shun
      the frumious bandersnatch

     he took his vorpal sword in hand
      long time the manxome foe he sought
     so rested he by the tumtum tree
      and stood awhile in thought

     and as in uffish thought he stood
      the jabberwock with eyes of flame
     came whiffling through the tulgey wood
      and burbled as it came

     one two one two and through and through
      the vorpal blade went snickersnack
     he left it dead and with its head
      he went galumphing back

     and hast thou slain the jabberwock
      come to my arms my beamish boy
     o frabjous day callooh callay
      he chortled in his joy

     twas brillig and the slithy toves
      did gyre and gimble in the wabe
     all mi

In [46]:
words = poem.split()
words[:10]

['jabberwocky',
 'twas',
 'brillig',
 'and',
 'the',
 'slithy',
 'toves',
 'did',
 'gyre',
 'and']

In [49]:
def is_palindrome(word):
    return word == word[::-1] and len(word)>1

In [50]:
{word for word in words if is_palindrome(word)}

{'did'}

Top 10 most frequent words
----

In [86]:
import collections

In [87]:
poem_counter = collections.Counter(words)

In [89]:
poem_counter.most_common(5)

[('the', 19), ('and', 14), ('he', 7), ('in', 6), ('jabberwock', 3)]

In [90]:
poem_counter.items()

dict_items([('jabberwocky', 1), ('twas', 2), ('brillig', 2), ('and', 14), ('the', 19), ('slithy', 2), ('toves', 2), ('did', 2), ('gyre', 2), ('gimble', 2), ('in', 6), ('wabe', 2), ('all', 2), ('mimsy', 2), ('were', 2), ('borogoves', 2), ('mome', 2), ('raths', 2), ('outgrabe', 2), ('beware', 2), ('jabberwock', 3), ('my', 3), ('son', 1), ('jaws', 1), ('that', 2), ('bite', 1), ('claws', 1), ('catch', 1), ('jubjub', 1), ('bird', 1), ('shun', 1), ('frumious', 1), ('bandersnatch', 1), ('he', 7), ('took', 1), ('his', 2), ('vorpal', 2), ('sword', 1), ('hand', 1), ('long', 1), ('time', 1), ('manxome', 1), ('foe', 1), ('sought', 1), ('so', 1), ('rested', 1), ('by', 1), ('tumtum', 1), ('tree', 1), ('stood', 2), ('awhile', 1), ('thought', 2), ('as', 2), ('uffish', 1), ('with', 2), ('eyes', 1), ('of', 1), ('flame', 1), ('came', 2), ('whiffling', 1), ('through', 3), ('tulgey', 1), ('wood', 1), ('burbled', 1), ('it', 2), ('one', 2), ('two', 2), ('blade', 1), ('went', 2), ('snickersnack', 1), ('left',

Words that appear exactly twice.
----

In [None]:
[(k, v) for (k, v) in poem_counter.items() if v==2]

Trigrams
----

All possible sequences of 3 words in the poem.

In [None]:
list(zip(words[:], words[1:], words[2:]))[:10]

In [None]:
import itertools as it

In [None]:
def window(x, n):
    """Sliding window of size n from iterable x."""
    s = (it.islice(x, i, None) for i in range(n))
    return zip(*s)

In [None]:
list(window(words, 3))[:10]

Using Regular Expressions
----

- [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)
- [Test your Python regular expressions](http://pythex.org/)

### Find all words with a sequence of two or more identical letters e.g. "look"

In [92]:
import re

In [93]:
regex = re.compile(r'(\w*(\w)\2+\w*)', re.IGNORECASE | re.MULTILINE)

In [94]:
for match in regex.finditer(poem):
    print(match.group(2), match.group(1))

b jabberwocky
l brillig
l all
b jabberwock
o took
e tree
o stood
f uffish
o stood
b jabberwock
f whiffling
o wood
b jabberwock
o callooh
l callay
l brillig
l all


### Convert the identical sequences to uppercase

In [None]:
def f(match):
    word, letter = match.groups()
    return word.replace(letter, letter.upper())

print(regex.sub(f, poem))

Natural language processing
----

If you intend to perform statistical analysis on natural language, you should probably use NLTK to pre-process the text instead of using string methods and regular expressions. For example, a simple challenge is to first parse the paragraph below into sentences, then parse each sentence into words.

Paragraph from random Pubmed abstract.

In [99]:
para = """When compared with the control group no significant associations were found for the NS-PEecl group after adjustment of confounding variables. For the S-PEecl group, antiβ2GP1 IgG (OR 16.91, 95% CI 3.71-77.06) was associated, as well as age, obesity, smoking and multiparity. Antiβ2GP1-domain I IgG were associated with aCL, antiβ2GP1 and aPS/PT IgG in the three groups. aPS/PT IgG were associated with aCL IgG, and aPS/PT IgM were associated with aCL and antiβ2GP1 IgM in the three groups CONCLUSION: S-PEecl is a distinct entity from NS-PEecl and is mainly associated with the presence of antiβ2GP1 IgG. Antiβ2GP1 domain I correlate with other aPL IgG tests, and aPS/PT may be promising in patients in which LA tests cannot be interpreted."""
print(para)

When compared with the control group no significant associations were found for the NS-PEecl group after adjustment of confounding variables. For the S-PEecl group, antiβ2GP1 IgG (OR 16.91, 95% CI 3.71-77.06) was associated, as well as age, obesity, smoking and multiparity. Antiβ2GP1-domain I IgG were associated with aCL, antiβ2GP1 and aPS/PT IgG in the three groups. aPS/PT IgG were associated with aCL IgG, and aPS/PT IgM were associated with aCL and antiβ2GP1 IgM in the three groups CONCLUSION: S-PEecl is a distinct entity from NS-PEecl and is mainly associated with the presence of antiβ2GP1 IgG. Antiβ2GP1 domain I correlate with other aPL IgG tests, and aPS/PT may be promising in patients in which LA tests cannot be interpreted.


### Naive splitting of sentences as anything separated by ., ! or ?

In [100]:
sep = re.compile(r'[\?\!\.]')

In [101]:
ss = sep.split(para)

In [102]:
for i, s in enumerate(ss, 1):
    print(i, ':', s, end='\n\n')

1 : When compared with the control group no significant associations were found for the NS-PEecl group after adjustment of confounding variables

2 :  For the S-PEecl group, antiβ2GP1 IgG (OR 16

3 : 91, 95% CI 3

4 : 71-77

5 : 06) was associated, as well as age, obesity, smoking and multiparity

6 :  Antiβ2GP1-domain I IgG were associated with aCL, antiβ2GP1 and aPS/PT IgG in the three groups

7 :  aPS/PT IgG were associated with aCL IgG, and aPS/PT IgM were associated with aCL and antiβ2GP1 IgM in the three groups CONCLUSION: S-PEecl is a distinct entity from NS-PEecl and is mainly associated with the presence of antiβ2GP1 IgG

8 :  Antiβ2GP1 domain I correlate with other aPL IgG tests, and aPS/PT may be promising in patients in which LA tests cannot be interpreted

9 : 



### Using NLTK

In [103]:
import nltk

In [104]:
ss_nltk = nltk.sent_tokenize(para)

In [105]:
for i, s in enumerate(ss_nltk, 1):
    print(i, ':', s, end='\n\n')

1 : When compared with the control group no significant associations were found for the NS-PEecl group after adjustment of confounding variables.

2 : For the S-PEecl group, antiβ2GP1 IgG (OR 16.91, 95% CI 3.71-77.06) was associated, as well as age, obesity, smoking and multiparity.

3 : Antiβ2GP1-domain I IgG were associated with aCL, antiβ2GP1 and aPS/PT IgG in the three groups.

4 : aPS/PT IgG were associated with aCL IgG, and aPS/PT IgM were associated with aCL and antiβ2GP1 IgM in the three groups CONCLUSION: S-PEecl is a distinct entity from NS-PEecl and is mainly associated with the presence of antiβ2GP1 IgG.

5 : Antiβ2GP1 domain I correlate with other aPL IgG tests, and aPS/PT may be promising in patients in which LA tests cannot be interpreted.



### Naive parsing of the second sentence into words

In [None]:
s = ss_nltk[1]
s

In [None]:
# remove punctuation and split on whit space
table = dict.fromkeys(map(ord, string.punctuation))
s.translate(table).split()

### Using NLTK

In [None]:
text = nltk.word_tokenize(s)
text

### NLTK is a rich library  for natural language processing

See http://www.nltk.org for details.

#### Tag tokens with part-of-speech labels

In [None]:
tagged_text = nltk.pos_tag(text)
tagged_text

In [None]:
s

### A simplistic way to pick up nouns

In [None]:
[w for w, t in tagged_text if t.startswith('N')]

String formatting
----

### Selection

In [None]:
import math

In [None]:
stuff = ('bun', 'shoe', ['bee', 'door'], 2, math.pi, 0.05)

In [None]:
'One: {}, Two {}'.format(*stuff)

In [None]:
'One: {0}, Two {1}'.format(*stuff)

In [None]:
'One: {1}, Two {1}'.format(*stuff)

In [None]:
'One: {0}, Two {2[1]}'.format(*stuff)

### Formatting

In [None]:
'One: {0:^10s}, Two {1:_>15s}'.format(*stuff)

In [None]:
'One: {3}, Two {4}'.format(*stuff)

In [None]:
'One: {3:+10d}, Two {4:.4f}'.format(*stuff)

In [None]:
'One: {3:04d}, Two {4:.4g}'.format(*stuff)

In [None]:
'One: {3:.4e}, Two {4:.4e}'.format(*stuff)

In [None]:
'One: {5:.2%}, Two {5:f}'.format(*stuff)

### Old style formatting

In [None]:
'%s, %s, %a, %d, %.4f, %.2f' % stuff

### Formatting numpy arrays

In [None]:
import numpy as np

In [None]:
x = np.arange(1, 13).reshape(3,4)
x

In [None]:
np.set_printoptions(formatter={'int': lambda x: '%8.2f' % x})

In [None]:
x

In [None]:
np.set_printoptions()

In [None]:
x