Strings

The process of cleaning data for analysis often requires working with text, for example, to correct typos, convert to standard nomenclature and resolve ambiguous labels. In some statistical fields that deal with (say) processing electronic medical records, information science or recommendations based on user feedback, text must be processed before analysis - for example, by converting to a bag of words.

We will use a whimsical example to illustrate Python tools for munging text data using string methods and regular expressions. Finally, we will see how to format text data for reporting.

In [1]:
# Get “Through the Looking Glass”
import requests

try:
    with open('looking_glass.txt') as f:
        text = f.read()
except IOError:
    url = 'http://www.gutenberg.org/cache/epub/12/pg12.txt'
    res = requests.get(url)
    text = res.text
    with open('looking_glass.txt','w') as f:
        f.write(str(text))

In [2]:
# Slice to get Jabberwocky

In [5]:
start = text.find('JABBERWOCKY')

In [6]:
text[start:start+2000]

"JABBERWOCKY\r\n\r\n     'Twas brillig, and the slithy toves\r\n      Did gyre and gimble in the wabe;\r\n     All mimsy were the borogoves,\r\n      And the mome raths outgrabe.\r\n\r\n     'Beware the Jabberwock, my son!\r\n      The jaws that bite, the claws that catch!\r\n     Beware the Jubjub bird, and shun\r\n      The frumious Bandersnatch!'\r\n\r\n     He took his vorpal sword in hand:\r\n      Long time the manxome foe he sought--\r\n     So rested he by the Tumtum tree,\r\n      And stood awhile in thought.\r\n\r\n     And as in uffish thought he stood,\r\n      The Jabberwock, with eyes of flame,\r\n     Came whiffling through the tulgey wood,\r\n      And burbled as it came!\r\n\r\n     One, two! One, two! And through and through\r\n      The vorpal blade went snicker-snack!\r\n     He left it dead, and with its head\r\n      He went galumphing back.\r\n\r\n     'And hast thou slain the Jabberwock?\r\n      Come to my arms, my beamish boy!\r\n     O frabjous day! Callooh! 

In [8]:
end = text.find('It seems very pretty', start)
poem = text[start:end]
poem

"JABBERWOCKY\r\n\r\n     'Twas brillig, and the slithy toves\r\n      Did gyre and gimble in the wabe;\r\n     All mimsy were the borogoves,\r\n      And the mome raths outgrabe.\r\n\r\n     'Beware the Jabberwock, my son!\r\n      The jaws that bite, the claws that catch!\r\n     Beware the Jubjub bird, and shun\r\n      The frumious Bandersnatch!'\r\n\r\n     He took his vorpal sword in hand:\r\n      Long time the manxome foe he sought--\r\n     So rested he by the Tumtum tree,\r\n      And stood awhile in thought.\r\n\r\n     And as in uffish thought he stood,\r\n      The Jabberwock, with eyes of flame,\r\n     Came whiffling through the tulgey wood,\r\n      And burbled as it came!\r\n\r\n     One, two! One, two! And through and through\r\n      The vorpal blade went snicker-snack!\r\n     He left it dead, and with its head\r\n      He went galumphing back.\r\n\r\n     'And hast thou slain the Jabberwock?\r\n      Come to my arms, my beamish boy!\r\n     O frabjous day! Callooh! 

In [9]:
print(poem)

JABBERWOCKY

     'Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe;
     All mimsy were the borogoves,
      And the mome raths outgrabe.

     'Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
     Beware the Jubjub bird, and shun
      The frumious Bandersnatch!'

     He took his vorpal sword in hand:
      Long time the manxome foe he sought--
     So rested he by the Tumtum tree,
      And stood awhile in thought.

     And as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
     Came whiffling through the tulgey wood,
      And burbled as it came!

     One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
     He left it dead, and with its head
      He went galumphing back.

     'And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
     O frabjous day! Callooh! Callay!'
      He chortled in his joy.

     'Twas brillig, and the slithy toves
      D

In [10]:
print(poem.title())

Jabberwocky

     'Twas Brillig, And The Slithy Toves
      Did Gyre And Gimble In The Wabe;
     All Mimsy Were The Borogoves,
      And The Mome Raths Outgrabe.

     'Beware The Jabberwock, My Son!
      The Jaws That Bite, The Claws That Catch!
     Beware The Jubjub Bird, And Shun
      The Frumious Bandersnatch!'

     He Took His Vorpal Sword In Hand:
      Long Time The Manxome Foe He Sought--
     So Rested He By The Tumtum Tree,
      And Stood Awhile In Thought.

     And As In Uffish Thought He Stood,
      The Jabberwock, With Eyes Of Flame,
     Came Whiffling Through The Tulgey Wood,
      And Burbled As It Came!

     One, Two! One, Two! And Through And Through
      The Vorpal Blade Went Snicker-Snack!
     He Left It Dead, And With Its Head
      He Went Galumphing Back.

     'And Hast Thou Slain The Jabberwock?
      Come To My Arms, My Beamish Boy!
     O Frabjous Day! Callooh! Callay!'
      He Chortled In His Joy.

     'Twas Brillig, And The Slithy Toves
      D

In [12]:
poem.count('All')

2

In [13]:
print(poem.replace('the', 'XXX'))

JABBERWOCKY

     'Twas brillig, and XXX slithy toves
      Did gyre and gimble in XXX wabe;
     All mimsy were XXX borogoves,
      And XXX mome raths outgrabe.

     'Beware XXX Jabberwock, my son!
      The jaws that bite, XXX claws that catch!
     Beware XXX Jubjub bird, and shun
      The frumious Bandersnatch!'

     He took his vorpal sword in hand:
      Long time XXX manxome foe he sought--
     So rested he by XXX Tumtum tree,
      And stood awhile in thought.

     And as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
     Came whiffling through XXX tulgey wood,
      And burbled as it came!

     One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
     He left it dead, and with its head
      He went galumphing back.

     'And hast thou slain XXX Jabberwock?
      Come to my arms, my beamish boy!
     O frabjous day! Callooh! Callay!'
      He chortled in his joy.

     'Twas brillig, and XXX slithy toves
      D

In [14]:
# Find palindromic words in poem if any
poem = poem.lower()

In [15]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
poem = poem.translate(dict.fromkeys(map(ord,string.punctuation)))
poem

'jabberwocky\r\n\r\n     twas brillig and the slithy toves\r\n      did gyre and gimble in the wabe\r\n     all mimsy were the borogoves\r\n      and the mome raths outgrabe\r\n\r\n     beware the jabberwock my son\r\n      the jaws that bite the claws that catch\r\n     beware the jubjub bird and shun\r\n      the frumious bandersnatch\r\n\r\n     he took his vorpal sword in hand\r\n      long time the manxome foe he sought\r\n     so rested he by the tumtum tree\r\n      and stood awhile in thought\r\n\r\n     and as in uffish thought he stood\r\n      the jabberwock with eyes of flame\r\n     came whiffling through the tulgey wood\r\n      and burbled as it came\r\n\r\n     one two one two and through and through\r\n      the vorpal blade went snickersnack\r\n     he left it dead and with its head\r\n      he went galumphing back\r\n\r\n     and hast thou slain the jabberwock\r\n      come to my arms my beamish boy\r\n     o frabjous day callooh callay\r\n      he chortled in his jo

In [24]:
words = poem.split()
words[:30]

['jabberwocky',
 'twas',
 'brillig',
 'and',
 'the',
 'slithy',
 'toves',
 'did',
 'gyre',
 'and',
 'gimble',
 'in',
 'the',
 'wabe',
 'all',
 'mimsy',
 'were',
 'the',
 'borogoves',
 'and',
 'the',
 'mome',
 'raths',
 'outgrabe',
 'beware',
 'the',
 'jabberwock',
 'my',
 'son',
 'the']

In [25]:
def is_palindrome(word):
    return word == word[::-1]
# 又是回文数

In [26]:
{word for word in words if is_palindrome(word)}

{'did', 'o'}