# String Processing

## Objectives

1. Use various string methods to transform a body of text.
2. Understand special (escaped) characters.
3. Read the contents of a text file into Python
4. Clean up and answer questions about a body of text.

## Useful string methods

* Use `lower` and `upper` to change case.
* Use `strip`, `rstrip`, and `lstrip` to strip whitespace
* Use `replace` to make changes to the text.

## Changing case

* Strings are case-sensitive.
* Use `lower` to remove case considerations.

In [1]:
"Hello".lower()

'hello'

In [2]:
name = "Todd Iverson"
name.lower()

'todd iverson'

### <font color="red"> Exercise 1 </font>

Import and explore the string module.

## Escaped Characters

* Python uses **escaped characters** for whitespace
* All escaped characters start with `\`
* Common characters include
    * "\n" is *newline*
    * "\t" is *tab*, etc.
    * '\'' and "\""
    * "\\"

In [3]:
from string import whitespace
whitespace

' \t\n\r\x0b\x0c'

## Whitespace - Evaluating versus printing

In [4]:
"\t"

'\t'

In [5]:
print('\t')

	


In [6]:
a_string = "This string\nhas\nmultiple\nlines"
a_string

'This string\nhas\nmultiple\nlines'

In [7]:
print(a_string)

This string
has
multiple
lines


## Whitespace counts

In [8]:
len("\n")

1

In [9]:
len("\t")

1

In [10]:
len(" ")

1

## Removing whitespace

Since whitespace counts toward string length, we frequently strip it from the ends of a string

In [2]:
raw_name = "    Todd\n\t\n"
len(raw_name)

11

In [3]:
raw_name.strip()

'Todd'

In [4]:
raw_name.lstrip()

'Todd\n\t\n'

In [5]:
raw_name.rstrip()

'    Todd'

## Chaining methods in one expression

* You can chain methods together using dot notation
* Think about the types of each part of the equation

In [6]:
raw_name.strip().lower()

'todd'

<img src="img/chaining_methods.png">

## Changing a string with `replace`

* You can replace one or more characters with `replace`

In [19]:
"Mississippi".replace("i", "I")

'MIssIssIppI'

In [7]:
"Mississippi".replace("ss", "")

'Miiippi'

## Inserting data in a string

* Use `format` to insert data into a string template
    * Identify insertion point with `{}` or `{0}`, `{1}`, etc.
   
**Source:** https://docs.python.org/3.4/library/string.html#formatexamples

In [8]:
'{0}, {1}, {2}'.format('a', 'b', 'c')

'a, b, c'

In [10]:
'{2}, {1}, {0}'.format('a', 'b', 'c')

'c, b, a'

In [12]:
'{0}{1}{0}'.format('abra', 'cad')   # arguments' indices can be repeated

'abracadabra'

## `format` provides advanced formating

In [22]:
# show sign always
'{0:+f}; {1:+f}'.format(3.14, -3.14)  

'+3.140000; -3.140000'

In [23]:
 # show space for positive numbers
'{0: f}; {1: f}'.format(3.14, -3.14) 

' 3.140000; -3.140000'

In [28]:
# Specify number of decimals with 0.2f
# Add - to only show negative sign.
'{0:-.2f}; {1:-.2f}'.format(3.14, -3.14) 

'3.14; -3.14'

In [29]:
# Insert commas in long ints
'{:,}'.format(1234567890)

'1,234,567,890'

In [31]:
# Express as percent
points = 19
total = 22
'Correct answers: {:.2%}'.format(points/total)

'Correct answers: 86.36%'

In [32]:
# format also supports binary numbers
"int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}".format(42)

'int: 42;  hex: 2a;  oct: 52;  bin: 101010'

## Splitting strings

* Split *cuts* a string into parts
* Returns a list of strings
* split_by character/sequence is removed
    * No argument == split on whitespace


In [24]:
state = "Mississippi"
state.split("i")

['M', 'ss', 'ss', 'pp', '']

In [35]:
split_str = state.split("ss")
split_str

['Mi', 'i', 'ippi']

In [37]:
quote = '''I know something ain't right.
            Sweetie, we're crooks. If everything were right, we'd be in jail.'''
quote.lower().split()

['i',
 'know',
 'something',
 "ain't",
 'right.',
 'sweetie,',
 "we're",
 'crooks.',
 'if',
 'everything',
 'were',
 'right,',
 "we'd",
 'be',
 'in',
 'jail.']

## Joining strings

* Reverse of split
    * join a list of strings into one string
    * *glue* the characters together with base string

In [27]:
split_str

['Mi', 'i', 'ippi']

In [28]:
"".join(split_str)

'Miiippi'

In [29]:
"-".join(split_str)

'Mi-i-ippi'

In [40]:
"***".join(split_str)

'Mi***i***ippi'

## Using list comprehensions and join

* List comprehension on a string processes each character
* Join the strings back together after processing

In [42]:
[ch for ch in "Mississippi"]

['M', 'i', 's', 's', 'i', 's', 's', 'i', 'p', 'p', 'i']

In [47]:
no_vowels = [ch 
             for ch in "Mississippi".lower() 
             if ch not in "aeiou"]
no_vowels

['m', 's', 's', 's', 's', 'p', 'p']

In [49]:
no_vowels = "".join([ch 
                     for ch in "Mississippi".lower() 
                     if ch not in "aeiou"])
no_vowels

'msssspp'

In [51]:
"".join([2*ch for ch in "Mississippi"])

'MMiissssiissssiippppii'

## Processing strings with split and join

<img src="img/building_a_string_diagram.png">

## Example: Zen of Python - Remove Punctuation

In [58]:
zen_of_python = '''The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''

In [64]:
from string import punctuation
zen_list_no_punc = [ch for ch in zen_of_python if ch not in punctuation]
zen_list_no_punc[:5]

['T', 'h', 'e', ' ', 'Z']

In [65]:
zen_string_no_punc = ''.join(zen_list_no_punc)
zen_string_no_punc[:5]

'The Z'

In [66]:
print(zen_string_no_punc)

The Zen of Python by Tim Peters
Beautiful is better than ugly
Explicit is better than implicit
Simple is better than complex
Complex is better than complicated
Flat is better than nested
Sparse is better than dense
Readability counts
Special cases arent special enough to break the rules
Although practicality beats purity
Errors should never pass silently
Unless explicitly silenced
In the face of ambiguity refuse the temptation to guess
There should be one and preferably only one obvious way to do it
Although that way may not be obvious at first unless youre Dutch
Now is better than never
Although never is often better than right now
If the implementation is hard to explain its a bad idea
If the implementation is easy to explain it may be a good idea
Namespaces are one honking great idea  lets do more of those


## Computing statistics

1. Clean up the text
    1. Remove punctuation and other items you want to ignore
2. Use `split` and a list comprehension to change words into their value/worth
3. Use reduction functions like `sum`, `len`, `all`, `any` to compute statistic

In [68]:
quote = '''I know something ain't right.
   Sweetie, we're crooks. If everything were right, we'd be in jail.'''
characters_no_punc = [ch.lower() for ch in quote if ch not in punctuation]
characters_no_punc[:10]

['i', ' ', 'k', 'n', 'o', 'w', ' ', 's', 'o', 'm']

In [70]:
str_no_punc = "".join(characters_no_punc)
str_no_punc

'i know something aint right\n   sweetie were crooks if everything were right wed be in jail'

In [78]:
# Combine in one step
s_no_punc = "".join([ch.lower() for ch in quote if ch not in punctuation])
s_no_punc

'i know something aint right\n   sweetie were crooks if everything were right wed be in jail'

In [79]:
split_lower = s_no_punc.lower().split()
split_lower

['i',
 'know',
 'something',
 'aint',
 'right',
 'sweetie',
 'were',
 'crooks',
 'if',
 'everything',
 'were',
 'right',
 'wed',
 'be',
 'in',
 'jail']

In [80]:
# Split, make lowercase, and map to word length
lengths = [len(w) 
           for w in split_lower]
lengths

[1, 4, 9, 4, 5, 7, 4, 6, 2, 10, 4, 5, 3, 2, 2, 4]

In [82]:
mean_word_length = sum(lengths)/len(lengths)
mean_word_length

4.5

### <font color="red"> Exercise 2 </font>

Use the expressions from the last example to create function `remove_punc`, `split_lower`, `word_lengths`, and `mean`

### <font color="red"> Exercise 3 </font>

Compose the functions from exercise 2 into one function that cleans, splits, and computes the mean number of words.

### <font color="red"> Exercise 4 </font>

Compute the length of the longest word in the zen of python

### <font color="red"> Exercise 5 </font>

Count the number of words in the zen of python that have at least 5 characters.