# Class 21: Introduction to algorithms for strings

We can use a lot of the same methods we used for lists like indexing and slicing on strings as well.

In [1]:
p = 'a man a plan a canal panama'

In [2]:
p[0] # Indexing to retrieve first character of string

'a'

In [3]:
p[-1] # Reverse indexing to retrieve last character

'a'

In [4]:
p[::2] # Advanced slicing to skip every other character

'amnapa  aa aaa'

In [5]:
p[::-1] # Reversing a string

'amanap lanac a nalp a nam a'

In [65]:
gw = '(Lived: 67 years)'
gw[8:10]

'67'

In [69]:
print(r'(Lived: \d+ years)')

(Lived: 
+ years)


In [7]:
p[0] = 'P' # Note that string mutation via indexing doesn't work and raises a TypeError

TypeError: 'str' object does not support item assignment

In [10]:
list(p)[0] = 'P'

'a man a plan a canal panama'

Some string-specific methods. These are going to be ***super*** helpful throughout the rest of your Python career!

In [11]:
p.upper() #make all the characters upper-case

'A MAN A PLAN A CANAL PANAMA'

In [12]:
p.replace('canal','railroad') # replace a specific substring with another string

'a man a plan a railroad panama'

In [13]:
split_p = p.split(' ') # splits a string on a character, returning a list of strings
split_p

['a', 'man', 'a', 'plan', 'a', 'canal', 'panama']

In [14]:
'! '.join(split_p) # joins a list of strings together (inside the join function) by another string (outside)

'a! man! a! plan! a! canal! panama'

In [25]:
p

TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [26]:
excited_p = '\n' + '! '.join(split_p) + '\n' # Special characters are escaped in the string with a foreward slash.
excited_p

'\na! man! a! plan! a! canal! panama\n'

In [27]:
print(excited_p) # Printing the string makes the newlines actually generate new lines.


a! man! a! plan! a! canal! panama



In [28]:
excited_p.strip() # Remove whitespace characters like line breaks from the start and end of the string

'a! man! a! plan! a! canal! panama'

## Encoding and decoding strings

This is adapted from [Ramalho (2015) Chapter 4](https://learn.colorado.edu/d2l/le/content/190526/viewContent/2910062/View) and [Lutz (2013) Chapter 7](https://learn.colorado.edu/d2l/le/content/190526/viewContent/2910061/View)

What are the characters in an ASCII string?

In [29]:
import string

string.ascii_letters + string.digits + string.punctuation

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Every one of them has an underlying integer code.

In [30]:
ord('B')

66

We can also get the characters residing at a location in the codebook.

In [31]:
chr(66)

'B'

What happens when we have non-ASCII characters?

![](https://www.wired.com/wp-content/uploads/2017/01/giphy-1.gif)

In [32]:
s = 'Beyoncé'

How many characters in it?

In [33]:
len(s)

7

Convert this string to a bytes encoding.

In [34]:
b = s.encode('utf8')
b

b'Beyonc\xc3\xa9'

How many characters in the string?

In [35]:
len(b)

8

Convert back to unicode from bytes using the UTF-8 encoding.

In [36]:
b.decode('utf8')

'Beyoncé'

There are many, *many* kinds of [character encodings](https://en.wikipedia.org/wiki/Character_encoding) for representing non-ASCII text. [This cartoon](https://xkcd.com/927/) pretty much explains why there are so many standards rather than a single standard:

* [Latin-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1): the basis for many encodings
* [CP-1252](https://en.wikipedia.org/wiki/Windows-1252): a common default encoding in Microsoft products similar to Latin-1
* [UTF-8](https://en.wikipedia.org/wiki/UTF-8): one of the most widely adopted and compatible - *use it wherever possible*
* [CP-437](https://en.wikipedia.org/wiki/Code_page_437): used by the original IBM PC (predates latin1) but this old zombie is still lurking
* [GB-2312](https://en.wikipedia.org/wiki/GB_2312): implemented to support Chinese & Japanese characters, Greek & Cyrillic alphabets
* [UTF-16](https://en.wikipedia.org/wiki/UTF-16): treats everyone equally poorly, here there also be emojis

Other resources on why Unicode is what it is by [Ned Batchelder](https://nedbatchelder.com/text/unipain.html), this tutorial by [Esther Nam and Travis Fischer](https://www.slideshare.net/fischertrav/character-encoding-unicode-how-to-with-dignity-33352863), or [this Unicode tutorial](https://docs.python.org/3.5/howto/unicode.html) in the docs.

In [37]:
for codec in ['latin1','utf8','cp437','gb2312','utf16']:
    print(codec.rjust(10),s.encode(codec), sep=' = ')

    latin1 = b'Beyonc\xe9'
      utf8 = b'Beyonc\xc3\xa9'
     cp437 = b'Beyonc\x82'
    gb2312 = b'Beyonc\xa8\xa6'
     utf16 = b'\xff\xfeB\x00e\x00y\x00o\x00n\x00c\x00\xe9\x00'


Watch how quickly things can go wrong.

In [38]:
b2 = b'Montr\xe9al'

for codec in ['cp1252','iso8859_7','koi8_r','utf8']:
    print(codec.rjust(10),b2.decode(codec),sep=' = ')

    cp1252 = Montréal
 iso8859_7 = Montrιal
    koi8_r = MontrИal


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

You can also chain encoding/decodings together to translate between them.

In [None]:
b2.decode('cp1252').encode('utf8')

There are different ways to handle these encoding/decoding errors.

In [None]:
for error_handling in ['ignore','replace']:
    print(error_handling,b2.decode('utf8',errors=error_handling),sep='\t')

How do you discover the proper encoding given an observed byte sequence? **You can't.** But you can make some informed guesses by using a library like [`chardet`](https://github.com/chardet/chardet) to find clues based on relative frequencies and presence of byte-order marks.

![](https://image.slidesharecdn.com/unicodeandcharacterencoding-140410003026-phpapp01/95/character-encoding-unicode-how-to-with-dignity-1-638.jpg?cb=1397089926)

When you're reading and writing files, it's important to not rely on default system encodings but to declare encodings explicitly.

Create a file called "queen.txt", give Python permission to write to it with the "w" parameter in open, and specify an encoding charset to use like "utf8". Then write the single text string "Beyoncé" to the file.

Then read the file back in.

In [41]:
with open('queen.txt','w',encoding='utf8') as f:
    f.write(s)

# Note the lack of an encoding declaration in the open function, it uses the system's default instead
open('queen.txt','r').read()

'Beyoncé'

Doing this properly.

In [42]:
open('queen.txt','r',encoding='utf8').read()

'Beyoncé'

What is your system's default? More like what *are* the defaults.

In [43]:
import sys, locale

expressions = """
locale.getpreferredencoding()
my_file.encoding
sys.stdout.encoding
sys.stdin.encoding
sys.stderr.encoding
sys.getdefaultencoding()
sys.getfilesystemencoding()
"""

my_file = open('dummy', 'w')

for expression in expressions.split():
    value = eval(expression)
    print(expression.rjust(30), '=', repr(value))

 locale.getpreferredencoding() = 'UTF-8'
              my_file.encoding = 'UTF-8'
           sys.stdout.encoding = 'UTF-8'
            sys.stdin.encoding = 'UTF-8'
           sys.stderr.encoding = 'UTF-8'
      sys.getdefaultencoding() = 'utf-8'
   sys.getfilesystemencoding() = 'utf-8'


We're still not done with string encoding madness. Unicode has code points that modify other characters in the sequence. Read about how to deal with these kinds of problems using the normalize function in the [`unicodedata`](https://docs.python.org/3.5/library/unicodedata.html) library and in the [Ramalho (2015) Chapter 4](https://learn.colorado.edu/d2l/le/content/190526/viewContent/2910062/View). 

Interestingly, exploiting these characters is how to make [glitch art](https://en.wikipedia.org/wiki/Glitch_art).

In [46]:
s1 = 'Beyoncé'
s2 = 'B\u0301e\u0301y\u0301o\u0301n\u0301c\u0301e\u0301\u0301\u0301\u0301'

print(s1,len(s1))
print(s2,len(s2))

Beyoncé 7
B́éýóńćé́́́ 17


**tl;dr**: Make sure you've converted bytestrings into strings by encoding/decoding into proper characterset, ideally UTF-8 wherever possible.

## Edit distance

How similar are two strings? One way to measure this is the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) which is the number of single-character edits (insertions, deletions, substitutions) needed to change one word into the other.

* **k**itt**e**n --> **s**itt**i**n**g**

Most algorithms compute a matrix of cumulative distances between all characters in the as we move from character to character. We can visualize this in the table below by following the diagonal, which increments up by one value every time a difference is observed. Before the operation starts, the distance is 0, k and s are different so the value increments up to one; the i and e are different and the value increments again; and finally "sitting" is longer than "kitten" so this insertion increments a third time.

|       | | k | i | t | t | e | n |
| ----- | --- | --- | --- | --- | --- | --- | --- |
|       | 0 | 1 | 2 |3 | 4 | 5 | 6 |
| **s** | 1 | ***1*** | 2 | 3 | 4 | 5 | 6 |
| **i** | 2 | 2 | ***1*** | 2 | 3 | 4 | 5 |
| **t** | 3 | 3 | 2 | ***1*** | 2 | 3 | 4 |
| **t** | 4 | 4 | 3 | 2 | ***1*** | 2 | 3 |
| **i** | 5 | 5 | 4 | 3 | 2 | ***2*** | 3 |
| **n** | 6 | 6 | 5 | 4 | 3 | 3 | ***2*** |
| **g** | 7 | 7 | 6 | 5 | 4 | 3 | ***3*** |

In [48]:
# From: https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python

def levenshtein(s1, s2):
    if len(s1) < len(s2):
        return levenshtein(s2, s1)

    # len(s1) >= len(s2)
    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
            deletions = current_row[j] + 1       # than s2
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

In [49]:
levenshtein('kitten','sitting')

3

In [50]:
levenshtein('high','bye')

4

In [51]:
levenshtein('football','soccer')

7

## Longest common substring

If you're working in a domain like genetics, you want to find sequences of data that are similar to each other since these likely encode similar genes or proteins. Like the edit distance algorithm above, this algorithm also constructs a similarity matrix.

In [53]:
# https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in range(1, 1 + len(s1)):
        for y in range(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

In [57]:
import numpy as np

In [54]:
s1 = 'ProgrammingBuildsStrongBones'
s2 = 'BuildProgramsForStrongBones'

longest_common_substring(s1,s2)

'StrongBones'

We'll dive deeper into how each algorithm works in the lab assignments at the end of the week.