# Week 6 : Lecture A 
 ## Data structures: Strings
 ##### CS1P - University of Glasgow - John H. Williamson - 2017/2018 Python 3.x


## Strings
We've already used strings extensively. But they are so useful for so many tasks, that we're going to spend some time covering interesting ways in which they can be used. And we'll see *regular expressions*, which are a **very** powerful way to manipulate strings.

### Text
Much of what we do with computer system is to do with text. It's so prevalent that it seems almost silly to point this out.

Strings are the data structure to represent text. And that means that the operations we might want to perform on text are operations we need to be able to do to strings. This includes, among many others:

* breaking text into chunks (e.g. split by lines)
* joining together text (e.g. merging documents into a single one)
* searching for text (e.g. finding all occurences of `<img`)
* substituting text (e.g. replacing all occurences of "Trump" with "Drumpf")
* formatting text (e.g. word wrapping a long piece of text)
* string interpolation (e.g. inserting variable values into strings)
* encrypting and decrypting (e.g. so that eavesdroppers cannot read it)

### Universal data type
The "universal" go-to data type is a string. *Every* useful language has some form of strings. Strings are usually supported at a very basic operating system level; there may even be special processor instructions for strings (e.g. on x86). 

And every kind of data structure can be written into a string. We can *serialize* data into strings, and *deserialize* it back out. The `pickle` module we saw earlier really converts any value into a string. That's the tricky bit, writing a string to a file is easy.

In [2]:
x = ["---", ["string", "serialised"], 5, 6]
print(x, type(x))

['---', ['string', 'serialised'], 5, 6] <class 'list'>


In [3]:
# convert x to a string
y = str(x)
print(y, type(y))

['---', ['string', 'serialised'], 5, 6] <class 'str'>


In [4]:
# convert a string back into raw Python values
# in practice, we would never use eval(). Later we will see data
# formats like JSON which are much more suitable for this purpose.
x2 = eval(y)
print(x2, type(x2))

['---', ['string', 'serialised'], 5, 6] <class 'list'>


### I/O
Strings can be read and written from files. They can be pushed into network packets and sent over the Internet. They can be printed to a console or displayed on a user interface. They are a common language for communication within computer systems. 

Computer systems don't even agree on common formats for integers (big-endian vs. little-endian) or floating-point numbers, never mind more complex data structures. But they *do* agree (for the most part) about how strings should be stored.

-----
### What are strings?

<img src="imgs/string.jpg">
*[Image credit:  Christopher Sessums via flickr.com CC-BY-SA 3.0]*


A string is a sequence of characters. A character is just a length-1 string. A string is a compound data type; and because it has an order, it is a **sequence** data type.

    "string" = "s" "t" "r" "i" "n" "g"

Operations like iterating, indexing, slicing, length getting and concatenating that we saw with lists work as well with strings. These are not special list operations, but operations which work on a wide range of sequence types.

In [1]:
string = "hypothesis"
print(string[0]) # index
print(string[:5]) # slice
print(string[4:6]) # slice
print(string + ": true") # concatenate
print(string*2) # repeat
print(len(string)) # length
for char in string:   # iterate
    print(char, end=' ')
print()    
print("h" in "hypothesis") # membership

h
hypot
th
hypothesis: true
hypothesishypothesis
10
h y p o t h e s i s 
True


## Subsequence match
Note that there is special behaviour for in with strings that does not work for lists or other sequences:

In [2]:
print("hyp" in "hypothesis") # sub-sequence membership test (special to strings!)

True


## Equality and ordering
Two strings are equal if they have the same characters. Strings are ordered lexicographically, i.e. in dictionary order.

In [3]:
## equality and ordering
abc2 = "a" + "b" + "c"
print("abc==abc", "abc" == abc2)  # True
print("abc==def", "abc" == "def") # False
print("Aardvark<Zebra", "Aardvark" < "Zebra") # True
print("Aardvark<Aarhus", "Aardvark" < "Aarhus") # True

abc==abc True
abc==def False
Aardvark<Zebra True
Aardvark<Aarhus True


In [4]:
# but watch out for ordering traps (we'll see why this
# is like this in a minute)
print("10<AB", "10" < "AB") # True!
print("ab<AB", "ab" < "AB") # False!
print("ab==AB", "ab" == "AB") # False!

10<AB True
ab<AB False
ab==AB False


### Immutability
However, strings are **immutable**. They cannot be altered after they have been created. 

Unlike lists, where we have operations like `append`, `pop`, `del` and can assign to specific indices, this doesn't work for strings. Strings are closer to tuples than to lists in this regard.

In [5]:
# error, cannot modify a string
string = "helooetheere"
string[5] = '-'

TypeError: 'str' object does not support item assignment

### Copying
All string operations like slicing, indexing, etc. all just return **new** strings with characters *copied* in from the original string. Strings **cannot** be modified after they have been created. 


In [6]:
a = "hello"
b = "hello"
# these actually refer to exactly the same string
# note that `is` tests if two things are the same value
# not if they are equal in value
print(a is b)

True


In [7]:
a += "o" # note: DOES NOT MODIFY a! Creates a new string and stores it in a
# any operator or function will (must!) allocate a new string and copy in
# characters as needed
print(a is b)
print(a,b)

False
helloo hello


-------


### Character encodings
Internally, strings are really sequences of integers. Each character has a numerical **code**. 

<img src="imgs/string.merm.png">

The encoding of characters can be quite complex, particuarly when "foreign" (i.e. non-American!) characters are involved; this is a problem dealt with by **Unicode**.

Unicode lets us do useful things like:

* ¡uʍop ǝpᴉsdn spɹoʍ ǝʇᴉɹʍ uɐɔ ǝʍ
* 侢 侣 侤 侥 侦 侧!
* 😍 🤖 🎤 💩 ☣ 🕵 

### ASCII
For the moment, we will consider characters to be given integer values between 0-127 -- these are **ASCII codes**.

Strings in Python are actually Unicode, with an encoding called UTF-8, which is compatible with this ASCII. Unicode is a complex and important topic, but we won't delve into the details. For now, we will only consider strings which contain ASCII characters, having codes in the range 0-127. 

We can convert an character to its integer value using `ord()` (**order**) and convert an integer code back to a character using `chr()`. (**character**)

In [8]:
print("A:", ord('A'))
print("B:", ord('B'))
print(65,":", chr(65))
print(66,":", chr(66))

A: 65
B: 66
65 : A
66 : B


In [11]:
# print all ASCII characters and their codes
# 16 x 8 = 128
code = 0
for j in range(16):
    for i in range(8):                
        print("%3d"%code, chr(code), end="\t")
        code = code + 1
    print()

  0  	  1 	  2 	  3 	  4 	  5 	  6 	  7 	
  8 	  9 		 10 
	 14 	 15 	
 16 	 17 	 18 	 19 	 20 	 21 	 22 	 23 	
 24 	 25 	 26 	 27 	 28 	 29 	 30 	 31 	
 32  	 33 !	 34 "	 35 #	 36 $	 37 %	 38 &	 39 '	
 40 (	 41 )	 42 *	 43 +	 44 ,	 45 -	 46 .	 47 /	
 48 0	 49 1	 50 2	 51 3	 52 4	 53 5	 54 6	 55 7	
 56 8	 57 9	 58 :	 59 ;	 60 <	 61 =	 62 >	 63 ?	
 64 @	 65 A	 66 B	 67 C	 68 D	 69 E	 70 F	 71 G	
 72 H	 73 I	 74 J	 75 K	 76 L	 77 M	 78 N	 79 O	
 80 P	 81 Q	 82 R	 83 S	 84 T	 85 U	 86 V	 87 W	
 88 X	 89 Y	 90 Z	 91 [	 92 \	 93 ]	 94 ^	 95 _	
 96 `	 97 a	 98 b	 99 c	100 d	101 e	102 f	103 g	
104 h	105 i	106 j	107 k	108 l	109 m	110 n	111 o	
112 p	113 q	114 r	115 s	116 t	117 u	118 v	119 w	
120 x	121 y	122 z	123 {	124 |	125 }	126 ~	127 	


For example, all of the UPPERCASE characters have codes 32 below the corresponding lowercase characters. So you could lowercase a string like this (**assuming it was only uppercase letters in the first place!**)

In [7]:
upper = "SHOUTING"
lower = ""
for ch in upper:
    # convert to integer, add 32, back to a character
    lower += chr(ord(ch)+32)
print(upper, lower)

SHOUTING shouting


We can also look at other Unicode characters, like emojis; these have character codes much larger than 127.

In [12]:
# print all some emojis characters and their codes
# 16 x 8 = 128
code = 0
for j in range(32):
    for i in range(8):   
        # emojis start at offset 0x1f601, or 128513 in decimal
        smiley =code+0x1f601 
        print("%5x"%(smiley), chr(code+0x1f601), end='\t')
        code = code + 1
    print()

1f601 😁	1f602 😂	1f603 😃	1f604 😄	1f605 😅	1f606 😆	1f607 😇	1f608 😈	
1f609 😉	1f60a 😊	1f60b 😋	1f60c 😌	1f60d 😍	1f60e 😎	1f60f 😏	1f610 😐	
1f611 😑	1f612 😒	1f613 😓	1f614 😔	1f615 😕	1f616 😖	1f617 😗	1f618 😘	
1f619 😙	1f61a 😚	1f61b 😛	1f61c 😜	1f61d 😝	1f61e 😞	1f61f 😟	1f620 😠	
1f621 😡	1f622 😢	1f623 😣	1f624 😤	1f625 😥	1f626 😦	1f627 😧	1f628 😨	
1f629 😩	1f62a 😪	1f62b 😫	1f62c 😬	1f62d 😭	1f62e 😮	1f62f 😯	1f630 😰	
1f631 😱	1f632 😲	1f633 😳	1f634 😴	1f635 😵	1f636 😶	1f637 😷	1f638 😸	
1f639 😹	1f63a 😺	1f63b 😻	1f63c 😼	1f63d 😽	1f63e 😾	1f63f 😿	1f640 🙀	
1f641 🙁	1f642 🙂	1f643 🙃	1f644 🙄	1f645 🙅	1f646 🙆	1f647 🙇	1f648 🙈	
1f649 🙉	1f64a 🙊	1f64b 🙋	1f64c 🙌	1f64d 🙍	1f64e 🙎	1f64f 🙏	1f650 🙐	
1f651 🙑	1f652 🙒	1f653 🙓	1f654 🙔	1f655 🙕	1f656 🙖	1f657 🙗	1f658 🙘	
1f659 🙙	1f65a 🙚	1f65b 🙛	1f65c 🙜	1f65d 🙝	1f65e 🙞	1f65f 🙟	1f660 🙠	
1f661 🙡	1f662 🙢	1f663 🙣	1f664 🙤	1f665 🙥	1f666 🙦	1f667 🙧	1f668 🙨	
1f669 🙩	1f66a 🙪	1f66b 🙫	1f66c 🙬	1f66d 🙭	1f66e 🙮	1f66f 🙯	1f670 🙰	
1f671 🙱	1f672 🙲	1f673 🙳	1f674 🙴	1f675 🙵	1f676 🙶	1f677 🙷	1f678 🙸	
1f679 🙹	1f67a 🙺	1f67b 🙻	1

### Special codes
Only codes between 32 and 127 appear as "printable" characters. These include things like letters, numbers and punctuation (*printable* or *graphic* characters). The codes below 32 include *special* or *control* characters, like TAB, newline, carriage return, backspace and some pretty weird control characters left over from old-fashioned teletype devices.

What computers looked like when ASCII was standardised:
<img src="imgs/teletype.jpg">
*[Image credit: By Jamie - Flickr: Telex machine TTY, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=19282428]*

Most of the time we don't need to worry about this, but there are some useful *whitespace* characters.

### Whitespace
Whitespace is any spacing character which has no visible appearance. The most common ones are:
* ` `: space (character 32)
* `\n`: newline (character 10)
* `\r`: carriage return (character 13)
* `\t`: tab (character 9)

Note that we can write these characters in strings using the special backslash \ notation. Python will replace any occurence of \n with the actual character 10 in a string.

In [2]:
print("Hi\nare\nyou\nice?")

Hi
are
you
ice?


In [14]:
print("Before tab\tAfter tab")
print("Before tab"+chr(9)+"After tab")

Before tab	After tab
Before tab	After tab


We can write any character directly in a string using `\xnn` where `nn` is the *hex* (base 16) code of the character we want. 

In [44]:
print("Before tab\x09After tab\x0aAfter newline\x41")

Before tab	After tab
After newlineA


If we want Python to ignore any backslash special codes, we can put an `r` in front of the string (a **raw string**). Otherwise, to get a backslash (the "escape" character) to appear, it will have to be doubled.

In [45]:
print("This\n has\n newlines")
# note \\ means a literal \ appears
print("This\\n does not have\\n newlines")

# note the r in front -- for raw string
print(r"This\n does not have\n newlines")

This
 has
 newlines
This\n does not have\n newlines
This\n does not have\n newlines


### Splitting and joining
We very often have data that is a string representing a sequence of values, separated by some character. For example, you can export an Excel spreadsheet as a CSV (*comma separated value*) file, which is plain text. It has one row per line, and each column is separated with a comma.

    YCG,Canada,49.296389,-117.6325
    YCH,Canada,47.007778,-65.449167
    YCL,Canada,47.990833,-66.330278
    YCO,Canada,67.816667,-115.143889
    YCT,Canada,52.075001,-111.445278
    YCW,Canada,49.152779,-121.9388
    
CSV is a very simple and very widely used way of storing and transferring data.

We can use the `split()` function to split a string with **delimiters** (like commas) and return a list of strings. Because splitting by spaces is very common, if you don't tell `split()` what the delimiter is, it will split on any *whitespace* character.

In [15]:
words = "these are some space separated words".split()
print(words)

['these', 'are', 'some', 'space', 'separated', 'words']


In [47]:
numbers = "5, 7,  31, 8"
print(numbers.split(",")) # note that the whitespace is *not* removed!

['5', ' 7', '  31', ' 8']


### Multi-character delimiters
Delimiters do not have to be single characters.

In [48]:
multi_char_delimiter = "Helium <==> 2 <==> He"
print(multi_char_delimiter.split(' <==> '))

['Helium', '2', 'He']


#### Splitting lines
We can split lines apart using `.split("\n")`, or the convenience function `.splitlines()`

In [None]:
multi = """A multiline string
will have line
breaks within it"""

In [26]:
print(multi.split("\n"))

['A multiline string', 'will have line', 'breaks within it']


In [27]:
print(multi.splitlines())

['A multiline string', 'will have line', 'breaks within it']


However, most of the time that we work with multiline strings is when reading from files (or file-like objects) and the splitting usually done already (e.g. with `readlines()` or `for line in f`).

## Efficient concatenation

You can concatenate strings with +. But each use of + allocates a totally new string. This uses up memory quickly, and can slow things down a lot. The interpreter has to copy the string each time + is used. 

Joining one character at a time is very expensive:

    "" + "c" = "c"
    "c" + "o" = "co"
    "co" + "n" = "con"
    "con" + "c" = "conc"
    "conc" + "a" = "conca"
    "conca" + "t" = "concat"

If you use + to join 1000 strings, **999 intermediate strings have to be allocated and copied.** Ouch.

The *efficient way* to join lots of strings together is to use `join()`. `join()` takes a sequence (list, tuple, etc.) and a separator, and concatenates each element of the sequence with the separator in between. The separator can be the empty string, a single character or any other string. 

It is *much better* performance-wise to put the strings into a mutable data structure like a list, and the join the whole list at once.

`join()` is the complement of `split()`.

### join syntax
`join` has an unusual syntax. It is a string method, and takes a seqeuence. So to join together three strings without any space the syntax is:

In [22]:
# read: join the following list with no separator
print("".join(["alpha", "bravo", "charlie"]))

alphabravocharlie


To put a dash between, it would be like this:

In [23]:
# read: join the following list with a dash between each element
print("-".join(["alpha", "bravo", "charlie"]))

alpha-bravo-charlie


In [24]:
# each element on its own line
print("\n".join(["alpha", "bravo", "charlie"]))

alpha
bravo
charlie


or any other string

In [25]:
print("<-WORD->".join(["alpha", "bravo", "charlie"]))

alpha<-WORD->bravo<-WORD->charlie


Note that `join` only puts the separator *in between* the elements of the sequence. This is a very useful behaviour and is annoying to get right with a loop.

## Speed of joining
For long sequences, `join` is **much** faster than repeatedly calling +, because it only allocates the big string once (instead of copying strings hundreds or thousands of times in intermediate computations).

In [49]:
# 1 milllllllion strings
number_names = [str(i) for i in range(1000000)]
print(number_names[-10:]) # first ten elements

['999990', '999991', '999992', '999993', '999994', '999995', '999996', '999997', '999998', '999999']


In [50]:
%%timeit # timeit is cool -- it will time the execution time of this cell.
concat = ""
for string in number_names:
    concat += string

1.71 s ± 150 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [51]:
%%timeit
y = "".join(number_names)

14.9 ms ± 282 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## On the case
Latin-derived languages have two cases of characters: upper and lower. We can convert the case of strings easily. Non-alphabetic characters are unharmed:

In [30]:
mad_caps = "My cApS lOCK is bRokEN. CaLL 101 SoS!"
print(mad_caps)
print(mad_caps.lower()) # to lower case
print(mad_caps.upper()) # to upper case
print(mad_caps.capitalize()) # Capitalize first letter
print(mad_caps.title()) # Title Case (Capitalize First Letter Of Each Word)
print(mad_caps.swapcase()) # when was this ever useful!?

My cApS lOCK is bRokEN. CaLL 101 SoS!
my caps lock is broken. call 101 sos!
MY CAPS LOCK IS BROKEN. CALL 101 SOS!
My caps lock is broken. call 101 sos!
My Caps Lock Is Broken. Call 101 Sos!
mY CaPs Lock IS BrOKen. cAll 101 sOs!


Case conversion is very useful when you want to search for strings in a case-insensitive way. Just lower case (or upper case) the query string and the document string, and do the search; all characters are then guaranteed to be the same case. 

## Padding: Filling and justifiying
We often need to line up string so that they have a constant width. For example, to make data appear nicely in columns. This looks bad:

In [16]:
columns = ["Aiport", "Country", "Lat", "Lon"]
data = [["YCG","Canada","49.296389","-117.6325"],
        ["LRL","Togo","9.767333","1.09125"],
        ["LFW","Togo","6.165611","1.254511"]]

for col in columns:
    print(col, end=' ')
print()    
for row in data:
    for entry in row:
        print(entry, end=' ')
    print()

Aiport Country Lat Lon 
YCG Canada 49.296389 -117.6325 
LRL Togo 9.767333 1.09125 
LFW Togo 6.165611 1.254511 


We can justify text to the left, right or center, so that it fits into fixed length chunks using `.ljust()`, `.rjust()` and `.center()`. 

Note that this will *pad* the string to make it fit into the column size, but will *not* truncate it.

In [32]:
for col in columns:
    print(col.ljust(10), end=' ')
print()    
for row in data:
    for entry in row:
        print(entry.ljust(10), end=' ')
    print()

Aiport     Country    Lat        Lon       
YCG        Canada     49.296389  -117.6325 
LRL        Togo       9.767333   1.09125   
LFW        Togo       6.165611   1.254511  


In [17]:
# to truncate, just use slices
# note: a slice extending beyond the end of a sequence is fine
# it is *not* an error to do "abc"[:5] -- this will take up to
# the first five characters
for col in columns:
    print(col[:4].ljust(4), end=' ')
print()    

for row in data:
    for entry in row:
        print(entry[:4].ljust(4), end=' ')
    print()

Aipo Coun Lat  Lon  
YCG  Cana 49.2 -117 
LRL  Togo 9.76 1.09 
LFW  Togo 6.16 1.25 


## Interpolation
We saw basic formatting with %, which allows us to substitute values into strings. This substitution is known as **string interpolation**; values are interpolated into the string.

* `%f` float
* `%e` float, scientific notation
* `%d` integer, decimal
* `%s` string
* `%x` integer, hex

(there are many others, but they are rarely used. Only `%d`, `%s` and `%f` are widely used)

In [21]:
x = 215
print("%f %e %d %x %X %s" % (x,x,x,x,x,x))

215.000000 2.150000e+02 215 d7 D7 215


#### Formatting numbers
When formatting numbers, we often want to be able to specify the *precision* of the number (or the length of a string).

There is quite a sophisticated set of codes that can be used to customize the interpolation. 

Most usefully, a number can be placed between the % and the formatting character to specify the minimum number of characters the value should take up. This is identical to what `lfill()` does (and yes, it is possible to get the `rfill()` and `center()` behaviours too).

In [35]:
x = "Barn"
y = 2
print("%s %4s %8s %16s" % (x,x,x,x))
print("%d %4d %8d %16d" % (y,y,y,y))

Barn Barn     Barn             Barn
2    2        2                2


We can also put a . (decimal point) in the number to specify the number of digits after a floating point number to include -- or the maximum number of characters to use for a string.

In [36]:
x = "Flexible"
y = 2
pi = 3.141592653589793
# String: here . means maximum string length
print("%.1s %.4s %.8s" % (x,x,x))

# Decimal: here . means fill with leading zeros to given length
print("%.d %.4d %.8d" % (y,y,y))

# Float: here . means truncate the number after the decimal
# to the given number of digits
print("%f %.1f %.4f %.16f" % (pi,pi,pi,pi))

F Flex Flexible
2 0002 00000002
3.141593 3.1 3.1416 3.1415926535897931


These formatting commands or similar equivalents are available in other languages (e.g. C, Java).

# Matching
Being able to *find* strings within other strings is an essential part of string manipulations. From web search, Word's find-and-replace and command line tools like `grep`, being able to test whether text matches a search pattern is a vital task.

## Matching strings
We know how to find if a (sub)string is inside another string using `in`

In [53]:
print("ican" in "pelicans")
print("" in "pelicans") # the empty string is in everything
print("pelicans" in "pelicans") # note: inclusive of the whole string!
print("icant" in "pelicans")

True
True
True
False


We also have `startswith()` and `endswith()` to test the special cases where a string appears at the start or end of a string

In [54]:
print("pelican".startswith("p"))
print("pelican".startswith("peli"))
print("pelican".startswith("q"))

True
True
False


In [55]:
print("pelican".endswith("can"))
print("pelican".endswith("peli"))

True
False


If  a string is present inside another, we can find out **where** using `index()`. Note that `index()` throws an error if the string searched for *doesn't appear* in the other string. The returned index is the index of the character that the substring starts on.

In [22]:
print("pelican".index("ican"))

3


In [23]:
# no luck - this will cause an exception "substring not found"
print("pelican".index("icant"))

ValueError: substring not found

Just like `index()` for lists, a second parameter can be used to specify an index from where to start searching. This can be used to find all occurences of a search string inside another string, by finding each index and then starting the next search from the character afterwards (e.g. like using Ctrl-F in a browser).

### Find
`find()` find does the same as index() but returns -1 on failure to match, instead of causing an error.

In [58]:
## find does the same as index() but returns -1 on failure instead of causing an error
print("pelican".find("icant"))
print("pelican".find("ican"))

-1
3


## Search and replace
We can perform literal search and replace on a string using `.replace()`. This replaces **every** occurence of a given substring with a replacement substring.

In [59]:
x = "pelican"
print(x)

pelican


In [60]:
x = x.replace("n", "t") # single character replace
print(x)

pelicat


In [61]:
x = x.replace("pel", "del") # can be any two strings
print(x)

delicat


In [62]:
x = x.replace("t", "tely") # don't have to be the same length
print(x)

delicately


In [63]:
x = x.replace("elic", "") # can remove all occurences of a substring
print(x)

dately


In [64]:
x = x.replace("", " | ") # put the given string between every existing character
print(x)

 | d | a | t | e | l | y | 


-----------

## Pattern matching
But what if we wanted to match things less precisely? If we didn't want just exact matches, but wanted to find strings that followed a certain pattern or template?

**Regular expressions** are an extremely powerful tool for doing this. They are sophisticated tools, and we will only scratch the surface. Regular expressions are an entire sub-language, which is widely available as part of most common programming languages or their standard libraries. They are also part of many command line utilities (`grep`, `sed`, etc.) and part of any "serious" text editor (`vim`, `emacs`, `notepad++`, `sublime`, `atom`, etc.).

<img src="https://imgs.xkcd.com/comics/regular_expressions.png">

Regular expressions have a syntax that can be obscure, but is very compact. They are powerful, so it is also easy to abuse them. But, for many cases, they are the tool of choice to solve text processing problems. 

### Regular expression patterns
A regular expression pattern is just a string, which uses **special characters** to represent ways in which the pattern can vary. Regular expression matching functions take a regular expression pattern and can match that against a string.

In the most basic case, a regular expression matching can just look for a literal string, just as `.find()` does:

In [24]:
import re # import the Regular Expression module
print(re.findall("ican", "pelican"))

['ican']


### Crossword matching
<img src="imgs/British_crossword.png">
*[Image credit: Public domain]*

**Impress your grandparents!**
Say you have a partial solution to a crossword: `c_tl___`. How could you find all the words that match that partial solution?

## A placeholder: .

We can write a `.` (period, dot) in a regular expression to mean "any character can go here". Let's use the list of words from `words.txt` and solve this problem.

There is a function `re.match(pattern, string)` that determines if the given pattern matches a given string, *with the pattern required to be present at the very start of the string.*

In [25]:
with open("words.txt") as f:
    for line in f:
        word = line.strip() # strip that newline!
        if re.match('c.tl...', word):
            print(word)

cutlass
cutlasses
cutlers
cutlery
cutlets


## Anchors: $ and ^
This works, but `match()` will match any string that matches the pattern at the start, regardless of what comes next. This include `cutlasses`, even though it has extra characters at the end. 

We can use the special characters `^` (caret) and `$` (dollar) to mean `start of string` and `end of string`. For example, $ forces the pattern to only match if the string ends where the `$` is. These characters are called `anchors` because they anchor the pattern to the start or end of a string. (`^` isn't useful with `re.match()` but it is useful with other regular expression functions).

In [28]:
# let's write this as a function to save some repetition
def match_words(pattern):
    with open("words.txt") as f:
        for line in f:
            word = line.strip() 
            if re.match(pattern, word):
                print(word)
                
match_words('c.tl...$')

cutlass
cutlers
cutlery
cutlets


## Escaping
What if we wanted to actually match a `$` or a literal `.`? We can always **escape** a special character to make it behave as if it were not special. Backslash \ is the escape character. It makes the following character work as if it were not special.


In [3]:
# does not match -- because the $ is taken to mean the anchor
print(re.match('200$', '200$'))

# now we escape the $ and it behaves as the literal character $
# it matches correctly
print(re.match('200\$', '200$'))

NameError: name 're' is not defined

**However, \ still has its effect of making characters like \n into newlines.** This can be a pain, and raw strings (with an `r` in front) are often used to avoid this.

In [54]:
# note the r: no chance the \ does something unexpected
print(re.match(r'200\$', '200$'))

<_sre.SRE_Match object at 0x000000000AB5B8B8>


### Character classes
We can restrict a placeholder to a set of possible characters, instead of just any character. 

To do this, we put all the possible characters we want to match inside square brackers `[]`. We can also specify consecutive range of characters, like `a-z` or `0-9`

### *The whole square bracketed set of characters applies to one character.*

In [30]:
match_words("p[nt][aeiou]")

pneumatic
pneumatically
pneumonia
pneumonic
ptarmigan
pterodactyl
ptolemaic
ptolemy
ptomaine


In [58]:
# using a range of characters
print(re.match('[a-z].$', "as")) # match
print(re.match('[a-z].$', "0s")) # no match
# note that it is case sensitive
print(re.match('[A-Z].$', "AS")) # no match

<_sre.SRE_Match object at 0x000000000AB5BAC0>
None
<_sre.SRE_Match object at 0x000000000AB5BAC0>


### Inverted character classes
We can also invert a character class to say match **any character except these ones**. To do this, we put a ^ (caret) as the very first character in the square brackets:

In [59]:
# a word beginning with f then three **non-vowels**
match_words("f[^aeiou][^aeiou][^aeiou]")

flycatcher
flycatchers
flyleaf
flynn
flyway
flyways
flyweight
flywheel
flywheels


## Built-in character classes
Some classes are so commonly used there are special codes for them:

    \d 	Match any digit: character in the range 0 - 9 [0-9]
    \D 	Match any nondigit: character NOT in the range 0 - 9 [^0-9]
    \s 	Match any whitespace characters (space, tab etc.).
    \S 	Match any character NOT whitespace (space, tab).
    \w 	Match any alphanumeric character: in the range 0 - 9, A - Z, a - z and punctuation 
    \W     Match any character not in \w

## Repeat characters
We can also tell the matcher to allow characters to **repeat**. The two most common ways of doing that are postfixing an expression element (e.g. a character or a character class) with :
* a `*` (repeat zero or more times)
* a `+` (repeat at least once or more times)
* a `?` (appears zero or one times)
* a `{n}` (appears exactly n times)

In [60]:
# all words that begin fir and end with t
match_words("fir.*t$")

fireboat
firelight
firmament
firmest
first


In [61]:
# all words that have f<something>t<something>ha<something>
match_words("f.*t.*ha.*")

feltham
firsthand
fotheringhay


In [62]:
# match nec<zero or more vowels><one or more s><something>y
match_words("nec[aeiou]*s+.*y")

necessarily
necessary
necessitously
necessity


In [63]:
# same as the f followed by three non-vowels example
# note the way the repeat character binds to the previous expression, which might not
# be a single character in the pattern!
match_words("f[^aeiou]{3}")

flycatcher
flycatchers
flyleaf
flynn
flyway
flyways
flyweight
flywheel
flywheels


In [64]:
# match a sequence of digits, possibly followed by one letter
# Note that we can just jam in multiple ranges in the character class 
# inside the square brackets
options = ["13131", "3133103b", "31hello", "o88", "7B"]

for option in options:
    print(option, re.match("[0-9]+[A-Za-z]?$", option) is not None)

13131 True
3133103b True
31hello False
o88 False
7B True


### Groups
We can group multiple elements in a regular expression together. To do this, we put the "subexpression" in brackets. So:

    f(lip)+$
    
means anything with an `f`, followed by one or more `lip`, then the end of string. It will match:

    flip
    fliplip
    flipliplip
    
but not:

    flipli
    fliplp
    
Any regular expression components can go in these brackets.    

**Any repeat operator following will apply to the whole group -- everything in the group works as if it were just one character**

In [32]:
tests = ["flip", "fliplip", "flipliplip", "flipli", "flipl", "fliplipliplop"]

# match a pattern against a list of tests
def match_against(tests, pattern):
    for st in tests:
        print(st.ljust(20), re.match(pattern, st) is not None)
        
match_against(tests, "f(lip)+$")

flip                 True
fliplip              True
flipliplip           True
flipli               False
flipl                False
fliplipliplop        False


### Alternatives
We can use this grouping functionality to make an intelligent kind of "or". Imagine we wanted to match any of `Mrs.` or `Ms.` or `Miss.`. How could we write that as a regular expression?

| is an operator which means "one of the options on either side of the |". The pattern can be a *grouped expression*.

So this pattern would work:

    (Mrs\.)|(Ms\.)|(Miss\.)
    
or this one:

    M(rs)|(s)|(iss)\.
    


In [33]:
names = ["Mrs. Purple", "Miss. White", "Ms. Yellow", "Dr. Blue", "Mr. Red"]

match_against(names, "(Mrs\.)|(Miss\.)|(Ms\.)")

Mrs. Purple          True
Miss. White          True
Ms. Yellow           True
Dr. Blue             False
Mr. Red              False


  
Any regular expression element can be alternated with |:

    [a-z]|[A-Z] same as [a-zA-Z]
    b|g  matches b or g, same as [bg]
    (b[aeiou]+k)|(d[aeiou]+t) matches both book and duet
    
But grouped subexpressions are the most useful thing to alternate -- usually character classes can capture most other patterns.

In [68]:
match_words("((b[aeiou]+k)|(d[aeiou]+t))$")

beak
book
diet
dot
duet


### Captures and extraction
As well as being able to do interesting alternation, every time a group is used, a regular expression matcher *captures* the contents of a group. This is how regular expressions can be used to extract specific text.

Every capture is numbered, counting the number of ( from the left. So if I wanted to capture someone's title and their name following their title:

    (Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)

Then the title would be in capture 0 and their name would be in capture 1.

`re.match()` lets us get at those groups, by calling `.groups()` on the return value.

In [69]:
names = ["Mrs. Purple", "Miss. White", "Ms. Yellow", "Dr. Blue", "Mr. Red", "Lord. Black"]
# simple matching
match_against(names, "(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)")

Mrs. Purple          True
Miss. White          True
Ms. Yellow           True
Dr. Blue             True
Mr. Red              True
Lord. Black          True


In [70]:
names = ["Mrs. Purple", "Miss. White", "Ms. Yellow", "Dr. Blue", "Mr. Red", "Lord. Black"]
# now we actually use the captures
def print_groups(tests, pattern):
    for test in tests:
        print(test.ljust(20), end=' ')
        match = re.match(pattern, test)
        if match is not None:
            print(match.groups()) # groups gets a list of the captures
print_groups(names, "(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)")

Mrs. Purple          ('Mrs', 'Purple')
Miss. White          ('Miss', 'White')
Ms. Yellow           ('Ms', 'Yellow')
Dr. Blue             ('Dr', 'Blue')
Mr. Red              ('Mr', 'Red')
Lord. Black          ('Lord', 'Black')


### Substitution
We can also do regular expression find and replace. This lets us find text matching a pattern and replace it:

`re.sub` performs this operation:

In [71]:
def sub_list(tests, pattern, replacement):
    for test in tests:
        print(test.ljust(20), "=>", end=' ')
        subst = re.sub(pattern,  replacement, test)
        print(subst)
        
sub_list(names, "(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\.", "<title>")

Mrs. Purple          => <title> Purple
Miss. White          => <title> White
Ms. Yellow           => <title> Yellow
Dr. Blue             => <title> Blue
Mr. Red              => <title> Red
Lord. Black          => <title> Black


## Back references
We can refer to the value of any previous captured group using the notation `\<n>` where `<n>` is an integer specifying the index of the group (+1: \0 means the whole string, so \1 is the first capture). This works in substitutions:


In [72]:
def sub_list(tests, pattern, replacement):
    for test in tests:
        print(test.ljust(20), "=>", end=' ')
        subst = re.sub(pattern,  replacement, test)
        print(subst)
# the \1 refers to the title and the \2 refers to the name
sub_list(names, "(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)", 
         r"{'title': '\1', 'name': '\2'}")

Mrs. Purple          => {'title': 'Mrs', 'name': 'Purple'}
Miss. White          => {'title': 'Miss', 'name': 'White'}
Ms. Yellow           => {'title': 'Ms', 'name': 'Yellow'}
Dr. Blue             => {'title': 'Dr', 'name': 'Blue'}
Mr. Red              => {'title': 'Mr', 'name': 'Red'}
Lord. Black          => {'title': 'Lord', 'name': 'Black'}


Back references actually work in the matching part as well, and allow us to force a previously match value to be used again.

In [73]:
# match everything that has a v, followed by two of the **same** vowel
match_words(r"v([aeiou])\1")

veer
veered
veering
veers
voodoo


### Matching versus finding
* `re.match()` finds a pattern at the start of a string.
* `re.search()` find the first pattern in a string (like match, but the pattern does not have to match at the start of the string).
* `re.findall()` can find **multiple** matches in a string, and returns them in a list

In [74]:
# one massive string
names = """Mrs. Purple, Miss. White, Ms. Yellow, Dr. Blue, Mr. Red, Lord. Black"""
# just matched the first one
print(re.match("(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)", names).groups())
# search does the same thing here
print(re.search("(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)", names).groups())

('Mrs', 'Purple')
('Mrs', 'Purple')


In [75]:
names = """some stuff Mrs. Purple, Miss. White, Ms. Yellow, Dr. Blue, Mr. Red, Lord. Black"""
# will *not* match, because of some stuff at the start
print(re.match("(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)", names))
# search still finds the first match
print(re.search("(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)", names).groups())

None
('Mrs', 'Purple')


In [76]:
### using findall
all_matches = re.findall("(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)", names)
# this will just be a list of the groups found
for match in all_matches:
    print(match)

('Mrs', 'Purple')
('Miss', 'White')
('Ms', 'Yellow')
('Dr', 'Blue')
('Mr', 'Red')
('Lord', 'Black')


### Confused about regular expressions?
Try practicing on https://regex101.com/
It's very helpful.

## Finally
Feedback on CS1P, please.



In [None]:
## Syntax review (from learnxinyminutes.com)

In [None]:
# Strings are created with " or '
"This is a string."
'This is also a string.'

# Strings can be added too!
"Hello " + "world!"  # => "Hello world!"
# Strings can be added without using '+'
"Hello " "world!"  # => "Hello world!"

# ... or multiplied
"Hello" * 3  # => "HelloHelloHello"

# A string can be treated like a list of characters
"This is a string"[0]  # => 'T'

# You can find the length of a string
len("This is a string")  # => 16

#String formatting with %
#Even though the % string operator will be deprecated on Python 3.1 and removed
#later at some time, it may still be good to know how it works.
x = 'apple'
y = 'lemon'
z = "The items in the basket are %s and %s" % (x,y)

##### Regular expressions
See http://www.rexegg.com/regex-quickstart.html for a complete guide.

## Week review
## Week Review

* Strings are sequences of characters.
* A character is a length 1 string.
* Every character has a numerical code.
* ASCII codes range from 0-127 and cover "standard" (unaccented, Arabic/Latin) numbers, letters and punctuation.
* `chr()` converts integers to characters, and `ord()` converts characters to integers.
* Strings can be split into a list of elements with `string.split()`
* A list of strings can be joined together using `"sep".join(list)`
* Trailing whitespace can be removed with `strip()`
* In Python, strings are **immutable**.
* We can justify strings easily using `ljust()`, `rjust()` and `center()`
* We can test for substrings using `in`
* We can find the location of substrings using `index()`
* We can substitute substrings using `replace()`
* We can change the case of text using `upper()`, `lower()` and `capitalize()`.
* We can format strings using `%` to substitute in values. 
#### Regular expressions
* Regular expressions allows to match using flexible templates.
* Regexs use special characters to specify matching templates.
* Any special character can be **escaped** by preceding with a `\ ` to make it behave as a regular character.
* . means match any character ("c.tl.ss" matches "cutlass")
* We can allow character to be repeated using * (zero or more), + (one or more) and ? (zero or one) `"gl.*ow"` matches anything beginning `gl` and ending in `ow`, including `glow` and `glasgow`
* `$` and `^` allow us to force a pattern to match at the end or beginning of a string, respectively.
* We can use **character classes** to represent a set of characters that can be matched. `[abc]` matches a,b or c.
* A range of consecutive characters in a character class using `-` and we can invert a character class using `^` (meaning anything *but* any of these)
* We can group characters together using `()`
* We can alternate groups using `|` meaning "or" `(cat)|(dog)` means match either `cat` or `dog`
* We can capture groups to extract information from a match.
* We can use capture groups to perform very flexible substitutions.