# Sorting and Pattern Matching

## <img src="https://az712634.vo.msecnd.net/notebooks/python_course/v1/octopus.png" alt="Smiley face" width="42" height="42" align="left">Learning Objectives
* * *
* Become comfortable with sorting (inline or copied) and the options you have
* Explore fancy sorting with a built-in or custom function as a "key"
* Learn ways to search and find patterns with regular expressions
* See how to split up strings with regular expressions

#### Here are some people's names
* How could you imagine wanting to sort them?

In [1]:
people = ['Joshua Richardson', 'Wei Ling', 'Sarah McKearny', 'Matthias Strauch']

#### Here are some shoe prices in US dollars
* What if you wanted to sort and convert to Euros at the same time?

In [2]:
prices = [134.50, 49.99, 300.00, 12.50, 68.49]

### Sorting functions
* Inline - no copying with `list.sort()`

In [5]:
# Create a list of 10 random integers
import random
a = random.sample(range(50), 10)
print('a:', a, '\n')

# Inline sorting
a.sort(reverse=True)
print('inline sorted a:', a, '\n')

a: [48, 27, 29, 7, 47, 28, 22, 33, 12, 20] 

inline sorted a: [48, 47, 33, 29, 28, 27, 22, 20, 12, 7] 



**To reverse:**  add `reverse=True` to `sort()` method

* Sort and create a copy with `sorted`

In [6]:
# another list of 10 random integers
b = random.sample(range(50), 10)
print('b:', b)

# make a sorted copy of list
b_sorted = sorted(b)
print('sorted copy of b:', b_sorted)

b: [13, 15, 43, 20, 3, 38, 41, 0, 6, 17]
sorted copy of b: [0, 3, 6, 13, 15, 17, 20, 38, 41, 43]


Will the following work with inline `sort` and/or copy version `sorted`?
```python
c = (3, 4, 2, 6, 5)
```

In [9]:
c = (3, 4, 2, 6, 5)
d = sorted(c, reverse=True)
print(d)

[6, 5, 4, 3, 2]


<b>Sorting efficiency</b>
* `sort()` is usually faster on lists than `sorted()` as it doesn't create an intermediate list
* however, `sort()` method only works on lists whereas `sorted()` works on lists and enumerable collections
* use `timeit` magic to do a speed test on your own (modify code above)

EXERCISE 1:  Do the following and time each with `%timeit` magic
* Create a "big" list of random integers (1000+) like below
* Create a sorted copy
* Sort the list inline
* Sort the list inline in reverse order
* Convert to a list of floating point numbers (like below)
* Sort the float list inline
* Sort the float list inline in reverse order
---
NB:  Must do a "deep copy" of original list before applying an inline fuction.  e.g. 
```python
test = a.copy()
```
The list and float conversion:
```python
a = random.sample(range(10000), 1000)

# converted to float list
b = [float(x) for x in a]
```

Which of the above operations are fastest/slowest and why?

In [None]:
# Try your solution here

**Adding a *function* to the sorting process**
You can
* sort by built-in functions
* sort by custom functions


In [None]:
people = ['Joshua Richardson', 'Wei Ling', 'Sarah McKearny', 'Matthias Strauch']

Let's sort `people` from above by last name

In [None]:
# Our function (anonymous one here)
func = lambda s: s.split()[1]
    
func(people[0])

In [None]:
sorted(people, key = func) # does not modify the original

```python
# Does this work for the inline "sort" method?
___.sort(key = ___)
```
Copy and fill in blanks below.

In [None]:
# Try your solution here

EXERCISE 2: Sort a list of random integers by their value on the sin curve

```python
import math
import random

# HINT: create an anonymous function
#  math.sin(x) is the conversion function

a = random.sample(range(25, 50), 10)
```

In [None]:
# Try your solution here

<b>Behind the scenes of the `sorted()` function with a key</b>
* when using a key, `sorted()` creates an intermediate list or proxy

### The `re` module
* Most letters and characters match themselves, e.g. `super` will only match 'super' in a case-sensitive manner (we can make it case-insensitive though with an optional argument)
* `re` module gives us the power to search and match specific patterns in strings
* `re` gives us functions to find all occurences or iterate over all matches
* `re` allows us to split strings and substitute based on patterns

<b>Regular expressions in Python</b>
* These are the characters (called metacharacters) which have special meaning in Python regex: 
`. ^ $ * + ? { } [ ] \ | ( )`
* regex basics are not going to be covered here (please reference https://docs.python.org/3/howto/regex.html for a Python regex howto)
<p></p>
* Here are examples of using regex with functions and methods in `re` module:

In [10]:
import re

# match() finds first occurence of compiled pattern and returns a match object
p = re.compile('[A-z][a-z]')
m = p.match('AaAa')
print(m.group())

# search() includes the regex AND search string and returns a match object
s = re.search('[A-Z][a-z]', 'AaAa')
print(s.group())

# findall() includes the regex AND search string and returns a list of all matches
f = re.findall('[A-z][a-z]', 'AaAa')
print(f)

# print(f.index('Aa'))

# finditer() includes the regex AND search string and returns an iterable with all matches
i = re.finditer('[A-z][a-z]', 'AaAa')
for x in i:
    print((x.group(), x.span()))

Aa
Aa
['Aa', 'Aa']
('Aa', (0, 2))
('Aa', (2, 4))


EXERCISE 3:  A small one surrounded by three body guards on either side

---
Using the following text and the `re` module, find all sets of letters that match this pattern: three upper case, one lower case, and three more uppercase letters consecutively (e.g.  ABCdEFG).  Then print all the "small ones" (middle lowercase letters).  What does it say?

```python
text = "kAewtloIKHbWJZNhHVGxXDiQmzjfcpYbzRPBoLPDSmUbCunkfxZWDZoUZMiGqhRRiUvGmYmvnJIHEmbTMUKLECKdCthezSYBpIEl"
```

Hint:  A *regex* could look like: `[A-Z][a-z][A-Z]` to match `AbC`

In [11]:
help(re)

Help on module re:

NAME
    re - Support for regular expressions (RE).

MODULE REFERENCE
    http://docs.python.org/3.5/library/re
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last matc

<b>Note about regex with backslashes</b>
* When trying to match a '\' using a string literal as the pattern, we must use '\\\\'
* This is where raw strings come in handy making the search for '\' into the pattern '\\'
* The output from matching, searching and finding is still a string literal
* Take this example below where we search for sections in a latex document

In [None]:
# Here's a bit of latex
latex = """\section{Introduction}
Make it possible for all to write documents with \LaTeX{}!

\subsection{more introduction}
Go more in detail \ldots

\subsubsection{even more introduction}
come to the point \ldots

\paragraph{Paragraphs}
A paragraph is small but 

\subparagraph{Subparagraphs}
subparagraphs are smaller! 

\paragraph{Outline}
First we start with a little example of the article class, which is an 
important documentclass. But there would be other documentclasses like 
book \ref{book}, report \ref{report} and letter \ref{letter} which are 
described in Section \ref{documentclasses}. Finally, Section 
\ref{conclusions} gives the conclusions.



\section{Documentclasses} \label{documentclasses}"""

# Find section using a string literal
literal = '\\\\section'
found1 = re.findall(literal, latex)
print(found1)

# Find section using a raw string (preface with 'r')
raw = r'\\section'
found2 = re.findall(raw, latex)
print(found2)

<b>Spliting</b>
* `split()` method splits a string by applying a compiled pattern (like we do with `match()`)
* It has an optional second argument (maxsplit) which will limit the number of splits starting at the beginning of the input string

In [None]:
text = 'Happy 15th Birthday!  You are 15 going on 30.'

# Create pattern which recognizes digits
p = re.compile('\d+')

# Use split method on pattern with string as argument
words = p.split(text)
print(words)

# Now we can use the string.format method to reconstruct a new string
newwish = '{}60{}60{}30{}'.format(words[0], words[1], words[2], words[3])
print(newwish)

In [None]:
# Example of using maxplit with split()
text = 'grabxxfirstxthreexxxandxherexxxxisxxrest'
p = re.compile('x+')
p.split(text, maxsplit = 3)

EXERCISE 4: Phonebook
* using `split()` use the following text
<pre><code>"""Steve Martin: 310.222.3333 400 Holly Ave
<p></p>
Sandra Bullock: 512.456.1789 200 52nd St
Tommy-Lee Jones: 210.555.7777 321 Calahan Rd"""</code></pre>
<p>to create a list of dictionaries similar to this one:</p>
```python
{'Bullock': {'address': '200 52nd St',
  'first': 'Sandra',
  'last': 'Bullock',
  'phone': '512.456.1789'},
 'Jones': {'address': '321 Calahan Rd',
  'first': 'Tommy-Lee',
  'last': 'Jones',
  'phone': '210.555.7777'},
 'Martin': {'address': '400 Holly Ave',
  'first': 'Steve',
  'last': 'Martin',
  'phone': '310.222.3333'}}
```

e.g.

```python
# Split on new lines
entries = re.split('\n', text)

# A bit fancier split - parses a string by ':' or space up to 3 splits
re.split(":? ", entry, 3)
```

In [None]:
# Try your solution here

---
Created by a Microsoft Employee.
	
The MIT License (MIT)<br>
Copyright (c) 2016