# Data Science Day 6

## Generators

### Generator Expressions

#### List comprehensions use square brackets, while generator expressions use parentheses

- This is a representative list comprehension:

In [1]:
[n ** 2 for n in range(12)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]

- While this is a representative generator expression:

In [3]:
(n ** 2 for n in range(12))

<generator object <genexpr> at 0x10898cf20>

- Notice that printing the generator expression does not print the contents
- One way to print the contents of a generator expression is to pass it to the list constructor:

In [5]:
G = (n ** 2 for n in range(12))

In [6]:
next(G)

0

In [7]:
list(G)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]

#### A list is a collection of values, while a generator is a recipe for producing values

- When you create a list, you are actually building a collection of values, and there is some memory cost associated with that
- When you create a generator, you are not building a collection of values, but a recipe for producing those values
- Both expose the same iterator interface, as we can see here:

In [8]:
L = [n ** 2 for n in range(12)]
for val in L:
    print(val, end=' ')

0 1 4 9 16 25 36 49 64 81 100 121 

In [9]:
G = (n ** 2 for n in range(12))
for val in G:
    print(val, end=' ')

0 1 4 9 16 25 36 49 64 81 100 121 

- The difference is that a generator expression does not actually compute the values until they are needed
- This not only leads to memory efficiency, but to computational efficiency as well
- This also means that while the size of a list is limited by available memory, the size of a generator expression is unlimited
- An example of an infinite generator expression can be created using the count creator definted in itertools:

In [10]:
from itertools import count
count()

count(0)

In [11]:
for i in count():
    print(i, end=' ')
    if i>= 10: break

0 1 2 3 4 5 6 7 8 9 10 

- The count iterator will go on happily counting forever until you tell it to stop
- This makes it convenient to create generators that will also go on forever:

In [12]:
factors = [2, 3, 5, 7]
G = (i for i in count() if all(i % n > 0 for n in factors))
for val in G:
    print(val, end=' ')
    if val > 40: break

1 11 13 17 19 23 29 31 37 41 

- If we were to expand the list of factors appropriately, what we would have the beginnings of is a prime number generator, using the Sieve of Eratosthenes algorithm

#### A list can be interated multiple times; a generator expression is single-use

- This is one of the potential gotchas of generator expression
- With a list, we can straightforwardly do this:

In [13]:
L = [n ** 2 for n in range(12)]
for val in L:
    print(val, end=' ')
print()

for val in L:
    print(val, end=' ')

0 1 4 9 16 25 36 49 64 81 100 121 
0 1 4 9 16 25 36 49 64 81 100 121 

- A generator expression, on the other hand, is used-up after one iteration:

In [14]:
G = (n ** 2 for n in range(12))
list(G)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]

In [15]:
list(G)

[]

- This can be very useful because it means iteration can be stopped and started:

In [16]:
G = (n ** 2 for n in range(12))
for n in G:
    print(n, end=' ')
    if n > 30: break

print("\ndoing something in between")

for n in G:
    print(n, end=' ')

0 1 4 9 16 25 36 
doing something in between
49 64 81 100 121 

### Generator Functions: Using yield

- List comprehensions are best used to create relatively simple lists, while using a normal for loop can be better in more complicated situations
- The same is true of generator expressions: we can make more complicated generators using generator functions, which make use of the yield statement
- Here we have two ways of constructing the same list:

In [17]:
L1 = [n ** 2 for n in range(12)]
L2 = []
for n in range(12):
    L2.append(n ** 2)
print(L1)
print(L2)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]


- Similarly, here we hve two ways of constructing equivalent generators:

In [18]:
G1 = (n ** 2 for n in range(12))
def gen():
    for n in range(12):
        yield n ** 2
G2 = gen()
print(*G1)
print(*G2)

0 1 4 9 16 25 36 49 64 81 100 121
0 1 4 9 16 25 36 49 64 81 100 121


- A generator function is a function that, rather than using return to return a value once, uses yield to yield a (potentially infinite) sequence of values
- Just as in generator expressions, the state of the generator is preserved between partial iterations, but if we want a fresh copy of the generator, we can simply call the function again

#### Example: Prime Number Generator

- An example of a generator function is a function to generate an unbounded series of prime numbers
- A classic algorithm for this is the Sieve of Eratosthenes, which works something like this:

In [19]:
#Generate a list of candidates
L = [n for n in range(2, 40)]
print(L)

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]


In [20]:
#Remove all multiples of the first value
L = [n for n in L if n == L[0] or n % L[0] > 0]
print(L)

[2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39]


In [21]:
#Remove all multiples of the second value
L = [n for n in L if n == L[1] or n % L[1] > 0]
print(L)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 25, 29, 31, 35, 37]


In [22]:
#Remove all multiples of the third value
L = [n for n in L if n == L[2] or n % L[2] > 0]
print(L)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]


- If we repeat this procedure enough times on a large enough list, we can generate as many primes as we wish
- Let's encapsulate this logic in a generator function:

In [23]:
def gen_primes(N):
    """Generate primes up to N"""
    primes = set()
    for n in range(2, N):
        if all(n % p > 0 for p in primes):
            primes.add(n)
            yield n
            
print(*gen_primes(100))

2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97


## Regular Expression Operations

- Regular expressions use the backslash character ('\\') to indicate special forms or to allow special characters to be used without invoking their special meaning
- This collides with Python's usage of the same character for the same purpose in string literals:
    - For example, to match a literal backslash, one might have to write '\\\\\\\\' as the pattern string, because the regular expression must be \\\\, and each backslash must be expressed as \\\\ inside a regular Python string literal
- The solution is to use Python's raw string notation for regular expression patterns
- Backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n' while "\n" is a one-character string containing a newline
- Usually patterns will be expressed in Python code using this raw string notation.

### Regular Expressions

- We have already seen that we can ask from a string str whether it begins with some substring as follows: str.startswith('Apple')
- If we would like to know whether it starts with "Apple" or "apple", we would have to call startswith method twice
- Regular expressions offer a simpler solution: re.match(r"[Aa]pple", str)
- The bracket notation is one example of the special syntax of regular expressions
- In this case, it says that any of the characters inside brackets will do: either "A" or "a"
- The other letters in "pple" will act normally
- The string r"[Aa]pple" is called a pattern

- A more complicated example asks whether the string str starts with either apple or banana (no matter if the first letter is capital or not): re.match(r"[Aa]pple|[Bb]anana", str)
- In this example, we saw a new special character | that denotes an alternative
- On either side of the bar character we have a subpattern

- A legal name in Python starts with a letter or an underline character and the following characters can also be digits
- So legal names are, for instance: _ hidden, L_value, A123_
- But the name 2abc is not a valid variable name
- Let's see what would be the regular expression pattern to recognize valid variable names: r"[A-Za-z_][A-Za-z_0-9]*\Z"
- Here we have used a shorthand for character ranges: A-Z
    - This means all the characters from A to Z

- The first character of the variable name is defined in the first brackets
- The subsequent characters are defined in the second brackets
- The special character * means that we allow any number (0, 1, 2, ...) of the previous subpattern
- For example, the pattern r"ba*" allows strings "b", "ba", "baa", "baaa", and so on
- The special syntax \Z denotes the end of the string
- Without it we would also accept abc- as a valid name since the match function normally checks only that a string starts with a pattern

- The special notations, like \Z, also cause problems with string handling
- Remember that normally in string literals we have some special notation: \n stands for newline, \t stands for tab, and so on
- So, both string literals and regular expressions use similar looking notations, which can create serious confusion
- This can be solved by using the so-called raw strings
- We denote a raw string by having an r letter before the first quotation mark, for example r"ab*\Z"
- When using raw strings, the newline (\n), tab (\t), and other speical string literal notations aren't interpreted
- One should always used raw strings when defining regular expressions patterns

### Patterns

- A pattern represents a set of strings
- This set can even be potentially infinite
- They can be used to describe a set of strings that have some commonality; some regular structure

- In patterns, normal characters (letters, numbers) just represent themselves, unless preceded by a backslash, which may trigger some special meaning
- Punctuation characters have special meaning unless preceded by backslash (\\), which deprives their special meaning
- Use \\\\ to represent a backslash character without any special meaning

- We have already seen that a | character denotes alternatives
- For example, the pattern r"Get (on|off|ready)" matches the following strings: "Get on", "Get off", "Get ready"
- We can use parentheses to create groupings inside a pattern: r"(ab)+" will match the strings ("ab", "abab", "ababab", and so on
- These groups are also given a reference number starting from 1
- We can refer to groups using backreferences: \number
- For example, we can find separated patterns that get repeated: r"([a-z]{3,}) \1 \1"
- This will recognize, for example, the following strings: "aca aca aca", "turn turn turn", but not the strings "aca aba aca" or "ac ac ac"

- We can shorten our previous variable name example to r'[a-zA-Z_]\w*\Z'

- The patterns \A, \b, \B, and \Z will all match an empty string, but in specific places
- The patterns \A and \Z will recognize the beginning and end of the string, respectively
- Note that the patterns ^ and \$ can in some cases match also after a newline and before a newline, correspondibly
    - So \A is distinct from ^, and \Z is distinct from $
- The pattern \b matches at the start of end of a word
- The pattern \B does the reverse

### Match and search functions

- The function re.search allows to match any substring of a string
- Example: re.search(r'\bback\b', s) will match strings "back", "a back, is a body part", "get back", but it will not match the strings "backspace" or "comeback"

- The function re.search finds only the first occurence
- We can use the re.findall function to find all occurences
- Let's say we want to find all present participle words in a string s
- The present participle words have end 'ing'
- The function call would look like this: re.findall(r'\w+ing\b', s)
- Let's try running this:

In [26]:
import re
s = "Doing things, going home, staying awaking, sleeping later"
re.findall(r'\w+ing\b', s)

['Doing', 'going', 'staying', 'awaking', 'sleeping']

- Let's say we want to pick up all the integers from a string
- We can try that will the following function call:

In [27]:
re.findall(r'[+-]?\d+', "23 + -24 = -1")

['23', '-24', '-1']

- Suppose we are given a string of if/then sentences, and we would like to extract the conditions from these sentences:

In [29]:
s = ("If I'm not in a hurry, then I should stay. " + "On the other hand, if I leave, then I can sleep.")
re.findall(r'[Ii]f (.*), then', s)

["I'm not in a hurry, then I should stay. On the other hand, if I leave"]

- But I wanted a result ["I'm not in a hurry", "I leave"]
    - That is, the condition from both sentences

- The problem is that the pattern .+ tries to match as many characters as possible
- This is called greedy matching
- One way of solving this problem is to notice that the two sentences are separated by a full-stop (.)
- So, instead of matching all the characters, we need to match everything but the dot character
- This can be achieved by using the complement character class: [^.]
- The hat character (^) in the beginning of a characer class means the complement character class

- After the modification, the function call looks like this: re.findall(r'[Ii]f ([^.]\*, then', s)
- Another way of solving this problem is to use a non-greedy matching
- The repetition specifiers +, \*, ?, and {m,n} have corresponding non-greedy versions: +? \*?, ??, and {m,n}?
- These expressions use as few characters as possible to make the whole pattern match some substring
- By using non-greey versions, the function call looks like this: re.findall(r'[Ii]f (.\*?), then', s)

### Functions in the re module

- re.match(pattern, str)
- re.search(pattern, str)
- re.findall(pattern, str)
- re.finditer(pattern, str)
- re.sub(pattern, replacement, str, count=0)

- Functions match and search return a match object
- A match object describes the found occurence
- The function findall returns a list of all the occurences of the pattern
- The elements in the list are strings
- The function finditer works like findall except that instead of returning a list, it returns an iterator whose items are match objects
- The function sub replaces all the occurences of the pattern in str with the string replacement and returns the new string

- The following program will replace all "she" words with "he"

In [30]:
import re
str = "She goes where she wants to, she's a sheriff."
newstr = re.sub(r'\b[Ss]he\b', 'he', str)
print(newstr)

he goes where he wants to, he's a sheriff.


- The sub function can also use backreferences to refer to the match string
- The backreferences \1, \2 and so on, refer to the groups of the pattern, in order:

In [31]:
import re
str = """He is the president of Russia
He's a powerful man."""
newstr = re.sub(r'(\b[Hh]e\b)', r'\1 (Putin)', str, 1)
print(newstr)

He (Putin) is the president of Russia
He's a powerful man.


#### Match object

- Functions match, search, and finditer use match objects to describe the found occurence
- The method groups() of the match object returns the tuple of all the substrings matched by the groups of the pattern
- Each pair of parentheses in the pattern creates a new group
- These groups are referred to by indices 1, 2, ...
- The group 0 is a special one: it refers to the match created by the whole pattern

In [33]:
mo = re.search(r'\d+ (\d+) \d+ (\d+)','first 123 45 67 890 last')

- The call mo.groups() returns a tuple ('45, '890')
- We can access just some individual groups by using the method group(gid, ...)
- For example, the call mo.group(1) will return '45'
- The zeroth group will represent the whole match: '123 45 67 890'

### Miscellaneous stuff

- If the same pattern is used in many function calls, it may be wise to precompile the pattern, mainly for efficiency reasons
- This can be done using the compile(pattern, flag=0) function in the re module
- The function returns a so-called RE object
- The RE object has method versions of the functions found in module re
- The only difference is that the first parameter is not the pattern since the precompiled pattern is stored in the RE object

- The details of match operation can be specified using optional flags
- These flags can be given either inside the pattern or as a parameter to the compile function

## String Manipulation and Regular Expressions

- Strings in Python can be defined using either single or double quotations (they are functionally equivalent):

In [34]:
x = 'a string'
y = 'a string'
x == y

True

- In addition, it is possible to define multi-line strings using a triple-quote syntax:

In [35]:
multiline = """
one
two
three
"""
print(multiline)


one
two
three



### Simple String Manipulation in Python

- For basic manipulation of strings, Python's built-in string methods can be extremely convenient

#### Formatting strings: Adjusting case

- Python makes it quite easy to adjust the case of a string

In [36]:
fox = "tHe qUICK bROWn fOx."

- To convert the entire string into upper-case or lower-case, you can use the upper() or lower() methods respectively:

In [37]:
fox.upper()

'THE QUICK BROWN FOX.'

In [38]:
fox.lower()

'the quick brown fox.'

- A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence
- This can be done with the title() and capitalize() methods:

In [39]:
fox.title()

'The Quick Brown Fox.'

In [40]:
fox.capitalize()

'The quick brown fox.'

In [41]:
fox.swapcase()

'ThE Quick BrowN FoX.'

#### Formatting strings: Adding and removing spaces

- Another common need is to remove spaces (or other characters) from the beginning of end of the string
- The basic method of removing characters is the strip() method, which strips whitespace from the beginning and end of the line:

In [42]:
line = '       this is the content       '
line.strip()

'this is the content'

- To remove just space to the right or left, use rstrip() or lstrip() respectively:

In [43]:
line.rstrip()

'       this is the content'

In [44]:
line.lstrip()

'this is the content       '

- To remove characters other than spaces, you can pass the desired character to the strip() method:

In [45]:
num = "0000000000000435"
num.strip('0')

'435'

- The opposite of this operation, adding spaces or other characters, can be accomplished using the center(), ljust(), and rjust() methods

In [49]:
line.center(30)

'       this is the content       '

- All these methods additionally accept any character which will be used to fill the space

#### Finding and replacing substrings

- If you want to find the occurences of a certain characters in a string, the find() / rfind(), index() / rindex() and replace() methods are the best built-in methods
- find() and index() are very similary in that they search for the first occurence of a character or substring within a string, and return the index of the substring:

In [51]:
line = 'the quick brown fox jumped over a lazy dog'
line.find('fox')

16

In [52]:
line.index('fox')

16

- The only difference between find() and index() is their behavior when the search string is not found: find() returns -1, while index() raises a ValueError:

In [53]:
line.find('bear')

-1

In [54]:
line.index('bear')

ValueError: substring not found

#### Splitting and partitioning strings

- If you would like to find a substring and then split the string based on its location, the partition() and/or split() methods are what you're looking for: both will return a sequence of substrings
- The partition() method returns a tuple with three elements: the substring before the first instance of the of the split-point, the split-point itself, and the substring after:

In [55]:
line.partition('fox')

('the quick brown ', 'fox', ' jumped over a lazy dog')

- The rpartition() method is similar, but searches from the right of the string
- The split() method is more useful: it finds all instances of the split-point and returns the substrings in between
- The default is to split on any whitespace, returning a list of the individual words in a string:

In [56]:
line.split()

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

- A related method is splitlines(), which splits on newline characters

In [57]:
haiku = """matsushima-ya
aah matsushima-ya
matsushima-ya"""

haiku.splitlines()

['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']

- Note that if you would like to undo a split(), you can use the join() method, which returns a string built from a split-point and an iterable:

In [58]:
'--'.join(['1', '2', '3'])

'1--2--3'

- A common pattern is to use the special character "\n" (newline) to join together lines that have been previously split and recover the input:

In [59]:
print("\n".join(['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']))

matsushima-ya
aah matsushima-ya
matsushima-ya


### Format Strings

- Another use of string methods is to manipulate string representation of values of other types
- String representations can be found using the str() function
- Inside the {} marker, you can also include information on exactly what you would like to appear there
- If you include a number, it will refer to the index of the argument to insert:

In [62]:
"""First letter: {0}. Last letter: {1}.""".format('A', 'Z')

'First letter: A. Last letter: Z.'

- If you include a string, it will refer to the key of any keyword argument