![DSB Logo](img/Dolan.jpg)
# Python Data Types: Dictionaries
## PY4E Chapter 9
### How data are stored and processed in Python

# Dicts in General

- Dicts are somewhat similar to lists
    - in lists, _positions_ of elements are marked by _integers_
    - in dicts, _positions_ of elements are marked by (_alsmost_) any data type
    
- A dictionary is a mapping between a set of indices (_keys_) and a set of _values_
    - each key in a dict is mapped to a value
        - this type of mapping is called a _key-value pair_ or an _item_
        
```
dict -> {key : value}
```

In [1]:
sample_dict = dict()
# This will print out an empty dict `{}`
print(sample_dict)

{}


In [2]:
# another way of defining a dict
sample_dict2 = {}
print(sample_dict2)

{}


In [3]:
# to update/add item into a dict
# in following item, `'one'` is the key, and `1` is the value
sample_dict['one'] = 1
print(sample_dict)

{'one': 1}


In [6]:
# You can define a dictionary with multiple items
sample_dict2 = {'one': 1, 'two': 2, 'three': 3}
print(sample_dict2)

{'one': 1, 'two': 2, 'three': 3}


In [7]:
# you can always access the value in a dict by using its key
sample_dict2['one']

1

In [8]:
# however, if the key is not in the dict, you will not get the results
sample_dict2['four']

KeyError: 'four'

In [9]:
# also, the reverse-lookup (using value to look up key) does not work
sample_dict2[1]

KeyError: 1

In [10]:
# you can find the number of key-value pairs using `len()`
len(sample_dict2) # 3

3

In [11]:
# of course an empty dict will return a zero
len({})

0

In [12]:
# you can use the `in` operator 
# however this works on keys but not values
'one' in sample_dict2

True

In [13]:
# but this will not work on keys
1 in sample_dict2

False

In [14]:
# but you can use the `.values()` method to retrieve the values
vals = sample_dict2.values()
1 in vals

True

# Dictionary as a Set of Counters

- Suppose you are given a bunch of text, and count how many times each letter appears. You can do it as:
    - You could create 26 variables, one for each letter of the alphabet. Then you could traverse the string and, for each character, increment the corresponding counter, probably using a chained conditional.
    - You could create a list with 26 elements. Then you could convert each character to a number (using the built-in function ord), use the number as an index into the list, and increment the appropriate counter.
    - You could create a dictionary with characters as keys and counters as the corresponding values. The first time you see a character, you would add an item to the dictionary. After that you would increment the value of an existing item.
        - here you can only reserve space for letters do appear, but not the letters that do not

In [15]:
word = 'brontosaurus' 
d = dict()
for c in word:
    if c not in d: 
        d[c] = 1
    else:
        d[c] = d[c] + 1
print(d)

{'b': 1, 'r': 2, 'o': 2, 'n': 1, 't': 1, 's': 2, 'a': 1, 'u': 2}


In [16]:
# you can use the `.get()` method to retrieve item from a dict
counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}
print(counts.get('jan', 0)) # 100
print(counts.get('tim', 0)) # 0

100
0


In [17]:
# So with this method, we can rewrite the counter code as below
# this is prederred method because it is optimal
d = dict()
for c in word:
    d[c] = d.get(c,0) + 1
print(d)

{'b': 1, 'r': 2, 'o': 2, 'n': 1, 't': 1, 's': 2, 'a': 1, 'u': 2}


# Traversing the Dictionary

- Like lists and strings, dicts are collections/sequences, so that we can use loop to traverse them
    - you can go through the keys
    - or you can go through key-value pairs (__preferred method__)
        - in Python, data like _key-value pairs_ are called _tuples_ (see chapter 10 in PY4E) for more details

In [18]:
# iterate through using keys
for key in d: # d is the counter dict we created
    print(key, d[key])

b 1
r 2
o 2
n 1
t 1
s 2
a 1
u 2


In [19]:
# iterate through dictionary using key-value pairs
for k,v in d.items(): # k-key, v-value
    print(k,v)

b 1
r 2
o 2
n 1
t 1
s 2
a 1
u 2


In [20]:
# you can always add conditions to the for loops
# when iteratre through a dict
# for instance, we want the letters appear more than once
# in the word 'brontosaurus' 
for k,v in d.items(): # k-key, v-value
    if v > 1:
        print(k,v)

r 2
o 2
s 2
u 2


# Sorting the Dictionaries

- Unlike lists, dictionaries cannot be directly sorted, but we can sort the _keys_ and _values_
    - _keys_ and _values_ are essentially lists, or can be converted to lists
        - according to field experiences, sorting _values_ is more useful than sorting _keys_

In [23]:
# sorting by keys is easy

# extract keys as a list
key_lst = list(d.keys())

# sort a list
key_lst.sort()

# traverse through the sorted lisst
for key in key_lst:
    print(key, d[key])

a 1
b 1
n 1
o 2
r 2
s 2
t 1
u 2


# Sorting the Dictionaries

- Even though sorting the dictionaries by values is not as direct and easy
    - but essentially you still sort the the keys
    - what you need to do is to _reversed_ dicts
    - problem is that keys are _unique_, but values are not, so if you have duplicate _values_, it will raise a problem

In [25]:
# reverse the `counts` dict
rev_counts = {v:k for k,v in counts.items()}
# get keys from the `rev_counts` dict
# keep in mind that the keys are value in the original dict
val_lst = list(rev_counts.keys())
# sort the value list
val_lst.sort()
for val in val_lst:
    # reason we print it out this way is that 
    # we need to reverse it back to the original (key, value) pairs
    print(rev_counts[val], val)

chuck 1
annie 42
jan 100


# Dictionary Operations

- Like other data types we have covered so far, dictionaries have certain operations
    - keep in mind that dict items are __only__ accessible using __keys__
    - dictionaries are mutable, means that you can _update_ values in dict items
    - you can also use the `del` operator to delete an item from a dictionary

In [26]:
inventory = {"apples": 430, "bananas": 312, "oranges": 525, "pears": 217}
del inventory["pears"]
print(inventory)

{'apples': 430, 'bananas': 312, 'oranges': 525}


In [27]:
inventory["bananas"] += 200
print(inventory)

{'apples': 430, 'bananas': 512, 'oranges': 525}


# Aliasing and Copying

- Since dicts are mutable, we need to be careful of _aliasing_
    - keep in mind that aliasing is create a second name of the same object
    - thus, change the alias will affect the change to the original dict
    
- If you want to change a version of the dict, without affecting the original:
    - you need to create a copy of the dict
    - since this is a different object, change the copy will _not_ change the original

In [28]:
opposites = {"up": "down", "right": "wrong", "yes": "no"}
alias_dict = opposites
copy_dict = opposites.copy() 

In [29]:
# this will change the dict value to 'left'
alias_dict['right'] = 'left'
opposites['right']

'left'

In [30]:
# this will not change the dict value - it remains as 'left'
copy_dict['right'] = 'privilege'
opposites['right']

'left'

# Sparse Matrices

- In data analytics, we often deal with sparse matrices
    - Sparse matrices refer to the matrices with a lot of `0`s
    - we can use a list of lists to represent the matrices
    - but this is a waste of resources since `0`s usually contain less value
    - the alternative is to use a dictionary for that
    
$$ \begin{equation*}
\begin{vmatrix}
\ 0 & 0 & 0 & \mathbf{1} & 0 \\
\ 0 & 0 & 0 & 0 & 0 \\
\ 0 & \mathbf{2} & 0 & 1 & 0 \\
\ 0 & 0 & 0 & 0 & 0 \\
\ 0 & 0 & 0 & \mathbf{3} & 0 \\
\end{vmatrix}
\end{equation*} $$

In [31]:
# following list of lists represent above matrix

matrix = [[0, 0, 0, 1, 0],
          [0, 0, 0, 0, 0],
          [0, 2, 0, 0, 0],
          [0, 0, 0, 0, 0],
          [0, 0, 0, 3, 0]]


# same matrix can be represented using this dictionary
matrix = {(0, 3): 1, (2, 1): 2, (4, 3): 3} # all zero values are ignored

# comparing to the list method, the dict method only keeps three non-zero values
# in the dict method, the keys are the positions (coordinates) of the non-zero values

In [33]:
# retrieve non-zero values are easy
matrix[(0, 3)]

1

In [35]:
# retrieve a zero value will be tricky
# we get the KeyError since there is no such key in the dict
matrix[(1, 3)]

KeyError: (1, 3)

In [36]:
# we can always use the `.get()` method for that purpose
# the first argument is the key
# the second argument is the value should return 
# if the key is not in the dict
matrix.get((1, 3), 0)

0

# Creating Dictionaries from Lists

- Since dict keys and values are collections, we can create dicts from two lists:
    - you will use the `zip()` built-in function
        - note that `zip()` takes two arguments for both lists
        - the first list will be the key, the second list will be the values
        - you need to make sure that the first list does not contain __duplicate__ values

In [37]:
# values
lst1 = [1, 2, 3]
# keys
lst2 = ['a', 'b', 'c']
my_dict = dict(zip(lst2, lst1))
my_dict

{'a': 1, 'b': 2, 'c': 3}

# Your Turn Here
Finish exercises below by following instructions of each of them. 

Make sure you provide proper __pseudo code__ for each of your program.

## Q1. Coding Problem

Write a function to calculate the product sales.

Example inputs and output:
```python
price_dict = {'A': 100, 'B': 200, 'C': 300}
unit_dict = {'A': 1, 'B': 0, 'C': 5}
-> sales_dict = {'A': 100, 'B': 0, 'C': 1500}
```

## Q2. Coding Problem

In cryptography, a *Caesar* cipher is a very simple encryption techniques in which each letter in the plain text is replaced by a letter some fixed number of positions down the alphabet. 

For example, with a shift of **3**:
- A would be replaced by D, 
- B would become E, and so on. 

The method is named after Julius Caesar, who used it to communicate with his generals. 

**ROT-13** ("rotate by 13 places") is a widely used example of a Caesar cipher where the shift is _13_. In Python, the key for ROT-13 may be represented by means of the following dictionary:

```
key = {'a':'n', 'b':'o', 'c':'p', 'd':'q', 'e':'r', 'f':'s', 'g':'t', 'h':'u', 
       'i':'v', 'j':'w', 'k':'x', 'l':'y', 'm':'z', 'n':'a', 'o':'b', 'p':'c', 
       'q':'d', 'r':'e', 's':'f', 't':'g', 'u':'h', 'v':'i', 'w':'j', 'x':'k',
       'y':'l', 'z':'m', 'A':'N', 'B':'O', 'C':'P', 'D':'Q', 'E':'R', 'F':'S', 
       'G':'T', 'H':'U', 'I':'V', 'J':'W', 'K':'X', 'L':'Y', 'M':'Z', 'N':'A', 
       'O':'B', 'P':'C', 'Q':'D', 'R':'E', 'S':'F', 'T':'G', 'U':'H', 'V':'I', 
       'W':'J', 'X':'K', 'Y':'L', 'Z':'M'}
```

Your task in this exercise is to implement an encoder/decoder of ROT-13. Once you're done, you will be able to 

1. Encode the following message: BRAVO! I have created a secret ROT decipher in June 2018!
    
2. Decode the following secret message: V nqzver lbh, TERNG CLGUBA Znfgre!!! Ubj qvq lbh ohvyq guvf qrpbqre?

![DSB Logo](img/Dolan.jpg)
# Python Data Types: Dictionaries
## PY4E Chapter 9
### How data are stored and processed in Python

![DSB Logo](img/Dolan.jpg)
# Python Data Types: Regular Expressions
## PY4E Chapter 11
### How do you search data in Python

# Searching for Patterns in Text

- As discussed in chapter 6, Python provides some built-in functions and methods for finding text parts using patterns
    - for example, `find()` and `split()`
- However, in real world practices, we often deal with pattern extraction much more complicated
- Python has a very powerful library called _regular expressions_ that handles many of these tasks quite elegantly
    - regular expression is its own programming language
    - it is so complicated and has its own syntax - here is a good [reference](https://docs.python.org/library/re.html)

# Example

- We are dealing with the complete log files like below:
    - the log file is in `'./data/mbox.xt'`

```
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
	 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
	 Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
	 by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
	 Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])
	by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;
	Sat, 5 Jan 2008 09:14:15 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
	BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 
	 5 Jan 2008 09:14:10 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
	by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;
	Sat,  5 Jan 2008 14:10:05 +0000 (GMT)
Message-ID: <200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>
Mime-Version: 1.0
```

In [2]:
# Our taaks is to extract every line contains 'From'
# in essense we are retrieving meta-data from the log files
# meta data is data about data - you will learn more about that in BA 510

import re
# we use `with open` so that the file is released when we are done with it
# this is the preferred method
with open('./data/mbox.txt') as fp: 
    for line in fp: # this is how you read data line-by-line
        line = line.rstrip() # standard process of handling text data - strip spaces
        if re.search('From:', line): # this is similar as the `.find()` method
            print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: cwen@iupui.ed

From: zqian@umich.edu
From: josrodri@iupui.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: ray@media.berkeley.edu
From: chmaurer@iupui.edu
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: rjlowe@iupui.edu
From: ian@caret.cam.ac.uk
From: gjthomas@iupui.edu
From: gjthomas@iupui.edu
From: gjthomas@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: chmaurer@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: aaronz@vt.edu
From: gjthomas@iupui.edu
From: gjthomas@iupui.edu
From: gjthomas

From: gsilver@umich.edu
From: ian@caret.cam.ac.uk
From: ajpoland@iupui.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: zach.thomas@txstate.edu
From: ajpoland@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: gjthomas@iupui.edu
From: rjlowe@iupui.edu
From: rjlowe@iupui.edu
From: dlhaines@umich.edu
From: rjlowe@iupui.edu
From: rjlowe@iupui.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: cwen@iupui.edu
From: ktsao@stanford.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: jzaremba@unicon.net
From: aaronz@vt.edu
From: aaronz@vt.edu
From: dlhaines@umich.edu
From: dlhaines@umich.edu
From: dlhaines@umich.edu
From: ajpoland@iupui.edu
From: rjlowe@iupui.edu
From: rjlowe@iupui.edu
From: rjlowe@iupui.edu
From: dlhaines@umich.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: ian@caret.cam.ac.uk


In [3]:
# Above example does not show the power of RegEx
# We change our taaks as extracting every line starts with 'From'

import re

with open('./data/mbox.txt') as fp: 
    for line in fp: 
        line = line.rstrip() 
        # we use the carat sign (`^`) before 'From:'
        # to extract 
        if re.search('^From:', line): # this is equivalently with the `.startswith()` method
            print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: cwen@iupui.ed

From: stuart.freeman@et.gatech.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: ian@caret.cam.ac.uk
From: dlhaines@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: gopal.ramasammycook@gmail.com
From: dlhaines@umich.edu
From: zqian@umich.edu
From: ian@caret.cam.ac.uk
From: aaronz@vt.edu
From: john.ellis@rsmart.com
From: john.ellis@rsmart.com
From: aaronz@vt.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: wagnermr@iupui.edu
From: aaronz@vt.edu
From: wagnermr@iupui.edu
From: gopal.ramasammycook@gmail.com
From: lance@indiana.edu
From: sgithens@caret.cam.ac.uk
From: ggolden@umich.edu
From: ggolden@umich.edu
From: cwen@iupui.edu
From: ggolden@umich.edu
From: ggolden@umich.edu
From: ggolden@umich.edu
From: ggolden@umich.edu
From: david.horwitz@uct.ac.za
From: antranig@caret.cam.ac.uk
From: ian@caret.cam.ac.uk
From: ian@caret.cam.ac.uk
From: ian@caret.cam.ac.uk
From: antranig@caret.c

# Making RegEx more Powerful

- So far our use of RegEx is fairly basic, now we can look at some advanced functions
    - we can use period (`.`) to match any character
        - e.g. `Fairfi..d` can match `Fairfield`, `Fairfi12d`, ...
        
    - we can combine period (`.`) with other wild cards, such as `*` and `+`
        - `+` means _one or more_
        - `*` means _zero or more_
        - Note that `+` and `*` are _greedy_ - they try to match everything they can

In [4]:
with open('./data/mbox.txt') as fp: 
    for line in fp: 
        line = line.rstrip() 
        if re.search('^F..m:', line): # use `.` replace any character
            print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: cwen@iupui.ed

From: mmmay@indiana.edu
From: ktsao@stanford.edu
From: jimeng@umich.edu
From: jimeng@umich.edu
From: jimeng@umich.edu
From: cwen@iupui.edu
From: mmmay@indiana.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: thoppaymallika@fhda.edu
From: chmaurer@iupui.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: chmaurer@iupui.edu
From: sgithens@caret.cam.ac.uk
From: chmaurer@iupui.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: chmaurer@iupui.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: sgithens@caret.cam.ac.uk
From: david.horwitz@uct.ac.za
From: zqian@umich.edu
From: sgithens@caret.cam.ac

In [6]:
with open('./data/mbox.txt') as fp: 
    for line in fp: 
        line = line.rstrip() 
        # use `.+@` replace one or more character contains '@' as well
        if re.search('From:.+@', line): # note that '@' is a character not a RegEx special character
            print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: cwen@iupui.ed

From: aaronz@vt.edu
From: zqian@umich.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: kimsooil@bu.edu
From: kimsooil@bu.edu
From: aaronz@vt.edu
From: nuno@ufp.pt
From: arwhyte@umich.edu
From: dlhaines@umich.edu
From: dlhaines@umich.edu
From: ajpoland@iupui.edu
From: mmmay@indiana.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: ajpoland@iupui.edu
From: rjlowe@iupui.edu
From: rjlowe@iupui.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: ajpoland@iupui.edu
From: wagnermr@iupui.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: aaronz@vt.edu
From: sgithens@caret.cam.ac.uk
From: ian@caret.cam.ac.uk
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: josrodri@iupui.edu
From: josrodri@iupui.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: dlhai

# Extract Multiple Sub-strings from Text

- Sometimes a particular substring of text may appear several times in text
    - it is important to extract all occurrences of it
    - also, we may want to store them in a _list_ for future use
    - see example below

In [7]:
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
# RegEx used below is `\S+@\S+`
# `\S+` means at least one non-white space character
lst = re.findall('\S+@\S+', s)
print(lst)

['csev@umich.edu', 'cwen@iupui.edu']


# Your Turn Here

Use your understanding of the RegEx used above to explain why `'@2PM'` is not extracted.

In [8]:
# Now since we have a working Email extractor
# can we use it on the log file as above?

with open('./data/mbox.txt') as fp: 
    for line in fp: 
        line = line.rstrip() 
        x = re.findall('\S+@\S+', line)  # this is our email extractor
        if len(x) > 0: # if anything is returned then print out
            print(x)


['stephen.marquard@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801042308.m04N8v6O008125@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801042109.m04L92hb007923@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject

['dlhaines@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712211855.lBLItOjY010357@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['dlhaines@umich.edu']
['source@collab.sakaiproject.org']
['dlhaines@umich.edu']
['dlhaines@umich.edu']
['bkirschn@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712211854.lBLIsB6l010345@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['bkirschn@umich.edu']
['source@collab.sakaiproject.org']
['bkirschn@umich.edu']
['bkirschn@umich.edu']
['dlhaines@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712211850.lBLIoVI9010330@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<sou

['lance@indiana.edu']
['source@collab.sakaiproject.org']
['lance@indiana.edu']
['lance@indiana.edu']
['lance@indiana.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712181640.lBIGelu5001477@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['lance@indiana.edu']
['source@collab.sakaiproject.org']
['lance@indiana.edu']
['lance@indiana.edu']
['cwen@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712181625.lBIGP9a7001435@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['cwen@iupui.edu']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['cwen@iupui.edu']
['cwen@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712181614.lBIGEaYr001405@nakamura.uits.iupui.edu>']
['<so

['cwen@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712142014.lBEKE2vZ013276@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['cwen@iupui.edu']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['cwen@iupui.edu']
['cwen@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712142004.lBEK4RpU013237@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['cwen@iupui.edu']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['cwen@iupui.edu']
['dlhaines@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712141957.lBEJv0iR013058@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']


['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['cwen@iupui.edu']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['cwen@iupui.edu']
['josrodri@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712111916.lBBJGomV002954@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['josrodri@iupui.edu']
['source@collab.sakaiproject.org']
['josrodri@iupui.edu']
['josrodri@iupui.edu']
['josrodri@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712111910.lBBJASdb002942@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['josrodri@iupui.edu']
['source@collab.sakai

['source@collab.sakaiproject.org;']
['ian@caret.cam.ac.uk']
['source@collab.sakaiproject.org']
['ian@caret.cam.ac.uk']
['ian@caret.cam.ac.uk']
['sgithens@caret.cam.ac.uk']
['<postmaster@collab.sakaiproject.org>']
['<200712010509.lB159lt8009136@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['sgithens@caret.cam.ac.uk']
['source@collab.sakaiproject.org']
['sgithens@caret.cam.ac.uk']
['sgithens@caret.cam.ac.uk']
['sgithens@caret.cam.ac.uk']
['<postmaster@collab.sakaiproject.org>']
['<200712010101.lB111m8a008768@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['sgithens@caret.cam.ac.uk']
['source@collab.sakaiproject.org']
['sgithens@caret.cam.ac.uk']
['sgithens@caret.cam.ac.uk']
['zqian@umich

['<200711202358.lAKNwfug003712@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['ian@caret.cam.ac.uk']
['source@collab.sakaiproject.org']
['ian@caret.cam.ac.uk']
['ian@caret.cam.ac.uk']
['ray@media.berkeley.edu']
['<postmaster@collab.sakaiproject.org>']
['<200711202308.lAKN8aJp003616@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['ray@media.berkeley.edu']
['source@collab.sakaiproject.org']
['ray@media.berkeley.edu']
['ray@media.berkeley.edu']
['ktsao@stanford.edu']
['<postmaster@collab.sakaiproject.org>']
['<200711202243.lAKMhbqq003574@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@loca

['ray@media.berkeley.edu']
['ostermmg@whitman.edu']
['<postmaster@collab.sakaiproject.org>']
['<200711142218.lAEMINnY029055@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['ostermmg@whitman.edu']
['source@collab.sakaiproject.org']
['ostermmg@whitman.edu']
['ostermmg@whitman.edu']
['gsilver@umich.edu']
['josrodri@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200711142158.lAELwDor029012@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['josrodri@iupui.edu']
['source@collab.sakaiproject.org']
['josrodri@iupui.edu']
['josrodri@iupui.edu']
['ostermmg@whitman.edu']
['<postmaster@collab.sakaiproject.org>']
['<200711142151.lAELpgq1029000@nakamura.uits.iupui.edu>']
['<source@collab.sakaip

['apache@localhost)']
['source@collab.sakaiproject.org;']
['dlhaines@umich.edu']
['source@collab.sakaiproject.org']
['dlhaines@umich.edu']
['dlhaines@umich.edu']
['arwhyte@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200711062100.lA6L00fB029030@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['arwhyte@umich.edu']
['source@collab.sakaiproject.org']
['arwhyte@umich.edu']
['arwhyte@umich.edu']
['arwhyte@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200711062057.lA6KvWRD029013@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['arwhyte@umich.edu']
['source@collab.sakaiproject.org']
['arwhyte@umich.edu']
['arwhyte@umich.edu']
['arwhyte@umich.edu']
['<postmaster@collab.sakaipr

['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['ajpoland@iupui.edu']
['source@collab.sakaiproject.org']
['ajpoland@iupui.edu']
['ajpoland@iupui.edu']
['mbreuker@loi.nl']
['ajpoland@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200710301441.l9UEfjs7022541@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['ajpoland@iupui.edu']
['source@collab.sakaiproject.org']
['ajpoland@iupui.edu']
['ajpoland@iupui.edu']
['mbreuker@loi.nl']
['ajpoland@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200710301441.l9UEfQfA022529@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['ajpoland@iupui.edu']
['sourc

['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['ray@media.berkeley.edu']
['source@collab.sakaiproject.org']
['ray@media.berkeley.edu']
['ray@media.berkeley.edu']
['ajpoland@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200710251928.l9PJSkkf019212@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['ajpoland@iupui.edu']
['source@collab.sakaiproject.org']
['ajpoland@iupui.edu']
['ajpoland@iupui.edu']
['rjlowe@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200710251928.l9PJS9ZL019200@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['rjlowe@iupui.edu']
['source@collab.sakaiproject.org']
['rj

# Fine-tuning Your Match

- From above results we see some Email addresses are successfully extracted
- But a lot of incorrect results
    - for instance, Email addresses contain '<' or ';'
    - so we only care about legit Email addresses 
        - legit Email addresses starts with letters or numbers
    - we can use a new RegEx for that
        - `[a-zA-Z0-9]` means a __single__ _lowercase letter_, _uppercase letter_ or _number_ at the __beginning__ of the substring for match
        
```
[a-zA-Z0-9]\S*@\S*[a-zA-Z]
```

In [9]:
# Now the results look much better

with open('./data/mbox.txt') as fp: 
    for line in fp: 
        line = line.rstrip() 
        x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line)  # this is our email extractor
        if len(x) > 0: # if anything is returned then print out
            print(x)

['stephen.marquard@uct.ac.za']
['postmaster@collab.sakaiproject.org']
['200801051412.m05ECIaH010327@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['postmaster@collab.sakaiproject.org']
['200801042308.m04N8v6O008125@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
['200801042109.m04L92hb007923@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject

['200712211433.lBLEX8tH009885@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['david.horwitz@uct.ac.za']
['source@collab.sakaiproject.org']
['david.horwitz@uct.ac.za']
['david.horwitz@uct.ac.za']
['chmaurer@iupui.edu']
['david.horwitz@uct.ac.za']
['postmaster@collab.sakaiproject.org']
['200712211408.lBLE8eQg009817@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['david.horwitz@uct.ac.za']
['source@collab.sakaiproject.org']
['david.horwitz@uct.ac.za']
['david.horwitz@uct.ac.za']
['stephen.marquard@uct.ac.za']
['postmaster@collab.sakaiproject.org']
['200712211407.lBLE7LPt009805@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apach

['mmmay@indiana.edu']
['source@collab.sakaiproject.org']
['mmmay@indiana.edu']
['mmmay@indiana.edu']
['jbush@rsmart.com']
['ssmail@indiana.edu']
['postmaster@collab.sakaiproject.org']
['200712171919.lBHJJtr2031760@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['ssmail@indiana.edu']
['source@collab.sakaiproject.org']
['ssmail@indiana.edu']
['ssmail@indiana.edu']
['cwen@iupui.edu']
['postmaster@collab.sakaiproject.org']
['200712171919.lBHJJC6Y031748@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['cwen@iupui.edu']
['mmmay@indiana.edu']
['postmaster@collab.sakaiproject.org']
['200712171919.lBHJJBZL031736@nakamura.uits.iupui.edu']
['source@co

['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['cwen@iupui.edu']
['cwen@iupui.edu']
['postmaster@collab.sakaiproject.org']
['200712141616.lBEGGH5E012558@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['cwen@iupui.edu']
['cwen@iupui.edu']
['postmaster@collab.sakaiproject.org']
['200712141600.lBEG0vmY012525@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['source@collab.sakaiproject.org']
['cwen@iupui.edu']
['cwen@iupui.edu']
['josrodri@iu

['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['zqian@umich.edu']
['source@collab.sakaiproject.org']
['zqian@umich.edu']
['zqian@umich.edu']
['zqian@umich.edu']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
['200712072055.lB7KtfaQ004342@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['zqian@umich.edu']
['source@collab.sakaiproject.org']
['zqian@umich.edu']
['zqian@umich.edu']
['zqian@umich.edu']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
['200712072052.lB7KqVfp004315@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['zqian@umich.edu']
['source@collab.sakaiproject.org']
['zqian@umich.edu']
['zqian@umich.edu

['source@collab.sakaiproject.org']
['chmaurer@iupui.edu']
['source@collab.sakaiproject.org']
['chmaurer@iupui.edu']
['chmaurer@iupui.edu']
['bkirschn@umich.edu']
['bkirschn@umich.edu']
['bkirschn@umich.edu']
['bkirschn@umich.edu']
['bkirschn@umich.edu']
['mmmay@indiana.edu']
['postmaster@collab.sakaiproject.org']
['200711292106.lATL6HiV005971@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['mmmay@indiana.edu']
['source@collab.sakaiproject.org']
['mmmay@indiana.edu']
['mmmay@indiana.edu']
['david.horwitz@uct.ac.za']
['david.horwitz@uct.ac.za']
['mmmay@indiana.edu']
['postmaster@collab.sakaiproject.org']
['200711292040.lATKef5X005943@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['mmmay@indiana.edu']
['source@co

['rjlowe@iupui.edu']
['postmaster@collab.sakaiproject.org']
['200711200533.lAK5XckH001169@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['rjlowe@iupui.edu']
['source@collab.sakaiproject.org']
['rjlowe@iupui.edu']
['rjlowe@iupui.edu']
['rjlowe@iupui.edu']
['postmaster@collab.sakaiproject.org']
['200711200532.lAK5WNHu001156@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['rjlowe@iupui.edu']
['source@collab.sakaiproject.org']
['rjlowe@iupui.edu']
['rjlowe@iupui.edu']
['rjlowe@iupui.edu']
['postmaster@collab.sakaiproject.org']
['200711200528.lAK5SrTf001118@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source

['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['jimeng@umich.edu']
['source@collab.sakaiproject.org']
['jimeng@umich.edu']
['jimeng@umich.edu']
['aaronz@vt.edu']
['postmaster@collab.sakaiproject.org']
['200711110036.lAB0a13j008258@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['aaronz@vt.edu']
['source@collab.sakaiproject.org']
['aaronz@vt.edu']
['aaronz@vt.edu']
['aaronz@vt.edu']
['postmaster@collab.sakaiproject.org']
['200711110026.lAB0QfJr008246@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['aaronz@vt.edu']
['source@collab.sakaiproject.org']
['aaronz@vt.edu']
['aaronz@vt.edu']
['aaronz@vt.edu']
['postmaster@collab.sakaiproject

['200711052150.lA5Lo6XW025758@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['mmmay@indiana.edu']
['source@collab.sakaiproject.org']
['mmmay@indiana.edu']
['mmmay@indiana.edu']
['nuno@ufp.pt']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
['200711052119.lA5LJvkd025631@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['zqian@umich.edu']
['source@collab.sakaiproject.org']
['zqian@umich.edu']
['zqian@umich.edu']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
['200711052117.lA5LHEIF025619@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['zqian@umich.edu']

['200710292000.l9TK0Y6g020085@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['mmmay@indiana.edu']
['source@collab.sakaiproject.org']
['mmmay@indiana.edu']
['mmmay@indiana.edu']
['ian@caret.cam.ac.uk']
['mmmay@indiana.edu']
['postmaster@collab.sakaiproject.org']
['200710291958.l9TJwvN1020072@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['mmmay@indiana.edu']
['source@collab.sakaiproject.org']
['mmmay@indiana.edu']
['mmmay@indiana.edu']
['ian@caret.cam.ac.uk']
['mmmay@indiana.edu']
['postmaster@collab.sakaiproject.org']
['200710291955.l9TJtNQm020050@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@col

['200710221731.l9MHVbh6006443@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['rjlowe@iupui.edu']
['source@collab.sakaiproject.org']
['rjlowe@iupui.edu']
['rjlowe@iupui.edu']
['aaronz@vt.edu']
['postmaster@collab.sakaiproject.org']
['200710221717.l9MHH6Ii006431@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['aaronz@vt.edu']
['source@collab.sakaiproject.org']
['aaronz@vt.edu']
['aaronz@vt.edu']
['aaronz@vt.edu']
['postmaster@collab.sakaiproject.org']
['200710221716.l9MHGsxX006419@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['aaronz@vt.edu']
['source@collab.sakaiproject.o

# Your Turn Here

Now observe the results, and comparing the results we got before, do you see the improvements?

# More Complicated Use Cases

- In data analytics, there is an important task when we collecting data from different data sources
    - this task belongs in a larger task called _ETL_ (extract, transform, load)
        - you will learn about it in BA 510
    - We can actively use RegEx to search and extract useful data from text data sources
    - for instance, how about extracting substrings such as:
        - can you look at below examples and find out the pattern(s)?
    
```
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
```

In [12]:
# we can use this RegEx for pattern matching `^X\S*: [0-9.]+`
# Search for lines that start with 'X' followed by any non
# whitespace characters and ':'
# followed by a space and any number.
# The number can include a decimal.

with open('./data/mbox.txt') as fp: 
    for line in fp: 
        line = line.rstrip() 
        if re.search('^X\S*: [0-9.]+', line):
            print(line)
        

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7565
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7626
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7556
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7002
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7615
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7601
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6959
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7606
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7559
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6932
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7558
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6526
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6948
X-DSPAM-Probability: 0.0000
X-DSPAM-Co

X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8419
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7613
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9777
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8428
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9815
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7604
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9840
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8491
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9847
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9860
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9839
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8424
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8434
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9821
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8481
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9808
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7602
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8479
X-DSPAM-Pr

X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9844
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8474
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8470
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9843
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9848
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9824
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9853
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9855
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.8486
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7003
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9815
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.9828
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6197
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.5993
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7007
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6568
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7622
X-DSPAM-Pr

# More Complicated Use Cases

- You can see from above results that we suceesfully extracted the data items we needed
    - for instance we can convert above results to a list of dicts, then we can further use it
- But we can move one step further by just retrieving the numerical values from above results
    - see below example

In [14]:
with open('./data/mbox.txt') as fp: 
    for line in fp: 
        line = line.rstrip() 
        # the key different is the `()` we added around the numbers
        # which means we only care about this part of data
        x = re.findall('^X\S*: ([0-9.]+)', line) 
        # even though the extracted data items are strings not floating numbers
        # and they are single items in lists, we still extracted the data we wanted
        if len(x) > 0:
            print(x)

['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
['0.0000']
['0.7565']
['0.0000']
['0.7626']
['0.0000']
['0.7556']
['0.0000']
['0.7002']
['0.0000']
['0.7615']
['0.0000']
['0.7601']
['0.0000']
['0.7605']
['0.0000']
['0.6959']
['0.0000']
['0.7606']
['0.0000']
['0.7559']
['0.0000']
['0.7605']
['0.0000']
['0.6932']
['0.0000']
['0.7558']
['0.0000']
['0.6526']
['0.0000']
['0.6948']
['0.0000']
['0.6528']
['0.0000']
['0.7002']
['0.0000']
['0.7554']
['0.0000']
['0.6956']
['0.0000']
['0.6959']
['0.0000']
['0.7556']
['0.0000']
['0.9846']
['0.0000']
['0.8509']
['0.0000']
['0.9907']
['0.0000']
['0.7003']
['0.0000']
['0.8507']
['0.0000']
['0.9895']
['0.0000']
['0.9965']
['0.0000']
['0.9875']
['0.0000']
['0.9867']
['0.0000']
['0.9903']
['0.0000']
['0.7006']
['0.0000']
['0.9907']
['0.0000']
['0.9886']
['0.0000']
['0.8495']
['0.0000']
['0.7606']
['0.0000']
['0.9875']
['0.0000']
['0.8489']
['0.0000']
['0.9854']
['0.0000']
['0.7549']
['0.0000']
['0.9877']
['0.0000']
['0.9881']
['0.0000']
['0.9864']

['0.9836']
['0.0000']
['0.7551']
['0.0000']
['0.7542']
['0.0000']
['0.8440']
['0.0000']
['0.6527']
['0.0000']
['0.6951']
['0.0000']
['0.9791']
['0.0000']
['0.9861']
['0.0000']
['0.9827']
['0.0000']
['0.9801']
['0.0000']
['0.6945']
['0.0000']
['0.7548']
['0.0000']
['0.9840']
['0.0000']
['0.6939']
['0.0000']
['0.9804']
['0.0000']
['0.8473']
['0.0000']
['0.9804']
['0.0000']
['0.9795']
['0.0000']
['0.9888']
['0.0000']
['0.8496']
['0.0000']
['0.9792']
['0.0000']
['0.9797']
['0.0000']
['0.9825']
['0.0000']
['0.9864']
['0.0000']
['0.7626']
['0.0000']
['0.9810']
['0.0000']
['0.9811']
['0.0000']
['0.8434']
['0.0000']
['0.9826']
['0.0000']
['0.9851']
['0.0000']
['0.9902']
['0.0000']
['0.9834']
['0.0000']
['0.9806']
['0.0000']
['0.8419']
['0.0000']
['0.7613']
['0.0000']
['0.9777']
['0.0000']
['0.8428']
['0.0000']
['0.9815']
['0.0000']
['0.7604']
['0.0000']
['0.9840']
['0.0000']
['0.8491']
['0.0000']
['0.9847']
['0.0000']
['0.9860']
['0.0000']
['0.9839']
['0.0000']
['0.8424']
['0.0000']
['0.8434']

['0.0000']
['0.7007']
['0.0000']
['0.6567']
['0.0000']
['0.7006']
['0.0000']
['0.7542']
['0.0000']
['0.7622']
['0.0000']
['0.7006']
['0.0000']
['0.9817']
['0.0000']
['0.9813']
['0.0000']
['0.9837']
['0.0000']
['0.8439']
['0.0000']
['0.8444']
['0.0000']
['0.7005']
['0.0000']
['0.7618']
['0.0000']
['0.8467']
['0.0000']
['0.7008']
['0.0000']
['0.7623']
['0.0000']
['0.7619']
['0.0000']
['0.7538']
['0.0000']
['0.7622']
['0.0000']
['0.7622']
['0.0000']
['0.7624']
['0.0000']
['0.7009']
['0.0000']
['0.7623']
['0.0000']
['0.8478']
['0.0000']
['0.9814']
['0.0000']
['0.9877']
['0.0000']
['0.8470']
['0.0000']
['0.7605']
['0.0000']
['0.6964']
['0.0000']
['0.9829']
['0.0000']
['0.7546']
['0.0000']
['0.9751']
['0.0000']
['0.7005']
['0.0000']
['0.9821']
['0.0000']
['0.7006']
['0.0000']
['0.6202']
['0.0000']
['0.7619']
['0.0000']
['0.7004']
['0.0000']
['0.7610']
['0.0000']
['0.9848']
['0.0000']
['0.9870']
['0.0000']
['0.9851']
['0.0000']
['0.7514']
['0.0000']
['0.7544']
['0.0000']
['0.9844']
['0.0000']

# Your Turn Here

Can you fix above code so that we can get a __single__ list of float point numbers as items?

# More Complicated Use Cases

- Please read PY4E pp. 134 - 135 for more illustrational cases.

# Escape Character

- In RegEx we use special characters to match the specific parts of the line
    - for instance, `+` or `*` or `$`
    - however, some times we need to search for these characters
        - Without their specific meaning
    - we can use the escape character `\`

In [15]:
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+',x) # look at the `\` we used before `$` to search for `$10.00`
print(y)

['$10.00']


# Use RegEx to Replace Substring

- In strings, we can use the `.replace()` method to replace part of the string
    - however, so far `.replace()` can only use exact match
- Now with the power of RegEx, we can match more powerfully
    - you can embed RegEx in the `.replace()` method 
    - or you can use the `re.sub()` method

In [16]:
# Exact match using replace
my_str = 'In the year 2012, ABC company spend $1 million on their new product line.'
my_str.replace('ABC', 'XYZ')

'In the year 2012, XYZ company spend $1 million on their new product line.'

In [19]:
# Now let's search and replace strings
# `re.sub()` takes three arguments
# first is the RegEx you want to search
# second is the string/variable/function you want to replace with 
# third is the string you want to search
# Note we use the `r` prefix to indicate a RegEx
re.sub(r'[1-3][0-9]{3}', 'YEAR', my_str)

'In the year YEAR, ABC company spend $1 million on their new product line.'

# Summary

- Refer to PY4E pp. 136 - 137 for a quick summary of useful RegEx.

# Your Turn Here
Finish exercises below by following instructions of each of them. 

Make sure you provide proper __pseudo code__ for each of your program.

## Q1. Coding Problem

Write a function to remove leading zeros from any number.
- Take user input for a multi-digit number;
    - if the number leads with (multiple) zero(es), remove the zero(es) and return the remainder of the number
    - if the number does not lead with zero, return the number.

## Q2. Coding Problem

When you register for some websites, you may have noticed that the website will detect if the phone number or Email address you entered are legitimate. In this exercise you will need to write a function using __RegEx__ to detect if a(n) phone number or Email address is legitimate. 
- Legitimate phone numbers follow this format: `(xxx)xxx-xxxx`. In which `x` is any number between 0 - 9.
    - note that the parentheses `()` and dash `-` are required.
    - also note the digits separated by the symbols.

- Legitimate Email addresses follow below rules:
    - it starts with a lowercase letter, uppercase letter, or a number
    - it contains an `@` symbol - also, there should be non-whitespace characters before and after `@`
    - after `@` symbol, it should also contain a period `.`.


# Classwork (start here in class)
You can start working on them right now:
- Read Chapter 9 and 11 in PY4E
- If time permits, start in on your homework. 
- Ask questions when you need help. Use this time to get help from the professor!

# Homework (do at home)
The following is due before class next week:
  - Any remaining classwork from tonight
  - Data Camp “Regular Expressions for Pattern Matching” assignment 

Note: All work on Data Camp is logged. Don't try to fake it!

Please email jtao@fairfield.edu if you have any problems or questions.

![DSB Logo](img/Dolan.jpg)
# Python Data Types: Regular Expressions
## PY4E Chapter 11
### How do you search data in Python