# Tutorial 2 Regular expression

In the lecture, you have learnt how to use regular expressions. To quickly review some regular expression syntax:


* <font color="red">[0-9]</font> Matches a single digit
* <font color="red">[a-z0-9]</font> Matches a single character that must be a lower case letter or a digit.
* <font color="red">[A-Za-z]</font> Matches a single character that much be a upper/lower case letter 
* <font color="red">\d</font> Matches any decimal digit; equivalent to the set [0-9].
* <font color="red">\D</font> Matches characters that are not digits, which is equivalent to [^0-9] or [^\d].
* <font color="red">\w</font> Matches any alphanumeric character, which is equivalent to [a-zA-Z0-9].
* <font color="red">\W</font> Matches any non-alphanumeric character; which is equivalent to [^a-zA-Z0-9] or [^\w].
* <font color="red">\s</font> Matches any whitespace character; which is equivalent to [\t\n\r\f\v], where \t indicates taps, \n  line feeds, \r carriage returns, \f form feeds and \v vertical tabs.
* <font color="red">\S:</font> Matches any non-whitespace character; which is equivalent to  [^ \t\n\r\f\v].
* <font color="red">ˆ</font> Matches the start of the line.
* <font color="red">$</font> Matches the end of the line.
* <font color="red">.</font> Matches any character (a wildcard).
* <font color="red">*</font> Matches when the preceding character occurs zero or more times
* <font color="red">?</font> Matches when the preceding character occurs zero or one times
* <font color="red">+</font> Matches when the preceding character occurs one or more times

More information can be found here :
https://docs.python.org/2/library/re.html
* * *

In [1]:
import sys
print (sys.version_info)

sys.version_info(major=3, minor=6, micro=3, releaselevel='final', serial=0)


Libraries needed are:

In [2]:
import re # library for regular expression
import pandas as pd
pd.__version__

'0.20.3'

## 1. Backslash 

**First, what is '\'? **

'\', backslash or escape-character, is used to indicate special forms or to allow special characters to be used without invoking their special meaning.



**How about r"" ? When to use it? **

r"" is Python’s string literal prefix notation, which has nothing to do with regular expression.  By using r"" or r'', Python will not handle special characters in any special way, in another word, it treated the contents as raw string. For example, r"\t" represents
a two-character string containing '\' and 't', whereas "\t" represents tab.

Sometimes you can use them interchangeably,

In [97]:
str1 = re.findall('\t', "Please find \t s")
print (str1)
# s = "Please find \t s"
# print(s)

str2 = re.findall(r'\t', "Please find \t")
print (str2)

['\t']
Please find 	 s
['\t']


Sometimes not!

#### about "\1"
正则表达式中 
‘\1’ 匹配的是 字符 ‘\1’ 。 (因为 ‘\’ 匹配字符 ‘\’ ) 
‘\2’ 匹配的是 字符 ‘\2’

单独斜杠的 \1 ， \2 就是反向引用了。 
‘\1’ 匹配的是 所获取的第1个()匹配的引用。例如，’(\d)\1’ 匹配两个连续数字字符。如33aa 中的33 
‘\2’ 匹配的是 所获取的第2个()匹配的引用。 
例如，’(\d)(a)\1’ 匹配第一是数字第二是字符a,第三\1必须匹配第一个一样的数字重复一次，也就是被引用一次。如9a9 被匹配，但9a8不会被匹配，因为第三位的\1必须是9才可以，

‘(\d)(a)\2’ 匹配第一个是一个数字，第二个是a，第三个\2必须是第二组（）中匹配一样的，如，8aa被匹配，但8ab，7a7不会被匹配，第三位必须是第二组字符的复制版，也是就引用第二组正则的匹配内容。

以此类推
https://blog.csdn.net/liangf05/article/details/79361191

In [87]:
# https://blog.csdn.net/liangf05/article/details/79361191
str1=re.match(r"(\W)(.)\1\W", " g  ") 
print (str1)

<_sre.SRE_Match object; span=(0, 4), match=' g  '>


In [101]:
str1=re.match(r"\W(.)\1\W", " gg ") 
str11=re.match(r"\W(.)(.)\1\W", " k1k ") 
str12=re.match(r"\W(.)(.)\2\W", " mkk ") 
# \1 is equivalent to re.search(...).group(1), 
# the first parentheses-delimited expression inside of the regex.
print (str1)
print (str11)
print (str12)

str2=re.match("\W(.)\1\W", " ff ")
str22=re.match("\W(.)\1\W", " f\1 ")
print (str2)
print (str22)

str3=re.match("\\W(.)\\1\\W", " ff ")
print (str3)

<_sre.SRE_Match object; span=(0, 4), match=' gg '>
<_sre.SRE_Match object; span=(0, 5), match=' k1k '>
<_sre.SRE_Match object; span=(0, 5), match=' mkk '>
None
<_sre.SRE_Match object; span=(0, 4), match=' f\x01 '>
<_sre.SRE_Match object; span=(0, 4), match=' ff '>


"\W(.)\1\W" doesn't match ?  What is the difference? 

In [5]:
# 沒有r，表示與group的第一個位置比對
str4="\W(.)\1\W"
print (str4)
str4

\W(.)\W


'\\W(.)\x01\\W'

In [6]:
# 有r，表示反斜線及1
str4=r"\W(.)\1\W"
print (str4)
str4

\W(.)\1\W


'\\W(.)\\1\\W'

Now you might be able to guess, what "\W(.)\1\W" will match. 

In [67]:
str2=re.match("\W(.)\1\W", " f\x01 ")
str2_1=re.match("\W(.)\1\W", " f\1 ")
print (str2)
print (str2_1)

<_sre.SRE_Match object; span=(0, 4), match=' f\x01 '>
<_sre.SRE_Match object; span=(0, 4), match=' f\x01 '>


It matches with non-word + any one character  + "\x01" + non=word.

** Conclusion -- always fist validate your regular expression, then test with Python**

\* is ??  <br>
\* is a wildcard similar with ? and +  <br>
\* matches 0+ <br>
? matches 0-1 <br>
\+ matches 1+ <br>

In [102]:
# find 0 or more
str1 = re.findall(r'.*', 'Please find all.')
print (str1)

['Please find all. ', '']


In [69]:
# find a character each time
str1 = re.findall(r'.?', 'Please find all.')
print (str1)

['P', 'l', 'e', 'a', 's', 'e', ' ', 'f', 'i', 'n', 'd', ' ', 'a', 'l', 'l', '.', '']


In [70]:
# find one and more
str1 = re.findall(r'.+', 'Please find all.')
print (str1)

['Please find all.']


In [74]:
# find one or more "l"
str1 = re.findall(r'l+', 'Please find all')
print (str1)

['l', 'll']


## 2. Homework in the reading material "Introduction to Regular Expressions"

Refine the regular expression for date to distinguish months with 29/30/31 days.
*Note assume all years are a leap year, which means every Feburary has 29 days.

In [88]:
def date(pattern, m):
    if re.match(pattern, m):
        print (m + " is a date")
    else:
        print (m + " is NOT a date")

In [107]:
# ''' 這是註解 '''
regex = r'''(?x)
    (?:
    # February (29 days every year)
    ([12][0-9]|0?[0-9])[/-](0?2)
      
    # 30-day months
    |(30|[12][0-9]|0?[0-9])[/-](0?[469]|11)
      
    # 31-day months
    |(3[01]|[12][0-9]|0?[0-9])[/-](0?[13578]|1[02])
      
    ) 
    # Year
    [/-]((?:[0-9]{2})?[0-9]{2}) 
    # if 4 digits就會從頭讀到尾，若只有2 digits就只讀後面的部分[0-9]{2}（前面[0-9]{2})?是optional）
    
'''

In [108]:
#date(r"((31[/-](0?[13578]|1[02]))|(30[/-](0?[469]|11))|(28[/-]02))[/-]((?:\d{2})?\d{2})", "28/02/2019")
date(regex, "28/02/2019")
date(regex, "31/04/2019")
date(regex, "29/05/2019")
date(regex, "31/06/2019")

28/02/2019 is a date
31/04/2019 is NOT a date
29/05/2019 is a date
31/06/2019 is NOT a date


## 3. Extract IPs, dates, and email address with regular expressions

With following tasks we will use the mail box data ([mbox-short.txt](http://www.pythonlearn.com/code3/mbox-short.txt)) provided by the book [Python for Informatics: Exploring Information](http://www.pythonlearn.com/book.php#python-for-informatics). 

In [109]:
with open('mbox-short.txt','r') as infile:
    text = infile.read()

### 3.1 Find IP addresses 

In this task we will need to 
1. find all IP addresses in the mbox-short dataset.
2. print unique IP addresses 

We assume here that all of them are valid in the dataset. 



In [119]:
def ip(pattern, i):
    if re.match(pattern, i):
        print (i + " is a valid IP address")
    else:
        print (i + " is Not a valid IP address")

In [129]:
ip_address = r"\b(?:\d{1,3}\.){3}\d{1,3}$\b"

In [134]:
ip(ip_address, "2222.4.56.5")
ip(ip_address, "255. 255.268.7")
ip(ip_address, "0.0.0000.0")
ip(ip_address, "120.333.2.343") # $表示結尾一定要數字

2222.4.56.5 is Not a valid IP address
255. 255.268.7 is Not a valid IP address
0.0.0000.0 is Not a valid IP address
120.333.2.343 is a valid IP address


### 3.2  Extract All date time 


In the next task, we need to extract all date time from the file. We trust that all date time are valid for now. 


### 3.3 Extract author's email address


There are many email addresses included in the file. We would like to 

1. extract all email addresses (the format is normally: "Author: stephen.marquard@uct.ac.za" )
2. extract email address after after Author

Have a try now: 

```python
r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
```
which was copied and pasted from http://emailregex.com/

Does it work in the task?

### Nope但我還不知道錯在哪裡

In [135]:
def email(pattern, e):
    if re.match(pattern, e):
        print("This is a valid email.")
    else:
        print("This is NOT a valid email.")

In [178]:
# \b文字邊界（表示要在文字的最邊邊）
# \bage\b -> "the age is" -> age符合，因為a前面是non-word，且e後面也是non-word
# \bage\b -> "language" -> 不符合，因為age夾在word與non-word之中
email_pattern1 = r"\b[a-zA-Z0-9_\.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-]\b"
email_pattern2 = r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"

In [179]:
email(email_pattern1, "stephen.marquard@uct.ac.za")
email(email_pattern2, "stephen.marquard@uct.ac.za")

This is NOT a valid email.
This is a valid email.


## 4. Home work :

Watch the Software Carpentry lecture on regular expressions, if you need more help.

https://www.youtube.com/playlist?list=PL7C1EB31127AB8A0B

or you can look at the video lecture at Lynda.com 

https://wwwlyndacom.ezproxy.lib.monash.edu.au/Regular-Expressions-tutorials/Welcome/85870/93904-4.html?autoplay=true

In order to access the Lynda, you need to setup your account according to 

http://resources.lib.monash.edu.au/eresources/lyndacom.html

In [170]:
match = re.search('(....)-(..)-(..)', 'Baker 1\t2019-11-02\t1223.0')

In [172]:
print(match.group(0)) # print all the text
print(match.group(1), match.group(2), match.group(3)) # print the specific one

2019-11-02
2019 11 02
