**Regular expressions in Python**

[Tutorial on YouTube](https://www.youtube.com/watch?v=sZyAn2TW7GY)

# Identifiers
- Says what you're looking for

```
\d any number
\D anything but a number
\s space
\S anything but a space
\w any character
\W anything but a character
.  (period) any charcter, except for a newline
\b the whitespace around words
\. a period  (needs a backslash)

```

# Modifiers
- Says how many or what type you're looking for, like a description

```
{5} Expecting 5 of these
{1, 3}  We're expecting 1-3
?   Match 0 or 1
*   Match 0 or more
+   Match 1 or more
$   Match the end of a string
^   Matches the beginning of a string
|   Matches either, or   for example:  \d{1-3 | \w{5-6}
[]  Range or "variance"  for example:  [A-Za-z] # the alphabet, [0-9] # digits
```

# White space character

```
\n  new line
\s  space
\t  tab
\e  escape
\f  form feed
\r  return
```

# Don't forget!  Need a \ to escape these characters

```
.  +  *  ?  [  ]  $ 
```

# Practicing regular expressions

## Tutorial example

In [11]:
import re

In [12]:
exampleString = '''
Jessica is 15 years old, and Daniel is 27 years old.
Edward is 97 and his grandfather, Oscar, is 102.
'''

In [4]:
exampleString

'\nJessica is 15 years old, and Daniel is 27 years old.\nEdward is 97 and his grandfather, Oscar, is 102.\n'

In [7]:
ages = re.findall('\d{1,3}', exampleString)

In [8]:
ages

['15', '27', '97', '102']

In [9]:
names = re.findall('[A-Z][a-z]*', exampleString)

In [13]:
names

['Jessica', 'Daniel', 'Edward', 'Oscar']

In [None]:
## Work example

In [None]:
jack = 'xxCOL04_test'

In [None]:
x = re.findall('COL[0-9.]+', jack)

In [None]:
x

In [None]:
x[0][3:5]

# Applying regular expressions to Pandas

In [14]:
import pandas as pd

In [16]:
# Series with 
a_series = pd.Series(['G2_TKD171001753-AK232_AHG3W3CCXY_COL06_L8_ROW37', 'G2_TKD171001753-AK232_AHG3W3CCXY_COL06_L8_ROW19',
                     'D2_TKD171001753-AK226_AHG3W3CCXY_COL10_L8_ROW07', 'G2_TKD171001753-AK232_AHG3W3CCXY_COL06_L8_ROW03',
                     'H3_TKD171001753-AK235_AHG3W3CCXY_COL18_L8_ROW15'])

In [17]:
a_series

0    G2_TKD171001753-AK232_AHG3W3CCXY_COL06_L8_ROW37
1    G2_TKD171001753-AK232_AHG3W3CCXY_COL06_L8_ROW19
2    D2_TKD171001753-AK226_AHG3W3CCXY_COL10_L8_ROW07
3    G2_TKD171001753-AK232_AHG3W3CCXY_COL06_L8_ROW03
4    H3_TKD171001753-AK235_AHG3W3CCXY_COL18_L8_ROW15
dtype: object

In [25]:
# Extract the numbers after the 'COL'
a_series.str.extract('[COL](\d\d)', expand=False)

0    06
1    06
2    10
3    06
4    18
dtype: object

In [26]:
# Extract the numbers after the 'ROW'
a_series.str.extract('[ROW](\d\d)', expand=False)

0    37
1    19
2    07
3    03
4    15
dtype: object

In [27]:
%qtconsole

# From py4e book, Chapter 11

For example, the caret character is used in regular expressions to match “the
beginning” of a line. We could change our program to only match lines where
“From:” was at the beginning of the line as follows:

```
# Search for lines that start with 'From'
import re
    hand = open('mbox-short.txt')
    for line in hand:
    line = line.rstrip()
if re.search('^From:', line):
    print(line)
# Code: http://www.py4e.com/code3/re02.py
```

In [2]:
import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\S+@\S+', s)     # Find strings that surround an '@' character
print(lst)

['csev@umich.edu', 'cwen@iupui.edu']


In [3]:
lst = re.findall('\S+s\S+', s)     # Find strings that surround an 's' character
print(lst)

['message', 'csev@umich.edu']


In [4]:
lst = re.findall('\S+a\S+', s)     # Find strings that surround an 'a' character
print(lst)

['message']


In [11]:
lst = re.findall('a\S+', s)     # Find strings that START with an 'a' character
print(lst)

['age', 'about']


## Using square brackets
Square brackets are used to indicate a set of multiple acceptable characters we are willing to consider matching. In a sense,
the “\S” is asking to match the set of “non-whitespace characters”. Now we will be a little more explicit in terms of the characters we will match.
Here is our new regular expression:
```
[a-zA-Z0-9]\S*@\S*[a-zA-Z]
```
This is getting a little complicated and you can begin to see why regular expressions are their own little language unto themselves. Translating this regular expression, we are looking for substrings that start with a single lowercase letter, uppercase letter, or number “[a-zA-Z0-9]”, followed by zero or more non-blank characters (“\S\*”), followed by an at-sign, followed by zero or more non-blank characters (“\S\*”), followed by an uppercase or lowercase letter. Note that we switched from
“+” to “\*” to indicate zero or more non-blank characters since “[a-zA-Z0-9]” is already one non-blank character. Remember that the “\*” or “+” applies to the single character immediately to the left of the plus or asterisk.

Inside square brackets, characters are not “special”. So when we say “[0-9.]”, it really means digits or a period. Outside of square brackets, a period is the “wildcard” character and matches any character. Inside square brackets, the period is
a period.

*Note: I escaped asterisks so that it would show properly for the purposes of the markdown code (except for this line which is intended to be italicized). In the above text, the asterisks are used to demonstrate matching 0 or more for regular expressions.*

## Extracting specific substrings depending on its location in relation to another string


- X-DSPAM-Confidence: 0.8475
- X-DSPAM-Probability: 0.0000
- X-DSPAM-Confidence: 0.6178
- X-DSPAM-Probability: 0.0000

But now we have to solve the problem of extracting the numbers. While it would
be simple enough to use split, we can use another feature of regular expressions
to both search and parse the line at the same time.

Parentheses are another special character in regular expressions. When you add
parentheses to a regular expression, they are ignored when matching the string.
But when you are using findall(), parentheses indicate that while you want the
whole expression to match, you only are interested in extracting a portion of the
substring that matches the regular expression.
So we make the following change to our program:

```
# Search for lines that start with 'X' followed by any
# non whitespace characters and ':' followed by a space
# and any number. The number can include a decimal.
# Then print the number if it is greater than zero.

import re
hand = open('mbox-short.txt')

for line in hand:
    line = line.rstrip()
        x = re.findall('^X\S*: ([0-9.]+)', line)
        if len(x) > 0:
            print(x)
# Code: http://www.py4e.com/code3/re11.py
```



## Dr. Chuck's summary

```
ˆ Matches the beginning of the line.
$ Matches the end of the line.
. Matches any character (a wildcard).
\s Matches a whitespace character.
\S Matches a non-whitespace character (opposite of \s).
* Applies to the immediately preceding character and indicates to match zero or
more of the preceding character(s).
*? Applies to the immediately preceding character and indicates to match zero or
more of the preceding character(s) in “non-greedy mode”.
+ Applies to the immediately preceding character and indicates to match one or
more of the preceding character(s).
+? Applies to the immediately preceding character and indicates to match one or
more of the preceding character(s) in “non-greedy mode”.
[aeiou] Matches a single character as long as that character is in the specified set.
In this example, it would match “a”, “e”, “i”, “o”, or “u”, but no other characters.
[a-z0-9] You can specify ranges of characters using the minus sign. This example
is a single character that must be a lowercase letter or a digit.
[ˆA-Za-z] When the first character in the set notation is a caret, it inverts the logic.

This example matches a single character that is anything other than an uppercase
or lowercase letter.
( ) When parentheses are added to a regular expression, they are ignored for the
purpose of matching, but allow you to extract a particular subset of the matched
string rather than the whole string when using findall().
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
\d Matches any decimal digit; equivalent to the set [0-9].
\D Matches any non-digit character; equivalent to the set [ˆ0-9].
```

## Regex Coach

http://www.weitz.de/regex-coach/

# Kevin's lesson on regex

1. Learning regular syntax in general
2. Learning how to use syntax **in Python**

He suggests learning these separately to avoid confusion

## Learning syntax (in general, outside of Python)



Go to regex101.com for practice/lessons

>literals - matches exactly <br>
>metacharacter - match something other than itself, e.g. a period (.)

. = any character other than a new line