# Regular expressions (regex)
is a sequence of characters that define a search pattern. They allow us to do fancy data sciency things like searching for an email address with a particular pattern - eg. starts with an "s", followed by 3 digits and ending with "@yahoo.com".

In this notebook we will briefly touch upon string manipulation and using regex with pandas.

# String manipulation <a name="strings"></a>
Python has long been popular for its raw data manipulation in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed.

In [1]:
import numpy as np
import pandas as pd

### Basics
Let's refresh what normal `str` (String objects) are capable of in Python

In [2]:
# complex strings can be broken into small bits
val = "Edinburgh is great"
val.split(" ")

['Edinburgh', 'is', 'great']

In [3]:
# substrings can be concatinated together with +
first, second, last = val.split(" ")
first + "::" + second + "::" + last

'Edinburgh::is::great'

Remember that Strings are just lists of individual charecters

In [4]:
val = "Edinburgh"
for each in val:
    print(each)

E
d
i
n
b
u
r
g
h


You can use standard list operations with them

In [5]:
val.find("n")

3

In [6]:
val.find("x")  # -1 means that there is no such element

-1

In [7]:
# and of course remember about upper() and lower()
val.upper()

'EDINBURGH'

If you want to learn more about strings you can always refer to the [Python manual](https://docs.python.org/2/library/string.html)

### Regular expressions
provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called *regex*, is a string formed according to the regular expression language. Python's built-in module is responsible for applying regular expression of strings via the `re` package

In [8]:
import re
text = "foo    bar\t baz   \tqux"
text

'foo    bar\t baz   \tqux'

In [9]:
re.split("\s+", text)

['foo', 'bar', 'baz', 'qux']

this expression effectively removed all whitespaces and tab characters (`\t`) which was stated with the `\s` regex and then the `+` after it means to remove any number of sequential occurrences of that character.

Let's have a look at a more complex example - identifying email addresses in a text file:

In [10]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

# pattern to be used for searching
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [11]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']


Let's dissect the regex part by part:
```
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
```

- the `r` prefix before the string signals that the string should keep special characters such as the newline character `\n`. Otherwise, Python would just treat it as a newline
- `A-Z` means all letters from A to Z including lowercase and uppercase
- `0-9` similarly means all characters from 0 to 9
- the concatenation `._%+-` means just include those characters
- the square brackets [ ] means to combine all of the regular expressions inside. For example `[A-Z0-9._%+-]` would mean include all letters A to Z, all numbers 0 to 9, and the characters ._%+-
- `+` means to concatenate the strings patterns
- `{2,4}` means consider only 2 to 4 character strings

To summarise the pattern above searches for any combination of letters and numbers, followed by a `@`, then any combination of letters and numbers followed by a `.` with only 2 to 4 letters after it.

### Regular expressions and pandas
Let's see how they can be combined. Replicating the example above

In [12]:
data = pd.Series({'Dave': 'Daves email dave@google.com', 'Steve': 'Steves email steve@gmail.com',
        'Rob': 'Robs rob@gmail.com', 'Wes': np.nan})
data

Dave      Daves email dave@google.com
Steve    Steves email steve@gmail.com
Rob                Robs rob@gmail.com
Wes                               NaN
dtype: object

We can reuse the same `pattern` variable from above

In [13]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [dave@google.com]
Steve    [steve@gmail.com]
Rob        [rob@gmail.com]
Wes                    NaN
dtype: object

pandas also offers more standard string operations. For example, we can check if a string is contained within a data row:

In [14]:
data.str.contains("gmail")

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Many more of these methods exist:
    
    
| Methods | Description |
| -- | -- |
| cat | Concatenate strings element-wise with optional delimiter |
| contains | Return boolean array if each string contains pattern/regex |
| count | Count occurrences of a pattern |
| extract | Use a regex with groups to extract one or more strings from a Series |
| findall | Computer list of all occurrences of pattern/regex for each string |
| get | Index into each element |
| isdecimal | Checks if the string is a decimal number |
| isdigit | Checks if the string is a digit |
| islower | Checks if the string is in lower case |
| isupper | Checks if the string is in upper case |
| join | Join strings in each element of the Series with passed seperator |
| len | Compute the length of each string |
| lower, upper | Convert cases |
| match | Returns matched groups as a list |
| pad | Adds whitespace to left, right or both sides of strings |
| repeat | Duplicate string values |
| slice | Slice each string in the Series |

### Exercise
There is a dataset `data/yob2012.txt` which lists the number of newborns registered in 2012 with their names and sex. Using regular expressions, extract all names from the dataset which start with letters A to C. How many names did you find?

Note: `^` is the "starting with" operator in regular expressions, 

In [29]:
new_pattern = "(^[A-C][a-z]+?),"
count = 0
with open("data/yob2012.txt","r",encoding = "utf-8") as f:
    for line in f.readlines():
        if re.findall(new_pattern, line):
            count += 1
            print(re.findall(new_pattern, line)[0])

Ava
Abigail
Chloe
Avery
Addison
Aubrey
Charlotte
Amelia
Brooklyn
Anna
Aaliyah
Allison
Alexis
Audrey
Alyssa
Claire
Camila
Arianna
Ashley
Brianna
Bella
Alexa
Aubree
Autumn
Ariana
Alexandra
Caroline
Bailey
Aria
Annabelle
Andrea
Brooke
Brielle
Alice
Angelina
Clara
Brooklynn
Aliyah
Amy
Adriana
Cora
Alaina
Catherine
Aurora
Alana
Ariel
Alivia
Brynn
Aniyah
Angela
Adalyn
Allie
Alayna
Alexandria
Ashlyn
Adrianna
Amaya
Cecilia
Ana
Callie
Angel
Chelsea
Adelyn
Adeline
Camille
Adalynn
Arabella
Athena
Ayla
Alexia
Addyson
Allyson
Amber
Amanda
Alina
Alicia
Alison
Ashlynn
Cassidy
Alondra
Christina
Carly
Cadence
Briana
Charlie
Abby
Annabella
Bianca
Cheyenne
Brynlee
Aubrie
Cali
Carmen
Anastasia
Ainsley
Baylee
Alessandra
Adelaide
Camryn
Bethany
Angelica
Addisyn
Annie
Amiyah
Briella
Caitlyn
Charlee
Crystal
Angelique
Alejandra
Anya
April
Breanna
Brittany
Brylee
Arielle
Arya
Cynthia
Aleah
Cassandra
Caitlin
Carolina
Bristol
Camilla
Audrina
Braelyn
Bridget
Aniya
Averie
Aylin
Adelynn
Celeste
Anaya
Catalina
Catale

## Further Resources
In this extra notebook, we briefly touched upon *regular expressions* and how they are used in Python. However, regular expressions are actually a standard format for text matching across all of computing and IT i.e. they are not particular to a programming language or a tool. If you are interested in textual analysis or working with databases, then I would recommend taking the time to learn regular expressions.

Here is a collection of resources for that:
- [Lynda course](https://www.lynda.com/Regular-Expressions-tutorials/Using-Regular-Expressions/85870-2.html?srchtrk=index%3a1%0alinktypeid%3a2%0aq%3aregular+expressions%0apage%3a1%0as%3arelevance%0asa%3atrue%0aproducttypeid%3a2) - very good one-stop-shop for learning regular expressions. You'll be a pro at the end of the course. Lynda is free for students and staff of Edinburgh University.
- https://regexone.com/ - free website course on regular expressions. Also very good, comprehensive and interactive.
- [Python-specific tutorial](https://www.w3schools.com/python/python_regex.asp) - if you ever need to see more examples of how regular expressions are used in Python.