# MSAS Python Tutorial Six

Authored by: Ben Weber

Content drawn from professor Chris Brooks at the University of Michigan

**Today we are going to focus on a new library that Python offers, called regex**

Before beginning, take a moment to think about the functionality of a library. Why is it important that Python has a wide collection of libraries that can imported into your environment?

**What is regex?**

* regex is an important library in Python for data cleaning, and stands for regular expression syntax
* The built-in functions that regex offers allows the user to match strings to a given regular expression

Why might these functions be so useful to data scientists when cleaning data?

First, import regex into your environment 

In [None]:
import regex as re

In [None]:
simple_string = """Amy is 5 years old, and her sister Mary is 2 years old. 
    Ruth and Peter, their parents, have 3 kids."""

result = re.findall("[A-Z][a-z]*", simple_string)
print(result)

This is a simple example of regex's findall method:

* findall takes a user defined regular expression in the first parameter, and a string in the second parameter to match with
* findall returns a list of all the matches between the regular expression and the string

At this point, you are likely wondering how to actually come up with the regular expression in the parameter. How do we create the right regular expression?

* Regular expressions are created with a combination of metacharacters and special sequences
* There are far too many to try to explain, but there will be a link at the end of this tutorial to a page that lists many of them with a description
* With the regular expression syntax that Python has already defined and can interpret, you can create regular expressions to match just about any sequence in a string or file that you may desire

**It takes practice to figure out how all of the different metacharcters and sequences work! A regex problem is best approached by determining what types of characters in the type of string you desrire to match, and then figuring out the metacharacters and sequences that apply.**

Let's try an example that's a little more involved:

* We are going to load in a file with students names and their grades in this format:

Ronald Mayr: A

* We want to return a list of the names of the students that earned B's

In [None]:
grades = open("grades.txt", "r")
sg = grades.read()
B_students = re.findall("[A-Za-z]{3,20}[\s][A-Za-z]{3,20}[:][\s][B]", sg)
for number, item in enumerate(B_students):
    l = len(item)
    new_string = ""
    for num in range(l):
        if num < l-3:
            new_string += item[num]
    B_students[number] = new_string
print(B_students)
grades.close()

**Explanation:**

* Open file using the open function and read the file object with the read() method to a new variable
* Use the findall method to return a list of the students that earned a B
* However, we only want to return the names of the students, so we can use a nested for loop to accomplish this
* Closing the file object when you are done is always good practice

**Is there a way that we could reduce code by using findall again instead of the nested for loop?**

**Using regex on a dataframe:**

* When cleaning data in a dataframe, you may want to change how the data looks for better understanding 
* This is where regex comes into play, using a method in pandas called replace()


In [None]:
import pandas as pd

In [None]:
pbp = pd.read_csv("UofM 2022 pbp")

Let's say we want to change the format of the season column so that it says 2022 instead of 2022/2023:

* We can think of a regex that will allow us to accomplish this

In [None]:
pbp['season'].replace(to_replace = '/2023', value = '', regex = True, inplace = True)
pbp['season']

Look at the dataframe to see if there is any formating that you would want to change and then give it a try with regex!

**Here is a source that shares some other helpful functions that regex offers and a list of uesful metacharcters and sequences to know:** 

https://www.w3schools.com/python/python_regex.asp

**Remember, you do not need to memorize all these things to be capable of using regex. If you understand the concepts, you can use online resources very easily to remind yourself how to address a wide range of problems when data cleaning!**

**That's all for this tutorial!**