# RegEx in Pandas STR Methods
Thanks to Travis Vitello

*This tutorial is prepared in the midst of the Covid-19 pandemic and thus has a different theme running in.*

#### Introduction

In Python, it is possible to leverage the <b>pandas</b> library to manipulate strings and text using Regular Expressions.

The power of using Regular Expressions is that the manipulations aren't limited to just a simple sequence of characters in the affected string or text, but to a wider, more complex variety of possible sequences as defined in the Regular Expression.  This has the advantage of reducing the number statements needed in one's code, which will be detailed below.

Before moving on, it is recommended that the users familiarize themselves with this section of the pandas online guide, <https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html>, whose content supplements and otherwise forms the basis of the following examples.

First, consider a string like: "Goodbye World!".

We would like to change this to read "Hello World!".  Now, very simply we can do the following in Python:

In [1]:
string = "Goodbye World!"
string = string.replace("Goodbye","Hello")
print(string)

Hello World!


In pandas, we can also do this very simply as:

In [2]:
import pandas as pd
string = pd.Series(["Goodbye World!"], dtype="str")
string = string.str.replace("Goodbye" , "Hello")
string[0]

'Hello World!'

This doesn't look particularly helpful, however.  But what if we had many strings we wanted to manipulate?

In [3]:
string_1 = "The house is blue."
string_2 = "The car is blue."
string_3 = "The sky is blue."
print(string_1)
print(string_2)
print(string_3)

The house is blue.
The car is blue.
The sky is blue.


In [4]:
s_1 = string_1.replace("blue","red")
s_2 = string_2.replace("blue","red")
s_3 = string_3.replace("blue","red")
print(s_1)
print(s_2)
print(s_3)

The house is red.
The car is red.
The sky is red.


With pandas, this can be simplified as:

In [5]:
strings = pd.Series([string_1,string_2,string_3],dtype="str")
strings = strings.str.replace("blue","red")
for i in range(0,len(strings)):
    print(strings[i])

The house is red.
The car is red.
The sky is red.


We observe that with one "replace" statement, the pandas series manipulated all strings.  But what if things were more complex?  Say we wanted to replace both the color and objects of the sentences.

We can do this as:

In [6]:
strings = pd.Series([string_1,string_2,string_3],dtype="str")
strings = strings.str.replace("blue","red")
for i in range(0,len(strings)):
    print(strings[i])

The house is red.
The car is red.
The sky is red.


In [7]:
strings = pd.Series([string_1,string_2,string_3],dtype="str")
strings = strings.str.replace("blue","red").str.replace("sky","apple").\
str.replace("house","dinosaur").str.replace("car","hat")
for i in range(0,len(strings)):
    print(strings[i])

The dinosaur is red.
The hat is red.
The apple is red.


We observe from this example pandas allows us to chain together a set of string replacements in one statement!

Say we want to manipulate the following opening lines from Lewis Carroll's poem "Jabberwocky":
<br>
<i>
    
'Twas brillig, and the slithy toves<br>
      Did gyre and gimble in the wabe:<br>
    
All mimsy were the borogoves,<br>
      And the mome raths outgrabe.<br>
"Beware the Jabberwock, my son!<br>
      The jaws that bite, the claws that catch!<br>      
Beware the Jubjub bird, and shun<br>
      The frumious Bandersnatch!" </i>

Let's try this with Regular Expressions.

First, we establish our pandas series containing the above lines of the poem "Jabberwocky":

In [14]:
jabberwocky = pd.Series([''''Twas brillig, and the slithy toves''',"Did gyre and gimble in the wabe:",\
"All mimsy were the borogoves,","And the mome raths outgrabe.",'''"Beware the Jabberwock, my son!''',\
"The jaws that bite, the claws that catch!","Beware the Jubjub bird, and shun",\
'''The frumious Bandersnatch!"'''],dtype="str")
jb = jabberwocky

In [15]:
for i in range(0,len(jb)):
    print(jb[i])

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"


#### Replace

Now, let's try a simple regular expression that replaces all instances of the fantasy creatures listed with something less harmful, like kittens.  We can do this with the following code:

In [17]:
jb_1 = jb.replace(['Jubjub bird|Bandersnatch|Jabberwock'],"kittens",regex=True)
for i in range(0,len(jb_1)):
    print(jb_1[i])

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
"Beware the kittens, my son!
The jaws that bite, the claws that catch!
Beware the kittens, and shun
The frumious kittens!"


This still doesn't feel right.  Why should we "beware the kittens" and "shun the frumious kittens"?  Let's see if Regular Expressions can help us out.

We recall that pandas lets us chain multiple manipulations together.  Keeping this in mind, we write the following:

In [103]:
jb_2 = jb.replace(['Jubjub bird|Bandersnatch|Jabberwock'],"kittens",regex=True).\
replace(['Beware'],'Cuddle',regex=True).replace(['shun'],'post internet photos of',regex=True)
for i in range(0,len(jb_2)):
    print(jb_2[i])

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
"Cuddle the kittens, my son!
The jaws that bite, the claws that catch!
Cuddle the kittens, and post internet photos of
The frumious kittens!"


With just a few Regular Expressions, we were able to turn Lewis Carroll's poem into something much more feline-friendly!

#### Extract

What else can be done with Regular Expressions in pandas?  Say we wanted to find the first word in each line of the poem "Jabberwocky".  We can do this by entering the following, whereby we're interested in any letter (A through Z), apostrophe, or quotation mark as the first character in each line's first word:

In [24]:
jb_3 = jb.str.extract('([a-zA-Z|\'|\"]*)')
print(jb_3)

         0
0    'Twas
1      Did
2      All
3      And
4  "Beware
5      The
6   Beware
7      The


#### Contains

Say we wanted to figure out each line that contains the word "and".  We can use Regular Expressions to help by entering the following:

In [158]:
jb.str.contains('(and|And)')

0     True
1     True
2    False
3     True
4    False
5    False
6     True
7     True
dtype: bool

But wait --- the final line in our set reads: <i>The frumious Bandersnatch!"</i>.  The Regular Expression is picking up the sequence "and" inside the word "Bandersnatch".

This isn't what we want, so we adjust our Regular Expression to look for "and" or "And" with a trailing space.

We enter the following, which gives us:

In [159]:
jb.str.contains('(and|And)\s')

0     True
1     True
2    False
3     True
4    False
5    False
6     True
7    False
dtype: bool

We observe that the final line went from "True" to "False" as the word "Bandersnatch" violates the Regular Expression defined in the "contains" method.

#### Concatenate

Getting away from the "Jabberwocky" example, we look at some basic examples to highlight other string ethods in pandas that can make use of Regular Expressions.

Assume the following series of strings:

In [164]:
CS = pd.Series(['CS5012', 'is', 'a', 'really', 'great', 'course!'], dtype="str")
print(CS)

0     CS5012
1         is
2          a
3     really
4      great
5    course!
dtype: object


We observe that we've defined a pandas series containing 6 individual strings.  What if we wanted to tie these together into a single sentence?  We can do that with the following:

In [170]:
CS.str.cat(sep=' ')

'CS5012 is a really great course!'

Using what was demonstrated above, we can chain to this concatenate statement replacement methods to give us something like the following:

In [172]:
CS.str.cat(sep=' ').replace("really ","").replace("great ","")

'CS5012 is a course!'

We can also achieve a similar result using Regular Expressions:

In [218]:
CS.replace(['really|great']," ",regex=True).str.cat(sep=' ').replace('  ','')

'CS5012 is a course!'

#### Conclusion

This Jupyter notebook has hopefully provided a brief overview for some of the ways Regular Expressions can be used to manipulate strings or text in Python using pandas.  Functions like "Replace", "Extract", "Contains", and "Concatenate" are just some of the many pandas string methods.  For a complete list with detailed examples, it is recommended to visit the following site <https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html>.