# Introduction to Regex

## GOAL

Introduce to the library `re` (regexp) and show the main functions and how to apply it to Netflix data. 

## DESCRIPTION

In this workshop, the following functions will be reviewed: 

* `findall()`
* `search()`
* `split()`
* `sub()`
* `span()`
* `string()`
* `group()`

Metacharacters: ` . ^ $ * + ? { } [ ] \ | ( )`

Special Sequences: `\A` `\b` `\d` `\s`

And how to compile the regex expressions to reuse it. 

More information on that [link](https://www.w3schools.com/python/python_regex.asp).

In [1]:
import re
import pandas as pd

netflix = pd.read_csv('C:/Users/david/WBSCodingSchool/SecondWeek/NetflixWorkshop/netflix_csv/netflix-titles.csv')
netflix.head()

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [2]:
# extract an specific description
dark_descr = netflix.query('title == "Dark"')['description'].values[0]

dark_descr

'A missing child sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.'

### `findall`

Returns a list containing all matches

In [3]:
# return all ocurrencies appearing on a string
re.findall('a', dark_descr)

['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

### `search`

Returns a Match object if there is a match anywhere in the string. If there is more than one match, only the first occurrence of the match will be returned.

The mathch objects have the following methods: 
- `.span()` returns a tuple containing the start-, and end positions of the match.
- `.string` returns the string passed into the function
- `.group()` returns the part of the string where there was a match


In [4]:
dark_descr

'A missing child sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.'

In [5]:
match_obj = re.search('mystery', dark_descr)

In [6]:
match_obj.string

'A missing child sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.'

In [7]:
match_obj.group()

'mystery'

In [8]:
match_obj.span()

(96, 103)

### `split`
Returns a list where the string has been split at each match

In [9]:
dark_descr.split(' a ') # a is removed from the list

['A missing child sets four families on',
 'frantic hunt for answers as they unearth',
 'mind-bending mystery that spans three generations.']

### `sub`

Replaces one or many matches with a string

In [10]:
dark_descr = re.sub("child", "spoon", dark_descr)
print(dark_descr)

A missing spoon sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.


In [11]:
## CHALLENGE!

### METACHARACTERS


Some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.

` . ^ $ * + ? { } [ ] \ | ( )`

 #### `[]` means set of characters:
 
 - `[abc]` will match any of the characters a, b, or c
 - `[a-c]` will do the same
 - `[a-z]` will match any lowercase letter

In [12]:
alphanumeric = "4298fsfsv012rvv21v9"

In [13]:
re.findall(r"[a-z]", alphanumeric)

['f', 's', 'f', 's', 'v', 'r', 'v', 'v', 'v']

`\` Can help us to scape special characters 

In [14]:
alphanumeric_with_special = alphanumeric + "[a-z]"
print(alphanumeric_with_special)
# CALLENGE: use \ to escape the square brakets
re.findall(r"\[a-z]", alphanumeric_with_special)

4298fsfsv012rvv21v9[a-z]


['[a-z]']

#### Some special sequences:

- `\A`- Returns a match if the specified characters are at the beginning of the string
- `\b` - Returns a match where the specified characters are at the beginning or at the end of a word
- `\d` - 	Returns a match where the string contains digits (numbers from 0-9) (`\D` for where the string DOES NOT contain digits)
- `\s`- Returns a match where the string contains a white space character (`\S` for where the string DOES NOT contain a white space)

In [15]:
dark_descr

'A missing spoon sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.'

In [16]:
# CALLENGE: use a special sequence to capitalize the "dark_descr" string
re.findall(r'\b(sp)', dark_descr)

['sp', 'sp']

In [17]:
dark_descr_split = dark_descr.split(' a ')
dark_descr_split

['A missing spoon sets four families on',
 'frantic hunt for answers as they unearth',
 'mind-bending mystery that spans three generations.']

In [18]:
# from the sentance in the middle, only return f if the first word starts with it
re.findall(r"\A(f)", dark_descr_split[1])

['f']

In [19]:
# replace the "four" string for 4
dark_descr_split[0] = dark_descr_split[0].replace('four', '4')
# find all possible numbers
re.findall(r"\d", dark_descr_split[0])

['4']

In [20]:
# if I want to find the word "on"
print(dark_descr)
re.findall(r"\so\S\s",dark_descr)

A missing spoon sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.


[' on ']

### `.`	Any character (except newline character)

In [21]:
print(dark_descr)
re.findall(r".o", dark_descr)

A missing spoon sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.


['po', 'fo', ' o', 'fo', 'io']

### `+` One or more occurrences

In [22]:
# use re.sub() together with + to fix the occurrance of too many whitespaces
print(dark_descr)
# show where is happining
print(re.findall("o+", dark_descr))
# another way to visualize it
print(re.sub("o+", "__", dark_descr))

A missing spoon sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.
['oo', 'o', 'o', 'o', 'o']
A missing sp__n sets f__ur families __n a frantic hunt f__r answers as they unearth a mind-bending mystery that spans three generati__ns.


### `{}`- Exactly the specified number of occurrences

In [23]:
# use re.sub() together with + to fix the occurrance of too many whitespaces
print(dark_descr)
# show where is happining
print(re.findall("o{2}", dark_descr))
# another way to visualize it
print(re.sub("o{2}", "__", dark_descr))

A missing spoon sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.
['oo']
A missing sp__n sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.


### `^` Starts with

In [24]:
print(dark_descr_split)
re.findall(r"^f", dark_descr_split[1])

['A missing spoon sets 4 families on', 'frantic hunt for answers as they unearth', 'mind-bending mystery that spans three generations.']


['f']

#### How to apply it on the whole dataframe?

In [25]:
# Looking for all the titles on Netflix dataset which starts with F
netflix.loc[netflix['title'].str[0] == 'F',:].head(3)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
16,TV Show,Feb-09,,"Shahd El Yaseen, Shaila Sabt, Hala, Hanadi Al-...",,"March 20, 2019",2018,TV-14,1 Season,"International TV Shows, TV Dramas","As a psychology professor faces Alzheimer's, h..."
2083,TV Show,F is for Family,,"Bill Burr, Laura Dern, Justin Long, Debi Derry...","United States, France, Canada","June 12, 2020",2020,TV-MA,4 Seasons,TV Comedies,"Follow the Murphy family back to the 1970s, wh..."
2084,Movie,F the Prom,Benny Fine,"Cameron Palatas, Richard Karn, Cheri Oteri, Da...",United States,"March 5, 2018",2017,TV-MA,92 min,"Comedies, Romantic Movies",Maddy and Cole were inseparable before high sc...


In [26]:
# Or another way will be
netflix.loc[netflix['title'].str.count(r'^F') == 1].head(3)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
16,TV Show,Feb-09,,"Shahd El Yaseen, Shaila Sabt, Hala, Hanadi Al-...",,"March 20, 2019",2018,TV-14,1 Season,"International TV Shows, TV Dramas","As a psychology professor faces Alzheimer's, h..."
2083,TV Show,F is for Family,,"Bill Burr, Laura Dern, Justin Long, Debi Derry...","United States, France, Canada","June 12, 2020",2020,TV-MA,4 Seasons,TV Comedies,"Follow the Murphy family back to the 1970s, wh..."
2084,Movie,F the Prom,Benny Fine,"Cameron Palatas, Richard Karn, Cheri Oteri, Da...",United States,"March 5, 2018",2017,TV-MA,92 min,"Comedies, Romantic Movies",Maddy and Cole were inseparable before high sc...


In [27]:
# Learn more how to apply regexp and pandas: 
# https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/

### `*`	Zero or more occurences

In [28]:
similar_words = ["hey", "hay", "how", "h i j k", "h", "ha", "oops"]

In [29]:
# use "." to return all words starting with "h"
for word in similar_words:
    print(re.findall("h.*", word))

['hey']
['hay']
['how']
['h i j k']
['h']
['ha']
[]


In [30]:
print(dark_descr)
re.findall("spo*\S*", dark_descr)

A missing spoon sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.


['spoon', 'spans']

In [31]:
# Another way to show
re.findall("spo*\w+", dark_descr)
# \w: Returns a match where the string contains any word characters 
#    (characters from a to Z, digits from 0-9, and the underscore _ character)
# +: One or more occurrences

['spoon', 'spans']

### Examples into dataframes

In [32]:
# I would like to filter all the titles that contains Joan or John
netflix.loc[netflix['title'].str.count(r'^(Joan|John)') == 1].head(3)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
3192,Movie,Joan Didion: The Center Will Not Hold,Griffin Dunne,Joan Didion,United States,"October 27, 2017",2017,TV-14,98 min,Documentaries,Literary icon Joan Didion reflects on her rema...
3193,Movie,Joan Rivers: Don't Start with Me,Scott L. Montoya,Joan Rivers,United States,"July 3, 2018",2012,TV-MA,69 min,Stand-Up Comedy,"At 78, Joan Rivers has no interest in slowing ..."
3200,Movie,John & Jane,Ashim Ahluwalia,,India,"August 15, 2016",2005,TV-14,79 min,"Documentaries, International Movies",Truth and fiction blend in this quasi-document...


In [33]:
# CHALLENGE: how can you reduce the previous regexp expression?
netflix.loc[netflix['title'].str.count(r'^Jo(a|h)n') == 1].head(3)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
3192,Movie,Joan Didion: The Center Will Not Hold,Griffin Dunne,Joan Didion,United States,"October 27, 2017",2017,TV-14,98 min,Documentaries,Literary icon Joan Didion reflects on her rema...
3193,Movie,Joan Rivers: Don't Start with Me,Scott L. Montoya,Joan Rivers,United States,"July 3, 2018",2012,TV-MA,69 min,Stand-Up Comedy,"At 78, Joan Rivers has no interest in slowing ..."
3200,Movie,John & Jane,Ashim Ahluwalia,,India,"August 15, 2016",2005,TV-14,79 min,"Documentaries, International Movies",Truth and fiction blend in this quasi-document...


In [34]:
# Okay, now I would like to filter all the titles with a "spoon" on their desrciption
netflix.loc[netflix['description'].str.count(r'spoon') >= 1]

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5567,TV Show,Shine On with Reese,,Reese Witherspoon,United States,"October 29, 2019",2018,TV-MA,1 Season,"Docuseries, Stand-Up Comedy & Talk Shows","In a talk show straight from the heart, actor ..."


In [35]:
netflix.loc[netflix['description'].str.count(r'spoon') >= 1 ]['description'].values[0]
# spoon are not a popular topic

'In a talk show straight from the heart, actor and producer Reese Witherspoon visits with groundbreaking women to discuss their inspiring journeys.'

In [36]:
(
netflix.loc[
        # note that I am using \b because I want to exclude all words like "football"
        netflix['description'].str.count(r'\bball\s') >= 1
    ]
    .head(3)['description']
    .to_list()
)

['Dynamic comic DeRay Davis hits the stage like a ball of fire, nailing the finer points of living, dating and handling show business as a black man.',
 'In 1987 New York, LGBTQ ball fixture Blanca starts her own house, soon becoming mother to a gifted dancer and a sex worker in love with a yuppie client.']

### Compile regular expressions

In [37]:
pattern = re.compile(r"\bball\s")

In [38]:
(
netflix.loc[
        netflix['description'].str.count(pattern) >= 1
    ]
    .head(3)['description']
    .to_list()
)

['Dynamic comic DeRay Davis hits the stage like a ball of fire, nailing the finer points of living, dating and handling show business as a black man.',
 'In 1987 New York, LGBTQ ball fixture Blanca starts her own house, soon becoming mother to a gifted dancer and a sex worker in love with a yuppie client.']

### Other examples

In [39]:
print(dark_descr)

A missing spoon sets four families on a frantic hunt for answers as they unearth a mind-bending mystery that spans three generations.


In [40]:
# Extract the whole words that start with "m"
re.findall(r"\bm\w+", dark_descr)

['missing', 'mind', 'mystery']

In [41]:
# Replace all the words not starting with "m" with ______
re.sub(r"\b(?!m)\w+", "______", dark_descr)
# More information in https://docs.python.org/3/library/re.html

'______ missing ______ ______ ______ ______ ______ ______ ______ ______ ______ ______ ______ ______ ______ ______ mind-______ mystery ______ ______ ______ ______.'

In [42]:
# CHALLENGE: how can I avoid replacing mind-bending for mind-______
# This editor can help you on this task!
# https://www.debuggex.com/r/E7N3Sscav7q9eFw1
re.sub(r"(\b(m\w*)|(m\w*[-']\w*)\b)", "______", dark_descr)

'A ______ spoon sets four families on a frantic hunt for answers as they unearth a ______-bending ______ that spans three generations.'