# Solutions

1. [Intro to Regular Expressions](#1.-Intro-to-Regular-Expressions)
1. [Quantifiers](#2.-Quantifiers)
1. [Or Conditions](#3.-Or-Conditions)
1. [Character Sets and Grouping](#4.-Character-Sets-and-Grouping)


In [1]:
import pandas as pd
import numpy as np

# 1. Intro to Regular Expressions

In [2]:
movie = pd.read_csv('../data/movie.csv')
title = movie['title']

### Problem 1
<span  style="color:green; font-size:16px">Find all movies that have 2 consecutive z's in them.</span>

In [3]:
filt = title.str.contains('zz')
title[filt]

416                All That Jazz
907         The Dukes of Hazzard
1041                   Bedazzled
2234                   Paparazzi
2524                    Hot Fuzz
2593    The Lizzie McGuire Movie
3215       Into the Grizzly Maze
3535                Mystic Pizza
4399              Blue Like Jazz
Name: title, dtype: object

### Problem 2
<span  style="color:green; font-size:16px">Find all movies that begin with 9.</span>

In [4]:
filt = title.str.contains('^9')
title[filt]

1651                       9
2416                9½ Weeks
3705    90 Minutes in Heaven
Name: title, dtype: object

### Problem 3
<span  style="color:green; font-size:16px">Find all movies that have a `b` as their third character.</span>

In [5]:
filt = title.str.contains('^..b')
title[filt].head()

22                Robin Hood
228                  RoboCop
286           Public Enemies
448                   Robots
494    Babe: Pig in the City
Name: title, dtype: object

### Problem 4
<span  style="color:green; font-size:16px">Find all movies with a fourth-to-last character of `M` and a last character of `e`.</span>

In [6]:
filt = title.str.contains('M..e$')
title[filt].head()

704     The Green Mile
1167            8 Mile
1616         Like Mike
2122    Moonlight Mile
2486      How She Move
Name: title, dtype: object

### Problem 5
<span  style="color:green; font-size:16px">Could you use a regular expression to find a movie that was exactly 6 characters in length?</span>

In [7]:
filt = title.str.contains('^......$')
title[filt].head(10)

0      Avatar
41     Cars 2
58     WALL·E
125    Frozen
168    Sahara
292    Eraser
298    Eragon
368    Pixels
426    Jumper
428    Zodiac
Name: title, dtype: object

### Problem 6
<span  style="color:green; font-size:16px">What is a more natural way to complete problem 5 without a regex?</span>

In [8]:
filt = title.str.len() == 6
title[filt].head(10)

0      Avatar
41     Cars 2
58     WALL·E
125    Frozen
168    Sahara
292    Eraser
298    Eragon
368    Pixels
426    Jumper
428    Zodiac
Name: title, dtype: object

# 2. Quantifiers

Read in the movie dataset first.

In [9]:
movie = pd.read_csv('../data/movie.csv')
title = movie['title']

### Problem 1
<span  style="color:green; font-size:16px">Find all movies that have a 'z' as their 15th character.</span>

In [10]:
pattern = '^.{14}z'
filt = title.str.contains(pattern)
title[filt]

2484      American Dreamz
2625    Ramona and Beezus
Name: title, dtype: object

### Problem 2
<span  style="color:green; font-size:16px">Find all movies that have the word 'Friend' or 'Friends' in them.</span>

In [11]:
pattern = 'Friends?'
filt = title.str.contains(pattern)
title[filt]

1055                     My Best Friend's Wedding
1413                        Friends with Benefits
1775        How to Lose Friends & Alienate People
2216                        My Best Friend's Girl
3116    Seeking a Friend for the End of the World
3495                           Friends with Money
4184                          We Are Your Friends
4279                        Dysfunctional Friends
4670                               Mutual Friends
Name: title, dtype: object

### Problem 3
<span  style="color:green; font-size:16px">Find all movies that have between 40 and 43 characters in them. Can you verify the results with another `str` accessor method?</span>

In [12]:
pattern = '^.{40,43}$'
filt = title.str.contains(pattern)
m40_43 = title[filt]
m40_43.head()

1        Pirates of the Caribbean: At World's End
4      Star Wars: Episode VII - The Force Awakens
13     Pirates of the Caribbean: Dead Man's Chest
16       The Chronicles of Narnia: Prince Caspian
18    Pirates of the Caribbean: On Stranger Tides
Name: title, dtype: object

In [13]:
m40_43.str.len().head()

1     40
4     42
13    42
16    40
18    43
Name: title, dtype: int64

In [14]:
m40_43.str.len().value_counts()

40    15
41    12
43     9
42     9
Name: title, dtype: int64

### Problem 4
<span  style="color:green; font-size:16px">Find all movies that begin with 'The' and end in 'Movie'</span>

In [15]:
pattern = 'The.*Movie'
filt = title.str.contains(pattern)
title[filt]

319                                     The Peanuts Movie
561                                 The Angry Birds Movie
569                                    The Simpsons Movie
580              The SpongeBob Movie: Sponge Out of Water
759                                        The Lego Movie
1586                      The SpongeBob SquarePants Movie
1592                            Hannah Montana: The Movie
1593                          Rugrats in Paris: The Movie
1734                                    The Rugrats Movie
1895                           The Wild Thornberrys Movie
2162                                     The Tigger Movie
2593                             The Lizzie McGuire Movie
2645    The Pirates Who Don't Do Anything: A VeggieTal...
3068                             Twilight Zone: The Movie
3099                                Hey Arnold! The Movie
3253                           Glee: The 3D Concert Movie
3296                                     The Muppet Movie
3630          

### Problem 5
<span  style="color:green; font-size:16px">Create your own Series and make a regular expression that uses the `+` metacharacter. Is this character necessary?</span>

# 3. Or Conditions

In [16]:
import pandas as pd
title = pd.read_csv('../data/movie.csv')['title']
title.head()

0                                        Avatar
1      Pirates of the Caribbean: At World's End
2                                       Spectre
3                         The Dark Knight Rises
4    Star Wars: Episode VII - The Force Awakens
Name: title, dtype: object

### Problem 1
<span  style="color:green; font-size:16px">Find all movies that begin with 'The' followed by the next word that begins with digits.</span>

In [17]:
pattern = '^The [0-9]'
filt = title.str.contains(pattern)
title[filt]

212                                      The 13th Warrior
429                                           The 6th Day
1354                                         The 5th Wave
1817                               The 40-Year-Old Virgin
1958                                               The 33
3567                                      The 5th Quarter
4373    The 41-Year-Old Virgin Who Knocked Up Sarah Ma...
Name: title, dtype: object

### Problem 2
<span  style="color:green; font-size:16px">Find all movies that have three consecutive capital letters in them.</span>

In [18]:
pattern = '[A-Z]{3}'
filt = title.str.contains(pattern)
title[filt].head()

4      Star Wars: Episode VII - The Force Awakens
40                                   TRON: Legacy
58                                         WALL·E
140                       Mission: Impossible III
177                                       The BFG
Name: title, dtype: object

### Problem 3
<span  style="color:green; font-size:16px">Find all movies that have begin and end with a capital letter.</span>

In [19]:
pattern = '^[A-Z].*[A-Z]$'
filt = title.str.contains(pattern)
title[filt].head()

46                 World War Z
58                      WALL·E
140    Mission: Impossible III
151            Men in Black II
177                    The BFG
Name: title, dtype: object

### Problem 4
<span  style="color:green; font-size:16px">Find all the movies that have a digit followed by a comma followed by a digit.</span>

In [20]:
pattern = r'[0-9],[0-9]'
filt = title.str.contains(pattern)
title[filt].head()

276                                10,000 B.C.
3266    Ultramarines: A Warhammer 40,000 Movie
3641              20,000 Leagues Under the Sea
4775             The Beast from 20,000 Fathoms
Name: title, dtype: object

### Problem 5
<span  style="color:green; font-size:16px">Find all the movies that have either an ampersand or a question mark in them.</span>

In [21]:
pattern = '[&?]'
filt = title.str.contains(pattern)
title[filt].head()

129          Angels & Demons
145    Mr. Peabody & Sherman
214           Batman & Robin
252         Mr. & Mrs. Smith
278           Town & Country
Name: title, dtype: object

### Problem 6
<span  style="color:green; font-size:16px">Which movie has the most ampersands, question marks, and periods in it?</span>

In [22]:
pattern = '[&.?]'
count = title.str.count(pattern)
count.head()

0    0
1    0
2    0
3    0
4    0
Name: title, dtype: int64

In [23]:
filt = count == count.max()
title[filt]

542    The Man from U.N.C.L.E.
Name: title, dtype: object

In [24]:
count.max()

5

# 4. Character Sets and Grouping

### Problem 1
<span  style="color:green; font-size:16px">For all movies that begin with 'The' and are followed by the next word that begins with a digit, extract just the digits part of this word.</span>

In [25]:
pattern = r'^The (\d+)'
title.str.extract(pattern).dropna()

Unnamed: 0,0
212,13
429,6
1354,5
1817,40
1958,33
3567,5
4373,41


### Problem 2
<span  style="color:green; font-size:16px">Find all movies that have two separate numbers in them. An example would be, '7 days and 7 nights'.</span>

In [26]:
pattern = r'\d+\D+\d+'
filt = title.str.contains(pattern)
title[filt]

276                                10,000 B.C.
289                 The Taking of Pelham 1 2 3
509                           2 Fast 2 Furious
1043                              3:10 to Yuma
1610                            13 Going on 30
1617        Naked Gun 33 1/3: The Final Insult
2466                     40 Days and 40 Nights
2646                                     U2 3D
3266    Ultramarines: A Warhammer 40,000 Movie
3308                                     50/50
3516                           Fahrenheit 9/11
3576                                     11:14
3641              20,000 Leagues Under the Sea
3934                                      2:13
4210                   24 7: Twenty Four Seven
4376                    Friday the 13th Part 2
4532              4 Months, 3 Weeks and 2 Days
4775             The Beast from 20,000 Fathoms
Name: title, dtype: object

### Problem 3
<span  style="color:green; font-size:16px">Find all the movies that have 6 or more non-vowel and non-space characters in a row.</span>

In [27]:
pattern = r'[^aeiouAEIOU ]{6,}'
filt = title.str.contains(pattern)
title[filt]

276                                10,000 B.C.
542                    The Man from U.N.C.L.E.
1935                          Punch-Drunk Love
2392                                  Catch-22
2480                         Brooklyn's Finest
2507                   When Harry Met Sally...
2912        Tales from the Crypt: Demon Knight
3266    Ultramarines: A Warhammer 40,000 Movie
3641              20,000 Leagues Under the Sea
4775             The Beast from 20,000 Fathoms
Name: title, dtype: object

### Problem 4
<span  style="color:green; font-size:16px">Extract the very next character after 't' or 'T' for each movie.</span>

In [28]:
pattern = r'[Tt](.)'
title.str.extract(pattern).head()

Unnamed: 0,0
0,a
1,e
2,r
3,h
4,a


### Problem 5
<span  style="color:green; font-size:16px">What is the most common character after 't' or 'T'?</span>

In [29]:
pattern = r'[Tt](.)'
letters = title.str.extract(pattern)
letters.head()

Unnamed: 0,0
0,a
1,e
2,r
3,h
4,a


This is a DataFrame. The column name is the integer 0. Let's select it as a Series.

In [30]:
letter_series = letters[0]
letter_series.head()

0    a
1    e
2    r
3    h
4    a
Name: 0, dtype: object

In [31]:
letter_series.value_counts().head()

h    1431
      311
e     266
o     183
i     169
Name: 0, dtype: int64

Minor detail here - there is an `expand` parameter than you can set to `False` to return a Series.

In [32]:
pattern = r'[Tt](.)'
letters = title.str.extract(pattern, expand=False)
letters.head()

0    a
1    e
2    r
3    h
4    a
Name: title, dtype: object

In [33]:
letters.value_counts().head()

h    1431
      311
e     266
o     183
i     169
Name: title, dtype: int64

The above only extracts the character after first appearance of the letter 't'. Use the `extractall` string method to get the first characters after each 't'.

In [34]:
pattern = r'[Tt](.)'
letters = title.str.extractall(pattern)
letters.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,a
1,0,e
1,1,h
1,2,
2,0,r


In [35]:
letters[0].value_counts().head()

h    1942
      620
e     480
o     354
i     343
Name: 0, dtype: int64

### Problem 6
<span style="color:green; font-size:16px">Extract all the words that begin with 'T' or 't' and end in 'e' then find their frequency. Research the word boundaray special character.</span>

In [36]:
pattern = r'\b([tT]\w*e)\b'
letters = title.str.extractall(pattern)
letters[0].str.lower().value_counts()

the              1555
time               26
tale               12
true               10
three               8
teenage             7
there               6
take                6
trouble             4
trade               4
treasure            2
terrace             2
twice               2
torque              1
torture             1
tombstone           1
thunderdome         1
tide                1
turtle              1
triple              1
trapeze             1
tae                 1
tree                1
tadpole             1
terrible            1
tease               1
trance              1
throttle            1
thr3e               1
transcendence       1
twelve              1
timeline            1
turbulence          1
tootsie             1
triangle            1
tape                1
temple              1
Name: 0, dtype: int64