That being said, making the investment to learn regular expressions is worthwhile.

* You can write complex operations with string data much more quickly once you understand how they function, which will save you time. 
* Regular expressions can be applied more quickly than their manual counterparts.
* Nearly all current programming languages, as well as places like databases and command-line tools, support regular expressions. Anywhere you work with data, you can use a powerful tool that you gain by understanding regular expressions.


Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "stories") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles; stories that make it to the top of Hacker News' listings can get hundreds of thousands of visitors.

The dataset we will be working with is based off this CSV of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:

* `id`: The unique identifier from Hacker News for the story
* `title`: The title of the story
* `url`: The URL that the stories links to, if the story has a URL
* `num_points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the story
* `author`: The username of the person who submitted the story
* `created_at`: The date and time at which the story was submitted

In [1]:
import pandas as pd

In [2]:
hn = pd.read_csv("hacker_news.csv")

In [3]:
import re

In [4]:
re.search("and", "I have two hands")

<re.Match object; span=(12, 15), match='and'>

In [5]:
"I have two hands"[12:15]

'and'

In [6]:
re.search("and", "antidote")

<img src = "set_1.svg" width = "100%" />

In [11]:
pattern = "[msb]end" # mend, send, bend

re.search(pattern, "bsending")

<re.Match object; span=(1, 5), match='send'>

The regular expression above will match the strings `mend`, `send`, and `bend`.

<img src = "set_2.svg" width = "60%" />

In [14]:
title = hn.title
title.head()

0                            Interactive Dynamic Video
1    Florida DJs May Face Felony for April Fools' W...
2         Technology ventures: From Idea to Enterprise
3    Note by Note: The Making of Steinway L1037 (2007)
4    Title II kills investment? Comcast and other I...
Name: title, dtype: object

In [16]:
pattern = "[Pp]ython"

title[title.str.contains(pattern)]

102                    From Python to Lua: Why We Switched
103              Ubuntu 16.04 LTS to Ship Without Python 2
144      Create a GUI Application Using Qt and Python i...
196      How I Solved GCHQ's Xmas Card with Python and ...
436      Unikernel Power Comes to Java, Node.js, Go, an...
                               ...                        
19597    David Beazley  Python Concurrency from the Gro...
19852      Ask HN: How to automate Python apps deployment?
19862                            Moving Away from Python 2
19980                        Python vs. Julia Observations
19998    Show HN: Decorating: Animated pulsed for your ...
Name: title, Length: 160, dtype: object

`1000` to `2999`

In [17]:
pattern = "[1-2][0-9][0-9][0-9]"

title[title.str.contains(pattern)]

3        Note by Note: The Making of Steinway L1037 (2007)
34                        The reverse job applicant (2010)
50         Ask HN: Which framework for a CRUD app in 2016?
59            2015 in review  1 year after I quit blogging
80       Apple Watch Scooped Up Over Half the Smartwatc...
                               ...                        
20032    Things I Wont Work With: Dioxygen Difluoride (...
20049    Study: US is an oligarchy, not a democracy (2014)
20075    Tips from a Pro: An Introduction to Microscopi...
20081                 Usenet, what have you become? (2012)
20082    Colma, Calif., Is a Town of 2.2 Square Miles, ...
Name: title, Length: 1143, dtype: object

<img src = "quantifiers_numeric.svg" width = "80%" />

In [19]:
pattern = "[1-2][0-9]{3}"

title[title.str.contains(pattern)]

3        Note by Note: The Making of Steinway L1037 (2007)
34                        The reverse job applicant (2010)
50         Ask HN: Which framework for a CRUD app in 2016?
59            2015 in review  1 year after I quit blogging
80       Apple Watch Scooped Up Over Half the Smartwatc...
                               ...                        
20032    Things I Wont Work With: Dioxygen Difluoride (...
20049    Study: US is an oligarchy, not a democracy (2014)
20075    Tips from a Pro: An Introduction to Microscopi...
20081                 Usenet, what have you become? (2012)
20082    Colma, Calif., Is a Town of 2.2 Square Miles, ...
Name: title, Length: 1143, dtype: object

<img src = "quantifier_example.svg" width = "90%" />

<img src = "character_classes_1.svg" width = "70%" />
[Aa-Zz]
<img src = "character_classes_2.svg" width = "70%" />

In [20]:
print("Danial \n Gauhar")

Danial 
 Gauhar


In [21]:
pattern = "[1-2]\d{3}"

title[title.str.contains(pattern)]

3        Note by Note: The Making of Steinway L1037 (2007)
34                        The reverse job applicant (2010)
50         Ask HN: Which framework for a CRUD app in 2016?
59            2015 in review  1 year after I quit blogging
80       Apple Watch Scooped Up Over Half the Smartwatc...
                               ...                        
20032    Things I Wont Work With: Dioxygen Difluoride (...
20049    Study: US is an oligarchy, not a democracy (2014)
20075    Tips from a Pro: An Introduction to Microscopi...
20081                 Usenet, what have you become? (2012)
20082    Colma, Calif., Is a Town of 2.2 Square Miles, ...
Name: title, Length: 1143, dtype: object

`email` or `e-mail`

In [22]:
pattern = "e-{0,1}mail" # email, e-mail

title[title.str.contains(pattern)]

119      Show HN: Send an email from your shell to your...
313          Disposable emails for safe spam free shopping
1361     Ask HN: Doing cold emails? helps us prove this...
1750     Protect yourself from spam, bots and phishing ...
2421                    Ashley Madison hack treating email
                               ...                        
18098    House panel looking into Reddit post about Cli...
18583    Mailgen  Generates clean, responsive HTML for ...
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19446    Tell HN: Secure email provider Riseup will run...
Name: title, Length: 86, dtype: object

In [25]:
pattern = "(e-{0,1}mail)" # email, e-mail

title.str.extract(pattern).value_counts()

email     81
e-mail     5
dtype: int64

In [26]:
pattern = "e-{0,1}(mail)" # email, e-mail

title.str.extract(pattern).value_counts()

mail    86
dtype: int64

In [28]:
pattern = "e-{0,1}(mail)" # email, e-mail

title.str.extract(pattern).value_counts()

mail    86
dtype: int64

## Single Characters Qunatifers

<img src = "quantifiers_other.svg" width = "70%" />

`email` or `e-mail`

In [None]:
pattern = "(e-?mail)" # email, e-mail

title.str.extract(pattern).value_counts()

`[pdf]`, `[video]`, `[png]`, `[8]`, `[danial]`

In [29]:
pattern = "\[pdf\]"

title[title.str.contains(pattern)]

66       Analysis of 114 propaganda sources from ISIS, ...
159           File indexing and searching for Plan 9 [pdf]
162      Attack on Kunduz Trauma Centre, Afghanistan  I...
445      New Directions in Cryptography by Diffie and H...
511      Office of Inspector General's Audit of the Off...
                               ...                        
19549                IMP: Indirect Memory Prefetcher [pdf]
19763    TSA can now force you to go through body scann...
19947                                Swift Reversing [pdf]
19979    WSJ/Dowjones Announce Unauthorized Access Betw...
20089    Users Really Do Plug in USB Drives They Find [...
Name: title, Length: 276, dtype: object

In [32]:
pattern = "(\[\w+\])"

title.str.extract(pattern).value_counts()

[pdf]            276
[video]          111
[2015]             3
[audio]            3
[2014]             2
[beta]             2
[slides]           2
[1996]             1
[map]              1
[ask]              1
[blank]            1
[coffee]           1
[comic]            1
[crash]            1
[detainee]         1
[gif]              1
[png]              1
[much]             1
[Ubuntu]           1
[repost]           1
[satire]           1
[song]             1
[survey]           1
[transcript]       1
[updated]          1
[videos]           1
[Videos]           1
[USA]              1
[2008]             1
[SpaceX]           1
[5]                1
[ANNOUNCE]         1
[Australian]       1
[Benchmark]        1
[Beta]             1
[CSS]              1
[Challenge]        1
[Excerpt]          1
[GOST]             1
[German]           1
[HBR]              1
[Infograph]        1
[JavaScript]       1
[Live]             1
[Map]              1
[NSFW]             1
[Petition]         1
[Python]     

In [33]:
pattern = "\[(\w+)\]"

title.str.extract(pattern).value_counts()

pdf            276
video          111
2015             3
audio            3
2014             2
beta             2
slides           2
1996             1
map              1
ask              1
blank            1
coffee           1
comic            1
crash            1
detainee         1
gif              1
png              1
much             1
Ubuntu           1
repost           1
satire           1
song             1
survey           1
transcript       1
updated          1
videos           1
Videos           1
USA              1
2008             1
SpaceX           1
5                1
ANNOUNCE         1
Australian       1
Benchmark        1
Beta             1
CSS              1
Challenge        1
Excerpt          1
GOST             1
German           1
HBR              1
Infograph        1
JavaScript       1
Live             1
Map              1
NSFW             1
Petition         1
Python           1
React            1
SPA              1
Skinnywhale      1
viz              1
dtype: int64

`[pdf]`, `[video]`, `[png]`, `[8]`, `[danial]`, `[@khan]`

In [None]:
# home work 
# check this pattern

### `Accessing the Matching Text with Capture Groups`

In [34]:
pattern = "\[(\w+)\]"

title.str.extract(pattern).value_counts()

pdf            276
video          111
2015             3
audio            3
2014             2
beta             2
slides           2
1996             1
map              1
ask              1
blank            1
coffee           1
comic            1
crash            1
detainee         1
gif              1
png              1
much             1
Ubuntu           1
repost           1
satire           1
song             1
survey           1
transcript       1
updated          1
videos           1
Videos           1
USA              1
2008             1
SpaceX           1
5                1
ANNOUNCE         1
Australian       1
Benchmark        1
Beta             1
CSS              1
Challenge        1
Excerpt          1
GOST             1
German           1
HBR              1
Infograph        1
JavaScript       1
Live             1
Map              1
NSFW             1
Petition         1
Python           1
React            1
SPA              1
Skinnywhale      1
viz              1
dtype: int64

### `Negative Character Classes`

<img src = "negative_character_classes.svg" width = "70%" />

In [None]:
"javascript"
"java"

pattern = "[Jj]ava[^Ss]"

In [42]:
a = "I am Learning JavaScript"
b = "I am Learning Javapython"
c = "Java is my fav"
d = "I am Learning Javaspython"

pattern = "[Jj]ava[^Ss]"

re.search(pattern,d)

### `Word Boundaries`

In [43]:
print("hello\b")

hello


In [44]:
print("\bhello\b")

hello


In [45]:
print("\\bhello\\b")

\bhello\b


In [48]:
a = "I am Learning JavaScript"
b = "I am Learning Javapython"
c = "Java is my fav"
d = "I am Learning Java"

pattern = "\\b[Jj]ava\\b"

re.search(pattern,c)

<re.Match object; span=(0, 4), match='Java'>

In [56]:
a = "I am Learning C"
b = "I am Learning C$"
c = "Java is my fav"
d = "I am Learning Java"

pattern = "\\b[Cc]\\b"

re.search(pattern,b)

<re.Match object; span=(14, 15), match='C'>

### `Matching at the Start and End of Strings`

<img src = "positional_anchors.svg" width = "70%" />

In [60]:
pattern = "^\[\w+\]"

title[title.str.contains(pattern)]

195                 [Beta] Speedtest.net  HTML5 Speed Test
398         [video] Google Self-Driving SUV Sideswipes Bus
3136                           [CSS] Yellow Fade Technique
5054     [React] proptypes-parser: Define React PropTyp...
9389     [Petition] Tell Microsoft to stop making browsers
10960      [pdf] Ninth Circuit Decision on AT&T Throttling
11356    [video] a new tool to make your workflow as a ...
12323                [Map] Watch as the US grows over time
12374    [JavaScript] to promise or to callback? This i...
13385                [video] Introducing Apple File System
14397    [video] Boston Dynamics Atlas robot video comm...
16747    [ask] Why you should borrow me your spare comp...
19035    [2015] How one man earns $1M a year teaching w...
19482       [Challenge] Sorting algorithm with constraints
19583    [Ubuntu]if you do this sudo chmod 777 -R /etc,...
Name: title, dtype: object

In [61]:
pattern = "\[\w+\]$"

title[title.str.contains(pattern)]

66       Analysis of 114 propaganda sources from ISIS, ...
100      Munich Gunman Got Weapon from the Darknet [Ger...
159           File indexing and searching for Plan 9 [pdf]
162      Attack on Kunduz Trauma Centre, Afghanistan  I...
210      A plan to rescue western democracy from the ig...
                               ...                        
19763    TSA can now force you to go through body scann...
19867                       Using Pony for Fintech [video]
19947                                Swift Reversing [pdf]
19979    WSJ/Dowjones Announce Unauthorized Access Betw...
20089    Users Really Do Plug in USB Drives They Find [...
Name: title, Length: 417, dtype: object

`email` ,`Email`, `e Mail`, `e mail`, `E-mail`, `e-mail`, `eMail`, `E-Mail`, `EMAIL`, `emails`, `Emails`, `E-MailS`

In [63]:
pattern = "(e[\s-]?mails?)"

title.str.extract(pattern, flags = re.I).value_counts()

email      57
Email      42
emails     18
Emails     15
e Mail      5
e-mail      5
e mail      4
E-mails     2
E-Mail      1
EMAIL       1
eMail       1
dtype: int64

In [64]:
title = title.str.replace("(e[\s-]?mails?)","email", flags = re.I)

  title = title.str.replace("(e[\s-]?mails?)","email", flags = re.I)


In [65]:
pattern = "(e[\s-]?mails?)"

title.str.extract(pattern, flags = re.I).value_counts()

email    151
dtype: int64

`mysql`, `PostgreSQL`, `SQL` 

In [None]:
# task

`Counting Mentions of the 'C' Language`

In [67]:
pattern = r"\b[Cc]\b[^\.+]"

title[title.str.contains(pattern)]

365                       The new C standards are worth it
444            Moz raises $10m Series C from Foundry Group
521           Fuchsia: Micro kernel written in C by Google
1307             Show HN: Yupp, yet another C preprocessor
1326                      The C standard formalized in Coq
                               ...                        
18543                 C-style for loops removed from Swift
18549            Show HN: An awesome C library for Windows
18649                 Python vs. C/C++ in embedded systems
19151                      Ask HN: How to learn C in 2016?
19933    Lightweight C library to parse NMEA 0183 sente...
Name: title, Length: 84, dtype: object

In [68]:
pattern = r"[^Ss]\s\b[Cc]\b[^\.+]"

title[title.str.contains(pattern)]

365                       The new C standards are worth it
521           Fuchsia: Micro kernel written in C by Google
1307             Show HN: Yupp, yet another C preprocessor
1326                      The C standard formalized in Coq
1365                           GNU C Library 2.23 released
                               ...                        
18479    Tis-interpreter detects subtle bugs in C programs
18549            Show HN: An awesome C library for Windows
18649                 Python vs. C/C++ in embedded systems
19151                      Ask HN: How to learn C in 2016?
19933    Lightweight C library to parse NMEA 0183 sente...
Name: title, Length: 65, dtype: object

# `Lookarounds`

<img src = "lookarounds.svg" width = "70%" />

In [71]:
pattern = r"(?<!Series\s)\b[Cc]\b(?![\.+])"

title[title.str.contains(pattern)]

365                       The new C standards are worth it
521           Fuchsia: Micro kernel written in C by Google
1307             Show HN: Yupp, yet another C preprocessor
1326                      The C standard formalized in Coq
1365                           GNU C Library 2.23 released
                               ...                        
18543                 C-style for loops removed from Swift
18549            Show HN: An awesome C library for Windows
18649                 Python vs. C/C++ in embedded systems
19151                      Ask HN: How to learn C in 2016?
19933    Lightweight C library to parse NMEA 0183 sente...
Name: title, Length: 102, dtype: object

In [72]:
pattern = r"((?<!Series\s)\b[Cc]\b(?![\.+]))"

title.str.extract(pattern).value_counts()

C    99
c     3
dtype: int64

`BackReferences Using Capture Groups`

In [73]:
"Bye Bye"
"Women Women"

pattern = r"\b(\w+)\s\1\b"

title[title.str.contains(pattern)]

  title[title.str.contains(pattern)]


3102                  Silicon Valley Has a Problem Problem
3176                Wire Wire: A West African Cyber Threat
3178                         Flexbox Cheatsheet Cheatsheet
4797                            The Mindset Mindset (2015)
7276     Valentine's Day Special: Bye Bye Tinder, Flirt...
10371    Mcdonalds copying cyriak  cows cows cows in th...
11575                                    Bang Bang Control
11901          Cordless Telephones: Bye Bye Privacy (1991)
12697          Solving the the Monty-Hall-Problem in Swift
15049    Bye Bye Webrtc2SIP: WebRTC with Asterisk and A...
15839          Intellij-Rust Rust Plugin for IntelliJ IDEA
Name: title, dtype: object

In [74]:
"Bye Bye"
"Women Women"

pattern = r"\b(\w+)\s\1\s\1\b"

title[title.str.contains(pattern)]

  title[title.str.contains(pattern)]


10371    Mcdonalds copying cyriak  cows cows cows in th...
Name: title, dtype: object

In [76]:
string = "Problem OK OK Problem"

pattern = r"\b(\w+)\s(\w+)\s\2\s\1\b"

re.search(pattern,string)

<re.Match object; span=(0, 21), match='Problem OK OK Problem'>

In [78]:
hn["url"]

0                  http://www.interactivedynamicvideo.com/
1        http://www.thewire.com/entertainment/2013/04/f...
2        https://www.amazon.com/Technology-Ventures-Ent...
3        http://www.nytimes.com/2007/11/07/movies/07ste...
4        http://arstechnica.com/business/2015/10/comcas...
                               ...                        
20094    https://puri.sm/philosophy/how-purism-avoids-i...
20095    https://medium.com/@zreitano/the-yc-applicatio...
20096    http://blog.darknedgy.net/technology/2016/01/0...
20097    https://medium.com/@benjiwheeler/how-product-h...
20098                https://github.com/jmcarp/robobrowser
Name: url, Length: 20099, dtype: object

In [80]:
pattern = "\[(?P<Name>\w+)\]"

title.str.extract(pattern)

Unnamed: 0,Name
0,
1,
2,
3,
4,
...,...
20094,
20095,
20096,
20097,
