That being said, making the investment to learn regular expressions is worthwhile.

* You can write complex operations with string data much more quickly once you understand how they function, which will save you time. 
* Regular expressions can be applied more quickly than their manual counterparts.
* Nearly all current programming languages, as well as places like databases and command-line tools, support regular expressions. Anywhere you work with data, you can use a powerful tool that you gain by understanding regular expressions.


Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "stories") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles; stories that make it to the top of Hacker News' listings can get hundreds of thousands of visitors.

The dataset we will be working with is based off this CSV of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:

* `id`: The unique identifier from Hacker News for the story
* `title`: The title of the story
* `url`: The URL that the stories links to, if the story has a URL
* `num_points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the story
* `author`: The username of the person who submitted the story
* `created_at`: The date and time at which the story was submitted

In [1]:
import pandas as pd

In [2]:
hn = pd.read_csv("hacker_news.csv")

In [3]:
import re

In [4]:
re.search("and", "I have tow hands")

<re.Match object; span=(12, 15), match='and'>

In [5]:
print(re.search("and", "antidote"))

None


In [6]:
if re.search("and", "I have tow hands"):
    print("Match")

Match


In [7]:
if re.search("abc", "I have tow hands"):
    print("Match")
else:
    print("No Match")

No Match


In [8]:
string_list = ["Ali's favorite color is Blue.",
               "Noman's favorite color is Green.",
               "Ayesha's favorite colors are blue and red."]

pattern = "Blue"

for s in string_list:
    if re.search(pattern, s):
        print("Match")
    else:
        print("No Match")

Match
No Match
No Match


<img src = "set_1.svg" width = "100%" />

In [9]:
pattern = "[msb]end"

re.search(pattern, "mend")

<re.Match object; span=(0, 4), match='mend'>

The regular expression above will match the strings `mend`, `send`, and `bend`.

<img src = "set_2.svg" width = "60%" />

In [10]:
titles = hn["title"]
python_mentions = 0
pattern = "[Pp]ython"

for t in titles:
    if re.search(pattern, t):
        python_mentions += 1

In [11]:
python_mentions

160

In [12]:
pattern = '[Pp]ython'

python_mentions = titles.str.contains(pattern).sum()
python_mentions

160

In [13]:
titles[titles.str.contains("[Pp]ython")].head()

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object

`1000` to `2999`

In [14]:
pattern = "[1-2][0-9][0-9][0-9]"

titles[titles.str.contains(pattern)].head()

3     Note by Note: The Making of Steinway L1037 (2007)
34                     The reverse job applicant (2010)
50      Ask HN: Which framework for a CRUD app in 2016?
59         2015 in review  1 year after I quit blogging
80    Apple Watch Scooped Up Over Half the Smartwatc...
Name: title, dtype: object

<img src = "quantifiers_numeric.svg" width = "60%" />

<img src = "quantifier_example.svg" width = "70%" />

In [15]:
pattern = "[1-2][0-9]{3}"

titles[titles.str.contains(pattern)].head()

3     Note by Note: The Making of Steinway L1037 (2007)
34                     The reverse job applicant (2010)
50      Ask HN: Which framework for a CRUD app in 2016?
59         2015 in review  1 year after I quit blogging
80    Apple Watch Scooped Up Over Half the Smartwatc...
Name: title, dtype: object

<img src = "character_classes_1.svg" width = "70%" />
<img src = "character_classes_2.svg" width = "70%" />

In [16]:
pattern = "[1-2]\d{3}"

titles[titles.str.contains(pattern)].head()

3     Note by Note: The Making of Steinway L1037 (2007)
34                     The reverse job applicant (2010)
50      Ask HN: Which framework for a CRUD app in 2016?
59         2015 in review  1 year after I quit blogging
80    Apple Watch Scooped Up Over Half the Smartwatc...
Name: title, dtype: object

`email` or `e-mail`

In [17]:
pattern = 'e-{0,1}mail'

titles[titles.str.contains(pattern)].head()

119     Show HN: Send an email from your shell to your...
313         Disposable emails for safe spam free shopping
1361    Ask HN: Doing cold emails? helps us prove this...
1750    Protect yourself from spam, bots and phishing ...
2421                   Ashley Madison hack treating email
Name: title, dtype: object

## Single Characters Qunatifers

<img src = "quantifiers_other.svg" width = "70%" />

`email` or `e-mail`

In [18]:
pattern = 'e-?mail'

titles[titles.str.contains(pattern)].head()

119     Show HN: Send an email from your shell to your...
313         Disposable emails for safe spam free shopping
1361    Ask HN: Doing cold emails? helps us prove this...
1750    Protect yourself from spam, bots and phishing ...
2421                   Ashley Madison hack treating email
Name: title, dtype: object

`[pdf]`, `[video]`, `[png]`, `[8]`, `[danial]`

In [19]:
pattern = '\[\w+\]'

titles[titles.str.contains(pattern)]

66       Analysis of 114 propaganda sources from ISIS, ...
100      Munich Gunman Got Weapon from the Darknet [Ger...
159           File indexing and searching for Plan 9 [pdf]
162      Attack on Kunduz Trauma Centre, Afghanistan  I...
195                 [Beta] Speedtest.net  HTML5 Speed Test
                               ...                        
19763    TSA can now force you to go through body scann...
19867                       Using Pony for Fintech [video]
19947                                Swift Reversing [pdf]
19979    WSJ/Dowjones Announce Unauthorized Access Betw...
20089    Users Really Do Plug in USB Drives They Find [...
Name: title, Length: 444, dtype: object

`[pdf]`, `[video]`, `[png]`, `[8]`, `[danial]`, `[@khan]`

In [20]:
pattern = '\[.+\]'

titles[titles.str.contains(pattern)]

66       Analysis of 114 propaganda sources from ISIS, ...
100      Munich Gunman Got Weapon from the Darknet [Ger...
159           File indexing and searching for Plan 9 [pdf]
162      Attack on Kunduz Trauma Centre, Afghanistan  I...
195                 [Beta] Speedtest.net  HTML5 Speed Test
                               ...                        
19867                       Using Pony for Fintech [video]
19947                                Swift Reversing [pdf]
19975    [FOR GIT USERS] YoLog  Lightweight Wrapper to ...
19979    WSJ/Dowjones Announce Unauthorized Access Betw...
20089    Users Really Do Plug in USB Drives They Find [...
Name: title, Length: 471, dtype: object

### `Accessing the Matching Text with Capture Groups`

In [21]:
pattern = r"(\[\w+\])"

titles.str.extract(pattern, expand=False).value_counts().head()

[pdf]       276
[video]     111
[audio]       3
[2015]        3
[slides]      2
Name: title, dtype: int64

In [22]:
pattern = r"\[(\w+)\]"

titles.str.extract(pattern, expand=False).value_counts().head()

pdf      276
video    111
audio      3
2015       3
beta       2
Name: title, dtype: int64

In [23]:
pattern = r"\[(.+)\]"

titles.str.extract(pattern, expand=False).value_counts().index.unique()

Index(['pdf', 'video', 'audio', '2015', 'beta', 'slides', '2014', 'The Verge',
       'CSS', 'Benchmark', 'Google Sheets', 'systemd-devel] [ANNOUNCE',
       'Challenge', 'Reuters Institute survey', 'detainee', 'HBR', 'React',
       'updated', 'Skinnywhale', 'Python][Angular', 'XKCD Flowchart',
       'XSA-148', 'viz', 'coffee', 'Beta', 'Promo Codes in Comments',
       'With Infographic', 'USA', 'FOR GIT USERS', 'map', 'January 2016',
       'much', 'info needed', 'song', 'video/animation', '47:03',
       'Ksummit-Discuss', 'GOST', 'gif', 'Will Replace Logstash Forwarder',
       'videos', 'US Senate] [1996', 'Excerpt', 'Videos', 'comic',
       'from AGPL to Apache', '5', 'ACM Queue', '2007?', 'dns-operations',
       'Osmf-talk', 'German', 'Infograph', 'JavaScript', 'SpaceX',
       'In a 40mph Collision', 'Ubuntu', 'Live', 'Halting Problem',
       '] stroke and risk factors [', 'blank', '1:47', 'crash', 'satire',
       'repost', 'Map', 'transcript', 'PHP-DEV', 'SPA', '2008', 's

In [24]:
"Danial  Gauhar".replace("\s","_")

'Danial  Gauhar'

In [25]:
titles.str.replace("\s","_")

  titles.str.replace("\s","_")


0                                Interactive_Dynamic_Video
1        Florida_DJs_May_Face_Felony_for_April_Fools'_W...
2             Technology_ventures:_From_Idea_to_Enterprise
3        Note_by_Note:_The_Making_of_Steinway_L1037_(2007)
4        Title_II_kills_investment?_Comcast_and_other_I...
                               ...                        
20094    How_Purism_Avoids_Intels_Active_Management_Tec...
20095            YC_Application_Translated_and_Broken_Down
20096    Microkernels_are_slow_and_Elvis_didn't_do_no_d...
20097                        How_Product_Hunt_really_works
20098    RoboBrowser:_Your_friendly_neighborhood_web_sc...
Name: title, Length: 20099, dtype: object

In [26]:
re.sub("\s+"," ",'Danial  Gauhar')

'Danial Gauhar'

### `Negative Character Classes`

<img src = "negative_character_classes.svg" width = "70%" />

In [27]:
"javascript"
"javaScript"
"java"
"Java"

pattern = "[jJ]ava[^Ss]"

In [28]:
a = "I am Learning JavaScript"
b = "I am Learning Javapython"
c = "Java is my favourite"
d = "I am Learning Java"

pattern = "[jJ]ava[^Ss]"

re.search(pattern, d)

### `Word Boundaries`

In [29]:
print("hello\b")

hello


In [30]:
print("\bhello\b")

hello


In [31]:
print("\\bhello\\b")

\bhello\b


In [32]:
a = "I am Learning JavaScript"
b = "I am Learning Javapython"
c = "Java is my favourite"
d = "I am Learning Java"

pattern = "\\b[jJ]ava\\b"

re.search(pattern, d)

<re.Match object; span=(14, 18), match='Java'>

In [33]:
a = "I am Learning C"
b = "I am Learning C++"

pattern = "\\b[Cc]\\b"

re.search(pattern, b)

<re.Match object; span=(14, 15), match='C'>

### `Matching at the Start and End of Strings`

<img src = "positional_anchors.svg" width = "70%" />

In [34]:
pattern_beginning = r"^\[\w+\]"

titles[titles.str.contains(pattern_beginning)].head()

195                [Beta] Speedtest.net  HTML5 Speed Test
398        [video] Google Self-Driving SUV Sideswipes Bus
3136                          [CSS] Yellow Fade Technique
5054    [React] proptypes-parser: Define React PropTyp...
9389    [Petition] Tell Microsoft to stop making browsers
Name: title, dtype: object

In [35]:
pattern_ending =  r"\[\w+\]$"

titles[titles.str.contains(pattern_ending)].head()

66     Analysis of 114 propaganda sources from ISIS, ...
100    Munich Gunman Got Weapon from the Darknet [Ger...
159         File indexing and searching for Plan 9 [pdf]
162    Attack on Kunduz Trauma Centre, Afghanistan  I...
210    A plan to rescue western democracy from the ig...
Name: title, dtype: object

`email` ,`Email`, `e Mail`, `e mail`, `E-mail`, `e-mail`, `eMail`, `E-Mail`, `EMAIL`, `emails`, `Emails`, `E-Mails`

In [36]:
pattern = "[Ee][Mm][Aa][Ii][Ll]"

titles[titles.str.contains(pattern)]

119      Show HN: Send an email from your shell to your...
161      Computer Specialist Who Deleted Clinton Emails...
174                                        Email Apps Suck
261      Emails Show Unqualified Clinton Foundation Don...
313          Disposable emails for safe spam free shopping
                               ...                        
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19395    I used HTML Email when applying for jobs, here...
19446    Tell HN: Secure email provider Riseup will run...
19905    Gmail Will Soon Warn Users When Emails Arrive ...
Name: title, Length: 136, dtype: object

In [37]:
pattern = "([Ee][Mm][Aa][Ii][Ll])"

titles.str.extract(pattern).value_counts()

email    77
Email    57
EMAIL     1
eMail     1
dtype: int64

In [38]:
pattern = "(email)"

titles.str.extract(pattern, flags = re.I).value_counts() # Ignore Case

email    77
Email    57
EMAIL     1
eMail     1
dtype: int64

In [39]:
pattern = "(e[\s-]?mails?)"

titles.str.extract(pattern, flags = re.I).value_counts() # Ignore Case

email      57
Email      42
emails     18
Emails     15
e Mail      5
e-mail      5
e mail      4
E-mails     2
E-Mail      1
EMAIL       1
eMail       1
dtype: int64

In [40]:
print("\bhello\b")

hello


In [41]:
print(r"\bhello\b")

\bhello\b


`mysql`, `PostgreSQL`, `SQL` 

In [42]:
pattern = r"(\w+sql)"

titles.str.extract(pattern, flags = re.I).value_counts()

PostgreSQL    27
NoSQL         16
MySQL         12
CloudSQL       1
MemSQL         1
SparkSQL       1
mySql          1
nosql          1
dtype: int64

In [43]:
pattern = r"(\bsql\b)"

titles.str.extract(pattern, flags = re.I).value_counts()

SQL    36
sql     3
dtype: int64

`Counting Mentions of the 'C' Language`

In [44]:
pattern = r"\b[Cc]\b"

titles[titles.str.contains(pattern)]

13                  Custom Deleters for C++ Smart Pointers
220                         Lisp, C++: Sadness in my heart
221                   MemSQL (YC W11) Raises $36M Series C
353      VW C.E.O. Personally Apologized to President O...
365                       The new C standards are worth it
                               ...                        
19667                         Ill-Advised C++ Rant, Part 2
19799    Introducing a new, advanced Visual C++ code op...
19829    Ferret: Compiling a Subset of Clojure to ISO C...
19933    Lightweight C library to parse NMEA 0183 sente...
19997                                    Proposal: C.UTF-8
Name: title, Length: 190, dtype: object

In [45]:
pattern = r"\b[Cc]\b[^.+]"

titles[titles.str.contains(pattern)]

365                       The new C standards are worth it
444            Moz raises $10m Series C from Foundry Group
521           Fuchsia: Micro kernel written in C by Google
1307             Show HN: Yupp, yet another C preprocessor
1326                      The C standard formalized in Coq
                               ...                        
18543                 C-style for loops removed from Swift
18549            Show HN: An awesome C library for Windows
18649                 Python vs. C/C++ in embedded systems
19151                      Ask HN: How to learn C in 2016?
19933    Lightweight C library to parse NMEA 0183 sente...
Name: title, Length: 84, dtype: object

# `Lookarounds`

<img src = "lookarounds.svg" width = "70%" />

In [46]:
pattern = r"(?<!Series\s)\b[Cc]\b(?![+\.])"

titles[titles.str.contains(pattern)]

365                       The new C standards are worth it
521           Fuchsia: Micro kernel written in C by Google
1307             Show HN: Yupp, yet another C preprocessor
1326                      The C standard formalized in Coq
1365                           GNU C Library 2.23 released
                               ...                        
18543                 C-style for loops removed from Swift
18549            Show HN: An awesome C library for Windows
18649                 Python vs. C/C++ in embedded systems
19151                      Ask HN: How to learn C in 2016?
19933    Lightweight C library to parse NMEA 0183 sente...
Name: title, Length: 102, dtype: object

`BackReferences Using Capture Groups`

In [47]:
"Saad Saad"
"Bye Bye"
"Women Women"

pattern = r"\b(\w+)\s\1\b"

In [48]:
pattern = r"\b(\w+)\s\1\b"

titles[titles.str.contains(pattern)]

  return func(self, *args, **kwargs)


3102                  Silicon Valley Has a Problem Problem
3176                Wire Wire: A West African Cyber Threat
3178                         Flexbox Cheatsheet Cheatsheet
4797                            The Mindset Mindset (2015)
7276     Valentine's Day Special: Bye Bye Tinder, Flirt...
10371    Mcdonalds copying cyriak  cows cows cows in th...
11575                                    Bang Bang Control
11901          Cordless Telephones: Bye Bye Privacy (1991)
12697          Solving the the Monty-Hall-Problem in Swift
15049    Bye Bye Webrtc2SIP: WebRTC with Asterisk and A...
15839          Intellij-Rust Rust Plugin for IntelliJ IDEA
Name: title, dtype: object

In [49]:
pattern = r"\b(\w+)\s\1\s\1\b"

titles[titles.str.contains(pattern)]

10371    Mcdonalds copying cyriak  cows cows cows in th...
Name: title, dtype: object

In [50]:
string = "Problem OK OK Problem"

pattern = r"\b(\w+)\s(\w+)\s\2\s\1\b"

re.search(pattern, string)

<re.Match object; span=(0, 21), match='Problem OK OK Problem'>

In [51]:
url = hn["url"]
url.head()

0              http://www.interactivedynamicvideo.com/
1    http://www.thewire.com/entertainment/2013/04/f...
2    https://www.amazon.com/Technology-Ventures-Ent...
3    http://www.nytimes.com/2007/11/07/movies/07ste...
4    http://arstechnica.com/business/2015/10/comcas...
Name: url, dtype: object

In [52]:
pattern = r"https?://(?P<Domain>[\w.]+)/?[\w.]*"


url.str.extract(pattern)

Unnamed: 0,Domain
0,www.interactivedynamicvideo.com
1,www.thewire.com
2,www.amazon.com
3,www.nytimes.com
4,arstechnica.com
...,...
20094,puri.sm
20095,medium.com
20096,blog.darknedgy.net
20097,medium.com


In [53]:
pattern = r"(\w+)://([\w.]+)/?([\w.]*)"


url.str.extract(pattern)

Unnamed: 0,0,1,2
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment
2,https,www.amazon.com,Technology
3,http,www.nytimes.com,2007
4,http,arstechnica.com,business
...,...,...,...
20094,https,puri.sm,philosophy
20095,https,medium.com,
20096,http,blog.darknedgy.net,technology
20097,https,medium.com,


In [54]:
pattern = r"(?P<Protocol>\w+)://(?P<Domain>[\w.]+)/?(?P<Path>[\w.]*)"


url.str.extract(pattern)

Unnamed: 0,Protocol,Domain,Path
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment
2,https,www.amazon.com,Technology
3,http,www.nytimes.com,2007
4,http,arstechnica.com,business
...,...,...,...
20094,https,puri.sm,philosophy
20095,https,medium.com,
20096,http,blog.darknedgy.net,technology
20097,https,medium.com,
