# Extract Information With Regular Expression

Wikipedia: A [regular expression](https://en.wikipedia.org/wiki/Regular_expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

In this notebook we will go over a series of handy regex patterns and see how to use them. The goal is to use the regex not to learn how to build whole regex patterns from scratch.

Regex is a super useful tool when working with text. It allows you to quickly extract or replace patterns in a long text. It's reliable, lightning fast and flexible.

But it does take getting used to using cryptic pattern definitions.

We'll start simple with :
- finding #hashtags in tweets
- extracting and replacing @usernames 

## #Hashtags




In [3]:
# Here is a small corpus of tweets that contain hashtags
tweets = [
    'An #autumn scene showing a beautiful #horse coming to visit me.', 
    'My new favourite eatery in #liverpool and I mean superb! #TheBrunchClub #breakfast #food', 
    '#nowplaying Pointer Sisters - Dare Me | #80s #disco #funk #radio']
    
# import the regex module
import re




['#autumn', '#horse']
['#liverpool', '#TheBrunchClub', '#breakfast', '#food']
['#nowplaying', '#80s', '#disco', '#funk', '#radio']


### Define the pattern

This pattern find all the sequences of characters 
with the exclusion of spaces, tabs, line returns ...
that start with a # sign:


```# followed by a non empty sequence of letters and punctuation signs: S+```


In [6]:
pattern = r'#\S+'

use ```re.findall``` to extract all the elements from the text that match the pattern

In [7]:
for text in tweets:
    print(re.findall(pattern, text))

['#autumn', '#horse']
['#liverpool', '#TheBrunchClub', '#breakfast', '#food']
['#nowplaying', '#80s', '#disco', '#funk', '#radio']


# @usernames

Slightly modify the pattern to find all the @usernames

In [8]:
import re

text = 'Check out this new NLP course on @openclassrooms by @alexip'
    
# change the pattern # -> @
pattern = r'@\S+' 

print(re.findall(pattern, text))

['@openclassrooms', '@alexip']


We can also use ```re.sub``` to replace all the usernames by a special token.

For instance replace the usernames with the token USR. The pattern stay the same

In [12]:
print("\t",text)
print("becomes:")
print("\t",re.sub(pattern, 'USR', text))

	 Check out this new NLP course on @openclassrooms by @alexip
becomes:
	 Check out this new NLP course on USR by USR


# remove html tags

A slightly more complex example. We have a web page and we want to remove all the html tags. Html tags are represented by ```< some text >```. 

So we want to remove all the elements that are comprised between ```<``` and ```>``` including the brackets.


We define the pattern

```
pattern = r"<[^>]*>"
```

Let's apply that to a web page that we download raw from wikipedia. 
For a change consider the page about [House Music](https://en.wikipedia.org/wiki/House_music). The ```html``` element contains the raw html.

In [16]:
import requests
import re

# Music is in the House!
url = 'https://en.wikipedia.org/wiki/House_music'

# GET the content 
# Note: requests.get().content returns a byte object 
# that we can cast as string with .decode('UTF-8')
html = requests.get(url).content.decode('UTF-8')

# remove the header part of the html 
html = html.split('</head>')[1]

print(html)


<body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-House_music rootpage-House_music skin-vector action-view skin-vector-legacy"><div id="mw-page-base" class="noprint"></div>
<div id="mw-head-base" class="noprint"></div>
<div id="content" class="mw-body" role="main">
	<a id="top"></a>
	<div id="siteNotice" class="mw-body-content"><!-- CentralNotice --></div>
	<div class="mw-indicators mw-body-content">
	</div>
	<h1 id="firstHeading" class="firstHeading" lang="en">House music</h1>
	<div id="bodyContent" class="mw-body-content">
		<div id="siteSub" class="noprint">From Wikipedia, the free encyclopedia</div>
		<div id="contentSub"></div>
		<div id="contentSub2"></div>
		
		<div id="jump-to-nav"></div>
		<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>
		<a class="mw-jump-link" href="#searchInput">Jump to search</a>
		<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr"><div class="mw-parser-output"><div class="shortdescr

Now remove all the html tags with ```re.sub```

In [19]:
pattern = r"<[^>]*>"
text = re.sub(pattern,' ', html)

In [20]:
print(text)


   
  
 
	  
	   
	 
	 
	 House music 
	 
		 From Wikipedia, the free encyclopedia 
		  
		  
		
		  
		 Jump to navigation 
		 Jump to search 
		   Genre of electronic dance music 
 For the Steven James Adams album, see  House Music (album) . 
 Not to be confused with  House band . 
    House music    Stylistic origins  
   Disco   &#91;1&#93;   
  Hi-NRG   &#91;2&#93;    &#91;3&#93;    &#91;4&#93;    &#91;5&#93;   
  boogie   &#91;6&#93;    &#91;7&#93;   
  electro   &#91;8&#93;   
  Latin soul   &#91;1&#93;   
  jazz   &#91;1&#93;    
    Cultural origins  1980s,  Chicago ,  Illinois , United States    Derivative forms      Electroclash    Eurobeat    techno    UK garage    speed garage    trance    dance-pop    2-step garage    Detroit techno        Subgenres        Acid house    Brazilian bass    deep house    diva house    funky house    future house    garage house    ghetto house    hardbag    microhouse    tribal house    tropical house     ( complete list )    Fusion genres 

no more html tags, just raw text!

## Extracting urls

If we just remove all the html tags we also remove all the links which are in the form ``` <a href="some url> ... </a>"```. 

So we may also want to extract the urls from a web page. 
For instance if you want to list the sources cited on social networks or build a bot that follows the links from a web page.

To extract the urls we will use the following pattern

```
r'http.+?(?="|<)'
```

This pattern finds all strings that start with http and end with either " or <

Let's extract the urls from the wikipedia [House Music](https://en.wikipedia.org/wiki/House_music) page.

In [21]:
url = 'https://en.wikipedia.org/wiki/House_music'
html = requests.get(url).content.decode('UTF-8').split('</head>')[1]

In [28]:
pattern = r'http.+?(?=\?|"|<)'
urls = re.findall(pattern, html)
print(f"We find {len(urls)} urls")

We find 282 urls


In [30]:
for i in range(10):
    print(f"- {urls[i]}")

- https://en.wiktionary.org/wiki/proto-#Prefix
- https://en.wikipedia.org/w/index.php
- http://www.allmusic.com/subgenre/house-ma0000002651
- https://web.archive.org/web/20121006233620/http://www.allmusic.com/subgenre/house-ma0000002651
- http%3A%2F%2Fwww.allmusic.com%2Fsubgenre%2Fhouse-ma0000002651&amp;rfr_id=info%3Asid%2Fen.wikipedia.org%3AHouse+music
- http://www.allmusic.com/style/hi-nrg-ma0000012074
- https://web.archive.org/web/20120617152422/http://www.allmusic.com/style/hi-nrg-ma0000012074
- http%3A%2F%2Fwww.allmusic.com%2Fstyle%2Fhi-nrg-ma0000012074&amp;rfr_id=info%3Asid%2Fen.wikipedia.org%3AHouse+music
- https://www.spin.com/2014/04/burning-down-the-house-chicago-club-80s/
- https://web.archive.org/web/20180405214757/https://www.spin.com/2014/04/burning-down-the-house-chicago-club-80s/


## Punctuation signs

We can also use a regex to remove all the punctuation signs from a text.


In [36]:
text = "Hello, is your name bob? "

print(text)

print(re.sub(r'[^\w\s]', '', text) )


Hello, is your name bob? 
Hello is your name bob 


## Tokenization

The following pattern makes a decent tokenizer when used with the split function

```r"\b\w+\b"```


In [52]:
text = "Hello, is your name bob? "
re.findall(r"\b\w+\b", text)

['Hello', 'is', 'your', 'name', 'bob']