# Extract Information With Regular Expression


[![Open In Colab](colab-badge.svg)](https://colab.research.google.com/github/alexisperrier/intro2nlp/blob/master/notebooks/intro2nlp_09_regex.ipynb)

Wikipedia: A [regular expression](https://en.wikipedia.org/wiki/Regular_expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

In this notebook we will go over a series of handy regex patterns and see how to use them. The goal is to use the regex not to learn how to build whole regex patterns from scratch.

Regex is a super useful tool when working with text. It allows you to quickly extract or replace patterns in a long text. It's reliable, lightning fast and flexible.

But it does take getting used to using cryptic pattern definitions.

We'll start simple with :
- finding #hashtags in tweets
- extracting and replacing @usernames 

### Hashtags




In [None]:
# Here is a small corpus of tweets that contain hashtags
tweets = [
    'An #autumn scene showing a beautiful #horse coming to visit me.', 
    'My new favourite eatery in #liverpool and I mean superb! #TheBrunchClub #breakfast #food', 
    '#nowplaying Pointer Sisters - Dare Me | #80s #disco #funk #radio']
    
# import the regex module
import re




### Define the pattern

This pattern find all the sequences of characters 
with the exclusion of spaces, tabs, line returns ...
that start with a # sign:


```# followed by a non empty sequence of letters and punctuation signs: S+```


In [None]:
pattern = r'#\S+'

use ```re.findall``` to extract all the elements from the text that match the pattern

In [None]:
for text in tweets:
    print(re.findall(pattern, text))

# @usernames

Slightly modify the pattern to find all the @usernames

In [None]:
import re

text = 'Check out this new NLP course on @openclassrooms by @alexip'
    
# change the pattern # -> @
pattern = r'@\S+' 

print(re.findall(pattern, text))

We can also use ```re.sub``` to replace all the usernames by a special token.

For instance replace the usernames with the token USR. The pattern stay the same

In [None]:
print("\t",text)
print("becomes:")
print("\t",re.sub(pattern, 'USR', text))

# remove html tags

A slightly more complex example. We have a web page and we want to remove all the html tags. Html tags are represented by ```< some text >```. 

So we want to remove all the elements that are comprised between ```<``` and ```>``` including the brackets.


We define the pattern

```
pattern = r"<[^>]*>"
```

Let's apply that to a web page that we download raw from wikipedia. 
For a change consider the page about [House Music](https://en.wikipedia.org/wiki/House_music). The ```html``` element contains the raw html.

In [None]:
import requests
import re

# Music is in the House!
url = 'https://en.wikipedia.org/wiki/House_music'

# GET the content 
# Note: requests.get().content returns a byte object 
# that we can cast as string with .decode('UTF-8')
html = requests.get(url).content.decode('UTF-8')

# remove the header part of the html 
html = html.split('</head>')[1]

print(html)

Now remove all the html tags with ```re.sub```

In [None]:
pattern = r"<[^>]*>"
text = re.sub(pattern,' ', html)

In [None]:
print(text)

no more html tags, just raw text!

## Extracting urls

If we just remove all the html tags we also remove all the links which are in the form ``` <a href="some url> ... </a>"```. 

So we may also want to extract the urls from a web page. 
For instance if you want to list the sources cited on social networks or build a bot that follows the links from a web page.

To extract the urls we will use the following pattern

```
r'http.+?(?="|<)'
```

This pattern finds all strings that start with http and end with either " or <

Let's extract the urls from the wikipedia [House Music](https://en.wikipedia.org/wiki/House_music) page.

In [None]:
url = 'https://en.wikipedia.org/wiki/House_music'
html = requests.get(url).content.decode('UTF-8').split('</head>')[1]

In [None]:
pattern = r'http.+?(?=\?|"|<)'
urls = re.findall(pattern, html)
print(f"We find {len(urls)} urls")

In [None]:
for i in range(10):
    print(f"- {urls[i]}")

## Punctuation signs

We can also use a regex to remove all the punctuation signs from a text.


In [None]:
text = "Hello, is your name bob? "

print(text)

print(re.sub(r'[^\w\s]', '', text) )


## Tokenization

The following pattern makes a decent tokenizer when used with the split function

```r"\b\w+\b"```


In [None]:
text = "Hello, is your name bob? "
re.findall(r"\b\w+\b", text)