## Useful RegEx for text analysis

Overview:
<div style="float:left">
    <table style="width:300px;">
      <tr>
        <td style="width:80px; font-size:bold"><b>\d</b></td>
        <td style="text-align:left;">any digit</td>
      </tr>
      <tr>
        <td><b>\D</b></td>
        <td style="text-align:left;">any non digit character</td>
      </tr>
      <tr>
        <td><b>{m}</b></td>
        <td style="text-align:left;">m repeats</td>
      </tr>
      <tr>
        <td><b>{m,n}</b></td>
        <td style="text-align:left;">m to n repeats</td>
      </tr>
      <tr>
        <td><b>\*</b></td>
        <td style="text-align:left;">zero or many repeats</td>
      </tr>
      <tr>
        <td><b>\+</b></td>
        <td style="text-align:left;">one or more repeats</td>
      </tr>
      <tr>
        <td><b>^</b></td>
        <td style="text-align:left;">start</td>
      </tr>
      <tr>
        <td><b>$</b></td>
        <td style="text-align:left;">end</td>
      </tr>
      <tr>
        <td><b>\s</b></td>
        <td style="text-align:left;">any whitespace</td>
      </tr>
      <tr>
        <td><b>\S</b></td>
        <td style="text-align:left;">any non whitespace character</td>
      </tr>
      <tr>
        <td><b>\w</b></td>
        <td style="text-align:left;">any alphanumeric character</td>
      </tr>
      <tr>
        <td><b>\W</b></td>
        <td style="text-align:left;">any non alphanumeric character</td>
      </tr>
      <tr>
        <td><b>.</b></td>
        <td style="text-align:left;">any character</td>
      </tr>
      <tr>
        <td><b>?</b></td>
        <td style="text-align:left;">optional character</td>
      </tr>
      <tr>
        <td><b>[a-z]</b></td>
        <td style="text-align:left;">all characters from a to z</td>
      </tr>
      <tr>
        <td><b>[abc]</b></td>
        <td style="text-align:left;">just a, b or c</td>
      </tr>
      <tr>
        <td><b>[^abc]</b></td>
        <td style="text-align:left;">not a, b or c</td>
      </tr>
      <tr>
        <td><b>[0-9]</b></td>
        <td style="text-align:left;">all numbers from 0 to 9</td>
      </tr>
      <tr>
        <td><b>(ab|cd)</b></td>
        <td style="text-align:left;">matches ab or cd</td>
      </tr>
    </table>
</div>

In [1]:
import re

#### Find e-mail addresses

In [2]:
text_email  = 'My email is example.test@test.com and <example1@test.net_?.'
regex_email = re.compile(r"\"?([-a-zA-Z0-9._`?{}]+@\w+\.+[a-zA-Z0-9\-]+)\"?", re.IGNORECASE)

print(re.findall(regex_email, text_email))

['example.test@test.com', 'example1@test.net']


#### Extract e-mail domains

In [3]:
text_email_domains  = 'My email is example.test@test.com and <example1@test.net_?.'
regex_email_domains = re.compile(r"\"?[-a-zA-Z0-9._`?{}]+@(\w+\.+[a-zA-Z0-9\-]+)\"?", re.IGNORECASE)

print(re.findall(regex_email_domains, text_email_domains))

['test.com', 'test.net']


#### Get URLS

In [4]:
text_urls  = 'The links are http://www.test.de\xab<, https://www.test.de and www.test.com'
regex_urls = re.compile("((?:https?://|www\d{0,3}[.])?[a-z0-9.\-]+[.](?:(?:de)|(?:com)|(?:net)))", re.IGNORECASE)

print(re.findall(regex_urls, text_urls))

['http://www.test.de', 'https://www.test.de', 'www.test.com']


#### Find a specific number e.g. id number of length 10

In [5]:
credit_no    = 'Hello my id is 1234567890 and not 12342554644846532131'
regex_id_no  = re.compile(r'\b\d{10}(?:[-\s]\d{4})?\b')

print(re.findall(regex_id_no, credit_no))

['1234567890']


#### Find dates

In [6]:
text_date  = 'Today is 20.09.1993 or 21-07-2018 or 1/1/2001_02.02.18'
regex_date = re.compile(r"(\d{1,2}[/.-]\d{1,2}[/.-]\d{2,4})", re.IGNORECASE)

print(re.findall(regex_date, text_date))

['20.09.1993', '21-07-2018', '1/1/2001', '02.02.18']


#### Extract value between two tags 

In [7]:
tags = 'Hey <pos>this is<end> a tag'
regex_tags = re.compile(r'.<pos>(.+)<end>.', re.IGNORECASE)

print(re.findall(regex_tags, tags))

['this is']


#### Find telephonenumbers 

In [8]:
text_tele  = 'My no is +136-123456789?'
regex_tele = re.compile(r"([+0-9]*[0-9]{2,5}[-. ][0-9]*)", re.IGNORECASE)

print(re.findall(regex_tele, text_tele))

['+136-123456789']


#### Find streetnames

In [9]:
text_street  = 'Hi im living in the street of talst. 5 or talstreet 5'
regex_street = re.compile('[-a-zA-Z0-9._`?{}]+(?:street|st|avenue)[. ]*\d{1,4}\W?(?=\s|$)', re.IGNORECASE)

print(re.findall(regex_street, text_street))

['talst. 5', 'talstreet 5']


#### Find surnames

In [10]:
text_name  = 'My dear mr. heisenberg and mrs titan'
regex_name = re.compile('(?:hello|dear|and)[]*(?:mr|mrs)[. ]*([-a-zA-Z0-9._`?{}]*)', re.IGNORECASE)

print(re.findall(regex_name, text_name))

['heisenberg', 'titan']


#### Remove punctuation

In [11]:
text_punct  = 'I()=can/"/&$read%%.this!+'
regex_punct = re.compile(r'['+'!"#$%&\'()*+,-./:;<=>?[\\]^_`{|}~'+']+', re.IGNORECASE)

print(re.sub(regex_punct,' ',text_punct))

I can read this 


#### Remove multiple whitespaces, tabs and linebreaks

In [12]:
text_ws  = 'I have a                         cat.\tYes \nyes.'
regex_ws = re.compile(r"\s+", re.IGNORECASE)

print(re.sub(regex_ws, ' ', text_ws))

I have a cat. Yes yes.


#### Remove / replace numbers

In [13]:
text_num  = 'I will stay for 23 days.'
regex_num = re.compile(r'[0-9]+', re.IGNORECASE)

print(re.sub(regex_num,'xx',text_num))

I will stay for xx days.
