# Regex Introduction

Use the [regexr.com](regexr.com) to practice and hone your regular expressions before applying them in Python.

In [31]:
import re # standard Python library for text regular expression parsing

SAMPLE_TWEET = '''
#wolfram Alpha SUCKS! Even for researchers the information provided is less than you can get from 
#google or #wikipedia, totally useless!"
'''

`re.match` searches starting that the beginning of the string, while `re.search` searches the entire string.

### Match the first time a capital letter appears in the tweet

In [5]:
match = re.search("[a-z]", SAMPLE_TWEET)
match.group()

'w'

### Match all capital letters that appears in the tweet

In [7]:
re.findall("[A-Z]", SAMPLE_TWEET)

['A', 'S', 'U', 'C', 'K', 'S', 'E']

### Match all words that are at least 3 characters long

In [8]:
re.findall("[a-zA-Z0-9]{3,}", SAMPLE_TWEET)

['wolfram',
 'Alpha',
 'SUCKS',
 'Even',
 'for',
 'researchers',
 'the',
 'information',
 'provided',
 'less',
 'than',
 'you',
 'can',
 'get',
 'from',
 'google',
 'wikipedia',
 'totally',
 'useless']

### Match all hashtags in the tweet

In [9]:
re.findall("#[a-zA-Z0-9]+", SAMPLE_TWEET)

['#wolfram', '#google', '#wikipedia']

### Match all hashtags in the tweets, capture only the text of the hashtag

In [10]:
# capturing groups

re.findall("#([\w]+)", SAMPLE_TWEET)

['wolfram', 'google', 'wikipedia']

### Match all words that start with `t`, and are followed by `h` or `o`

In [15]:
re.findall("(?:th|to)\w*", SAMPLE_TWEET)

['the', 'than', 'totally']

### Match all words that end a sentence

In [13]:
re.findall("(\w+)(\.|\?|\!)", SAMPLE_TWEET)

[('SUCKS', '!'), ('useless', '!')]

### Match word boundary
*A thorough examination of the movie shows Thor was a thorn in the side of the villains. Thor.*

```python
re.findall("\b[tT]hor\b", SAMPLE_TWEET)
```

### How to Handle When the Regex Does Not Match?

In [33]:
SAMPLE_TWEET = "A thorough examination of the movie shows Thor was a thorn in the side of the villains. Thor."


re.findall("\b[tT]hor\b", SAMPLE_TWEET)

[]

In [35]:
mylist = "ASdad"

if re.findall("\\bThor\\b", SAMPLE_TWEET):
    print("Found")
else:
    print("Not found")

Found


# Using Regex Combined with Pandas

In [14]:
import pandas as pd

# load in dataframe


In [15]:
# get rid of some columns we don't care about

# preview the data

In [16]:
# get length of tweets in characters


In [17]:
# count number of times Obama appears in tweets


In [18]:
# find all the @s in the tweets 


In [19]:
# Mon May 11 03:17:40 UTC 2009

# get the weekday of tweet



In [20]:
# get the month of the tweet



In [21]:
# get the year of the tweet



### Exercises (15 minutes)
1. Identify the list of email addresses for your security administrator to blacklist from your company's email servers.
2. Identify any IP addresses that should be blacklisted (an IPv4 address goes from **1.1.1.1 to 255.255.255.255**)
3. Find a sensible way to identify all names of individuals in the spam emails.
3. Find all hashtags mentioned in the tweets dataset. Store it as a separate column called **hashtags**.

In [2]:
# 1 Identify the list of email addresses for your security administrator to blacklist from your company's email servers.






In [3]:
# 2 Identify any IP addresses that should be blacklisted (an IPv4 address goes from **1.1.1.1 to 255.255.255.255**)






In [4]:
# 3 Find a sensible way to identify all names of individuals in the spam emails.

In [5]:
# 4 Find all hashtags mentioned in the tweets dataset. Store it as a separate column called **hashtags**.