# Regex Introduction

Use the [regexp.com](regexp.com) to practice and hone your regular expressions before applying them in Python.

In [2]:
import re

SAMPLE_TWEET = '''
#wolfram Alpha SUCKS! Even for researchers the information provided is less than you can get from 
#google or #wikipedia, totally useless!"
'''

`re.match` searches starting that the beginning of the string, while `re.search` searches the entire string.

### Match the first time a capital letter appears in the tweet

In [14]:
match = re.search("[A-Z]", SAMPLE_TWEET)
match.group()

'A'

### Match all capital letters that appears in the tweet

In [15]:
re.findall("[A-Z]", SAMPLE_TWEET)

['A', 'S', 'U', 'C', 'K', 'S', 'E']

### Match all words that are at least 3 characters long

In [16]:
re.findall("[a-zA-Z]{3,}", SAMPLE_TWEET)

['wolfram',
 'Alpha',
 'SUCKS',
 'Even',
 'for',
 'researchers',
 'the',
 'information',
 'provided',
 'less',
 'than',
 'you',
 'can',
 'get',
 'from',
 'google',
 'wikipedia',
 'totally',
 'useless']

### Match all hashtags in the tweet

In [17]:
re.findall("#[a-zA-Z0-9]+", SAMPLE_TWEET)

['#wolfram', '#google', '#wikipedia']

### Match all hashtags in the tweets, capture only the text of the hashtag

In [50]:
# capturing groups - (...) # the lambda function is uppercasing everything
hashtags = list(map(lambda x: x.upper(), re.findall("#([a-zA-Z0-9]+)", SAMPLE_TWEET)))

### Match all words that start with `t`, and are followed by `h` or `o`

In [3]:
re.findall(r'(th|to|ta)\w+', SAMPLE_TWEET)

# non capturing groups - ?:
re.findall(r't(?:h|o|a)\w+', SAMPLE_TWEET)

['the', 'than', 'totally']

### Match all words that end a sentence

In [65]:
re.findall(r'(\w+)(?:[!\.\?])', SAMPLE_TWEET)

['SUCKS', 'useless']

### How to Handle When the Regex Does Not Match?

`re.findall()`, `re.match()`, and `re.search()` will return an empty list or `None` objects if no match occurs. You can combine this with Python's truthiness evaluation to handle each case.

In [4]:
if re.findall("a", SAMPLE_TWEET):
    print("True")
else:
    print("False")
    
if re.match("x", SAMPLE_TWEET):
    print("True") # Do stuff here if it matches
else:
    print("False") # Do stuff here if it does not match

True
False


### Word Boundaries

*A thorough examination of the movie shows Thor was a thorn in the side of the villains. Thor.*

In [12]:
text = "A thorough examination of the movie shows Thor was a thorn in the side of the villains. Thor."
re.findall(r'\bThor\b', text) # notice the use of the r string prefix!

['Thor', 'Thor']

### Case Insensitive Matching

In [20]:
re.findall(r'\bthor\b', text, re.IGNORECASE) # even though Thor is uppercased in the text above.

['Thor', 'Thor']

### Use regex to find the username and domains of emails that
- have subdomains (for example `marshall.usc.edu`, not `gmail.com`)
- are all lowercased

Using `^` inside brackets (`[^A-Z]`) will NOT match anything - this case, not match any capital letters. The `\n` stands for a carriage return (new line) - ie. don't match new lines. Copy paste this into `regex.com` for practice.

```
myemail@gmail.com
ychen220@marshall.usc.edu
anotherone@hotmail.com
yu.chen@anderson.ucla.edu
yu.chen@usc.edu
YUCHEN@gmail.com
```


In [22]:
EMAILS = '''
myemail@gmail.com
ychen220@marshall.usc.edu
anotherone@hotmail.com
yu.chen@anderson.ucla.edu
yu.chen@usc.edu
yu.chen@big.company.org
YUCHEN@gmail.com
'''

re.findall("([^A-Z\n]{3,12})@([^A-Z\n]+\.[^A-Z\n]+\.[^A-Z\n]{3})", EMAILS)

[('ychen220', 'marshall.usc.edu'),
 ('yu.chen', 'anderson.ucla.edu'),
 ('yu.chen', 'big.company.org')]

### Should I compile regex?

Compiling the regular expressions in Python is supposed to make performance faster. However, I've never really ever noticed much of a difference, and neither does this StackOverflow post.

Internally, Python automatically compiles and caches your compiled regular expressions anyways, so in general, it's not worth the time.

```python
h = re.compile(r'\bThor\b')
h.findall(text)
# ['Thor', 'Thor']
```