Announcements:
    
- Week 10 schedule
- Please fill out course evaluations!


Earlier this week: Natural language processing. 

Today: regex (regular expression)


# Regular Expressions

https://docs.python.org/3/howto/regex.html#introduction



Regular expressions are powerful tools to extract *structured information* from *unstructured text.*  For example, suppose that we are scraping Twitter data, and we'd like to extract a list of all the mentions and hashtags in a ~~tweet~~ X post. Our raw data might look something like this: 

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Our Great American Model was built on tough (very strong!!) parametric assumptions! <br><br>But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out. Not on my watch!!<a href="https://twitter.com/hashtag/statstwitter?src=hash&amp;ref_src=twsrc%5Etfw">#statstwitter</a> <a href="https://twitter.com/hashtag/epitwitter?src=hash&amp;ref_src=twsrc%5Etfw">#epitwitter</a> <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/math?src=hash&amp;ref_src=twsrc%5Etfw">#math</a> <a href="https://twitter.com/hashtag/AI?src=hash&amp;ref_src=twsrc%5Etfw">#AI</a> <a href="https://twitter.com/hashtag/DataScience?src=hash&amp;ref_src=twsrc%5Etfw">#DataScience</a> <a href="https://twitter.com/hashtag/python?src=hash&amp;ref_src=twsrc%5Etfw">#python</a> <a href="https://twitter.com/hashtag/Science?src=hash&amp;ref_src=twsrc%5Etfw">#Science</a></p>&mdash; Statistician Trump (@StatisticianTr2) <a href="https://twitter.com/StatisticianTr2/status/1281959378371969024?ref_src=twsrc%5Etfw">July 11, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>    

We'd like to extract the hashtags from this ~~tweet~~post. For example, we'd like to write a function `collect_hashtags()` with the following output: 

```python
collect_hashtags(tw)
['statstwitter', 'epitwitter', 'rstats', 'math', 'AI', 'DataScience', 'python', 'Science']
```

We could then use this function on many ~~tweet~~posts in order to conduct an analysis of what people are talking about on ~~Twitter~~ X. How can we recognize the hashtags? 

If you're familiar with X, you know that a hashtag consists of the symbol \#, followed by one or more letters, which may or may not be capitalized. A space `" "` terminates the hashtag. 

This is an informal description of a *pattern* -- a rule for detecting hashtags in text. In this case, the rule is: 

> Find a `#`. Then, continue through letters and numbers until a space `" "` is reached.

Regular expressions allow us to formally construct and use patterns to obtain structured data like hashtags from unstructured text. They are an extremely powerful tool in any applications in which we need to work with text data. 

To work with regular expressions, we need a few functions from the `re` package. 


In [None]:
import re

special metacharacters: `. ^ $ * + ? { } [ ] \ | ( )`

everything else, we can match one character at a time

In [None]:
s = "california cat colony scale"
pattern = 'c' # look for character 'c' in string

matches = re.findall(pattern, s)
print(matches)

In [None]:
s = "california cat colony scale"
pattern = 'ca' # look for characters 'ca' in string

matches = re.findall(pattern, s)
print(matches)

In [None]:
s = "california cat colony"


### Raw strings
Actually, it is better to represent patterns as *raw strings*. They are preceded by `r` outside quotes. Raw strings don't process special characters. For example, the string `"\n"` has just one character (the special newline character), but the string `r"\n"` has two (`"\"` and `"n"`).  

In [None]:
s1 = 'hello\nworld' # "normal" string so it reads \n as a newline
print(s1)
print( )

s2 = r'hello\nworld' # raw string, ignores the special characters
print(s2)

# input(...) vs. raw_input(...)

In [None]:
s1 = '\n'
print(len(s1)) # should give 1

s2 = r'\n'
print(len(s2)) # 2

### Character class

`[  ]` , define a set, find anything in that set

In [None]:
s = "california cat colony"
pattern = '[abc]' # look for a character that is a, b, or c in a string

matches = re.findall(pattern, s)
print(matches)

complement of this class: `^` (everything but)

In [None]:
s = "california cat colony"
pattern = '[^a-g]' # look for a character that is NOT a, b, .. g in a string

matches = re.findall(pattern, s)
print(matches)


`\d`

    Matches any decimal digit; this is equivalent to the class [0-9].
`\D`

    Matches any non-digit character; this is equivalent to the class [^0-9].
`\s`

    Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
`\S`

    Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
`\w`

    Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
`\W`

    Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].


escape meta characters: `\`

In [None]:
s = "what^s [up]"
pattern = r'\[' # look for the character '[', not interpreting it as a special character
#pattern = r'['

matches = re.findall(pattern, s)
print(matches)

match any single character except line terminators: .

In [None]:
s = "california cat colony 999 kafhlq6817468!!?"
pattern = r'.' 

matches = re.findall(pattern, s)
print(matches)

### Repeating things
    
`*` : zero or more times

`+` : one or more times

`?` : zero or one time (think of it being optional, like multi-national or multinational)

In [None]:
s = 'ct cat caat caaat caaaaaaaaaat'
#pattern = r'ca*t' # look for something that is c + a (zero or more times) + t
#pattern = r'ca+t' # c + a (one or more times) + t
pattern = r'ca?t' # c + a (zero or one time) + t

matches = re.findall(pattern, s)
print(matches)

In [None]:
s = 'multi-national multinational'
pattern = r'multi-?national'

matches = re.findall(pattern, s)
print(matches)

### Example: tweet (uh, X post)


In [None]:
s = "Our Great American Model was built on tough (very strong!!) parametric assumptions! \
But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out.\
Not on my watch!! #statstwitter #epitwitter #rstats #math #AI #DataScience #python #Science" 
s

# look for # and then some characters and then once you hit a space, end there

In [None]:
pattern = r'#\w+' # look for # + one or more alphanumeric characters

matches = re.findall(pattern, s)
print(matches)

In [None]:
s = 'GAME ON, Bruins! 🏀 \
Mini basketballs, @ASUCLAStudentU coupon books and other giveaways await you at the #UCLAfirstthursdays block party!'

pattern = r'[#@]\w+' # look for [# or @] + one or more alphanumeric characters

matches = re.findall(pattern, s)
print(matches)

### Groups: parsing email addresses
Suppose that we'd like to extract some email addresses from a body of text. For example: 

> You can reach me at kose@math.ucla.edu or kos@ucla.edu or kos@g.ucla.edu.

We'd like to extract the usernames and domains of each of these email addresses. 

For this we can use **groups**. Groups allow us to give names to "parts" of matches, enabling further processing. 

Intuitively, we are looking for: 

1. **The username**: A sequence of one or more letters and numbers, followed by 
2. An `@` symbol, followed by  
3. **The domain:** another sequence of characters, numbers, or the symbol `.`.
4. We should not include the final `.` in the domain name for Picard. 

To see how groups work, let's take a look at an interactive demonstration in [Pythex](https://pythex.org/). 

In [None]:
s = "You can reach me at kose@math.ucla.edu or kos@ucla.edu or kos@g.ucla.edu."

# characters + @ + [characters and . ] + characters

pattern = r'\w+@[\w\.]+\w' # 

matches = re.findall(pattern, s)
print(matches)

In [None]:
pattern = r"([A-z0-9]+)@([a-z\.]+[a-z]+)"
result = re.search(pattern, s)

In [None]:
re.findall(pattern, s)

### Example: Parsing HTML

there are tons of online tools for regex: 

https://pythex.org/

https://regexr.com/

In [None]:
from urllib.request import urlopen

url = "https://www.ucla.edu/"

page = urlopen(url)
html_bits = page.read()

html = html_bits.decode('latin-1')

print(html[:2000])

In [None]:
urls = re.findall(r'href=[\'"]?([^\'">]+)', html)

urls

[url for url in urls if "http" in url]
# ---

### Parsing Unstructured Scientific Data

Sometimes, data doesn't come to us neatly wrapped in CSV files. For example, consider the following: 

In [None]:
data = """
Andrea    5:31
Ben       5:02
Carl      6:21
Didi      5:10
"""
data

Since it looks like these data represent times, let's parse the data into names, minutes, and seconds. 

In [None]:
pattern = r"([A-z]+)\s+(\d+):(\d+)"

In [None]:
parsed = re.findall(pattern, data)
parsed

In [None]:
{p[0] : 60*int(p[1]) + int(p[2]) for p in parsed}