In [41]:
from lec_utils import *


<div class="alert alert-info" markdown="1">

#### Lecture 7


# Regular Expressions

### QSS 20, Fall 2025
    
</div>


## Regex in Python

---

### `re` in Python

- The `re` module is built into Python. It allows us to use regular expressions to find, extract, and replace strings.

In [None]:
import re

- `re.findall` takes in a string `regex` and a string `text` and returns a list of all matches of `regex` in `text`. **You'll use this most often.**

In [2]:
re.findall('AB*A', 
           'here is a string for you: ABBBA. here is another: ABBBBBBBA')

['ABBBA', 'ABBBBBBBA']

- `re.sub` takes in a string `regex`, a string `repl`, and a string `text`, and replaces all matches of `regex` in `text` with `repl`.

In [11]:
re.sub('AB*A', 
       'billy', 
       'here is a string for you: ABBBA. here is another: ABBBBBBBA')

'here is a string for you: billy. here is another: billy'

### Raw strings

When using regular expressions in Python, it's a good idea to use **raw strings**, denoted by an `r` before the quotes, e.g. `r'exp'`.

In [32]:
re.findall('\\new', "Folder path: C:\\new_folder")

[]

- Python interprets "\n" as a newline character, not a backslash and “n”.
- So the actual pattern sent to regex becomes a newline followed by "ew".
- Your text doesn’t have a newline — it has the literal characters \ and n.

In [31]:
re.findall(r'\\new', "Folder path: C:\\new_folder")

['\\new']

### Capturing and non-capturing groups

- Surround a regex with `(` and `)` to define a **capture group** within a pattern. Capture groups are useful for extracting relevant parts of a string.

In [43]:
re.findall(r'\w+@(\w+)\.edu', 
           'my old email was kc@ucsd.edu, my new email is kc@dartmouth.edu')

['ucsd', 'dartmouth']

- Notice what happens if we remove the `(` and `)`!

In [44]:
re.findall(r'\w+@\w+\.edu', 
           'my old email was kc@ucsd.edu, my new email is kc@dartmouth.edu')

['kc@ucsd.edu', 'kc@dartmouth.edu']

- Earlier, we also saw that parentheses can be used to group parts of a regex together. When using `re.findall`, all groups are treated as capturing groups.

In [35]:
# A regex that matches strings with two of the same vowel followed by 3 digits.
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')

[('oo', '124')]

- To specify that we **don't** want to capture a particular group, use `?:` inside the parentheses at the start.<br><small>`?:` specifies a **non-capturing group**.</small>

In [None]:
re.findall(r'(?:aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')

### Example: Extracting hashtags

- The dataset `'public_data/ira.csv'` contains tweets tagged by Twitter as likely being posted by the [Internet Research Agency](https://en.wikipedia.org/wiki/Internet_Research_Agency), the tweet factory facing allegations for attempting to influence US political elections.<br><small>For more context, read [this Wikipedia article](https://en.wikipedia.org/wiki/Russian_interference_in_the_2016_United_States_elections).</small>

In [13]:
tweets = pd.read_csv('../../public_data/ira.csv', names=['id', 'user', 'time', 'text'])
tweets.head()

Unnamed: 0,id,user,time,text
0,3906258,ea85ac8be1e8ab479064ca4c0fe3ac6587f76b1ef97452...,2016-11-16 09:04,The Best Exercise To Lose Belly Fat In 2 weeks...
1,1051443,8e58ab0f46d273103d9e71aa92cdaffb6e330ec7d15ae5...,2016-12-24 04:31,RT @Philanthropy: Dozens of ‘hate groups’ have...
2,2823399,Room Of Rumor,2016-08-18 20:26,"Artificial intelligence can find, map poverty,..."
3,272878,San Francisco Daily,2016-03-18 19:28,Uber balks at rules proposed by world’s busies...
4,7697802,41bb9ae5991f53996752a0ab8dd36b543821abca8d5aed...,2016-07-30 15:44,RT @dirtroaddiva1: #IHatePokemonGoBecause he ...


In [14]:
tweets.shape

(90000, 4)

- **Question**: What are the most common hashtags among all 9000 tweets?

### Extracting hashtags

- Most Series `.str` operations support regular expressions.<br>We can use `re.findall` to find all of the hashtags in a particular string.

In [36]:
example_tweet = tweets['text'].iloc[0]
example_tweet

'The Best Exercise To Lose Belly Fat In 2 weeks  https://t.co/oHFToG7rh6 #Exercise #LoseBellyFat #CatTV #TeenWolf… https://t.co/b4pr9gEx38'

In [37]:
re.findall(r'#(\w+)', example_tweet) 

['Exercise', 'LoseBellyFat', 'CatTV', 'TeenWolf']

In [38]:
re.findall(r'#(\w+)', 'hey there, no hashtags here') 

[]

- We can use the Series `str.findall` method, with the regular expression above, to extract hashtags out of each tweet in `tweets['text']`.

In [39]:
tags = tweets['text'].str.findall(r'#(\w+)') 
tags.head()

0    [Exercise, LoseBellyFat, CatTV, TeenWolf]
1                                           []
2                                       [tech]
3                                       [news]
4       [IHatePokemonGoBecause, PokesAreJokes]
Name: text, dtype: object

- We can use the `explode` method on the above Series to separate each list into individual elements.

In [42]:
(
    tags
    .explode()
    .value_counts()
    .head(15)
    .sort_values()
    .plot(kind='barh', title='Most Common Hashtags in IRA Tweets')
)