# Text Data Cleaning | BAIS 6100

**Instructor: Qihang Lin**

## Data Cleaning is Necessary

1. Remove non-informative content or noise. 
2. Extract useful information. (e.g. find all emails or phone numbers).
3. Prepare for counting word frequency.
    * Convert a phrase into a string (e.g. "White House" to "WhiteHouse").
    * Convert a string into a phrase (e.g. "we'll" to "we will").
    * Remove or replace punctuations or numbers.
    * Remove stop words.
    * Stemming or lemmatization.
    * ...

**NLTK** is a suite of libraries and programs for natural language processing and it has become one of the most important libraries for text analytics in practices.

We need to use **nltk** library and **regular expression** for advanced text data clearning and term frequency counting.

If that's the first time you run the code in this notebook, please uncomment and run the following commands to install necessary packages and NLTK modules.

In [1]:
#!pip3 install --upgrade nltk
#import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('wordnet')
#nltk.download('brown')
#nltk.download('treebank')
#nltk.download('stopwords')

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0m

[nltk_data] Downloading package punkt to /home/abromeland/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/abromeland/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/abromeland/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package brown to /home/abromeland/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package treebank to
[nltk_data]     /home/abromeland/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/abromeland/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
import pandas as pd
from collections import Counter              # for word counting
import nltk                                  # for text clearning
import itertools                             # to flatten a list of lists.

Load the data we need for this lecture:

In [3]:
df = pd.read_csv("classdata/clinton-street-social-club.csv",encoding="latin-1")

## Word Frequency Counts before Cleaning

Suppose we want to see the word frequency counts in the customer reviews but we directly process the raw data without any cleaning.

Take the first review as an example:

In [4]:
df["reviews"][0]

"With its jazzy vibes and chill atmosphere, Clinton Street Social Club itself may just be the speakeasy of Iowa City. It's entrance is through a small door next to Shorts burgers and a barber shop. If you don't know what you're looking for, you may never find it. \r\n\r\nIn all seriousness they mimic New Orleans culture and have the most delicious food and creative alcoholic beverages. Must try are their beignets, shrimp cocktail, house-batterer curds, sweet corn fritters! \r\n\r\nFor alcoholic drinks, I loved their Ramos Gin Fizz the most! I tend to not like anything where I can taste the alcohol and this one has just the right balance. It's has the right amount of fizz with added foam from the egg whites. Very unique compared to the usual bars located downtown. \r\n\r\nI had the opportunity to learn how to tend bar from the all knowing bartender Joy, that girl definitely knows her alcohol. She made us delicious drinks as told us all about its historical discovery."

Let's use the default tokenizer in NLTK to tokenize the first review and see what is returned.

In [5]:
tokens = nltk.word_tokenize(df["reviews"][0])   #Tokenize the first reivew. 
print(tokens)   

['With', 'its', 'jazzy', 'vibes', 'and', 'chill', 'atmosphere', ',', 'Clinton', 'Street', 'Social', 'Club', 'itself', 'may', 'just', 'be', 'the', 'speakeasy', 'of', 'Iowa', 'City', '.', 'It', "'s", 'entrance', 'is', 'through', 'a', 'small', 'door', 'next', 'to', 'Shorts', 'burgers', 'and', 'a', 'barber', 'shop', '.', 'If', 'you', 'do', "n't", 'know', 'what', 'you', "'re", 'looking', 'for', ',', 'you', 'may', 'never', 'find', 'it', '.', 'In', 'all', 'seriousness', 'they', 'mimic', 'New', 'Orleans', 'culture', 'and', 'have', 'the', 'most', 'delicious', 'food', 'and', 'creative', 'alcoholic', 'beverages', '.', 'Must', 'try', 'are', 'their', 'beignets', ',', 'shrimp', 'cocktail', ',', 'house-batterer', 'curds', ',', 'sweet', 'corn', 'fritters', '!', 'For', 'alcoholic', 'drinks', ',', 'I', 'loved', 'their', 'Ramos', 'Gin', 'Fizz', 'the', 'most', '!', 'I', 'tend', 'to', 'not', 'like', 'anything', 'where', 'I', 'can', 'taste', 'the', 'alcohol', 'and', 'this', 'one', 'has', 'just', 'the', 'rig

A few interesting observations:
- , . and ! are treated as tokens but ' is not.
- "don't" is splitted into "do" and "n't" two tokens. Similar case for "isn't", "aren't", and "can't".
- "It's" and "you're" are splitted into "It", "'s" and "you", "'re".

Use list comprehension to tokenize each review and return the list of tokens from each review as a "list of lists". 

In [6]:
words_all = [nltk.word_tokenize(s) for s in df["reviews"]]

**words_all** is "list of lists" and, because of that, we cannot count the word frequency directly using words_all. 

In [7]:
print(words_all[0:2])

[['With', 'its', 'jazzy', 'vibes', 'and', 'chill', 'atmosphere', ',', 'Clinton', 'Street', 'Social', 'Club', 'itself', 'may', 'just', 'be', 'the', 'speakeasy', 'of', 'Iowa', 'City', '.', 'It', "'s", 'entrance', 'is', 'through', 'a', 'small', 'door', 'next', 'to', 'Shorts', 'burgers', 'and', 'a', 'barber', 'shop', '.', 'If', 'you', 'do', "n't", 'know', 'what', 'you', "'re", 'looking', 'for', ',', 'you', 'may', 'never', 'find', 'it', '.', 'In', 'all', 'seriousness', 'they', 'mimic', 'New', 'Orleans', 'culture', 'and', 'have', 'the', 'most', 'delicious', 'food', 'and', 'creative', 'alcoholic', 'beverages', '.', 'Must', 'try', 'are', 'their', 'beignets', ',', 'shrimp', 'cocktail', ',', 'house-batterer', 'curds', ',', 'sweet', 'corn', 'fritters', '!', 'For', 'alcoholic', 'drinks', ',', 'I', 'loved', 'their', 'Ramos', 'Gin', 'Fizz', 'the', 'most', '!', 'I', 'tend', 'to', 'not', 'like', 'anything', 'where', 'I', 'can', 'taste', 'the', 'alcohol', 'and', 'this', 'one', 'has', 'just', 'the', 'ri

Use **itertools.chain.from_iterable** to "flatten" a list of lists.

In [8]:
# List of words across all reviews
words_all = list(itertools.chain.from_iterable(words_all))

In [9]:
print(words_all[0:200])

['With', 'its', 'jazzy', 'vibes', 'and', 'chill', 'atmosphere', ',', 'Clinton', 'Street', 'Social', 'Club', 'itself', 'may', 'just', 'be', 'the', 'speakeasy', 'of', 'Iowa', 'City', '.', 'It', "'s", 'entrance', 'is', 'through', 'a', 'small', 'door', 'next', 'to', 'Shorts', 'burgers', 'and', 'a', 'barber', 'shop', '.', 'If', 'you', 'do', "n't", 'know', 'what', 'you', "'re", 'looking', 'for', ',', 'you', 'may', 'never', 'find', 'it', '.', 'In', 'all', 'seriousness', 'they', 'mimic', 'New', 'Orleans', 'culture', 'and', 'have', 'the', 'most', 'delicious', 'food', 'and', 'creative', 'alcoholic', 'beverages', '.', 'Must', 'try', 'are', 'their', 'beignets', ',', 'shrimp', 'cocktail', ',', 'house-batterer', 'curds', ',', 'sweet', 'corn', 'fritters', '!', 'For', 'alcoholic', 'drinks', ',', 'I', 'loved', 'their', 'Ramos', 'Gin', 'Fizz', 'the', 'most', '!', 'I', 'tend', 'to', 'not', 'like', 'anything', 'where', 'I', 'can', 'taste', 'the', 'alcohol', 'and', 'this', 'one', 'has', 'just', 'the', 'rig

In [10]:
# Create counter and show 20 most frequent tokens.
counts = Counter(words_all)
counts.most_common(20)

[('.', 1022),
 ('the', 741),
 (',', 623),
 ('and', 566),
 ('a', 491),
 ('I', 370),
 ('to', 334),
 ('of', 285),
 ('was', 233),
 ('is', 217),
 ('in', 197),
 ('The', 190),
 ('for', 186),
 ('!', 162),
 ('it', 160),
 ('that', 141),
 ('with', 138),
 ('but', 125),
 ('had', 121),
 ('have', 117)]

**counts.most_common()** returns a list of **tuple**s. A tuple, for example, ('.', 1022), is represented by enclosing the items in parentheses () instead of square brackerts. 

It will be more intuitive to convert tuples into a data frame. This can be done as follows:

In [11]:
dffreq = pd.DataFrame(counts.most_common(), columns=['Term', 'Frequency'])
dffreq.head(10)

Unnamed: 0,Term,Frequency
0,.,1022
1,the,741
2,",",623
3,and,566
4,a,491
5,I,370
6,to,334
7,of,285
8,was,233
9,is,217


Any concerns about the word frequency counted this way? Shall we count punctuations and stop words?

## Word Frequency Counts after Clearning

The following steps of clearning can be done to make the frequency counts more reasonable. Depending on the use case, some steps can be modified or skipped. 

### 1. Turn letters to lower cases.

In [12]:
df["reviews_new"]=df["reviews"].str.lower()
df["reviews_new"][0]

"with its jazzy vibes and chill atmosphere, clinton street social club itself may just be the speakeasy of iowa city. it's entrance is through a small door next to shorts burgers and a barber shop. if you don't know what you're looking for, you may never find it. \r\n\r\nin all seriousness they mimic new orleans culture and have the most delicious food and creative alcoholic beverages. must try are their beignets, shrimp cocktail, house-batterer curds, sweet corn fritters! \r\n\r\nfor alcoholic drinks, i loved their ramos gin fizz the most! i tend to not like anything where i can taste the alcohol and this one has just the right balance. it's has the right amount of fizz with added foam from the egg whites. very unique compared to the usual bars located downtown. \r\n\r\ni had the opportunity to learn how to tend bar from the all knowing bartender joy, that girl definitely knows her alcohol. she made us delicious drinks as told us all about its historical discovery."

### 2. Replace Words if Needed

In [13]:
df["reviews_new"]=df["reviews_new"].str.replace("iowa city","iowacity")
df["reviews_new"]=df["reviews_new"].str.replace("can't ","can not ")
df["reviews_new"]=df["reviews_new"].str.replace("n't "," not ")
df["reviews_new"]=df["reviews_new"].str.replace("'re "," are ")
df["reviews_new"]=df["reviews_new"].str.replace("'ve "," have ")
df["reviews_new"]=df["reviews_new"].str.replace("'s "," is ")
df["reviews_new"]=df["reviews_new"].str.replace("'m "," am ")
df["reviews_new"]=df["reviews_new"].str.replace("'ll "," will ")
df["reviews_new"]=df["reviews_new"].str.replace("'d "," would ")
df["reviews_new"][0]

'with its jazzy vibes and chill atmosphere, clinton street social club itself may just be the speakeasy of iowacity. it is entrance is through a small door next to shorts burgers and a barber shop. if you do not know what you are looking for, you may never find it. \r\n\r\nin all seriousness they mimic new orleans culture and have the most delicious food and creative alcoholic beverages. must try are their beignets, shrimp cocktail, house-batterer curds, sweet corn fritters! \r\n\r\nfor alcoholic drinks, i loved their ramos gin fizz the most! i tend to not like anything where i can taste the alcohol and this one has just the right balance. it is has the right amount of fizz with added foam from the egg whites. very unique compared to the usual bars located downtown. \r\n\r\ni had the opportunity to learn how to tend bar from the all knowing bartender joy, that girl definitely knows her alcohol. she made us delicious drinks as told us all about its historical discovery.'

### 3. Remove all stop words.

Let's take a look at which words are considered as stop words by NLTK.

In [14]:
global_stopwords = nltk.corpus.stopwords.words("english") 
print(global_stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [15]:
words_all = [nltk.word_tokenize(s) for s in df["reviews_new"]]
words_all = list(itertools.chain.from_iterable(words_all))
words_all = [s for s in words_all if s not in global_stopwords]
counts = Counter(words_all)
dffreq = pd.DataFrame(counts.most_common(), columns=['Term', 'Frequency'])
dffreq.head(10)

Unnamed: 0,Term,Frequency
0,.,1017
1,",",623
2,!,162
3,great,129
4,food,106
5,place,102
6,good,86
7,drinks,65
8,menu,62
9,one,61


Here, we use the if statement ''**if s not in global_stopwords**'' in list comprehension so that only the string not in **global_stopwords** will be added to the output list. 

The condition in the if statement can be modified to handle other requirements. See the example in the next step below.

### 4. Remove all short tokens

We will remove tokens shorter than three. This will effectively remove the tokens that are punctuations.

In [16]:
words_all = [s for s in words_all if len(s)>2]
counts = Counter(words_all)
dffreq = pd.DataFrame(counts.most_common(), columns=['Term', 'Frequency'])
dffreq.head(10)

Unnamed: 0,Term,Frequency
0,great,129
1,food,106
2,place,102
3,good,86
4,drinks,65
5,menu,62
6,one,61
7,cheese,60
8,like,58
9,bar,58


### 5. Stemming 

Lemmatization requires POS tagging to achieve its best performance. In this lecture, we will focus on stemming. We need to first create a stemmer, e.g. SnowballStemmer, for English.

In [17]:
stemmer = nltk.stem.SnowballStemmer("english")
#stemmer = nltk.stem.PorterStemmer() #is less aggressive
#stemmer = nltk.stem.LancasterStemmer() #is more aggressive

Let's see a few examples.

In [18]:
stemmer.stem("nationality")

'nation'

In [19]:
stemmer.stem("businesses")

'busi'

In [20]:
stemmer.stem("busy")

'busi'

In [21]:
stemmer.stem("buses")

'buse'

In [22]:
stemmer.stem("city")

'citi'

In [23]:
stemmer.stem("drinks")

'drink'

Some cases might cause problems (e.g. "business" and "busy"). NLP is not perfect. We can live with it but need to be aware of these potential issues.

Apply stemming to each token. 

In [24]:
words_all = [stemmer.stem(s) for s in words_all]
counts = Counter(words_all)
dffreq = pd.DataFrame(counts.most_common(), columns=['Term', 'Frequency'])
dffreq.head(10)

Unnamed: 0,Term,Frequency
0,great,129
1,place,117
2,drink,116
3,food,107
4,good,87
5,cocktail,78
6,time,72
7,bar,70
8,order,65
9,one,64


## The Order  Matters

Applying the clearning steps in different orders may affect the frequency counts. Change and modify the orders according to your situation.

* A stop word is case senstitive. "There" is not a stop word but "there" is.
* "'re" is not a stop word but "are" is.
* "going" is longer then two characters but "go" isn't.
* ...

## Other Steps to Consider
-  Customize the stop word list.
-  Remove all punctuations and all digits.
-  Create **group tokens**
      -  Remove/replace all dollar amounts by a group token like "dollaramount". 
      -  Remove/replace all urls by a group token like "urltoken".
      -  ...

## Customize the List of Stop Words 

It may happen that some words are not considered as stop words but also not interesting for frequency counting. For examples, 'also' and 'would'.

It may also happen you want to keep some stop words in the frequency counts, for example, "not" and "can". 

In these situation, it is useful to know how to customize the stop word list.

Remove stop words from the default list and add new stop words:

In [25]:
global_stopwords = nltk.corpus.stopwords.words("english") 
global_stopwords.remove("not")
global_stopwords.remove("can")
global_stopwords = global_stopwords+["also","would"]
print(global_stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'only', 'own', 'same', 'so', 'than', 'too', '

Repeat every step once again using the new stop word list.

In [26]:
words_all = [nltk.word_tokenize(s) for s in df["reviews_new"]]
words_all = list(itertools.chain.from_iterable(words_all))
words_all = [s for s in words_all if s not in global_stopwords]
words_all = [s for s in words_all if len(s)>2]
words_all = [stemmer.stem(s) for s in words_all]
counts = Counter(words_all)
dffreq = pd.DataFrame(counts.most_common(), columns=['Term', 'Frequency'])
dffreq.head(10)

Unnamed: 0,Term,Frequency
0,not,181
1,great,129
2,place,117
3,drink,116
4,food,107
5,good,87
6,cocktail,78
7,time,72
8,bar,70
9,order,65


## Process Strings by Regular Expression

We have learned how to use string processing functions (the basic ones and the .str methods) to modify strings. However, the string pattern we search using these functions must be literally matched. 

What if the string pattern we want to search is a more general pattern? For example, how can we remove all punctuations or all digits in a string?

We can use <b>re</b> library and **regular expression** to describe a string pattern in a more general and flexible way.

Regular expression is used in almost all programming languages. It is a very sophisticated technique and lots of practice is necessary in order to become a proficient user. We will only cover some basic expressions. If you are interested to study RE with more depth, read https://www.tutorialspoint.com/python/python_reg_expressions.htm and http://webagility.com/posts/the-basics-of-regex-explained

We will first study REs representing character classes and then REs representing string patterns.

## Regular Expression for Classes of Characters

In [27]:
import re       

A few commonly used examples that represent a class of characters:

|     Regular  | Expression|
-:|:- 
`\d` | Any digit
`\w` | Any alphanumerical character and the underscore "_"
`\s` | Whitespace, tab, or newline
**Their complements** |
`\D` | Any character not in '\d'
`\W` | Any character not in '\w'
`\S` | Any character not in '\s'

For historical reasons, underscore "_" is used in **identifiers_like_this** in many programming languanges, which makes it special.

We use the following string as an example.

In [28]:
mystr = 'John said: "He was born 1990. Here\'s his the résumé." ___😂'
print(mystr)

John said: "He was born 1990. Here's his the résumé." ___😂


In [30]:
re.sub('\d', '', mystr)   # Remove all digits (replace by an empty string).

'John said: "He was born . Here\'s his the résumé." ___😂'

## Square Brackets and Ranges

In RE, "[ ]" represents "**anything in the brackets**" and "-" represents "**from-to**" :

|     Regular  | Expression|
-:|:- 
   `[\w\s]` | Anything represented by "\w" or "\s"
   `[a12]` | Match "a", "1" or "2". You can add as many as you want.
   `[0-9]`    | Any digit (just like "\d")
   `[a-z]`    | Any English letter in lower case
   `[A-Z]`    | Any English letter in upper case 
`[a-zA-Z0-9]` | Any digit or any English letter

In [31]:
re.sub('[a-zA-Z0-9]', ' ', mystr) #Replace any digit and any English-letter by a whitespace.

'         : "                .     \'           é   é." ___😂'

In [32]:
re.sub('[\w\s]', '', mystr) #Remove any spacing characters, any digit and any letter.

':".\'."😂'

### Not [^]

In RE, "[^...]" represents "**anything not in the brackets**":

|     Regular  | Expression|
-:|:- 
`[^a12]` | Match any character except "a", "1" and "2".
`[^a-zA-Z0-9]`    | Anything non-digit and non-English-letter
`[^\w\s]`    | Anything not in "\w" and "\s"

In [33]:
re.sub('[^a-zA-Z0-9]', ' ', mystr) #Replace anything non-digit non-english-letter by a whitespace.

'John said   He was born 1990  Here s his the r sum        '

In [34]:
re.sub('[^\w\s]', '', mystr)  #Remove anything not in "\w" and "\s"

'John said He was born 1990 Heres his the résumé ___'

## Disjunction of Search Conditions

In RE, '|' represents "OR". We can use the expression **"condition1|conditoin2|....|condition10"** to represent a string satisfies at least one condition from 1 to 10. 

In [36]:
re.sub('[^\w\s]|_|\d', '', mystr)   #Remove anything that matches '[^\w\s]' or "_" or "\d".

'John said He was born  Heres his the résumé '

## Other RE Functions 

In [37]:
re.search('\d', mystr)       #Search for a digit and return a "match object". 

<re.Match object; span=(24, 25), match='1'>

In [38]:
# Use bool to convert the search result to True or False.
bool(re.search('\d', mystr)) 

True

In [39]:
re.findall('\d', mystr)

['1', '9', '9', '0']

See (https://docs.python.org/3/library/re.html) for a complete list of functions from **re**.

Just like string methods, **re** methods will not change the original string unless you overwrite it. 

In [38]:
mystr=re.sub('[^\w\s]|_|\d', '', mystr)  
mystr

'John said He was born  Heres his the résumé '

Don't get confused between an escape sequence and a regular expression. They both use '\\' in their grammar. An escape sequence (like "\\n") is an special character but a regular expression (like "\d") represents a class of characters.

## Regular Expression Using Meta Characters

**re** allows using **meta characters** . ^ $ * + ? { } [ ] \ | ( ) to represent more flexible and general string patterns.

We use the following sentence as an example.

In [40]:
mystr = "Want $1,000,000.00? Contact John at john2021@gmail.com or 319-335-0988"

## Search Pattern at the Beginning or the End

Let **re** represents a regular expression such as "\d":

|     Regular  | Expression|
-:|:- 
^re | Matches **re** at the beginning of the string.
re$ | Matches **re** at the end of the string.

In [41]:
bool(re.search('^[a-z]', mystr)) #Check if the string begins with a lower letter

False

In [42]:
bool(re.search('\d$', mystr)) #Check if the string ends with a digit

True

## Quantifiers

|     Regular  | Expression|
-:|:- 
re* | Matches 0 or more continuous occurrences of **re**.
re+ | Matches 1 or more continuous occurrences of **re**.
re{2,6}  | Matches 2, 3, 4, 5 or 6 continuous occurrences of **re**.

In [42]:
re.findall('\d+', mystr)    #Find all continuous occurences of digits

['1', '000', '000', '00', '2021', '319', '335', '0988']

In [43]:
re.findall('[A-Z][a-z]*', mystr)  #Find strings starting with A-Z followed by any number of a-z 

['Want', 'Contact', 'John']

In [44]:
re.findall('\$\d+[\d,\.]*', mystr)   #Find all dollar amounts.

['$1,000,000.00']

Be careful! Here, we have to use the escape sequences "\\$" and "\." to represent the dollar sign and the decimal point because "$" and "." are meta characters and do not mean what they are literally.

In [43]:
re.findall('\d{3,3}-\d{3,3}-\d{3,4}', mystr)  #Find all phone numbers in the format xxx-xxx-xxxx

['319-335-0988']

In [44]:
re.findall('\S+@\S+\.\S+', mystr)  #Find all email addressess.

['john2021@gmail.com']

## Create Group Tokens

In [45]:
mystr = re.sub('\$\d+[\d,\.]*', "moneytoken", mystr)
mystr = re.sub('\d{3,3}-\d{3,3}-\d{4,4}', "phonenumbertoken", mystr)
mystr = re.sub('\S+@\S+\.\S+', "emailtoken", mystr)
mystr

'Want moneytoken? Contact John at emailtoken or phonenumbertoken'

With these replacements, all dollar amounts (phone numbers, emails) will be counted as the same token. This is helpful when analyzing text data containing many numbers, for example, economic news. In fact, you don't want to ignore all numbers but also do not want to have so many unique tokens because of the unique numbers in the article. 

## Apply Regular Expression to Strings in a Column or List

We can still use list comprehension or the ".str" methods to apply a regular expression to each string in a list or column. The ".str" methods actually can recognize regular expression just like "re" methods but it only works if the strings is in a column.

Example 1: Apply regular expression to find the reviews that mention a dollar amount.

In [46]:
#Using list comprehension
rowselected = [bool(re.search("\$\d+[\d,\.]*",s)) for s in df["reviews"]]
dfmoney = df[rowselected].copy()
dfmoney.reset_index(inplace=True, drop=True)       
dfmoney["reviews"][0]   #Just to check the first review we found

"There is no other place in town like CSSC. If you are looking for the speakeasy vibe and views of the ped mall this is the place to be. Grab a Silk Road and share the fondue for a perfect date night with ur SO or BFF.\r\n\r\nDrinks: Silk Road and Pisco sour if you love egg whites. I love egg white in my cocktails because the foam texture which reminds me of a latte. Annabelle lee is refreshing and they have a selection of whisky / bourbon as well if that's your drink of choice.  Cocktails are $10-$12. Also a handful of beers on tap. \r\n\r\nFood: LOVE the fondue. Melted with apples and caramelizad onions. The poutine is also very popular and I've enjoyed their burger. \r\n\r\nThey have a room for darts and pool great for larger groups."

In [47]:
#Using .str method
rowselected = df["reviews"].str.contains("\$\d+[\d,\.]*", regex=True)
dfmoney = df[rowselected].copy()
dfmoney.reset_index(inplace=True, drop=True)       
dfmoney["reviews"][0]   #Just to check the first review we found

"There is no other place in town like CSSC. If you are looking for the speakeasy vibe and views of the ped mall this is the place to be. Grab a Silk Road and share the fondue for a perfect date night with ur SO or BFF.\r\n\r\nDrinks: Silk Road and Pisco sour if you love egg whites. I love egg white in my cocktails because the foam texture which reminds me of a latte. Annabelle lee is refreshing and they have a selection of whisky / bourbon as well if that's your drink of choice.  Cocktails are $10-$12. Also a handful of beers on tap. \r\n\r\nFood: LOVE the fondue. Melted with apples and caramelizad onions. The poutine is also very popular and I've enjoyed their burger. \r\n\r\nThey have a room for darts and pool great for larger groups."

Here, in .str.contains, we set **regex=True** to indicate that we are searching with regular expression rather than search for literal matches. 

Example 2: Use regular expression to remove all special characters except "_" in all reviews?

In [48]:
#Using list comprehension
df["reviews_new"] = [re.sub("[^\w\s]", ' ' ,s) for s in df["reviews_new"]] 
df.head()

Unnamed: 0,reviews,ratings,reviews_new
0,"With its jazzy vibes and chill atmosphere, Cli...",5,with its jazzy vibes and chill atmosphere cli...
1,This was an exceptional surprise in Iowa city!...,5,this was an exceptional surprise in iowacity ...
2,There is no other place in town like CSSC. If ...,5,there is no other place in town like cssc if ...
3,Tucked away through a narrow staircase like a ...,5,tucked away through a narrow staircase like a ...
4,Love. Love. Love. If you're older than the col...,5,love love love if you are older than the co...


In [49]:
#Using .str method
df["reviews_new"] = df["reviews_new"].str.replace("[^\w\s]", ' ', regex=True)
df.head()

Unnamed: 0,reviews,ratings,reviews_new
0,"With its jazzy vibes and chill atmosphere, Cli...",5,with its jazzy vibes and chill atmosphere cli...
1,This was an exceptional surprise in Iowa city!...,5,this was an exceptional surprise in iowacity ...
2,There is no other place in town like CSSC. If ...,5,there is no other place in town like cssc if ...
3,Tucked away through a narrow staircase like a ...,5,tucked away through a narrow staircase like a ...
4,Love. Love. Love. If you're older than the col...,5,love love love if you are older than the co...


Let's compare the new column with the old one

In [50]:
print(df["reviews_new"][0])

with its jazzy vibes and chill atmosphere  clinton street social club itself may just be the speakeasy of iowacity  it is entrance is through a small door next to shorts burgers and a barber shop  if you do not know what you are looking for  you may never find it  

in all seriousness they mimic new orleans culture and have the most delicious food and creative alcoholic beverages  must try are their beignets  shrimp cocktail  house batterer curds  sweet corn fritters  

for alcoholic drinks  i loved their ramos gin fizz the most  i tend to not like anything where i can taste the alcohol and this one has just the right balance  it is has the right amount of fizz with added foam from the egg whites  very unique compared to the usual bars located downtown  

i had the opportunity to learn how to tend bar from the all knowing bartender joy  that girl definitely knows her alcohol  she made us delicious drinks as told us all about its historical discovery 


In [51]:
print(df["reviews"][0])

With its jazzy vibes and chill atmosphere, Clinton Street Social Club itself may just be the speakeasy of Iowa City. It's entrance is through a small door next to Shorts burgers and a barber shop. If you don't know what you're looking for, you may never find it. 

In all seriousness they mimic New Orleans culture and have the most delicious food and creative alcoholic beverages. Must try are their beignets, shrimp cocktail, house-batterer curds, sweet corn fritters! 

For alcoholic drinks, I loved their Ramos Gin Fizz the most! I tend to not like anything where I can taste the alcohol and this one has just the right balance. It's has the right amount of fizz with added foam from the egg whites. Very unique compared to the usual bars located downtown. 

I had the opportunity to learn how to tend bar from the all knowing bartender Joy, that girl definitely knows her alcohol. She made us delicious drinks as told us all about its historical discovery.
