# Chapter 6 - String Manipulation and Regular Expression

Main reference:<br>
- Chapter 2, 7, Python for Data Analysis, by Wes McKinney

Last edited: 05/16/2021

## 1. Python String Literal: F, R Strings

A string literal is a sequence of zero or more characters enclosed within single quotation marks.

Python supports multiple ways to format text strings.

In [1]:
# %load_ext nb_black

a = "string???"

a = "string!!!"

<IPython.core.display.Javascript object>

### 1.1 Triple-quoted. 
This literal syntax form begins and ends with three quotes. Newlines are left as is. In triple-quoted syntax, we do not need to escape quotes.


When copying & pasting several paragraphs, we need to use a triple-quoted string.

In [2]:
# Use a triple-quoted string.
MPintro = """
The term "monetary policy" refers to the actions undertaken by a central bank.

The Federal Reserve controls the three tools of monetary policy--open market operations, the discount rate, and reserve requirements. 
"""

print(MPintro)


The term "monetary policy" refers to the actions undertaken by a central bank.

The Federal Reserve controls the three tools of monetary policy--open market operations, the discount rate, and reserve requirements. 



<IPython.core.display.Javascript object>

In [3]:
MPintro

'\nThe term "monetary policy" refers to the actions undertaken by a central bank.\n\nThe Federal Reserve controls the three tools of monetary policy--open market operations, the discount rate, and reserve requirements. \n'

<IPython.core.display.Javascript object>

### `string` basics

In [4]:
# We can count the new line characters with the count method on
MPintro.count("\n")
# MPintro.count("monetary")

4

<IPython.core.display.Javascript object>

Python strings are **immutable**; you cannot modify a string:

In [5]:
a = "this is a string"

a[10]
# a[10] = 'f'

's'

<IPython.core.display.Javascript object>

But we can use `replace` to change the text and assign to a new variable, `b`.

In [6]:
b = a.replace("string", "longer string")

b
# print(a)

'this is a longer string'

<IPython.core.display.Javascript object>

Convert python object to a string using the `str` function:

In [7]:
a = 5.6
b = str(a)

b

'5.6'

<IPython.core.display.Javascript object>

Strings are a sequence of Unicode characters and therefore can be treated like other sequences, such as lists and tuples:

In [8]:
sent = "Explicit is better than implicit."

sent[:10]

'Explicit i'

<IPython.core.display.Javascript object>

In [9]:
sent2 = list(sent)

sent2

['E',
 'x',
 'p',
 'l',
 'i',
 'c',
 'i',
 't',
 ' ',
 'i',
 's',
 ' ',
 'b',
 'e',
 't',
 't',
 'e',
 'r',
 ' ',
 't',
 'h',
 'a',
 'n',
 ' ',
 'i',
 'm',
 'p',
 'l',
 'i',
 'c',
 'i',
 't',
 '.']

<IPython.core.display.Javascript object>

The `.join()` method is a string method to join elements in the list.

In [10]:
"".join(sent2)

'Explicit is better than implicit.'

<IPython.core.display.Javascript object>

The backslash character \ is an escape characbter, meaning that it is used to specify special characters like newline `\n` or Unicode characters. To write a string literal with backslashes, you need to escape them

In [11]:
s = "12\\34"

print(s)

12\34


<IPython.core.display.Javascript object>

Adding two strings together concatenates them and produces a new string:

In [12]:
a = "this is the first half "
b = "and this is the second half"

a + b

'this is the first half and this is the second half'

<IPython.core.display.Javascript object>

### 1.2 Format literal. (f-string) 
We can prefix a string with the "f" character to specify a format literal. Variables in the current scope are placed inside the curly brackets surrounding their names.


In [13]:
niter = 1000
consumption = 3.5

# Use format literal.
result = f"After {niter} loops, consumption value is {consumption:.3f}."
print(result)

After 1000 loops, consumption value is 3.500.


<IPython.core.display.Javascript object>

### 1.3 Raw literals. (r-string)
By prefixing a string literal with an r, we specify a raw string. In a raw string, the backslash character does not specify an escape sequence—it is a regular character.

Raw string literals are ideal for regular expression patterns. In "re" we often use the backslash.

In [14]:
# In a raw string "\" characters do not escape.
raw = r"\directory\123"
val = "\directory\123"

print(raw)
print(val)  # The "123" is treated as an escaped sequence in the normal string literal

\directory\123
\directoryS


<IPython.core.display.Javascript object>

### Natural Language Toolkit [(NLTK)](https://www.nltk.org/) 

[Installing NLTK](https://www.nltk.org/install.html)
    
    pip install nltk

In [15]:
from nltk.tokenize import sent_tokenize, word_tokenize

sent_tokenize(MPintro)

['\nThe term "monetary policy" refers to the actions undertaken by a central bank.',
 'The Federal Reserve controls the three tools of monetary policy--open market operations, the discount rate, and reserve requirements.']

<IPython.core.display.Javascript object>

In [16]:
word_tokenize(MPintro)

['The',
 'term',
 '``',
 'monetary',
 'policy',
 "''",
 'refers',
 'to',
 'the',
 'actions',
 'undertaken',
 'by',
 'a',
 'central',
 'bank',
 '.',
 'The',
 'Federal',
 'Reserve',
 'controls',
 'the',
 'three',
 'tools',
 'of',
 'monetary',
 'policy',
 '--',
 'open',
 'market',
 'operations',
 ',',
 'the',
 'discount',
 'rate',
 ',',
 'and',
 'reserve',
 'requirements',
 '.']

<IPython.core.display.Javascript object>

In [17]:
from nltk.corpus import stopwords

print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

<IPython.core.display.Javascript object>

In [18]:
import nltk

tokenizer = nltk.RegexpTokenizer(r"\w+")
tokenizer.tokenize(MPintro)

['The',
 'term',
 'monetary',
 'policy',
 'refers',
 'to',
 'the',
 'actions',
 'undertaken',
 'by',
 'a',
 'central',
 'bank',
 'The',
 'Federal',
 'Reserve',
 'controls',
 'the',
 'three',
 'tools',
 'of',
 'monetary',
 'policy',
 'open',
 'market',
 'operations',
 'the',
 'discount',
 'rate',
 'and',
 'reserve',
 'requirements']

<IPython.core.display.Javascript object>

`\w` means matching any word character, and following by `+` means matching 1 or more the preceding token (`\w`).

Textual Analysis in Economics research 

- [Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. Text as Data. Journal of Economic Literature, 57 (3): 535-74.](https://www.aeaweb.org/articles?id=10.1257/jel.20181020)
<br>

- [Baker, Scott R., Nicholas Bloom, and Steven J. Davis. "Measuring economic policy uncertainty." The quarterly journal of economics 131, no. 4 (2016): 1593-1636.](https://academic.oup.com/qje/article/131/4/1593/2468873?login=true)<br>
    [Measuring Economic Policy Uncertainty](https://www.policyuncertainty.com/)


## 2. Regular Expressions

A flexible way to search and match string patterns in text. You can interpret it as a more general (advanced) way of CTRL + F. <br>

Python built-in `re` module is responsible for applying regular expressions to string. The `re` module functions fall into three categories: pattern matching, substitution, and splitting. Naturally these are all related.


### 2.1 Pattern Split: `re.split`

The regex describing one or more **whitespace** characters is `\s+`.

In [19]:
# Split
import re

text = "foo bar\t baz \tqux"
re.split("\s+", text)

['foo', 'bar', 'baz', 'qux']

<IPython.core.display.Javascript object>

`re.split` can be written into two steps:

In [20]:
regex = re.compile(r"\s+")
regex.split(text)

['foo', 'bar', 'baz', 'qux']

<IPython.core.display.Javascript object>

### 2.2 Pattern Matching: `re.findall`

`match` and `search` are closely related to `findall`. While `findall` returns all matches in a string, `search` returns only the first match. More rigidly, `match` only matches at the beginning of the string.

In [21]:
text = """Dave dave@google.com
Steve steve.t@gmail.com
Rob rob@gmail.com
Ryan ry-an@yahoo.com
"""

pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

<IPython.core.display.Javascript object>

In [22]:
regex.findall(text)

['dave@google.com', 'steve.t@gmail.com', 'rob@gmail.com', 'ry-an@yahoo.com']

<IPython.core.display.Javascript object>

In [23]:
regex.search(text)[0]

'dave@google.com'

<IPython.core.display.Javascript object>

### 2.3 Pattern Substitution: `re.sub`

sub will return a new string with occurrences of the pattern replaced by the a new string.


In [24]:
print(regex.sub("REDACTED", text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



<IPython.core.display.Javascript object>

### 2.4 Capture groups: put parentheses around the segment

We want to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix.

In [25]:
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve.t', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ry-an', 'yahoo', 'com')]

<IPython.core.display.Javascript object>

**To try out if a regular expression works or not, you can do some experiments on online testers first.**<br>

**For example, [regexr.com]( https://regexr.com/), or [regex101](https://regex101.com/).**

#### Exercise 1

Write an regular expression, which can capture all `econom*` in the `text`, where * is a wild card.

In [26]:
text = """
The Federal Reserve is committed to using its full range of tools to support the U.S. economy in this challenging time, thereby promoting its maximum employment and price stability goals.

The COVID-19 pandemic is causing tremendous human and economic hardship across the United States and around the world. The pace of the recovery in economic activity and employment has moderated in recent months, with weakness concentrated in the sectors most adversely affected by the pandemic. Weaker demand and earlier declines in oil prices have been holding down consumer price inflation. Overall financial conditions remain accommodative, in part reflecting policy measures to support the economy and the flow of credit to U.S. households and businesses.

The path of the economy will depend significantly on the course of the virus, including progress on vaccinations. The ongoing public health crisis continues to weigh on economic activity, employment, and inflation, and poses considerable risks to the economic outlook.
"""

# -----------------------------------------------
# insert your code here
pattern = None
# -----------------------------------------------

# regex = re.compile(pattern, flags=re.IGNORECASE)
# regex.findall(text)

<IPython.core.display.Javascript object>

<details><summary>Click here for the solution</summary>

```python
pattern = r"econom(?:\w+)?"
```

</details>

#### Exercise 2

Write an regular expression, which can capture all of the web URLs ending with ".pdf" in the `text`.

In [27]:
text = """
<strong>Statement:</strong><br>
<a href="/monetarypolicy/files/monetary20210127a1.pdf">PDF</a> | <a href="/newsevents/pressreleases/monetary20210127a.htm">HTML</a><br>

<a href="/newsevents/pressreleases/monetary20210127a1.htm">Implementation Note</a>

</div>
<div class="col-xs-12 col-md-4 col-lg-3">
<a href="/monetarypolicy/fomcpresconf20210127.htm">Press Conference</a><br>

<br>
<a href="/newsevents/pressreleases/monetary20210127b.htm">Statement on Longer-Run Goals and Monetary Policy Strategy </a>
</div>
<div class="col-xs-12 col-md-4 col-lg-4 fomc-meeting__minutes">
<strong>Minutes:</strong><br>
<a href="/monetarypolicy/files/fomcminutes20210127.pdf">PDF</a> | <a href="/monetarypolicy/fomcminutes20210127.htm">HTML</a>

<br> (Released February 17, 2021)

"""

# -----------------------------------------------
# insert your code here
pattern = None
# -----------------------------------------------

# regex = re.compile(pattern, flags=re.IGNORECASE)
# regex.findall(text)

<IPython.core.display.Javascript object>

<details><summary>Click here for the solution</summary>

```python
pattern = r'href="(\/.+pdf)"'
# pattern = r'href="(\/\w.+.pdf)"'
```

</details>