Regular expressions are a powerful way of building patterns to match text. As powerful as regular expressions are, they can be difficult to learn at first — the syntax can look visually intimidating. As a result, a lot of students end up disliking regular expressions and try to avoid using them, instead opting to write more cumbersome code.

One thing to keep in mind before we start: Don't expect to remember all of the regular expression syntaxes. The most important thing is to understand the core principles, what is possible, and where to look up the details. That way, we can quickly jog our memory whenever we need regular expressions.

Don't be put off if some things don't stick in memory. As long as we can write and understand regular expressions with the help of documentation and/or other reference guides, we have all the skills we need to excel.

We'll learn regular expressions while performing analysis on a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

The dataset we will work with is based off this CSV of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:

* id: The unique identifier from Hacker News for the story
* title: The title of the story
* url: The URL that the stories links to, if the story has a URL
* num_points: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: The number of comments that were made on the story
* author: The username of the person who submitted the story
* created_at: The date and time at which the story was submitted

Let's import our Hacker News dataset into R.

`library(readr)
hn  <- read_csv("hacker_news.csv")`

The good news is that we have already used regular expressions. This is because any string can be a regular expression if it is used with the right function.

When working with regular expressions, we use the term pattern to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has matched.

Many R built-in as well as [`stringr` package](https://stringr.tidyverse.org/) functions support regular expressions. They are used to:

* Identify and filter match to a pattern.
* Identify the start position of matched patterns.
* Replace and split based on matched patterns.

In this file, we use `stringr` package functions. But the uses are similar for built-in functions, and for many other languages (although the syntax may change a bit).

Letters and numbers represent themselves in regular expressions. If we wanted to find the string `"and"` within another string, the regex pattern for that is simply `and`:

![image.png](attachment:image.png)

In R, if we want to find the pattern `and` in the string `hand`, we can use the [`str_detect()` function](https://stringr.tidyverse.org/reference/str_detect.html) from stringr package. This function is one of the most useful functions which allows checking whether there is a match to a pattern. It takes two required arguments:

* Either a string, or a vector (of string) we want to search that pattern for.
* The regex pattern.

`library(stringr)`

`m <- str_detect("hand", "and")
print(m)`

`[1] TRUE`

The `str_detect()` function will return `TRUE` if the input is a string. Hence, if the pattern is found anywhere within the string, the function returns `TRUE` or else it returns `FALSE`.

For now, we can use the fact that `str_detect()` function also receives a vector to easily check whether our regex matches each string in a vector. In this case, the output of the function is a logical vector at the same size of the input vector.

`string_vector  <- c("Julie's favorite color is green.",
                  "Keli's favorite color is Blue.",
                  "Craig's favorite colors are blue and red.")`

`pattern  <- "Blue"`

`m <- str_detect(string_vector, pattern)
print(m)`

`[1] FALSE  TRUE FALSE`

**Task**

* Extract a vector, `titles`, containing all the titles from there.
* Create a string — `pattern` — containing a regular expression pattern to match `Amazon`.
* Use the `str_detect()` function to check whether `pattern` matches title in `titles` variable. Save the result in `matches` variable.
* Use `if_else()` function to create a vector, denoted as `hn_matches`, containing `Match` and `No Match` according to the values in `matches` variable.

**Answer**

`library(stringr)
titles <- hn$title`

`library(dplyr)
pattern <- "Amazon"
matches <- str_detect(titles, pattern)
hn_matches <- if_else(matches, "Match", "No Match")`

Regular expressions also use special characters to search for possible repetition, to represent a set of characters, or to position pattern in a string. The power of regular expressions comes when we use these special characters.

The first of these we'll learn is called a **set**. A set allows us to specify two or more characters that can match in a single character's position.

We define a set by placing the characters we want to match for in square brackets `[ ]`:

![image.png](attachment:image.png)

We're going to use this technique to find out how many times `Amazon` is **mentioned** in the title of stories in our Hacker News dataset. We'll use a set to check for both `Amazon` with a capital "A" and `amazon` with a lowercase "a."

`pattern <- "[Aa]mazon"
matches <- str_detect(titles, pattern)
amazon_mentions <- sum(matches)`

The second special character we'll learn is the **alternative** character. It allows us to specify two or more patterns that can match a string.

We define an alternative by using the character `|` between patterns.

For example, if we want to find all the titles of our dataset containing the years `2010`, `2011`, and `2012`, we can either use the pattern `2010|2011|2012` or patterns `201[012]`.

**Task**

* Add a new column to our dataset if the title contains either `2000` or `2005` or `2010`.

**Answer**

`pattern <- "2000|2005|2010"
matches <- str_detect(titles, pattern)
hn_matches <- if_else(matches, "Match", "No Match")
hn_group <- hn %>%
    mutate(year_group = hn_matches)`

We used regular expressions to match and count how many titles contain `Amazon` or `amazon`. What if we wanted to view those titles?

In that case, we can use the logical vector returned by `str_detect()` function to select just those rows from our titles.

`titles  <-  hn$title`

`amazon_titles_logical  <- str_detect(titles, "[Aa]mazon" )`

Then, we can use that logical vector to select just the matching rows:

`amazon_titles  <-  titles[amazon_titles_logical]
print(head(amazon_titles))`

![image.png](attachment:image.png)

We can use the same logical vector (`amazon_titles_logical`) to select just the matching rows in hn using [`filter()`function](https://dplyr.tidyverse.org/reference/filter.html) from [`dplyr` package](https://dplyr.tidyverse.org/):


`hn_amazon <- hn %>% filter(amazon_titles_logical)`

`print(head(hn_amazon))`

![image.png](attachment:image.png)

**Task**

1. Use `str_detect()` to create the logical vector from `titles` that contain `Google` or `google`. Assign the result to `google_titles_logical`.
2. From `google_titles_logical` variable, create the vector of the title containing `Google` or `google`. Assign the result to `google_titles`.
3. From `google_titles_logical` variable, create also the dataframe of `hn` rows containing `Google` or `google` which is assigned to `hn_google`

**Answer**

`google_titles_logical  <- str_detect(titles, "[Gg]oogle" )
google_titles  <-  titles[google_titles_logical]
hn_google <- hn %>% filter(google_titles_logical)`

Braces (`{}`) are used as special character to specify that a character repeats in our regular expression. For instance, if we wanted to write a pattern that matches the numbers in text from 1000 to 2999 we could write the regular expression below:

![image.png](attachment:image.png)

The name for this type of regular expression syntax is called a **quantifier**. Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths. As an example, we might want to match both `e-mail` and `email`. To do this, we would want to specify to match - either zero or one times.

The specific type of quantifier we saw above is called a numeric quantifier. Here are the different types of numeric quantifiers we can use:

![image.png](attachment:image.png)

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that we're likely to use. A summary of them is below.

![image.png](attachment:image.png)

**Task**

1. Use a regular expression and `str_detect()` function to create a logical vector that matches items from `titles` containing `email` or `e-mail`. Assign the result to `email_logical`.
2. Use `email_logical` to count the number of titles that matched the regular expression. Assign the result to `email_count`.
3. Use `email_logical` to select only the items from titles that matched the regular expression. Assign the result to `email_titles`.

**Answer**

`email_logical  <-  str_detect(titles, "e-?mail")
 email_count  <-  sum(email_logical)
 email_titles  <-  titles[email_logical]`

Some stories submitted to Hacker News include a topic tag in brackets, like [pdf].

To match the substring `"[pdf]"`, we can use **backslashes to escape**  both the open and closing brackets: `\[pdf\]`.

![image.png](attachment:image.png)

The other critical part of our task of identifying how many titles have tags is knowing how to match the characters between the brackets (like `pdf` and `video`) without knowing ahead of time what the different topic tags will be.

To match unknown characters using regular expressions, we use **character classes**. Character classes allow us to match certain groups of characters.

We've actually seen two examples of character classes already:

1. The set notation using brackets to match any of a number of characters.
2. The range notation, which we used to match ranges of digits (like `[0-9]`).

Let's look at a summary of syntax for some of the regex character classes:

![image.png](attachment:image.png)

There are two new things we can observe from this table:

1. Ranges can be used for letters as well as numbers.
2. Sets and ranges can be combined.

Just like with quantifiers, there are some other common character classes which we'll use a lot.

![image.png](attachment:image.png)

The one that we'll use to match characters in tags is `\w`, which represents any digit, uppercase or lowercase letter. Each character class represents a single character, so to match multiple characters (e.g. words like `video` and `pdf`), we'll need to combine them with quantifiers.

In order to match word characters between our brackets, we can combine the word character class (`\w`) with the "one or more" quantifier (`+`), giving us a combined pattern of `\w+`.

This will match sequences like `pdf`, `video`, `Python`, and `2018`, but won't match a sequence containing a space or punctuation character like `PHP-DEV` or `XKCD Flowchart`. If we wanted to match those tags as well, we could use `.+`; however, in this case, we're just interested in single-word tags without special characters.

In R, as in many other languages, a backslash followed by certain characters represents an [escape sequence](https://en.wikipedia.org/wiki/Escape_sequences_in_C#Table_of_escape_sequences) — like the `\n` sequence — represents a new line. These escape sequences can result in unintended consequences for our regular expressions. For example, if we use the string `"\w"` as a pattern, we will have the following error (or similar) — because R tries to interpret it as escape sequence:

`Error: '\w' is an unrecognized escape in character string`

To avoid string interpretation errors by R, it is always necessary to escape the backslash itself. To do so, we double the backslashes for regular expression character sequences; i.e., always write, in R code, `\\w` or `\\[` instead of `\w` or `\[`. To escape the backslash itself as a regular expression, four backslashes are necessary: `\\\\`.

Let's quickly recap the concepts we learned:

* We can use a backslash to escape characters that have special meaning in regular expressions (e.g. `\[` will match an open bracket character).
* Character classes let us match certain groups of characters (e.g. `\w` will match any word character).
* Character classes can be combined with quantifiers when we want to match different numbers of characters.
* In R codes, we **have to double backslashes for with the character classes** to avoid wrong string interpretation.

**Task**

* We'll use these concepts to count the number of titles of stories submitted to Hacker News include a topic tag in brackets, like `[pdf]`, `[video]`, etc.

**Answer**

`pattern  <-  "\\[\\w+\\]"
tag_logical  <-  str_detect(titles, pattern)
tag_titles  <-  titles[tag_logical]
tag_count  <-  sum(tag_logical)`

Above, we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?

We'll learn how to access the matching pattern by looking at just the six first matching titles 

`tag_titles_head  <-  head(tag_titles)
 print(tag_titles_head)`
 
![image.png](attachment:image.png)

We use the [`str_extract() function`](https://stringr.tidyverse.org/reference/str_extract.html) to extract the matching pattern:

`pattern  <-  "\\[\\w+\\]"
tags_matches <- str_extract(tag_titles_head, pattern)
print(tags_matches)`

![image.png](attachment:image.png)

What if we wanted to find out this tag without square brackets `[ ]`?

In order to do this, we'll need to use **capture groups**. **Capture groups** allow us to specify one or more groups within our match that we can access separately.

We specify capture groups using parentheses.

![image.png](attachment:image.png)

We use the [`str_match() function`](https://stringr.tidyverse.org/reference/str_match.html) to extract the match within our parentheses:

`pattern  <-  "\\[(\\w+)\\]"
tags_matches <- str_match(tag_titles_head, pattern)
print(tags_matches)`

![image.png](attachment:image.png)

The output of `str_match()` function is a matrix where :

* the first column represents the whole matching pattern. That is the output of `str_extract()` function.
* the second column represents the match within parentheses.
* and the rows are the matches.

Hence, we can get just the text by indexing the second column:

`tags_text_matches <- tags_matches[,2]
print(tags_text_matches)`

![image.png](attachment:image.png)

If we then use the built-in [function `table()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html) we can quickly get a frequency table of the tags:

`tags_freq <- table(tags_text_matches)
print(tags_freq)`

![image.png](attachment:image.png)

**Task** 

Usse above technique to extract all of the tags from the Hacker News titles and build a frequency table of those tags.

**Answer**

`# tag_titles  <-  titles[str_detect(titles, "\\[\\w+\\]")]`


`pattern  <-  "\\[(\\w+)\\]"
tags_text_matches <- str_match(tag_titles, pattern)[,2]
tags_freq <- table(tags_text_matches)`

Above, we wrote fairly simple regular expressions. In reality, regular expressions are often complex. When creating complex regular expressions, we often need to work iteratively so we can find "bad" instances that match our pattern and then exclude them.

In order to work faster as we build our regular expression, it can be helpful to create a function that returns the first few matching strings:

`first_10_matches <- function (data, pattern) {
    matches <- str_detect(data, pattern) # finding pattern matches
    matched_df <- data[matches] # subsetting data (keep only matches)
    head(matched_df, 10) # taking the first ten matched elements
}`

Another useful approach is to use an online tool like [RegExr](https://regexr.com/) that allows us to build regular expressions and includes syntax highlighting, instant matches, and regex syntax reference. 

let's write a simple regular expression to match `Java` (a popular language), and use our function to look at the matches:

`first_10_matches(titles, "[Jj]ava")`

![image.png](attachment:image.png)

We can see that there are a number of matches that contain `Java` as part of the word `JavaScript`. We want to exclude these titles from matching so we get an accurate count.

One way to do this is by using **negative character classes**. Negative character classes are character classes that match every character except a character class.

![image.png](attachment:image.png)

**Task**

* Use the negative set [^Ss] to exclude instances like `JavaScript` and `Javascript`:

**Answer**

`pattern  <-  "[Jj]ava[^Ss]"
java_titles  <-  titles[str_detect(titles, pattern)]`

While the negative set was effective in removing any bad matches that mention "JavaScript", it also had the side-effect of removing any titles where `Java` occurs at the end of the string, like this title:

`Pippo  Web framework in Java`

This is because the negative set `[^Ss]` must match one character, so instances at the end of a string do not match.

A different approach to take in cases like these is to use the **word boundary anchor**, using the syntax `\b` or the function `boundary()` from `stringr` package. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string. The diagram below shows all the word boundaries in an example string:

![image.png](attachment:image.png)

Let's look at how using a word boundary changes the match from the string in the example above:

`string  <-  "Sometimes people confuse JavaScript with Java"`
`pattern_1  <-  "Java[^S]"`

`m_1  <-  str_detect(string, pattern_1)
print(m_1)`

![image.png](attachment:image.png)

The regular expression returns `FALSE`, because there is no substring that contains `Java` followed by a character that isn't `S`.

Let's instead use word boundaries `(\b)` in our regular expression:

`pattern_2  <-  "\\bJava\\b"

m_2  <-  str_detect(string, pattern_2)

print(m_2)`

![image.png](attachment:image.png)

One can also use `boundary()` function:

`pattern_3  <-  boundary("Java")

m_3  <-  str_detect(string, pattern_3)

print(m_3)`

![image.png](attachment:image.png)

With the word boundary, our pattern matches the `Java` at the end of the string.

**Task**

* Use the word boundary anchor as part of our regular expression to select the titles that mention `Java`.

**Answer**

`pattern  <-  "\\b[Jj]ava\\b"
java_titles  <-  titles[str_detect(titles, pattern)]`

So far, we've used regular expressions to match substrings contained anywhere within text. There are often scenarios where we want to specifically match a pattern at the start and end of strings.

We learned that the **word boundary anchor** matches the space between a word character and a non-word character. In regular expressions, an **anchor** matches something that isn't a character, as opposed to character classes which match specific characters.

Other than the word boundary anchor, the other two most common anchors are the **beginning anchor** and the **end anchor**, which represent the start and the end of the string.

![image.png](attachment:image.png)

Note that the `^` character is used both as a beginning anchor and to indicate a negative set, depending on whether the character preceding it is a `[` or not.

**Task**

* Use the beginning and end anchors to count how many titles have tags at the start versus the end of the story title in our Hacker News dataset.

**Answer**

`pattern_beginning  <-  "^\\[\\w+\\]"
beginning_count  <-  sum(str_detect(titles, pattern_beginning))`

`pattern_ending  <-  "\\[\\w+\\]$"
ending_count  <-  sum(str_detect(titles, pattern_ending))`

Up until now, we've used sets like `[Pp]` to match different capitalizations in our regular expressions. This strategy works well when there is only one character that has capitalization, but becomes cumbersome when we need to cater for multiple instances.

We can use **flags** to specify that our regular expression should ignore case. To do so, this sequence of characters, called ignorecase flag, is used: `(?i)` at the beginning of the pattern.

`email_tests  <-  c('email', 'Email', 'e Mail', 'e mail', 'E-mail',
                   'e-mail', 'eMail', 'E-Mail', 'EMAIL')`

`pattern <- "(?i)e[\\-\\s]?mail"
email_tests_matches <- str_detect(email_tests, pattern)
email_mentions <- sum(str_detect(titles, pattern))`