# Regular Expressions in Ruby

Regular expressions, though cryptic, is a powerful tool for worjing with text. Ruby has this feature built-in. It's used for pattern-matching and text processing.

Many people find regular expressions difficult to use, difficult to read, un-maintainable, and ultimately couterproductive. You may end up using only a modest number of regular expressions in your Ruby and Rails applications. Becoming recular expressions wizard isn't a prerequisite for Rails programming. _However, it's advisable to learn at least the basics of how regular expressions work_.

**A regular expression is simply a way of specifying a pattern of characters to be matched in a string**. In Ruby, you tipically create a regular expression by writting a pattern between slash characters ('/pattern/').

In Ruby, regular expressions are objects (of type `Regexp`) and can be manipulated as such. `//` is a regular expression and an instance of the `Regexp` class, as shown bellow.

In [1]:
//.class

Regexp

You could write a pattern that matches a string containing the text Pune or the text Ruby using the following regular expression:

In [2]:
/Pune|Ruby/

/Pune|Ruby/

The forward slashes delimit the pattern, which consists of the two things we are matching, separated by a pipe character (`|`). The pipe character means "either the thing on the right or the thing on the left", in this case _Pune or Ruby_.

The simplest way to find out whether there's a match between a pattern and a string is with the `match` method. You can do this in either direction: Regular expression objects and string objects both responds to `match`. If there's no match, you get back `nil`. If there's a match, it returns an instance of the class `MatchData`. We can also use the match operator `=~` to match a string against a regular expression. If the pattern is found in the string, `=~` returns its starting position, otherwise it returns `nil`.

In [3]:
m1 = /Ruby/.match("The future is Ruby")
puts m1.class

m2 = "The future is Ruby" =~ /Ruby/
puts m2

MatchData
14


The possible components of a regular expression include the following:

## Literal characters

Any literal character you put in a regular expression matches _itself_ in the string.

```ruby
/a/
```

This regular expression matches the string "a", as well as any string containing the letter "a".

Some characters have special meanings to the regexp parser. When you want to match one of these special characters as itself, you have to scape it with a backslash (`\`). For example, to match the character `?` (question mark), you have to write this:

```ruby
/\?/
```

The backslash means "don't treat the next character as special; treat it as itself".

The special characters include `^`, `$`, `?`, `.`, `/`, `[`, `]`, `{`, `}`, `(`, `)`, `+` and `*`.

## The wildcard character `.` (dot)

Sometimes you'll want to match any character at some point in your pattern. You do this with the special wildcard character `.` (dot). A dot matches any character with the exception of a newline. This regular expression:

```ruby
/.ejected/
```

matches both "dejected" and "rejected". It also matches "%ejected" and "8ejected". The wildcard dot is handy, but sometimes it gives you more matches than you want. However, you can impose constrains on matches while still allowing for multiple possible strings, using _character classes_.

## Character classes

A character class is an explicit list of characters, placed inside the regular expression in square brackets:

```ruby
/[dr]ejected/
```

This means "match either d or r, followed by ejected". This new pattern matches either "dejected" or "rejected" but not "&ejected". A character class is a kind of quasi-wildcard: It allows for multiple possible characters, but only a limited number of them.

Inside a character class, you can also insert a range of characters. A common case is this, for lowercase letters:

```ruby
/[a-z]/
```

To match a hexadecimal digit, you might use several ranges inside a character class:

```ruby
/[^A-Fa-f0-9]/
```

Some characters classes are so common that they have special abbreviations.

## Special escape sequences for common character classes

To match any digit, you can do this:

```ruby
/[0-9]/
```

But you can also accomplish the same thing more concisely with the special escape sequence `\d`:

```ruby
/\d/
```

Two other useful escape sequences for predefined character classes are these:

- `\w` matches any digit, alphabetical character, or underscore(`_`).
- `\s` matches any whitespace character (space, tab, newline).

Each of these predefined character classes also has a negated form. You can match _any character that is not a digit_ by doing this:

```ruby
/\D/
```

Similarly, `\W` matches _any character other than an alphanumeric character or underscore_, and `\S` matches _any non-whitespace character_.

A successful match returns a `MatchData` object.

Every `match` operation either succeeds or fails. Let's start with the simpler case: failure. When you try to match a string to a pattern, and the string doesn't match, the result is always `nil`:

In [4]:
/a/.match("b").nil?

true

This `nil` stands in for the false or no answer when you treat the match as a true/false test.

Unlike `nil`, the `MatchData` object returned by a successful `match` has a Boolean value of true, which makes it handy for simple match/no-match tests. Beyond this, however, it also stores information about the match, which you can pry out of them with the appropiate methods: where the match began (at what character in the string), how much of the string is covered, what was captured in the parenthetical groups, and so forth.

To use the `MatchData` object, you must first save it.

Consider an example where we want to pluck a phone number from a string and save the various parts of it (area code, exchange, number) in groupings:

In [5]:
string = "My phone number is (123) 555-1234."
phone_re = /\((\d{3})\)\s+(\d{3})-(\d{4})/
m = phone_re.match(string)
unless m
  puts "There was no match..."
  exit
end
print "The whole string we started with: "
puts m.string
print "The entire part of the string matched: "
puts m[0]
puts "The three captures:"
3.times do |index|
  puts "Capture ##{index+1}: #{m.captures[index]}"
end
puts "Here's another way to get at the first capture:"
print "Capture #1: "
puts m[1]

The whole string we started with: My phone number is (123) 555-1234.
The entire part of the string matched: (123) 555-1234
The three captures:
Capture #1: 123
Capture #2: 555
Capture #3: 1234
Here's another way to get at the first capture:
Capture #1: 123


In this code, we use the `string` method of `MatchData` (`puts m.string`) to get the entire string on which the `match` operation was performed.

To get the part of the string that matched our pattern, we address the `MatchData` object with square brackets, with an index of 0 (`puts m[0]`).

We also use the `times` method (`3.times do |index|`) to iterate exactly three times through a code block and print out submatches (the parenthetical captures) in succession. Inside that code block, a method called `captures` fishes out the substrings that matches the parenthesized parts of the pattern.

Finally, we take another look at the first capture, this time through a different technique: indexing the `MatchData` object directly with square brackets and positive integers, each integer corresponding to a capture.