## Regular expressions (Chap 11)
- Regular expressions appear in many programming languages, with minor differences among the incarnations. Their purpose is to specify character patterns that subsequently are determined to match (or not match) strings. Pattern matching, in turn, serves as the basis for operations like parsing log files, testing keyboard input for validity, and isolating substrings—operations, in other words, of frequent and considerable use to anyone who has to process strings and text.

- Regular expressions have a weird reputation. Using them is a powerful, concentrated technique; they burn through a large subset of text-processing problems like acid through a padlock. They’re also, in the view of many people (including people who understand them well), difficult to use, difficult to read, opaque,  unmaintainable, and ultimately counterproductive.

#### Writing

In [None]:
# seeing patterns (11.2.1)

In [9]:
# simple matches (11.2.2)

p //.class # constructor
p %{}.class # alias
p /abc/.match?("The alphabet starts with abc.")
p "The alphabet starts with abc.".match?(/abc/)
p /abc/.match("The alphabet starts with abc.")
p /abc/.match("def")

Regexp
String
true
true
#<MatchData "abc">
nil


In [10]:
# =~ returns index of character in string where match occurred
p "The alphabet starts with abc" =~ /abc/

25


25

![regex patterns](px/Selection_164.png)

#### Building a pattern

In [11]:
# literal character patterns (11.3.1)
p "The alphabet starts with abc" =~ /a/ # 1st match of "a"

4


4

In [12]:
# wildcards (11.3.2)
<<READ
Sometimes you’ll want to match any character at some point in your pattern. You do
this with the special dot wildcard character (.). A dot matches any character with the
exception of a newline.
READ
    
p /.ejected/.match?("%ejected")

true


true

In [20]:
# character classes (11.3.3)
<<READ 
A character class is an explicit list of characters placed inside the regexp in square
brackets. Example matchers:
lowercase letters: /[a-z]/
hex digits: /[A-Fa-f0-9]/
negative hex digits: [^A-Fa-f0-9]
digits: /[0-9]/
READ

string = "ABC3934 is a hex number."
p string =~ %r{[^A-Fa-f0-9]}
p string[7..-1]
p string[0...7]

7
" is a hex number."
"ABC3934"


"ABC3934"

In [21]:
# escape sequences for common character classes
<<READ
  /\d/ = any digit
  /\w/ = any digit, alphanumerical or underscore
  /\s/ = any whitespace
  /\D/ = any non-digit
  /\W/ = any non-alphanumeric character or underscore
  /\S/ = non non-whitespace
READ

"  /d/ = any digit\n  /w/ = any digit, alphanumerical or underscore\n  / / = any whitespace\n  /D/ = any non-digit\n  /W/ = any non-alphanumeric character or underscore\n  /S/ = non non-whitespace\n"

#### Matching, substrings, MatchData

In [27]:
# capturing submatches (11.4.1)

<<READ
When we perform the match, 1) we get a MatchData object that gives 
us access to the submatches (discussed in a moment). 2) Ruby 
automatically populates a series of variables for us, which also 
give us access to those submatches. The variables are global;
their names are based on numbers: $1 , $2 , and so forth. 
READ

s = "Peel,Emma,Mrs.,talented amateur"
p /([A-Za-z]+),[A-Za-z]+,(Mrs?\.)/.match(
    "Peel,Emma,Mrs.,talented amateur")
p "Dear #{$2}, #{$1},"

#<MatchData "Peel,Emma,Mrs." 1:"Peel" 2:"Mrs.">
"Dear Mrs., Peel,"


"Dear Mrs., Peel,"

In [31]:
# match success & failure (11.4.2)

p %r{a}.match("b") # non-matches return nil

<<READ
The MatchData object returned by a successful match has a 
Boolean value of true. You must first save it.
READ

string   = "My phone number is (123) 555-1234."
phone_re = %r{\((\d{3})\)\s+(\d{3})-(\d{4})}
m        = phone_re.match(string)
unless m
    puts "no match. sorry."; exit
end
puts "start: "+m.string
puts "entire matching part: "+m[0]
puts "three captures: "
3.times do |ndx|
    puts "#{m.captures[ndx]}"
end
    

nil
start: My phone number is (123) 555-1234.
entire matching part: (123) 555-1234
three captures: 
123
555
1234


3

In [None]:
# two ways of getting captures (11.4.3)


In [33]:
# named captures (in this case, <first>,<middle>,<last>)
s = "David A. Black"
re = %r{(?<first>\w+)\s+((?<middle>\w\.)\s+)(?<last>\w+)}
puts m  = re.match(s)
puts m[:first]
puts m.named_captures

David A. Black
David
{"first"=>"David", "middle"=>"A.", "last"=>"Black"}


In [34]:
# other MatchData info (11.4.4)
puts m.pre_match
puts m.post_match
puts m.begin(2) # 2nd capture, beginning char#
puts m.end(3) # 3rd capture, ending char#



6
14


#### Quantifiers, anchors & modifiers

In [35]:
# constraining matches with quantifiers (11.5.1)
<<READ
Regexp syntax gives you ways to specify not only what you want 
but also how many: exactly one of a particular character, 
5–10 repetitions of a subpattern, and so forth.
READ

(?-mix:Mrs?\.?)


In [38]:
# greedy & non-greedy quantifiers (11.5.2)
<<READ
The * (zero-or-more) and + (one-or-more) quantifiers are greedy. 
They match as many characters as possible, consistent with 
allowing the rest of the pattern to match.
READ

string = "abc!def!ghi!"
puts /.+!/.match(string)[0]

<<READ
We’ve asked for one or more characters (using the wildcard dot) 
followed by an exclamation point. You might expect to get back 
the substring "abc!" , which fits that description.
Instead, we get "abc!def!ghi!" . The + quantifier greedily eats 
up as much of the string as it can and only stops at the last 
exclamation point, not the first.
        
We can make + as well as * into non-greedy quantifiers by 
putting a question mark after them.This version says, “Give me one or more wildcard characters, but only as many as you
see up to the first exclamation point, which should also be included.” Sure enough,
this time we get "abc!" .
        
This version says, “Give me one or more wildcard characters, but 
only as many as you see up to the first exclamation point, which 
should also be included.” Sure enough, this time we get "abc!" .
READ
    
puts /.+?!/.match(string)[0]

abc!def!ghi!
abc!


In [None]:
# regex anchors & assertions (11.5.3)

In [None]:
# lookahead assertions

In [None]:
# lookbehind assertions

In [None]:
# conditional matches

In [None]:
# modifiers (11.5.4)

#### Converting strings to/from regexes

In [None]:
# string-to-regexp idioms (11.6.1)

In [None]:
# regex to string (11.6.2)

#### Common uses

In [41]:
# string#scan (11.7.1)
<<READ
scan goes from left to right, testing repeatedly for a match.
the results are returned in an array.
READ

# harvest all the digits in a string:
puts "testing 1 2 3 testing 4 5 6".scan(/\d/)

# parenthetical groups
str = "Leopold Auer was the teacher of Jascha Heifetz."
puts violinists = str.scan(/([A-Z]\w+)\s+([A-Z]\w+)/)
    
# scan with a code block
str.scan(/([A-Z]\w+)\s+([A-Z]\w+)/) do |fname, lname|
    puts "#{lname}'s first name was #{fname}."
end

["1", "2", "3", "4", "5", "6"]
[["Leopold", "Auer"], ["Jascha", "Heifetz"]]
Auer's first name was Leopold.
Heifetz's first name was Jascha.


"Leopold Auer was the teacher of Jascha Heifetz."

In [None]:
# string#split (11.7.2)

In [None]:
# sub/sub!, gsub/gsub! (11.7.3)

In [42]:
# case equality and grep (11.7.4)