# Lecture 1 - Regular expressions

<img src="https://imgs.xkcd.com/comics/regular_expressions.png">

Regex is a fantastic tool for processing data, especially for finding some patterns and extracting things like phone numbers.

Syntax is shear magic, but.. if you are serious about data processing, you have to know your regex-jujitsu:

<img src="https://github.com/kmisztal/PLN/raw/master/lab-01-regex/davechild_regex.png">

In [None]:
import re

text = "You only live once, but if you do it right, once is enough."

## Recipe 1

In [None]:
match = re.search(r'on\w+', text)

print(match.group())

only


## Recipe 2

In [None]:
matches = re.findall(r'on\w+', text)

print(matches)

['only', 'once', 'once']


## Recipe 3

In [None]:
for m in re.finditer(r'on\w+', text):
    print(m.group())
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

only
04-08: only
once
14-18: once
once
44-48: once


## Why r before string?

If you do not use raw formatting of strings, some strings with backslashes will be converted by Python to a format which regex does not support

In [None]:
str_1 = '\b\w+\b'
str_2 = r'\b\w+\b'
str_3 = '\\b\\w+\\b'

print(re.findall(str_1, text))
print(re.findall(str_2, text)) 
print(re.findall(str_3, text)) 

[]
['You', 'only', 'live', 'once', 'but', 'if', 'you', 'do', 'it', 'right', 'once', 'is', 'enough']
['You', 'only', 'live', 'once', 'but', 'if', 'you', 'do', 'it', 'right', 'once', 'is', 'enough']


# Exercise 1 

Using regular expressions, write code that will read from the file, and print out the name and phone number of each individual in format **name: telephone**.

In [None]:
txt_file = """Rose    416-333-4444    rose@someplace.com
Martha  905-888-1234    martha@hotmail.com
Donna   647-222-9876    donna@rogers.ca
Amy 905-777-2222    amy@gmail.com"""

In [None]:
print(txt_file.split("\n"))

['Rose    416-333-4444    rose@someplace.com', 'Martha  905-888-1234    martha@hotmail.com', 'Donna   647-222-9876    donna@rogers.ca', 'Amy 905-777-2222    amy@gmail.com']


In [None]:
for m in txt_file.split("\n"):
  print(re.findall(r'\b[A-Z]\w+\b', m), re.findall(r'\b[0-9][-0-9]*\b', m))


['Rose'] ['416-333-4444']
['Martha'] ['905-888-1234']
['Donna'] ['647-222-9876']
['Amy'] ['905-777-2222']


# Exercise 2

Give a pattern to match the following:

* Words that begin with an uppercase letter, followed by any number (including zero!) of lowercase letters.

* Strings containing two words: "I" and a word that has at least one "a", "i" or "e" letter



In [None]:
text_2 = """This tale grew in the telling, until it became a history of the Great War of
the Ring and included many glimpses of the yet more ancient history that
preceded it. It was begun soon after _The Hobbit_ was written and before its
publication in 1937; but I did not go on with this sequel, for I wished first
to complete and set in order the mythology and legends of the Elder Days,
which had then been taking shape for some years. I desired to do this for my
own satisfaction, and I had little hope that other people would be interested
in this work, especially since it was primarily linguistic in inspiration and
was begun in order to provide the necessary background of 'history' for Elvish
tongues. When those whose advice and opinion I sought corrected _little hope_ to
_no hope,_ I went back to the sequel, encouraged by requests from readers for
more information concerning hobbits and their adventures. But the story was
drawn irresistibly towards the older world, and became an account, as it were,
of its end and passing away before its beginning and middle had been told. The
process had begun in the writing of _The Hobbit,_ in which there were already
some references to the older matter: Elrond, Gondolin, the High-elves, and the
orcs, as well as glimpses that had arisen unbidden of things higher or deeper
or darker than its surface: Durin, Moria, Gandalf, the Necromancer, the Ring.
The discovery of the significance of these glimpses and of their relation to
the ancient histories revealed the Third Age and its culmination in the War of
the Ring.""".replace('\n', ' ')

In [None]:
pattern_1 = r'\b[A-Z][a-z]*\b'
pattern_2 = r'I [A-Za-z]*[aie]+[A-Za-z]*'

result_1 = re.findall(pattern_1, text_2)
result_2 = re.findall(pattern_2, text_2)

In [None]:
assert result_1 == ['This', 'Great', 'War', 'Ring', 'It', 'I', 'I', 'Elder',
                    'Days', 'I', 'I', 'Elvish', 'When', 'I', 'I', 'But', 'The',
                    'Hobbit', 'Elrond', 'Gondolin', 'High', 'Durin', 'Moria', 
                    'Gandalf', 'Necromancer', 'Ring', 'The', 'Third', 'Age',
                    'War', 'Ring']

assert result_2 == ['I did', 'I wished', 'I desired', 'I had', 'I went']
print("Well done!")

Well done!


# Exercise 3

Write code that takes given text and use regex to:
- find all words starting with letter "a" and return the longest one
- find all words ending with "ing" with length smaller than 5
- return the number of 6-letter words in the document
- find all words starting with an uppercase letter, which is not "A" or "B"

In [None]:
text_3 = """King. He felt better at once,' said Gandalf. 'But there is only one Power in
this world that knows all about the Rings and their effects; and as far as I
know there is no Power in the world that knows all about hobbits. Among the
Wise I am the only one that goes in for hobbit-lore: an obscure branch of
knowledge, but full of surprises. Soft as butter they can be, and yet
sometimes as tough as old tree-roots. I think it likely that some would resist
the Rings far longer than most of the Wise would believe. I don't think you
need worry about Bilbo.
 'Of course, he possessed the ring for many years, and used it, so it
might take a long while for the influence to wear off – before it was safe for
him to see it again, for instance. Otherwise, he might live on for years,
quite happily: just stop as he was when he parted with it. For he gave it up
in the end of his own accord: an important point. No, I was not troubled about
dear Bilbo any more, once he had let the thing go. It is for _you_ that I feel
responsible.
 'Ever since Bilbo left I have been deeply concerned about you, and about
all these charming, absurd, helpless hobbits. It would be a grievous blow to
the world, if the Dark Power overcame the Shire; if all your kind, jolly,
stupid Bolgers, Hornblowers, Boffins, Bracegirdles, and the rest, not to
mention the ridiculous Bagginses, became enslaved.'
 Frodo shuddered. 'But why should we be?' he asked. 'And why should he
want such slaves?'
 'To tell you the truth,' replied Gandalf, 'I believe that hitherto –
_hitherto,_mark you – he has entirely overlooked the existence of hobbits. You
should be thankful. But your safety has passed. He does not need you – he has
many more useful servants – but he won't forget you again. And hobbits as
miserable slaves would please him far more than hobbits happy and free. There
is such a thing as malice and revenge.'
 'Revenge?' said Frodo. 'Revenge for what? I still don't understand what
all this has to do with Bilbo and myself, and our ring.'
 'It has everything to do with it,' said Gandalf. 'You do not know the
real peril yet; but you shall. I was not sure of it myself when I was last
here; but the time has come to speak. Give me the ring for a moment.
""".replace('\n', ' ')

In [None]:
pattern_1 = r'\ba\w+\b'
pattern_2 = r'\b[A-Za-z]{0,1}ing\b'
pattern_3 = r'\b[A-Za-z]{6}\b'
pattern_4 = r'\b[C-Z]\w*\b'

result_1 = re.findall(pattern_1, text_3)
result_2 = re.findall(pattern_2, text_3)
result_3 = re.findall(pattern_3, text_3)
result_4 = re.findall(pattern_4, text_3)

In [None]:
assert max(result_1, key=len) == 'accord'
assert result_2 == ['King', 'ring', 'ring', 'ring']
assert len(result_3) == 29
assert len(result_4) == 45
print("Well done!")

Well done!
