# Regulární výrazy

> Some people, when confronted with a problem, think “I know, I'll use regular expressions.”   
> Now they have two problems.

## Trocha teorie

- Regulární výrazy umožňují exaktně popsat množinu textových řetězců odpovídající hledanému výrazu či jevu
- Zjednodušeně řečeno se jedná o vkládání určitých speciálních znaků se zvláštním významem do slov, která chceme vyhledat.
- Uplatnění v mnoha programovacích jazycích (především skriptovacích jako Python, PHP, Perl, JavaScript)  





- Počítač regulární výrazy vyhodnocuje sestavením konečného automatu (velice jednoduchý počítač, který může být v jednom z několika stavů, mezi kterými přechází na základě symbolů, které čte ze vstupu)  
- Automat pracuje po krocích - v každém kroku načte jeden znak ze vstupu  
- Pokud ze stavu existuje přechod odpovídající načtenému znaku, automat přejde do nového stavu a pokračuje dalším znakem. V opačném případě se znova inicializuje a načte další znak.   
- Opakuje to tak dlouho, dokud daný vstup celý nenačte nebo neskončí v koncovém stavu.  

(www\.[a-z]*\.(cz|sk))

<img src="img/automat.png" width=600 height=600 />

## Využití

- parsing user input
- parsing text files
- finding text in emails
- reading configuration files
- search and replace strings in general
- etc ...

## Přehled nejdůležitějších zkratek

- viz snippets.txt

## 1. Intro

*The first thing to recognize when using regular expressions is that everything is essentially a character, and we are writing patterns to match a specific sequence of characters (also known as a string). Most patterns use normal ASCII, which includes letters, digits, punctuation and other symbols on your keyboard like %#$@!, but unicode characters can also be used to match any type of international text.*

In [1]:
import re

In [2]:
text = '''
abcdefg
abcde
abc
'''

**re.match(pattern, string)** function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object. But if a match is found in some other line, the Python RegEx Match function returns null.

In [3]:
match = re.match(r'abc' , text)
match

**re.search(pattern, string)** function will search the regular expression pattern and return the first occurrence. Unlike Python re.match(), it will check all lines of the input string. The Python re.search() function returns a match object when the pattern is found and “null” if the pattern is not found

In [4]:
match = re.search(r'abc' , text)
match

<re.Match object; span=(1, 4), match='abc'>

In [9]:
str(match)

'abc'

In [12]:
match.span

AttributeError: 'str' object has no attribute 'span'

**re.findall(pattern, string)** module is used to search for “all” occurrences that match a given pattern. In contrast, search() module will only return the first occurrence that matches the specified pattern. findall() will iterate over all the lines of the file and will return all non-overlapping matches of pattern in a single step.

In [5]:
matches = re.findall(r'abc' , text)
print(matches)
for match in matches:
    print(match)

['abc', 'abc', 'abc']
abc
abc
abc


**re.finditer(pattern, string)** works exactly the same as the re.findall() method except it returns an iterator yielding match objects matching the regex pattern in a string instead of a list. It scans the string from left-to-right, and matches are returned in the iterator form.

In [13]:
matches = re.finditer(r'abc' , text)
for match in matches:
    print(match)

<re.Match object; span=(1, 4), match='abc'>
<re.Match object; span=(9, 12), match='abc'>
<re.Match object; span=(15, 18), match='abc'>


**re.sub(pattern, repl, string)** Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.

In [14]:
match = re.sub(r'abc', r'xyz', text)
print(match)


xyzdefg
xyzde
xyz



In [15]:
print(text)


abcdefg
abcde
abc



## 2. Digits

*Over the various examples, you will be introduced to a number of special metacharacters used in regular expressions that can be used to match a specific type of character. In this case, the character \d can be used in place of any digit from 0 to 9. The preceding slash distinguishes it from the simple d character and indicates that it is a metacharacter.*

In [16]:
text="""
abc123xyz
define "123"
var g = 123;
blablabla123bla
"""

In [19]:
matches = re.findall(r'\d' , text, re.MULTILINE)
print(matches)
for match in matches:
    print(match)

['1', '2', '3', '1', '2', '3', '1', '2', '3', '1', '2', '3']
1
2
3
1
2
3
1
2
3
1
2
3


In [18]:
matches = re.findall(r'\d+' , text, re.MULTILINE)
print(matches)
for match in matches:
    print(match)

['123', '123', '123', '123']
123
123
123
123


## 3. Dot wildcard

*There is also a concept of a wildcard, which is represented by the . (dot) metacharacter, and can match any single character (letter, digit, whitespace, everything). You may notice that this actually overrides the matching of the period character, so in order to specifically match a period, you need to escape the dot by using a slash \. accordingly.*

cat. - MATCH  
896. - MATCH  
?=+. - MATCH  
abc1 - SKIP

In [None]:
text="""
cat.
896.
?=+.
abc1
"""

In [None]:
matches = re.findall(r'\.' , text, re.MULTILINE)
for match in matches:
    print(match)

## 4. Matching specific characters

*There is a method for matching specific characters using regular expressions, by defining them inside square brackets. For example, the pattern [abc] will only match a single a, b, or c letter and nothing else.*

can - MATCH  
man - MATCH  
fan - MATCH  
dan - SKIP  
ran - SKIP  
pan - SKIP  

In [20]:
text="""
can
man
fan
dan
ran
pan
"""

In [22]:
matches = re.findall(r'[cmf]an' , text, re.MULTILINE)
for match in matches:
    print(match)

can
man
fan


## 5. Excluding characters

*We use a similar expression that excludes specific characters using the square brackets and the ^ (hat). For example, the pattern [^abc] will match any single character except for the letters a, b, or c.*

hog - MATCH  
dog - MATCH  
bog - SKIP

In [24]:
text="""
hog
dog
bog
"""

In [26]:
matches = re.findall(r'[^h]og' , text, re.MULTILINE)
for match in matches:
    print(match)

dog
bog


## 6. Matching ranges

*Luckily, when using the square bracket notation, there is a shorthand for matching a character in list of sequential characters by using the dash to indicate a character range. For example, the pattern [0-6] will only match any single digit character from zero to six, and nothing else. And likewise, [^n-p] will only match any single character except for letters n to p.*  

*Multiple character ranges can also be used in the same set of brackets, along with individual characters. An example of this is the alphanumeric \w metacharacter which is equivalent to the character range [A-Za-z0-9] and often used to match characters in English text.*

Ana - MATCH  
Bob - MATCH  
Cpc - MATCH  
aax - SKIP  
bby - SKIP  
ccz - SKIP  

In [43]:
text="""
Ana
ana
Bna
Bob
Cpc
aax
bby
ccz
"""

In [44]:
matches = re.findall(r'Ana' , text, re.MULTILINE)
for match in matches:
    print(match)

Ana


In [45]:
matches = re.findall(r'Ana' , text, re.MULTILINE | re.IGNORECASE)
for match in matches:
    print(match)

Ana
ana


In [46]:
matches = re.findall(r'[A-C]na' , text, re.MULTILINE)
for match in matches:
    print(match)

Ana
Bna


In [47]:
matches = re.findall(r'[A-C]\w+' , text, re.MULTILINE)
for match in matches:
    print(match)

Ana
Bna
Bob
Cpc


## 7. Repetitions part 1

*We've so far learned how to specify the range of characters we want to match, but how about the number of repetitions of characters that we want to match? One way that we can do this is to explicitly spell out exactly how many characters we want, eg. \d\d\d which would match exactly three digits.*

*A more convenient way is to specify how many repetitions of each character we want using the curly braces notation. For example, a{3} will match the a character exactly three times. Certain regular expression engines will even allow you to specify a range for this repetition such that a{1,3} will match the a character no more than 3 times, but no less than once for example.*

*This quantifier can be used with any character, or special metacharacters, for example w{3} (three w's), [wxy]{5} (five characters, each of which can be a w, x, or y) and .{2,6} (between two and six of any character).*

wazzzzzup - MATCH  
wazzzup - MATCH  
wazup - SKIP  

In [48]:
text="""
wazzzzzup
wazzzup
wazup
"""

In [53]:
matches = re.findall(r'waz{3,5}up' , text, re.MULTILINE)
for match in matches:
    print(match)

wazzzzzup
wazzzup


## 8. Repetitions part 2

*Another useful tools are Kleene Star and the Kleene Plus, which essentially represents either 0 or more or 1 or more of the character that it follows (it always follows a character or group). For example, we can use the pattern \d\* to match any number of digits, but a tighter regular expression would be \d+ which ensures that the input string has at least one digit.*

*These quantifiers can be used with any character or special metacharacters, for example a+ (one or more a's), [abc]+ (one or more of any a, b, or c character) and .\* (zero or more of any character).*

aaaabcc - MATCH  
aabbbbc - MATCH  
aacc - MATCH  
a - SKIP  

In [62]:
text="""
aaaabcc
aabbbbc
aacc
a
"""

In [63]:
matches = re.findall(r'a+b*c+' , text, re.MULTILINE)
for match in matches:
    print(match)

aaaabcc
aabbbbc
aacc


## 9. Optional character

*Another quantifier that is really common when matching and extracting text is the ? (question mark) metacharacter which denotes optionality. This metacharacter allows you to match either zero or one of the preceding character or group. For example, the pattern ab?c will match either the strings "abc" or "ac" because the b is considered optional.*

*Similar to the dot metacharacter, the question mark is a special character and you will have to escape it using a slash \? to match a plain question mark character in a string.*

1 file found? - MATCH  
2 files found? - MATCH  
24 files found? - MATCH  
No files found. - SKIP

In [64]:
text=""" 
1 file found?
2 files found?
24 files found?
No files found.
"""

In [74]:
matches = re.findall(r'.+\?' , text, re.MULTILINE)
for match in matches:
    print(match)

1 file found?
2 files found?
24 files found?


In [76]:
matches = re.findall(r'\d.+' , text, re.MULTILINE)
for match in matches:
    print(match)

1 file found?
2 files found?
24 files found?


## 10. Whitespaces

*The most common forms of whitespace you will use with regular expressions are the space (␣), the tab (\t), the new line (\n) and the carriage return (\r) (useful in Windows environments), and these special characters match each of their respective whitespaces. In addition, a whitespace special character \s will match any of the specific whitespaces above and is extremely useful when dealing with raw input text.*

1. abc - MATCH  
2.        abc - MATCH  
3.           abc - MATCH  
4\.abc - SKIP

In [77]:
text="""
1. abc 
2.    abc
3.           abc
4.abc
"""

In [81]:
matches = re.findall(r'\d\. .+' , text, re.MULTILINE)
for match in matches:
    print(match)

1. abc 
2.    abc
3.           abc


In [84]:
matches = re.findall(r'\d\.\s.+' , text, re.MULTILINE)
for match in matches:
    print(match)

1. abc 
2.    abc
3.           abc


## 11. Starting and ending

Mission: successful - MATCH  
Last Mission: unsuccessful - SKIP  
Next Mission: successful upon capture of target - SKIP  

In [85]:
text="""
Mission: successful
Last Mission: unsuccessful
Next Mission: successful upon capture of target
"""

In [88]:
matches = re.findall(r'^M.+' , text, re.MULTILINE)
for match in matches:
    print(match)

Mission: successful


In [91]:
matches = re.findall(r'.+\ssuccessful$' , text, re.MULTILINE)
for match in matches:
    print(match)

Mission: successful


In [93]:
matches = re.findall(r'.+ successful$' , text, re.MULTILINE)
for match in matches:
    print(match)

Mission: successful


## 12. Match groups

*Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group. In practice, this can be used to extract information like phone numbers or emails from all sorts of data.*

*Imagine for example that you had a command line tool to list all the image files you have in the cloud. You could then use a pattern such as ^(IMG\d+\.png)\\$ to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern ^(IMG\d+)\.png\\$ which only captures the part before the period.*

file_record_transcript.pdf - CAPTURE: file_record_transcript  
file_07241999.pdf - CAPTURE: file_07241999  
testfile_fake.pdf.tmp - SKIP

In [94]:
text="""
file_record_transcript.pdf
file_07241999.pdf
testfile_fake.pdf.tmp
"""

In [97]:
matches = re.findall(r'.+.pdf$' , text, re.MULTILINE)
for match in matches:
    print(match)

file_record_transcript.pdf
file_07241999.pdf


In [99]:
# () jen to v zavrce se zabrazi
matches = re.findall(r'(.+).pdf$' , text, re.MULTILINE)
for match in matches:
    print(match)

file_record_transcript
file_07241999


## 13. Nested groups

*When you are working with complex data, you can easily find yourself having to extract multiple layers of information, which can result in nested groups. Generally, the results of the captured groups are in the order in which they are defined (in order by open parenthesis).*

*Take the example from the previous lesson, of capturing the filenames of all the image files you have in a list. If each of these image files had a sequential picture number in the filename, you could extract both the filename and the picture number using the same pattern by writing an expression like ^(IMG(\d+))\.png\\$ (using a nested parenthesis to capture the digits).*

Jan 1987 - CAPTURE: Jan 1987, 1987  
May 1969 - CAPTURE: May 1969, 1969  
Aug 2011 - CAPTURE: Aug 2011, 2011

In [124]:
text="""
Jan 1987
May 1969
Aug 2011
"""

In [127]:
matches = re.findall(r'(.+ (\d+$))' , text, re.MULTILINE)
for match in matches:
    print(match)

('Jan 1987', '1987')
('May 1969', '1969')
('Aug 2011', '2011')


## 14. OR

*Specifically when using groups, you can use the | (logical OR, aka. the pipe) to denote different possible sets of characters. In the above example, I can write the pattern "Buy more (milk|bread|juice)" to match only the strings Buy more milk, Buy more bread, or Buy more juice.*

*Like normal groups, you can use any sequence of characters or metacharacters in a condition, for example, ([cb]ats\*|[dh]ogs?) would match either cats or bats, or, dogs or hogs. Writing patterns with many conditions can be hard to read, so you should consider making them separate patterns if they get too complex.*

I love cats - MATCH  
I love dogs - MATCH  
I love logs - SKIP  
I love cogs - SKIP

In [129]:
text="""
I love cats
I love dogs
I love logs
I love cogs
"""

In [130]:
matches = re.findall(r'(I love (cats|dogs))' , text, re.MULTILINE)
for match in matches:
    print(match)

('I love cats', 'cats')
('I love dogs', 'dogs')


In [133]:
matches = re.findall(r'I love cats|I love dogs' , text, re.MULTILINE)
for match in matches:
    print(match)

I love cats
I love dogs


In [134]:
matches = re.findall(r'I love [cats|dogs]' , text, re.MULTILINE)
for match in matches:
    print(match)

I love c
I love d
I love c


# Problems

## 1. Problem

3.14529 - MATCH  
-255.34 - MATCH  
128 - MATCH  
1.9e10 - MATCH  
123,340.00 - MATCH  
720p - SKIP

In [135]:
text="""
3.14529
-255.34
128
1.9e10
123,340.00
720p
"""

In [144]:
matches = re.findall(r'.+\d$' , text, re.MULTILINE)
for match in matches:
    print(match)

3.14529
-255.34
128
1.9e10
123,340.00


## 2. Problem

415-555-1234 - CAPTURE: 415  
650-555-2345 - CAPTURE: 650  
(416)555-3456 - CAPTURE: 416  
202 555 4567 - CAPTURE: 202  
4035555678 - CAPTURE: 403  
1 416 555 9292 - CAPTURE: 416

In [145]:
text="""
415-555-1234
650-555-2345
(416)555-3456
202 555 4567
4035555678
1 416 555 9292
"""

In [162]:
matches = re.findall(r'^\d{3}|416', text, re.MULTILINE)
for match in matches:
    print(match)

415
650
416
202
403
416


## 3. Problem

tom@hogwarts.com - CAPTURE: tom  
tom.riddle@hogwarts.com - CAPTURE: tom.riddle  
tom.riddle+regexone@hogwarts.com - CAPTURE: tom.riddle  
tom@hogwarts.eu.com - CAPTURE: tom  
potter@hogwarts.com - CAPTURE: potter  
harry@hogwarts.com - CAPTURE: harry  
hermione+regexone@hogwarts.com - CAPTURE: hermione

In [183]:
text="""
tom@hogwarts.com
tom.riddle@hogwarts.com
tom.riddle+regexone@hogwarts.com
tom@hogwarts.eu.com
potter@hogwarts.com
harry@hogwarts.com
hermione+regexone@hogwarts.com
"""

In [184]:
matches = re.findall(r'(.+)@.+' , text, re.MULTILINE)
for match in matches:
    print(match)

tom
tom.riddle
tom.riddle+regexone
tom
potter
harry
hermione+regexone


## 4. Problem

.bash_profile - SKIP  
workspace.doc - SKIP  
img0912.jpg - CAPTURE: img0912 jpg  
updated_img0912.png - CAPTURE: updated_img0912 png   
documentation.html - SKIP  
favicon.gif - CAPTURE: favicon gif  
img0912.jpg.tmp - SKIP  
access.lock - SKIP

In [179]:
text="""
.bash_profile
workspace.doc
img0912.jpg
updated_img0912.png
documentation.html
favicon.gif
img0912.jpg.tmp
access.lock
"""

In [180]:
matches = re.findall(r'' , text, re.MULTILINE)
for match in matches:
    print(match)




























































































































## 5. Problem

       The quick brown fox... - CAPTURE:The quick brown fox...
 jumps over the lazy dog. - CAPTURE:jumps over the lazy dog.

In [189]:
text="""
       The quick brown fox...
 jumps over the lazy dog.
"""

In [190]:
matches = re.findall(r'\$.+' , text, re.MULTILINE)
for match in matches:
    print(match)

## 6. Problem

W/dalvikvm( 1553): threadid=1: uncaught exception - SKIP  
E/( 1553): FATAL EXCEPTION: main - SKIP  
E/( 1553): java.lang.StringIndexOutOfBoundsException - SKIP  
E/( 1553):   at widget.List.makeView(ListView.java:1727) - CAPTURE: makeView, ListView.java, 1727  
E/( 1553):   at widget.List.fillDown(ListView.java:652) - CAPTURE: fillDown, ListView.java, 652  
E/( 1553):   at widget.List.fillFrom(ListView.java:709) - CAPTURE: fillFrom, ListView.java, 709

In [231]:
text="""
W/dalvikvm( 1553): threadid=1: uncaught exception
E/( 1553): FATAL EXCEPTION: main2
E/( 1553): java.lang.StringIndexOutOfBoundsException
E/( 1553): at widget.List.makeView(ListView.java:1727)
E/( 1553): at widget.List.fillDown(ListView.java:652)
E/( 1553): at widget.List.fillFrom(ListView.java:709)
"""

In [232]:
matches = re.findall(r'\$.+' , text, re.MULTILINE)
for match in matches:
    print(match)

## 7. Problem

ftp://file_server.com:21/top_secret/life_changing_plans.pdf CAPTURE: ftp, file_server.com, 21  
https://regexone.com/lesson/introduction#section CAPTURE: https, regexone.com  
file://localhost:4040/zip_file CAPTURE: file, localhost, 4040  
https://s3cur3-server.com:9999/ CAPTURE: https, s3cur3-server.com, 9999  
market://search/angry%20birds CAPTURE: market, search

In [243]:
text="""
ftp://file_server.com:21/top_secret/life_changing_plans.pdf
https://regexone.com/lesson/introduction#section
file://localhost:4040/zip_file
https://s3cur3-server.com:9999/
market://search/angry%20birds
"""

In [250]:
matches = re.findall(r'(^[f,h,m].{1,5}[p,s,e,t])://(.+)[:,/](.+)' , text, re.MULTILINE)
for match in matches:
    print(match)

('ftp', 'file_server.com:21/top_secret', 'life_changing_plans.pdf')
('https', 'regexone.com/lesson', 'introduction#section')
('file', 'localhost:4040', 'zip_file')
('https', 's3cur3-server.com', '9999/')
('market', 'search', 'angry%20birds')


## Tools

- https://www.debuggex.com/
- https://regex101.com/
- https://regexone.com/