## 1. REGULAR EXPRESSIONS: the re module
This module is used for pattern search: and regular expressions.
### 1.1 re.match()
re.match(w1, w2) returns true if "w1" can be found at the beginning of w2

In [2]:
import random
import string
import re
import numpy as np

In [3]:
# this method returns a random string with a given length
def random_string(str_len=10, all_lower=False, all_upper=False, characters=None):
    assert not all_lower or not all_upper
    
    if characters is None:
        characters = string.ascii_letters
    
    ran_str =  "".join([str(random.choice(characters)) for _ in range(str_len)])
    
    if all_lower :
        return ran_str.lower()
    if all_upper:
        return ran_str.upper() 
    
    return ran_str


In [4]:
print(re.match('hedge', 'hedgehog') is None) # which means "hedgehog" starts with "hedge"

False


### 1.2 Special characters
#### 1.2.1 Dot character
This character matches every character beside the new line characeters "\n", "\r"...
#### 1.2.2 The ? character
when written after a character, it means this character can be absent.

In [5]:
regexp = "ayhem .?.?"

evaluation = [(re.match(regexp, "ayhem " + random_string(2)) is not None) for _ in range(50)]
print(np.array(evaluation).all()) # this evaluates to True

True


[reference](https://hyperskill.org/learn/step/9468)

## 1.3 Escaping Characters
### 1.3.1 Backslashes
Sometimes we want to include the special characters in the regular expression as their actual literal meaning. We can use backslashes for this purpose. "\".

In [6]:
espaced_point = "\."
print(re.match(espaced_point, ".") is not None)
print(re.match(espaced_point, "a") is not None)

escaped_question_mark = "\?"
print(re.match(escaped_question_mark, "") is not None) # false
print(re.match(escaped_question_mark, "?") is not None) # True

True
False
False
True


Since having numerous "\" can easily become so confusing, Python included two additional powerful features:
1. the r prefix
2. re.escape
### 1.3.2 r prefix
r'string' will simply ask Python to treat each "\" literally. Thus the escape will be applied only to special characters of regex:
* r'\t' is indeed '\t' not the tabulation symbol
* r'\\.' denotes a usual dot and it is equivalent to '\\.'
### 1.3.3 re.escape
re.escape(string) will escape every special character and treat it literally

In [7]:
a = re.escape("sh.t")
print(re.match(a, "shit") is not None)

False


[!reference](https://hyperskill.org/learn/step/9754) 

## 1.4 Regexp Sets and Ranges 
sets are written between two brackets \[symbols\]: represents any character that belongs to that set.

In [8]:
template = r'[123][456][789]'
print(re.match(template, "147" ) is not None)
print(re.match(template, "159" ) is not None)
print(re.match(template, "169" ) is not None)
print(re.match(template, "139" ) is not None)

True
True
True
False


### 1.4.1 Escaping in sets
Sets treat most of the special characters in their literal meaning with the exception of
* \: backslash
* ]
### 1.4.2 Ranges
we can set a range of acceptable values inside a set:

In [9]:
r1 = "[0-9]"
random_num = "".join(e for e in random.sample(string.digits, random.randint(1, 10)))
print(random_num)
print(re.match(r1, random_num) is not None)
r2 = "[0-9][a-z]"
print(re.match(r2, "0t") is not None)
print(re.match(r2, "000000z")is not None)

634105789
True
True
False


### 1.4.2 Exclusion of sets
The same range can be used to exclude a range of values, just add the symbol "^" before the range in the \[\]

In [10]:
text = "Ayhem is a great student. Ayhem is hungry for success and unexpectedly pussy. Ayhem is suffering from both poverty and drought " + \
    "so don't be like Ayhem unless you 're really mentally tought enough!!"
re.search("[aA]yhem", text) 
re.findall("[a-t]{5}", text)
re.split("[a-t]{5}", text)
re.findall("[A-Z]{1}[a-z]+", text)

grades = ['A', 'B', 'C', 'D']
print(grades)
random_grades = random_string(str_len=random.randint(2, 15), characters=grades)
# try to detect a decreasing trend in the grades
print(random_grades)
re.findall("[A]*[B]*[C]*[D]*", random_grades)

['A', 'B', 'C', 'D']
CDBDCCCA


['CD', 'BD', 'CCC', 'A', '']

In [11]:
a1 = np.random.rand(4)
a2 = np.random.rand(4, 1)
a3 = np.array([[1, 2, 3, 4]])
a4 = np.arange(1, 4, 1)
a5 = np.linspace(1 ,4, 4)
a = [a1, a2, a3, a4, a5]
for ax in a:
    print(ax.shape)

(4,)
(4, 1)
(1, 4)
(3,)
(4,)


In [12]:
r = np.random.rand(6,6)
print(r)

[[0.64032124 0.06568595 0.82578824 0.55267544 0.2972072  0.97630772]
 [0.64861718 0.98809526 0.58577825 0.46557748 0.81798316 0.37382699]
 [0.51671447 0.29853595 0.71227551 0.1353399  0.36449887 0.45151959]
 [0.39396631 0.69043683 0.8475055  0.62281975 0.04692135 0.26477362]
 [0.52588784 0.35598716 0.29756383 0.43921521 0.96734099 0.08063448]
 [0.97052485 0.69861294 0.47983185 0.07046357 0.01599939 0.57980272]]


In [13]:
print(r[2:4, 2:4], "\n")
print(r[[2,3], [2,3]])

[[0.71227551 0.1353399 ]
 [0.8475055  0.62281975]] 

[0.71227551 0.62281975]


In [14]:
s = 'ACAABAACAAAB'
result = re.findall('A{1,2}', s)
len(result)

5

In [15]:
with open("utility_files/grades.txt") as file:
    grades = file.read()
    # print(grades)
    template = "([A-Za-z]+ [A-Za-z]+)(: B)"
    # print(re.findall(template, grades))
    for item in re.finditer(template, grades):
        print(item.groups()[0])

Bell Kassulke
Simon Loidl
Elias Jovanovic
Hakim Botros
Emilie Lorentsen
Jake Wood
Fatemeh Akhtar
Kim Weston
Yasmin Dar
Viswamitra Upandhye
Killian Kaufman
Elwood Page
Elodie Booker
Adnan Chen
Hank Spinka
Hannah Bayer


In [16]:
with open ("utility_files/logdata.txt", "r") as file:
    content = file.read()
    # print(content)
    template_host = "(\d+\.){3}\d+"
    print(re.match(template_host, "12.3.5.345") is not None)
    template_space = "( - )"
    template_user = "[\w-]+"
    template_date = " \[\d{2}\/[A-Za-z]{3}\/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}\] "
    template_request = '"[A-Z]+ .+"'
    final_template = template_host.join(["(", ")"]) + template_space + template_user.join(["(", ")"]) + template_date.join(["(", ")"]) + template_request.join(["(", ")"])
    # print(final_template)
    re.findall(template_host, content)
    
    line = '231.220.8.214 - - [21/Jun/2019:15:45:52 -0700] "HEAD /systems/sexy HTTP/1.1" 201 2578'
    res = [item for item in re.finditer(final_template, line)]
    i = 0
    while True:
       try:
            print("group(" + str(i)+ ")" + str(res[0].group(i)))
       except: 
            break
       i += 1
    result = [{"host": item.group(1), "user_name": item.group(4), "time": item.group(5)[2:-2], "request": item.group(6)[1:-1]} for item in re.finditer(final_template, content)]
    print(result[0:20])   

True
group(0)231.220.8.214 - - [21/Jun/2019:15:45:52 -0700] "HEAD /systems/sexy HTTP/1.1"
group(1)231.220.8.214
group(2)8.
group(3) - 
group(4)-
group(5) [21/Jun/2019:15:45:52 -0700] 
group(6)"HEAD /systems/sexy HTTP/1.1"
[{'host': '146.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}, {'host': '197.109.77.178', 'user_name': 'kertzmann3129', 'time': '21/Jun/2019:15:45:25 -0700', 'request': 'DELETE /virtual/solutions/target/web+services HTTP/2.0'}, {'host': '156.127.178.177', 'user_name': 'okuneva5222', 'time': '21/Jun/2019:15:45:27 -0700', 'request': 'DELETE /interactive/transparent/niches/revolutionize HTTP/1.1'}, {'host': '100.32.205.59', 'user_name': 'ortiz8891', 'time': '21/Jun/2019:15:45:28 -0700', 'request': 'PATCH /architectures HTTP/1.0'}, {'host': '168.95.156.240', 'user_name': 'stark2413', 'time': '21/Jun/2019:15:45:31 -0700', 'request': 'GET /engage HTTP/2.0'}, {'host': '71.172.239.195', 'user_name': 'dool

In [17]:
# the ?, + and * operators are greedy. In other words, when there are more than one possible match
# then these operators will return the longest one.
# regex offers a non-greedy (or lazy) verison of thee operators:
# ??, +?, *?: these return the shortest available match

# Grouping

In [18]:
# regex provides an additional construct called grouping: where a regex is written between parentheses
# parentheses and treated as a single unit

# keep in mind this sample example: (foo(bar)?)+(\d\d\d) # if you can understand how it works you're good
# for this part so far

# each (regex) represents a group: the match object generated by re.match (if is None of course)
# provides 2 important functions:

# groups(): returning all the tokens captured in order:

# group(<n>): returning the token captured in the n-th group

# let's consider an example

string = "aaa:678:ooo"

regex = '(\w+):(\d+):([oui]+)'

match_obj = re.match(regex, string)

match_obj.groups()
print()
for i in range(4):
    print(match_obj.group(i))



aaa:678:ooo
aaa
678
ooo


In [19]:
# here is slightly better thingy
# backreferences
# \<n>: matches the content of the n-th captured group: this is only available by 
# the value of n is limited to 99: as \100 is read as '@' by Python Interpreter

# example as usual:

regex = r'(\w+), (\d+), \2' # don't forget to use the raw string: r in your regex boy!!
# this will match

m1 = 'ayh, 22, 22'
m2 = 'a, 31, 31'

print(re.match(regex, m1), re.match(regex, m1))

<re.Match object; span=(0, 11), match='ayh, 22, 22'> <re.Match object; span=(0, 11), match='ayh, 22, 22'>


In [20]:
# back to backreference (did u see what I did there xD!!)
# let's use the groups' names

# the reference's pattern: (?P=name)

re_refer_group_name = r'(?P<g1>\w{2,4})--(?P=g1)'

string = 'yyy--yyy'

m_res = re.match(re_refer_group_name, string).group('g1')

In [21]:
# Don't get bored already boy!!
# there are still too many of these grouping constructs to come... we're getting there

# instead of using numbers to reference groups: regex gives us the power to create name groups: Can you believe it ?????
# they are of the following pattern:

named_groups = r'(?P<g1>\w{2,4})--(?P<g2>\w{3,4})'

# so u see, the first group is called g1, the second u guessed it g2

string = '222--t_ss'

match_obj = re.match(named_groups, string)
print(match_obj.group('g1'), match_obj.group('g2'), sep="\t" * 2)

## too good to be true, baby ?? Yeah I know, I know. Well No free lunch theorem comes to the picture
# any name in this costruct should conform to the Python identifier rules
# and all groups should be uniquely identifiable


222		t_ss


THE ANGLE BRACKETS (< AND >) ARE REQUIRED AROUND 'name' WHEN CREATING A NAMED GROUP BUT NOT WHEN REFERRING TO IT LATER: BACKREFERENCE OR BY .group()

In [22]:
# OKAY, READY TO HEAR SOME UNEXPECTED STUFF, WELL HOLD MY BEER
# we can make a non-capturing regex group: one that won't be captured by regex parser
# how ??:
# (?:<regex>)

# it can't be referred with backreference neither can be caught using match object
# this is mainly done for performance optimization

# what is even more interesting is the conditional match

# pattern: (?(<n>)<yes-regex>|<no-regex>)
# (?(<name>)<yes-regex>|<no-regex>)

# let's consider both these examples:
r1 = r'^(###)?foo(?(1)bar|baz)'
# r1 represents the following possibilities: either foobaz or ###foobar

r2 = r'^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$'
# r2 represents: either foo or \Wfoo\W where \W any non-word character

# no no boy Don't get overwhelmed just yet. We have a lot of stuff for you!! Hold tight!!


In [23]:
# (?=<lookahead_regex>): this is the pattern for the (u guessed it) the lookahead regex
# This group is not matched in by regex

# let's better understand with an example

r_lookahead = 'foo(?=[a-z])' #: this means all 'foo' s that are followed by a alphabetical character

# well we can't leave an assertive lookahead regex without its complement:
# matches a string is the string afterwards does not match the lookahead regex:
# pattern (?!=<lookahead_regex>)

r_lookahead_neg = 'foo(?![a-z])' #: this means all 'foo' s that are not followed by an alphabetical character


In [24]:
# well with a lookahead comes a lookbehind:
# (?<=<lookbehind_regex>):

r = '(?<=qux)bar' # matches only the 'bar' s preceeded by the string 'qux'
# the negative version is also there
r = '(?<!qux)bar' # matches only the 'bar' s that are not precceded by string 'qux'

# One important additional constraint is that lookbehind_regex must be of a finite length 

# Miscellaneous metcharacters

In [28]:
# Regex has still a couple of tricks up its sleeve
# (?# ...) represents a comment within the regex

re_with_comment = r'(?P<w1>[a-z]+)(?# chill boys !!) (?P=w1)'

print(re.match(re_with_comment, 'ayh ayh'))

# see how chill boys was ignored (no ignores my boys though...)

<re.Match object; span=(0, 7), match='ayh ayh'>


In [None]:
# alternation: the or operator in regex
# r1|r2|r3| ... |r_n
# a non-greedy operator that matches the first string satisfying one of the proposed regex