### Python One Liners - Regex 

Useful Takeaways 

> Using \ character in front of special character to negate its affect. 


> + is one or more for match. * is 0 or more times. {3} times. {2,4} between 2 to four times. ? is once or none.


> Regex group represented by (). Can have multiple groups being matched using (). On the other hand, when using two nested groups in the regex (hello(world)), the result of the re.findall() function would be a tuple of all matched groups ('helloworld', 'world').


> The dot regex matches any character (including whitespace characters). You can use it to indicate that you don’t care which character matches, as long as exactly one matches


> Second, say you want to match text that begins and ends with the character 'y' and an arbitrary number of characters in between. How do you accomplish this? You can do by this using the asterisk regex, the * character. Unlike the dot regex, the asterisk regex can’t stand on its own; it modifies the meaning of another regex. 

> .*?, Python searches for a minimal number of arbitrary characters.(. *?) matches any character ( . ) any number of times ( * ), as few times as possible to make the regex match ( ? )

> Using re.compile. Can be useful to store a particular regex expression and not have to repeat code unnecessarily by retyping the same expression when using it on other strings. 

> Differences between re.match, re.search, re.findall. The re.search returns the first instance and its location, re.findall returns all instances but only as a list of strings (i.e. no info about locations). re.match will match if it's at the begining of a string? 

> re.match - Ensures the string begins with the pattern

> re.fullmatch - Checks whether the regex matches the full string. 

• The dot regex . matches an arbitrary character.

• The asterisk regex <pattern>* matches an arbitrary number of the regex
<pattern>. Note that this includes zero matching instances.
    
• The at-least-one regex <pattern>+ can match an arbitrary number of
<pattern> but must match at least one instance.
    
• The zero-or-one regex <pattern>? matches either zero or one instances
of <pattern>.
    
• The nongreedy asterisk regex *? matches as few arbitrary characters as possible to match the overall regex.

• The regex <pattern>{m} matches exactly m copies of <pattern>.
    
• The regex <pattern>{m,n} matches between m and n copies of <pattern>.
    
• The regex <pattern_1>|<pattern_2> matches either <pattern_1> or <pattern_2>.
    
• The regex <pattern_1><pattern_2> matches <pattern_1> and then <pattern_2>.
    
• The regex (<pattern>) matches <pattern>. The parentheses group regu- lar expressions so you can control the order of execution (for exam- ple, (<pattern_1><pattern_2>)|<pattern_3> is different from <pattern_1> (<pattern_2>|<pattern_3>). The parentheses regex also creates a match- ing group, as you’ll see later in the section.
    
• The regex ^ is if you want to avoid an expression. 

In [1]:
import regex as re

report = '''
If you invested $1 in the year 1801, you would have $18087791.41 today.
This is a 7.967% return on investment.
But if you invested only $0.25 in 1801, you would end up with $4521947.8525.
'''

# Finding the dollar amounts 
re.findall('\$[0-9]*',report)

re.findall('(\$[0-9]+(\.[0-9]*)?)', report)

dollars = [x[0] for x in re.findall('(\$[0-9]+(\.[0-9]*)?)', report)]

In [16]:
re.findall('\$[0-9]*',report)

['$1', '$18087791', '$0', '$4521947']

In [17]:
re.findall('\$[0-9]?',report)

['$1', '$1', '$0', '$4']

In [18]:
re.findall('\$[0-9]+',report)

['$1', '$18087791', '$0', '$4521947']

In [19]:
re.findall('(\$[0-9]+(\.[0-9]*)?)', report)

[('$1', ''),
 ('$18087791.41', '.41'),
 ('$0.25', '.25'),
 ('$4521947.8525', '.8525')]

In [23]:
[x for x in re.findall('(\$[0-9]+(\.[0-9]*)?)', report)]

[('$1', ''),
 ('$18087791.41', '.41'),
 ('$0.25', '.25'),
 ('$4521947.8525', '.8525')]

In [24]:
re.findall('.', report)

['I',
 'f',
 ' ',
 'y',
 'o',
 'u',
 ' ',
 'i',
 'n',
 'v',
 'e',
 's',
 't',
 'e',
 'd',
 ' ',
 '$',
 '1',
 ' ',
 'i',
 'n',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'y',
 'e',
 'a',
 'r',
 ' ',
 '1',
 '8',
 '0',
 '1',
 ',',
 ' ',
 'y',
 'o',
 'u',
 ' ',
 'w',
 'o',
 'u',
 'l',
 'd',
 ' ',
 'h',
 'a',
 'v',
 'e',
 ' ',
 '$',
 '1',
 '8',
 '0',
 '8',
 '7',
 '7',
 '9',
 '1',
 '.',
 '4',
 '1',
 ' ',
 't',
 'o',
 'd',
 'a',
 'y',
 '.',
 'T',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 'a',
 ' ',
 '7',
 '.',
 '9',
 '6',
 '7',
 '%',
 ' ',
 'r',
 'e',
 't',
 'u',
 'r',
 'n',
 ' ',
 'o',
 'n',
 ' ',
 'i',
 'n',
 'v',
 'e',
 's',
 't',
 'm',
 'e',
 'n',
 't',
 '.',
 'B',
 'u',
 't',
 ' ',
 'i',
 'f',
 ' ',
 'y',
 'o',
 'u',
 ' ',
 'i',
 'n',
 'v',
 'e',
 's',
 't',
 'e',
 'd',
 ' ',
 'o',
 'n',
 'l',
 'y',
 ' ',
 '$',
 '0',
 '.',
 '2',
 '5',
 ' ',
 'i',
 'n',
 ' ',
 '1',
 '8',
 '0',
 '1',
 ',',
 ' ',
 'y',
 'o',
 'u',
 ' ',
 'w',
 'o',
 'u',
 'l',
 'd',
 ' ',
 'e',
 'n',
 'd',
 ' ',
 'u',
 'p',
 ' ',
 'w'

In [26]:
# Using dot regex to match any arbitrary character 
re.findall('y.u', report)

['you', 'you', 'you', 'you']

In [27]:
# Asterisk regex can't stand on own. Modifies the meaning 
# of another regex. 
re.findall('y.*y', report)

['you invested $1 in the year 1801, you would have $18087791.41 today',
 'you invested only $0.25 in 1801, y']

In [29]:
x = '''

The Northern Ireland Protocol was agreed under former PM Boris Johnson as part of the process of the UK leaving the European Union.

It means Northern Ireland has continued to follow some EU laws so that goods can flow freely over the border to the Republic of Ireland without checks.

Instead, goods arriving from England, Scotland and Wales are checked when they reach Northern Irish ports.

Critics, including Northern Ireland's Democratic Unionist Party (DUP), feel this undermines the nation's position within the rest of the UK, as well as impacting trade. 


'''

In [32]:
re.findall('.*?', x)

['',
 '',
 '',
 'T',
 '',
 'h',
 '',
 'e',
 '',
 ' ',
 '',
 'N',
 '',
 'o',
 '',
 'r',
 '',
 't',
 '',
 'h',
 '',
 'e',
 '',
 'r',
 '',
 'n',
 '',
 ' ',
 '',
 'I',
 '',
 'r',
 '',
 'e',
 '',
 'l',
 '',
 'a',
 '',
 'n',
 '',
 'd',
 '',
 ' ',
 '',
 'P',
 '',
 'r',
 '',
 'o',
 '',
 't',
 '',
 'o',
 '',
 'c',
 '',
 'o',
 '',
 'l',
 '',
 ' ',
 '',
 'w',
 '',
 'a',
 '',
 's',
 '',
 ' ',
 '',
 'a',
 '',
 'g',
 '',
 'r',
 '',
 'e',
 '',
 'e',
 '',
 'd',
 '',
 ' ',
 '',
 'u',
 '',
 'n',
 '',
 'd',
 '',
 'e',
 '',
 'r',
 '',
 ' ',
 '',
 'f',
 '',
 'o',
 '',
 'r',
 '',
 'm',
 '',
 'e',
 '',
 'r',
 '',
 ' ',
 '',
 'P',
 '',
 'M',
 '',
 ' ',
 '',
 'B',
 '',
 'o',
 '',
 'r',
 '',
 'i',
 '',
 's',
 '',
 ' ',
 '',
 'J',
 '',
 'o',
 '',
 'h',
 '',
 'n',
 '',
 's',
 '',
 'o',
 '',
 'n',
 '',
 ' ',
 '',
 'a',
 '',
 's',
 '',
 ' ',
 '',
 'p',
 '',
 'a',
 '',
 'r',
 '',
 't',
 '',
 ' ',
 '',
 'o',
 '',
 'f',
 '',
 ' ',
 '',
 't',
 '',
 'h',
 '',
 'e',
 '',
 ' ',
 '',
 'p',
 '',
 'r',
 '',
 'o',
 '',
 'c',


Our input is a string, and our goal is to use a nongreedy approach to find all patterns that start with the character 'p', end with the character 'r', and have at least one occurrence of the character 'e' (and, possibly, an arbitrary number of other characters) in between!

In [33]:
text = 'peter piper picked a peck of pickled peppers'

In [51]:
re.findall('p.*e.*r',text)

['peter piper picked a peck of pickled pepper']

In [52]:
re.findall('p.*?e.*?r',text)

['peter', 'piper', 'picked a peck of pickled pepper']

In [60]:
text = '''

The arrival of the outstanding Casemiro, the superb development of the combative Lisandro Martinez and Rashford's rejuvenation have helped to make the Old Trafford outfit a serious proposition again.

They were not at their best, but once they took control of this final they did not let Newcastle back in - and this was very much a case of mission accomplished.

At the heart of it all was Casemiro, a genuinely transformative acquisition. The Brazilian not only made the crucial contribution with the opening goal, but stamped his years of trophy-winning experience with Real Madrid all over this showpiece with his expert positioning and authority.

It will also increase the growing belief that Ten Hag is the manager who will move Manchester United forward and out of the wilderness that had engulfed them before his arrival at the start of the season.

Manchester United claimed their first trophy since 2017 with victory over Newcastle United in the Carabao Cup final at Wembley.

Newcastle's own wait for silverware, stretching back to 1969, goes on after two goals inside six minutes in the first half established Manchester United's superiority and set them on their way to a first success under manager Erik ten Hag.

Casemiro broke the deadlock after 33 minutes when he headed home Luke Shaw's free-kick.

His side doubled their advantage after Sven Botman deflected Marcus Rashford's shot out of the reach of Newcastle's debutant keeper Loris Karius, deputising for the suspended Nick Pope.

Newcastle attempted to rally in the second half, but the goals have dried up at the wrong time for Eddie Howe's men.

It meant Manchester United were back in the honours after last tasting success six years ago when lifting the Europa League under Jose Mourinho, and also winning this competition in the same campaign.

'''

In [69]:
re.findall('C.*o',text)

["Casemiro, the superb development of the combative Lisandro Martinez and Rashford's rejuvenation have helped to make the Old Trafford outfit a serious propositio",
 'Casemiro, a genuinely transformative acquisition. The Brazilian not only made the crucial contribution with the opening goal, but stamped his years of trophy-winning experience with Real Madrid all over this showpiece with his expert positioning and autho',
 'Carabao',
 'Casemiro broke the deadlock after 33 minutes when he headed ho']

In [62]:
re.findall('C.*?o',text)

['Casemiro', 'Casemiro', 'Carabao', 'Casemiro']

In [73]:
re.findall('M.*?a.*?r', text)

['Mar', 'Madr', 'Manchester', 'Manchester', 'Manchester', 'Mar', 'Manchester']

In [75]:
re.findall('M.*a.*r', text)

["Martinez and Rashford's rejuvenation have helped to make the Old Trafford outfit a serious pr",
 'Madrid all over this showpiece with his expert positioning and author',
 'Manchester United forward and out of the wilderness that had engulfed them before his arrival at the star',
 'Manchester United claimed their first trophy since 2017 with victory over Newcastle United in the Car',
 "Manchester United's superiority and set them on their way to a first success under manager Er",
 "Marcus Rashford's shot out of the reach of Newcastle's debutant keeper Loris Karius, deputising for",
 'Manchester United were back in the honours after last tasting success six years ago when lifting the Europa League under Jose Mour']

In [1]:
import regex as re

In [2]:
text_1 = "crypto-bot that is trading Bitcoin and other currencies"
text_2 = "cryptographic encryption methods that can be cracked easily with quantum computers"

In [3]:
text_1

'crypto-bot that is trading Bitcoin and other currencies'

In [4]:
text_2

'cryptographic encryption methods that can be cracked easily with quantum computers'

In [37]:
text_1_analysis = re.findall('(crypto.{0,30})(?=coin)',text_1)

In [38]:
text_1_analysis

['crypto-bot that is trading Bit']

In [39]:
text_2_analysis = re.findall('(crypto.{0,30})(?=coin)',text_2)

In [40]:
text_2_analysis

[]

In [43]:
pattern = re.compile("crypto(.{1,30})coin")

In [45]:
print(pattern.match(text_1))

<regex.Match object; span=(0, 34), match='crypto-bot that is trading Bitcoin'>


In [48]:
re.findall("(crypto.{1,30}coin)",text_1)

['crypto-bot that is trading Bitcoin']

In [50]:
re.compile("(crypto.{1,30}coin)")

regex.Regex('(crypto.{1,30}coin)', flags=regex.V0)

In [52]:
text = '''
"One can never have enough socks", said Dumbledore.
"Another Christmas has come and gone and I didn't
get a single pair. People will insist on giving me books."
Christmas Quote
'''
regex = 'Christ.*'

In [55]:
re.search(regex,text)

<regex.Match object; span=(62, 102), match="Christmas has come and gone and I didn't">

In [56]:
re.match(regex,text)

In [57]:
re.findall(regex,text)

["Christmas has come and gone and I didn't", 'Christmas Quote']

In [58]:
## Data
page = '''
<!DOCTYPE html>
<html>
<body>
<h1>My Programming Links</h1>
<a href="https://app.finxter.com/">test your Python skills</a>
<a href="https://blog.finxter.com/recursion/">Learn recursion</a>
<a href="https://nostarch.com/">Great books from NoStarchPress</a>
<a href="http://finxter.com/">Solve more Python puzzles</a>
</body>
</html>
'''

In [74]:
re.findall('(finxter.com.*puzzles.*)|(finxter.com.*test.*)',page)

[('', 'finxter.com/">test your Python skills</a>'),
 ('finxter.com/">Solve more Python puzzles</a>', '')]

In [82]:
re.findall("(<a.*?finxter.*?(test|puzzle).*?>)", page)

[('<a href="https://app.finxter.com/">test your Python skills</a>', 'test'),
 ('<a href="http://finxter.com/">Solve more Python puzzles</a>', 'puzzle')]

In [85]:
re.findall('(<a.*finxter.com.*puzzles.*)|(<a.*finxter.com.*test.*)',page)

[('', '<a href="https://app.finxter.com/">test your Python skills</a>'),
 ('<a href="http://finxter.com/">Solve more Python puzzles</a>', '')]

In [1]:
article = '''
             The algorithm has important practical applications
             http://blog.finxter.com/applications/
             in many basic data structures such as sets, trees,
             dictionaries, bags, bag trees, bag dictionaries,
             hash sets, https://blog.finxter.com/sets-in-python/
             hash tables, maps, and arrays. http://blog.finxter.com/
             http://not-a-valid-url
             http:/bla.ba.com
             http://bo.bo.bo.bo.bo.bo/
             http://bo.bo.bo.bo.bo.bo/333483--33343-/
             '''

In [3]:
import regex as re
stale_links = re.findall('http://[a-z0-9_\-.]+\.[a-z0-9_\-/]+', article)

In [5]:
inputs = ['18:29', '23:55', '123', 'ab:de', '18:299', '99:99']
[re.findall('[0-9]{2}:[0-9]{2}',i) for i in inputs]

[['18:29'], ['23:55'], [], [], ['18:29'], ['99:99']]

In [8]:
[re.fullmatch('[0-9]{2}:[0-9]{2}',i) for i in inputs]

[<regex.Match object; span=(0, 5), match='18:29'>,
 <regex.Match object; span=(0, 5), match='23:55'>,
 None,
 None,
 None,
 <regex.Match object; span=(0, 5), match='99:99'>]

In [13]:
import re
  
string_ = "Geeks for geeks"
pattern = "Geeks"
  
print(re.match(pattern, string))
print(re.fullmatch(pattern, string))

<re.Match object; span=(0, 5), match='Geeks'>
None


In [14]:
[re.fullmatch('[0-24]:[0-59]',i) for i in inputs]

[None, None, None, None, None, None]

In [15]:
time = '18:29'

In [18]:
re.fullmatch('[0-2]{1}[0-9]{1}:[0-9]{2}', time)

<re.Match object; span=(0, 5), match='18:29'>

In [20]:
[re.fullmatch('[0-2]{1}[0-9]{1}:[0-9]{2}', i) for i in inputs]

[<re.Match object; span=(0, 5), match='18:29'>,
 <re.Match object; span=(0, 5), match='23:55'>,
 None,
 None,
 None,
 None]

In [21]:
[re.fullmatch('[0-1][0-9]|2[0-3]:[0-5][0-9]',i) for i in inputs]

[None, <re.Match object; span=(0, 5), match='23:55'>, None, None, None, None]

In [30]:
[re.fullmatch('([0-1][0-9]|2[0-3]):[0-5][0-9]',i) for i in inputs]

[<re.Match object; span=(0, 5), match='18:29'>,
 <re.Match object; span=(0, 5), match='23:55'>,
 None,
 None,
 None,
 None]

In [31]:
re.findall('[0-9]','1')

['1']