<a href="https://colab.research.google.com/github/abregasi/Python-Datascience-Projects/blob/master/Regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
REGEX: All you Need To Know to get started



---
> by Ariana Bregasi



A Regular Expression (**RegEx**) is a sequence of characters that defines a search pattern. 

The abbreviation for regular expression is **regex** (*REs, regexes, or regex* patterns are also used). Regular expressions are a powerful language for matching text patterns. They are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. A pattern defined using **RegEx** can be used to match against a string. The search pattern can be anything from a simple character, a fixed string or a complex expression containing special characters describing the pattern. The pattern defined by the **regex** may match one or several times or not at all for a given string.


# **Meta Characters**

---


> []^.$*+?{}()|


---

**These are characters interpreted in a special way by the REGEX engine, used to specify regular expressions**


> OR operator  [] or |



[] **Square Brackets** specify a set of characters you wish to match

[abc] will match if the string contain any of the a, b, c [a-e] is the same as [abcde] [1-4] is the same as [1234] [0-39] is the same as [01239]

**| Alternation** ("or" operator)

a|b matches a or b (matches 1 in art or in bed, matches three times in adcbeda)


> Anchors ^ and $

^ **Caret** used to check if a string starts with a certain character

^a  starts with a -- will match a or abc or ariana ^ab starts with ab -- will match aberrant, absolute abc etc.

**$ Dollar** used to check if a string ends with a certain character

a$ matches string that ends with a eg a, data, Dora



> Quantifiers * + ? and {}



 ***Star**   matches zero or more occurences of the pattern that follows

 st*r  will match str string strong strrrr, will not match star 

**+ Plu**s matches one or more occurences of the pattern that follows

ma+n matches man, maaan, woman

**? Question mark**  matches zero or more occurences of the patterns that follows

ma?n matches man, woman (not main et)

**{} Braces** e.g {2,3} considers at least two and at most three pattern repetitions

a{2,3} a repeated two or three times eg aa or aaa (also recognizes the first three a(s) in aaaa)

[0.9]{2,4} matches min two max 4 digits eg 123 or 1245

**. Period** matches any single character except a new line ('\n')

**..** matches aa, abc, abcd (does not match a- one character - it expets two)

**\ Backslash** is used to escape various characters including metacharacters 

**\$a**  will  match if a string contains Dollar sign followed by a ($ is not interpreted as special character)



# **Special Sequences**

---




**\A** Matches if the special characters are at the start of a string
** \At the** Matches "The sun", does not match "In the sun"

**\b** Matches if the special character is at the beginning or end of the word \bfoo (football) ick\b (sick) - represents an anchor like caret

**\B** Matches if the special characters are not in the beginning or end of the word (opposite of \b) \Bic or ic\B will match sick 

**\d** Matches any decimal digit [0-9] 

**\D** Matches any non-decimal digit [^0-9]

**\$\d** Matches a string that has a $ before one digit

**\s** Matches where a string contains whitespace [ \t\n\r\f\v] 

**\S** Matches where a string contains non-whitespace characters [^ \t\n\r\f\v]

**\w** Matches any alphanumeric character (numbers and letters) [a-zA-Z0-9_](underscore is also alphaumeric character

 **\W**  Matches any non-alphanumeric character (eg %^&*$#@!? etc) [^a-zA-Z0-9_] 

**\Z** Matches specified charaters at the end of a string (eg. \Zthon will match Python)



---


A Regex object is **immutable**, which means that it can be used only for the match pattern you define when you create it. However, it can be used any number of times without being recompiled. 

---



# Getting Started 

---

### > This section will contain a few of the most common modules of REGEX



First import module into python

In [0]:
import re

Define the pattern and the string you are going to search

In [0]:
string = 'Hello14 ! It is:1 beautiful day today.' 
pattern = '\d+'

re.findall() returns a list containing all matches

In [0]:
string = 'Hello 14 ! It is:1 beautiful day today.' 
pattern = '\d+'
result = re.findall (string, pattern)
print(result)

#Output ['14', '1']


re.split() splits the string where there is a match and returns a list of strings where the splits have occurred

In [0]:
result2 = re.split(string, pattern)
print (result2)

#Output: ['Hello', 'is']

re.sub()  returns a string where matched occurrences are replaced with the content of replace variable.

In [0]:
#replace all whitespace with ','
pattern = '\s+'
repl = ','
string = 'I like going to the mountains'

result3= re.sub(pattern, repl, string)
print(result3)

I,like,going,to,the,mountains


You can also add a counting parameter to specify the number of times you want to do the replacement 

In [0]:
pattern = '\s+'
repl = ','
string = 'I like going to the mountains'
result4 = re.sub(pattern, repl, string, count=2)
print(result4)

I,like,going to the mountains


re.subn() is like above, nut returns a tuple with the new string and the number of substitutions made

In [0]:
pattern = '\s+'
repl = ','
string= 'I like going to the mountains'
tring)
print(result5)
result5 = re.subn(pattern, repl, s

('I,like,going,to,the,mountains', 5)


re.search() takes a pattern and a string and loos for the first location where the RegEx pattern produces a match with the string. If the search is successful, re.search() returns a match object; if not, it returns None.

In [0]:
pattern = '\AI'
string = "I like going to the mointains"
result6=re.search(pattern, string)
print(result6)

<_sre.SRE_Match object; span=(0, 1), match='I'>


**Match**
match is an object that takes several attributes
> you can get match attributes and methods using dir() function

Here are some of the most used matched attributes:

**match.group()**  returns part of the string where there is a match


In [0]:
string = '12345 678'
pattern = '(\d{3}) (\d{2})'

match = re.search(pattern, string) 
if match:
  print(match.group())
else:
  print("pattern not found")

# Output: 345 67


match.start(), match.end() and match.span()

In [0]:
match.start()
>>>1
match.end()
>>>8
match.span()
>>>(1,8)

match.string()

In [0]:
match.string()
>>>'12345 678'



---
**Note:** If there is no match, the value **None** will be returned, instead of the Match Object.


---



As you can imagine Regex can have multiple applications. Some of the most common fields include **data validation**, 
**data scraping** (e.g find words in specific order), 
**data wrangling** (transform data to different formats), 
**string parsing** (for example capture text inside a set of parenthesis), 
**string replacement** (for example replace “;” with “,” make it lowercase,etc.), **file renaming**  and many other applications involving strings.



---



---
Resources:

https://www.programiz.com/python-programming/regex

https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

https://www.w3schools.com/python/python_regex.asp

https://docs.python.org/3/howto/regex.html

