#### **RegEx Module**
*Python has a build-in package called `re`, which can be used to work with Regular Expression. We can use it by importing using `import re` command.*

In [2]:
# Import Regular Expression
import re

#### **RegEx Functions**

**`findall()`**- returns a list containing all matches

In [2]:
text = "returing all containing all matchings"
reg = re.findall("ing", text)
print(reg)

['ing', 'ing', 'ing']


In [3]:
text = "returing all containing all matchings"
reg = re.findall("ain", text)
print(reg)

['ain']


In [5]:
text = "returing all the containing of all matchings"
reg = re.findall("dev", text)
print(reg)

[]


**`search()`**- returns a match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned.

In [15]:
text = "returing all the containing of all matchings"
reg = re.search("ing", text)
print(f"The first occurrence of 'ing' in position {reg.start()}-{reg.end()}")

The first occurrence of 'ing' in position 5-8


In [17]:
text = "returing all the containing of all matchings"
reg = re.search("aap", text)
print(reg)

None


**`split()`**- returns a list where the string has been split at each match

In [19]:
text = "returing all the containing of all matchings"
reg = re.split("\s", text)      # ["\s"- white space] Split at each white-space character
print(reg)

['returing', 'all', 'the', 'cotaining', 'of', 'all', 'matchings']


In [22]:
# We can also control the number of occurrences by specifying the `maxsplit` parameter
text = "returing all the containing of all matchings"
reg = re.split("\s", text, 1)      # Here, 1 = maxsplit
print(reg)

['returing', 'all the cotaining of all matchings']


**`sub()`**- replaces the matches with the text of your choice

In [24]:
text = "returing all the containing of all matchings"
reg = re.sub("\s", "-", text)       # Replacing all the white-space character with the "-"(desh)
print(reg)

returing-all-the-cotaining-of-all-matchings


#### **RegEx Special Sequences**
*A special sequence is a `\` followed by one of the characters(english alphabet), has a special meaning.*

In [6]:
# `\A` - Returns a match if the specified characters are at the beginning of the string. ("\Axxx")
text = "returing all the containing of all matchings"
reg = re.sub("\Aret", "_", text)
print(reg)

_uring all the containing of all matchings


In [8]:
# `\b` - Returns a match where the specified characters are at the beginning or at the end of a word. (r"\bxxx" or r"xxx\b")
# 'r' in the beginning makes sure that the string is being trated as araw string.
text = "returing all the containing of all matchings"
reg = re.sub(r"\bret", "_", text)       # Beginning of the string
print(reg)

_uring all the containing of all matchings


In [10]:
text = "returing all the containing of all matchings"
reg = re.sub(r"ngs\b", "_", text)       # At the end of the string
print(reg)

returing all the containing of all matchi_


In [12]:
# '\B' - Returns a match where the specified characters are present, but not at he beginning or at the end of a word. (r"\Bxxx" or r"xxx\B").
text = "returing all the containing of all matchings"
reg = re.sub(r"\Bing", "_", text)       # At the end of the string
print(reg)

retur_ all the contain_ of all match_s


In [14]:
text = "returing all the containing of all matchings"
reg = re.sub(r"ing\B", "_", text)       # At the end of the string
print(reg)

returing all the containing of all match_s


In [15]:
# '\d' - Returns a match where the string contains digits (numbers from 0-9) ()"\d"
text = "Excel will change numbers like 0784367998 to 784367998."
reg = re.sub("\d", "_", text)       # At the end of the string
print(reg)

Excel will change numbers like __________ to _________.


In [16]:
# '\D' - Returns a match where the string doesnot contain digits ("\D")
text = "Excel will change numbers like 0784367998 to 784367998."
reg = re.sub("\D", "_", text)       # At the end of the string
print(reg)

_______________________________0784367998____784367998_


In [17]:
# '\s' - Return a match where the string contain a white space character ("\s")
text = "Excel will change numbers like 0784367998 to 784367998."
reg = re.sub("\s", "_", text)       # At the end of the string
print(reg)

Excel_will_change_numbers_like_0784367998_to_784367998.


In [18]:
# '\S' - Return a match where the string doesnot contain a white space character ("\S")
text = "Excel will change numbers like 0784367998 to 784367998."
reg = re.sub("\S", "_", text)       # At the end of the string
print(reg)

_____ ____ ______ _______ ____ __________ __ __________


In [19]:
# '\w' - Return a match where the string contains any word characters (characters from a-z, digits from 0-9, and the underscore _ character). ("\w")
text = "Excel will change numbers like 0784367998 to 784367998."
reg = re.sub("\w", "_", text)       # At the end of the string
print(reg)

_____ ____ ______ _______ ____ __________ __ _________.


In [20]:
# '\W' - Return a match where the string doesnot contain any word characters. ("\W")
text = "Excel will change numbers like 0784367998 to 784367998."
reg = re.sub("\W", "_", text)       # At the end of the string
print(reg)

Excel_will_change_numbers_like_0784367998_to_784367998_


In [22]:
# '\Z' - Returns a match if the specified characters are at the end of the string. ("xxx\Z")
text = "Excel will change numbers like 0784367998 to 784367998."
reg = re.sub("998.\Z", "_", text)       # At the end of the string
print(reg)

Excel will change numbers like 0784367998 to 784367_


#### **RegEx Metacharacters**
*Metacharacters are characters with a special meaning.*

**`"[]" A set of characters`**

In [37]:
# [] - A set of characters "[a-m]"
text = "returing all the containing of all matchings"
reg = re.sub("[r-z]", "_", text)    # Follows the alphabetic order
print(reg)

_e___ing all _he con_aining of all ma_ching_


In [38]:
# [] - A set of characters "[7-9]"
text = "Excel will change numbers like 078436 7998 to 784367998."
reg = re.sub("[7-9]", "_", text)    # Follows the alphabetic order
print(reg)

Excel will change numbers like 0__436 ____ to __436____.


In [8]:
# [] - A set of characters "[+=-_,:;'"./!@#$%&*()^{}[]`~\|]"
text = "Excel will change a-lot-of numbers like 0784367998 to 784367998[], where meta_characters includes: '(+a=b-c_d)z\"e(){f*g&h}{}^i%j$k#l@m[!n~o`p,q.r]\/s?t;u:v'w/x|y."
reg = re.sub("[+=,-.;:`~'\"(){}\[\]*&^%$#@!?/|_]", " ", text)
print(reg)

Excel will change a lot of numbers like 0784367998 to 784367998    where meta characters includes     a b c d z e   f g h    i j k l m  n o p q r \ s t u v w x y 


***`[NB]`backward_slash(`\`) cannot be removed with all other special_characters. It has to be replaced seperately.***

In [16]:
reg = reg.replace("\\", " ")
print(reg)

Excel will change a lot of numbers like 0784367998 to 784367998    where meta characters includes     a b c d z e   f g h    i j k l m  n o p q r   s t u v w x y 


**`"\" Special sequence or Escape special characters`**

In [17]:
# \ - Signals a special sequence, can also be used to escape special characters. ("\d") d=digit
text = "Excel will change numbers like 078436 7998 to 784367998."
reg = re.sub("\d", "_", text)       # Removed the digits
print(reg)

Excel will change numbers like ______ ____ to _________.


In [18]:
text = "Excel will change numbers like 078436 7998 to 784367998."
reg = re.sub("\.", "_", text)       # Removed the digits
print(reg)

Excel will change numbers like 078436 7998 to 784367998_


**`"." Any character but newline`**

In [22]:
# . - Any character except newline ("xxx..x")
text = "Excel will change numbers like 078436 7998 to 784367998."
reg = re.sub("E...l", "_", text)    # It finds for a sequence that starts with `E`, followed by any three characters, and ends with `l`
print(reg)

_ will change numbers like 078436 7998 to 784367998.


**`"^" Starts with`**

In [29]:
# ^ - Starts with ("^xxx")
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("^Excel", "_", text)
print(reg)

_ will change numbers like 078436 7998 to 784367998. And there is no other numbers.


In [30]:
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("^Excel ", "_", text)
print(reg)

_will change numbers like 078436 7998 to 784367998. And there is no other numbers.


In [31]:
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("^E", "_", text)
print(reg)

_xcel will change numbers like 078436 7998 to 784367998. And there is no other numbers.


**`"$" Ends with`**

In [36]:
# $ - Ends with ("xxxx$")
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("numbers.$", "_", text)
print(reg)

Excel will change numbers like 078436 7998 to 784367998. And there is no other _


In [38]:
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("bers.$", "_", text)    # It deals only with the endings
print(reg)

Excel will change numbers like 078436 7998 to 784367998. And there is no other num_


In [40]:
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("no.......numbers.$", "_", text)    # It deals only with the endings
print(reg)

Excel will change numbers like 078436 7998 to 784367998. And there is _


**`"*" Zero or more occurances`**

In [6]:
# * - Zero or more occurances ("xx.*x")
text = "Excel will change numbers like 078436 7998 to 784367998, and there is no other numbers"
reg = re.sub("Ex*", "_", text)
print(reg)

_cel will change numbers like 078436 7998 to 784367998, and there is no other numbers


In [35]:
text = "Excel will change numbers like 078436 7998 to 784367998, and there is no other numbers"
reg = re.sub(".*", "_", text)
print(reg)

__


In [58]:
text = "Excel will change numbers like 078436 7998 to 784367998, and there is no other numbers"
reg = re.sub("[.]*", "_", text)
print(reg)

_E_x_c_e_l_ _w_i_l_l_ _c_h_a_n_g_e_ _n_u_m_b_e_r_s_ _l_i_k_e_ _0_7_8_4_3_6_ _7_9_9_8_ _t_o_ _7_8_4_3_6_7_9_9_8_,_ _a_n_d_ _t_h_e_r_e_ _i_s_ _n_o_ _o_t_h_e_r_ _n_u_m_b_e_r_s_


In [43]:
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("ll*", "_", text)
print(reg)

Exce_ wi_ change numbers _ike 078436 7998 to 784367998. And there is no other numbers.


In [48]:
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("ll*", "_", text)
print(reg)

Exce_ wi_ change numbers _ike 078436 7998 to 784367998. And there is no other numbers.


**`"+" One or more occurances`**

In [49]:
# + - One or more occurances ("xx.+x")
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub(".+", "_", text)
print(reg)

_


**`"?" Zero or one ocurance`**

In [66]:
# ? - Zero or one occurance ("xx.?X")
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("wi.?l", "_", text)
print(reg)

Excel _ change numbers like 078436 7998 to 784367998. And there is no other numbers.


**`"{}" Specified number of occurances{2} or Range of occurances{2,4}`**

In [67]:
# {} - Exactly the specified number of occurances ("xx.{2}x")
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("wi.{2}", "_", text)
print(reg)

Excel _ change numbers like 078436 7998 to 784367998. And there is no other numbers.


In [70]:
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("9{2}", "_", text)
print(reg)

Excel will change numbers like 078436 7_8 to 784367_8. And there is no other numbers.


In [12]:
text = "Excel will change numbers like 078436 7998 to 784369998. And there is no other numbers."
reg = re.sub("9{2,4}", "_", text)   # 
print(reg)

Excel will change numbers like 078436 7_8 to 78436_8. And there is no other numbers.


**`"|" Either or`**

In [72]:
# | - Either or (xxx|xxx)
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("07|98", "_", text)
print(reg)

Excel will change numbers like _8436 79_ to 7843679_. And there is no other numbers.


In [73]:
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub("aa|ll", "_", text)
print(reg)

Excel wi_ change numbers like 078436 7998 to 784367998. And there is no other numbers.


**`"()" Capture and group`**

In [74]:
# () - Capture and group
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub(("no"), "_", text)
print(reg)

Excel will change numbers like 078436 7998 to 784367998. And there is _ other numbers.


In [76]:
text = "Excel will change numbers like 078436 7998 to 784367998. And there is no other numbers."
reg = re.sub(("7998"), "_", text)
print(reg)

Excel will change numbers like 078436 _ to 78436_. And there is no other numbers.


#### **Sets**
*A set of characters inside a pair of square brackets `[]` with a special meaning*

**`[abd]`** - *Returns a match where of the specified characters(a, b or d) is present.*

In [2]:
text = "The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - 1984 (epub).rar)"
reg = re.sub("[abd]", "_", text)
print(reg)

The content: function is not of much use in this c_se, _s it will just report the (one of the _ozen) filen_mes wherein this text w_s foun_ (like in_ex1.txt), not the _ctu_l filen_me (like George Orwell - 1984 (epu_).r_r)


**`[a-n]`** - *Returns a match for any lower case character, alphabetically between `a` to `n`.*

In [3]:
text = "The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - 1984 (epub).rar)"
reg = re.sub("[a-n]", "_", text)
print(reg)

T__ _o_t__t: _u__t_o_ _s _ot o_ _u__ us_ __ t__s __s_, _s _t w___ _ust r_port t__ (o__ o_ t__ _oz__) ________s w__r___ t__s t_xt w_s _ou__ (____ ____x1.txt), _ot t__ __tu__ ________ (____ G_or__ Orw___ - 1984 (_pu_).r_r)


In [5]:
# Try out Upper Case characters
text = "The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - 1984 (epub).rar)"
reg = re.sub("[A-Z]", "_", text)
print(reg)

_he content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like _eorge _rwell - 1984 (epub).rar)


**`[^abd]`** - *Returns a match for any characters EXCEPT `a`, `b`, and `d`*

In [6]:
text = "The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - 1984 (epub).rar)"
reg = re.sub("[^abd]", "_", text)
print(reg)

__________________________________________________a____a______________________________________d___________a_______________________a______d_________d__________________a___a_______a__________________________________b___a__


**`[0137]`** - *Returns a match where any of the specified digits (`0`, `1`, `3`, `7`) are present.*

In [7]:
text = "The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - 1984 (epub).rar)"
reg = re.sub("[0137]", "_", text)
print(reg)

The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index_.txt), not the actual filename (like George Orwell - _984 (epub).rar)


**`[0-9]`** - *Returns a match for any digit between `0` and `9`.*

In [8]:
text = "The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - 1984 (epub).rar)"
reg = re.sub("[0-9]", "_", text)
print(reg)

The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index_.txt), not the actual filename (like George Orwell - ____ (epub).rar)


**`[0-5][0-9]`** - *Returns a match for any two-digit numbers from `00` to `59`.*

In [9]:
text = "The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - 1984 (epub).rar)"
reg = re.sub("[0-5][0-9]", "_", text)
print(reg)

The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - _84 (epub).rar)


**`[a-zA-Z]`** - *Returns a match for any character alphabetically between `a` and `z`, lower-case or upper-case.*

In [10]:
text = "The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - 1984 (epub).rar)"
reg = re.sub("[a-zA-Z]", "_", text)
print(reg)

___ _______: ________ __ ___ __ ____ ___ __ ____ ____, __ __ ____ ____ ______ ___ (___ __ ___ _____) _________ _______ ____ ____ ___ _____ (____ _____1.___), ___ ___ ______ ________ (____ ______ ______ - 1984 (____).___)


**`[+]`** - *Return a match for any `+` character in the string. In sets, [`+`, `*`, `.`,`$`,`|`, `()`, `{}`] has no special meaning.*

In [12]:
text = "The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), + not the actual filename (like George Orwell - 1984 (epub).rar)"
reg = re.sub("[+]", "_", text)
print(reg)

The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), _ not the actual filename (like George Orwell - 1984 (epub).rar)


#### **Combination of Metacharcters, Special_Sequences and Sets**

In [2]:
text = '''The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - 1984 (epub).rar).


Do you have those files (like George Orwell - 1984 (epub).rar) on your disk? then you can skip searching in the dozen index files.

The question itself is rather vague.        What should be in the output file? What is this "code" you are referring to?
Please explain yourself further ...

Note that you probably will have some mismatches. That becomes apparent even in your own examples: H G Wells vs H.G. Wells (where HG Wells is yet another possibility('''

In [3]:
print(text)

The content: function is not of much use in this case, as it will just report the (one of the dozen) filenames wherein this text was found (like index1.txt), not the actual filename (like George Orwell - 1984 (epub).rar).


Do you have those files (like George Orwell - 1984 (epub).rar) on your disk? then you can skip searching in the dozen index files.

The question itself is rather vague.        What should be in the output file? What is this "code" you are referring to?
Please explain yourself further ...

Note that you probably will have some mismatches. That becomes apparent even in your own examples: H G Wells vs H.G. Wells (where HG Wells is yet another possibility(


In [11]:
# Applying all I learned
reg = re.sub("\n", " ", text)   # Replaces the new lines by a whitespace
reg = re.sub(" +", " ", reg)    # Replaces more than one whitespaces by a whitespace
reg = re.sub("([.]{2})+", "", reg)    # Replaces two or more occurances of dot(.) by nothing
reg = re.sub("[(][\w* *]", "", reg)
print(reg)




#### **Patterns for Specific Task**

**`Email`**

In [15]:
text = '''abs.alchemy20@gmail.com, abs.sayem@gmail.com, abssayem.ieee@gmail.com, abssayem@cuet.ac.bd, abssayem121194@gmail.com'''
reg = re.findall("[a-zA-Z0-9._+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9]+\.*[a-zA-Z0-9]*", text)
print(reg)

['abs.alchemy20@gmail.com', 'abs.sayem@gmail.com', 'abssayem.ieee@gmail.com', 'abssayem@cuet.ac.bd', 'abssayem121194@gmail.com']


**`Phone Number`**

In [12]:
text = '''01313406618, +8801313406618, 01313-406618, 01313 406618'''
reg = re.findall("\d{11}", text)
print(reg)

error: missing ), unterminated subpattern at position 0