# Regular Expression

# Metacharacters:-
```
[] A set of characters
\ Signals a special sequence (can also be used to escape special characters)
. Any character (except newline character)
^ Sentence starts with
$ Sentence Ends with
* Zero or more occurrences
? ignore
+ One or more occurrences
{} Exactly the specified number of occurrences
| Either or
() Capture and group

Special Sequences

\A Returns a match if the specified characters are at the beginning of the string
\b Returns a match where the specified characters are at the beginning or at the end of a word r” ain\b.”
\B Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
\d Returns a match where the string contains digits (numbers from 0-9)
\D Returns a match where the string DOES NOT contain digits
\s Returns a match where the string contains a white space character
\S Returns a match where the string DOES NOT contain a white space character
\w Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
\W Returns a match where the string DOES NOT contain any word characters
\Z Returns a match if the specified characters are at the end of the string

The re module offers a set of functions that allows us to search a string for a match:

findall()	- Returns a list containing all matches
search()	- Returns a Match object if there is a match anywhere in the string
split()		- Returns a list where the string has been split at each match
sub()	 	- Replaces one or many matches with a string
finditter() - For iteration

Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning:

[arn] 		- Returns a match where one of the specified characters (a, r, or n) is present
[a-n] 		- Returns a match for any lower case character, alphabetically between a and n
[^arn] 		- Returns a match for any character EXCEPT a, r, and n
[0123]		- Returns a match where any of the specified digits (0, 1, 2, or 3) are present
[0-9]		- Returns a match for any digit between 0 and 9
[0-5][0-9]	- Returns a match for any two-digit numbers from 00 and 59
[a-zA-Z]	- Returns a match for any character alphabetically between a and z, lower case OR upper case
[+]		- In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character
```

In [8]:
import re
txt = "The rain in the Spain"
pattern = "^T.*Spain$"
x = re.search(pattern, txt)
print(x)

<re.Match object; span=(0, 21), match='The rain in the Spain'>


### findall()

```
* The findall() function returns a list containing all matches.
* Return a list of all matches.
* The list contains the matches in the order they are found.
* If no matches are found, an empty list is returned.
```

In [11]:
txt = "The rain in the Spain"
pattern = "ai"
x = re.findall(pattern, txt)
print(x)

['ai', 'ai']


In [17]:
txt = "The rain in the Spain"
pattern = "Portugal"
x = re.findall(pattern, txt)
print(x)

[]


In [52]:
txt = "The Match object has properties and methods used to retrieve information about the search, and the result"
pattern = r"\bM\w+"
x = re.findall(pattern, txt)
print(x)

['Match']


### Search()
```
* Searches the string for a match, and returns a `Match object` if there is a match.
* If there is more than one match, only the first occurrence of the match will be returned.
* If no matches are found, the value None is returned.
```

In [10]:
x = re.search("S.*", txt)
print(x)

<re.Match object; span=(16, 21), match='Spain'>


In [23]:
txt = "The rain in the Spain"
pattern = "\s"
x = re.search(pattern, txt)
print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


In [25]:
txt = "The rain in the Spain"
pattern = "Portugal"
x = re.search(pattern, txt)
print(x)

None


In [50]:
txt = "The Match object has properties and methods used to retrieve information about the search, and the result"
pattern = r"\bM\w+"
x = re.search(pattern, txt)
print(x)

<re.Match object; span=(4, 9), match='Match'>


In [58]:
txt = """
Salman Salim Abdul Rashid Khan; born 27 December 1965 is an Indian actor, film producer,
writer and television personality who works predominantly in Hindi films.
In a film career spanning over thirty five years, 
Khan has received numerous awards, including two National Film Awards as a film producer,
and two Filmfare Awards as an actor.
He is cited in the media as one of the most commercially successful actors of Indian cinema.
Forbes has included Khan in listings of the highest-paid celebrities in the world, in 2015 and 2018,
with him being the highest-ranked Indian in the latter year.

"""

pattern = r"\b\d.*\d\b"
x = re.search(pattern, txt)
print(x)


<re.Match object; span=(38, 54), match='27 December 1965'>


### Split ()
```
* Returns a list where the string has been split at each match into words.
* Split at each white-space character.
* Split the string only at the first occurrence.
* We can control the number of occurrences by specifying the maxsplit parameter.
```

In [29]:
txt = "The rain in Spain"
pattern = "\s"
x = re.split(pattern, txt)
print(x)

['The', 'rain', 'in', 'Spain']


In [31]:
txt = "The rain in Spain"
pattern = "\s"
x = re.split(pattern, txt, maxsplit=2)
print(x)

['The', 'rain', 'in Spain']


In [34]:
txt = "The rain in Spain"
pattern = "\s"
x = re.split(pattern, txt, 1)
print(x)

['The', 'rain in Spain']


In [53]:
txt = "The Match object has properties and methods used to retrieve information about the search, and the result"
pattern = r"\bM\w+"
x = re.split(pattern, txt)
print(x)

['The ', ' object has properties and methods used to retrieve information about the search, and the result']


### Sub ()
```
* Replaces the matches with the text of your choice.
* Can control the number of replacements by specifying the count parameter.
```

In [40]:
txt = "The rain in Spain"
pattern = "\s"
x = re.sub(pattern, "_", txt)
print(x)

The_rain_in_Spain


In [45]:
txt = "The rain in Spain"
pattern = "\s"
x = re.sub(pattern, "9", txt, 1)
print(x)

The9rain in Spain


### Match Object

```
* A Match Object is an object containing information about the search and the result.

The Match object has properties and methods used to retrieve information about the search, and the result.
* .span() returns a tuple containing the start-, and end positions of the match.
* .string returns the string passed into the function
```

In [60]:
txt = "The rain in Spain"
pattern = r"\bS\w+"
x = re.search(pattern, txt)
print(x.span())

(12, 17)


In [61]:
txt = "The rain in Spain"
pattern = r"\bS\w+"
x = re.search(pattern, txt)
print(x.string)

The rain in Spain


In [110]:
txt = "Contact us at info@example.com or support@domain.pk for assistance atif_salam@domain.pk."
pattern = r"(\w{1,50}@\w{1,50}.pk)"
match = re.findall(pattern, txt)
print(match)

['support@domain.pk', 'atif_salam@domain.pk']


In [104]:
text = "Contact us at info@example.com or support@domain.pk for assistance atif_salam@domain.pk."
pattern = r'\b[\w]+@[\w]+.pk'

domain_pk = re.findall(pattern, text)
print(domain_pk)

['support@domain.pk', 'atif_salam@domain.pk']


In [134]:
txt = "My CNIC is 12345-6789012-3 and another one is 34567-8901234-5 nd 42201 9446591 3"
pattern = r"\d.+\d.+\d"

pattern = r"(\d{4,5}.\d{6,7}.\d{1,2})"
match = re.findall(pattern, txt)
print(match)

['12345-6789012-3', '34567-8901234-5', '42201 9446591 3']


In [133]:
text = "My CNIC is 12345-6789012-3 and another one is 34567-8901234-5, 34567 8901234 5."
pattern = (r"\d+.\d+.\d+")
cnic = re.findall(pattern, text)
print(cnic)

['12345-6789012-3', '34567-8901234-5', '34567 8901234 5']


In [138]:
txt = "یہ sentence میں کچھ English words بھی ہیں۔"
pattern = r"\b[\u0600-\u06FF]+"
match = re.findall(pattern, txt)
print(match)

['یہ', 'میں', 'کچھ', 'بھی', 'ہیں۔']


In [158]:
txt = "The event will take place on 15-08-2023 and 23-09-2023 or 19092023"
pattern = r"\d{1,2}.\d{1,2}.\d{1,4}"
match = re.findall(pattern, txt)
print(match)

['15-08-2023', '23-09-2023', '19092023']


In [329]:
txt = "Visit http://www.example.pk or https://website.com.pk for more information."
pattern = r"\b[\w://]+[\w\.]+\.[\w]."
match = re.findall(pattern, txt)
print(match)

['http://www.example.pk', 'https://website.com.pk']


In [363]:
txt = """my email id is atif.salam@gmail.com and my friend's email id is koki_1978@yahoo.com

and his friend's id is atifsalam@hotmail.com and his friend's id is 1978_atif@outlook.com

"""
pattern = r"[\w\.]+[\w]+[@]+[\w]+\.[\w]+"
match = re.findall(pattern, txt)
print(match)

['atif.salam@gmail.com', 'koki_1978@yahoo.com', 'atifsalam@hotmail.com', '1978_atif@outlook.com']


In [379]:
txt = "The product costs PKR 1500, while the deluxe version is priced at Rs. 2500."
pattern = r"[\w\.]+\s[\d]+"
match = re.findall(pattern, txt)
print(match)


['PKR 1500', 'Rs. 2500']


In [398]:
txt = "کیا! آپ, یہاں؟"
pattern = r"[\u0600-\u06FF]+\b"
match = re.findall(pattern, txt)
print(match)

['کیا', 'آپ', 'یہاں']


In [485]:
txt = "Lahore, Karachi, Islamabad, and Peshawar are major cities of Pakistan."
pattern = r'\b[A-Z][a-z]+\b'
city = re.findall(pattern, txt)
for match in city:
    if match != 'Pakistan':
        print(match)

Lahore
Karachi
Islamabad
Peshawar


In [513]:
txt = """I saw a car with the number plate LEA-567 near the market 
and I have car number BJF 219 and my
my wife's car number is BJK258"""

pattern = r"[A-Z\-]+\s?[\d]+"
match = re.findall(pattern, txt)
print(match)

['LEA-567', 'BJF 219', 'BJK258']


In [546]:
data = """
12:05:40 From Ali Asar To Everyone : PIAIC-187258
12:05:41 From Hasnain Munir To Everyone : PIAIC-187272
12:05:44 From Shaheer Baig To Everyone : PIAIC-169519
12:05:45 From Humera Naz To Everyone : PIAIC-173431
12:05:47 From Sadia Anwar To Everyone : PIAIC-180028
12:05:47 From Muhammad Mehroz To Everyone : PIAIC-131496
12:05:47 From Home To Everyone : PIAIC-172998
12:05:47 From Muhammad Sadullah = PIAIC-178950 To Everyone : PIAIC-178950
12:05:48 From Muhammad Uzair-177637 To Everyone : PIAIC-177637
12:05:48 From Taqwa Khaliq To Everyone : PIAIC-173701 TAQWA KHALIQ
12:05:48 From asghar ibraheem CNC-012105 To Everyone : CNC-012105
12:05:49 From Faisal Bashir-177802 To Everyone : PIAIC-177802
12:05:50 From Talha Munir Rana To Everyone : PIAIC-173761
12:05:52 From Abdul Qadar To Everyone : PIAIC-172941
12:05:57 From Dr. Bhagwan Das To Everyone : PIAIC-96879
12:05:57 From Uzair Ullah (PIAIC169405) To Everyone : PIAIC-169405
12:06:00 From Arshad To Everyone : PIAIC168092
12:06:01 From PIAIC188523 Subhan Ahmed To Everyone : PIAIC188523
12:06:04 From Taqwa Khaliq To Everyone : PIAIC 173701
12:06:10 From Ali Asar To Everyone : PIAIC-188720
12:06:23 From Mohammad Javed To Everyone : WhatsApp number?
12:06:28 From Gulshan Ali To Everyone : PIAIC-176719
12:06:29 From M Ali Asif Khan - PIAIC-57947 To Everyone : PIAIC-57947
12:06:33 From Usman Noor PIAIC-188401 To Everyone : PIAIC-188401
12:06:46 From Zia (piaic121514) To Everyone : thanks
12:06:49 From Sohaib Baseer Ahmad To Everyone : PIAIC67260
12:40:20 From Ali Asar To Everyone : PIAIC-187258
"""

pattern = r"(\d{2}:\d{2}:\d{2}) From (.+) To Everyone : (PIAIC ?-?\d{5,6})"
match = re.findall(pattern, data)
match

[('12:05:40', 'Ali Asar', 'PIAIC-187258'),
 ('12:05:41', 'Hasnain Munir', 'PIAIC-187272'),
 ('12:05:44', 'Shaheer Baig', 'PIAIC-169519'),
 ('12:05:45', 'Humera Naz', 'PIAIC-173431'),
 ('12:05:47', 'Sadia Anwar', 'PIAIC-180028'),
 ('12:05:47', 'Muhammad Mehroz', 'PIAIC-131496'),
 ('12:05:47', 'Home', 'PIAIC-172998'),
 ('12:05:47', 'Muhammad Sadullah = PIAIC-178950', 'PIAIC-178950'),
 ('12:05:48', 'Muhammad Uzair-177637', 'PIAIC-177637'),
 ('12:05:48', 'Taqwa Khaliq', 'PIAIC-173701'),
 ('12:05:49', 'Faisal Bashir-177802', 'PIAIC-177802'),
 ('12:05:50', 'Talha Munir Rana', 'PIAIC-173761'),
 ('12:05:52', 'Abdul Qadar', 'PIAIC-172941'),
 ('12:05:57', 'Dr. Bhagwan Das', 'PIAIC-96879'),
 ('12:05:57', 'Uzair Ullah (PIAIC169405)', 'PIAIC-169405'),
 ('12:06:00', 'Arshad', 'PIAIC168092'),
 ('12:06:01', 'PIAIC188523 Subhan Ahmed', 'PIAIC188523'),
 ('12:06:04', 'Taqwa Khaliq', 'PIAIC 173701'),
 ('12:06:10', 'Ali Asar', 'PIAIC-188720'),
 ('12:06:28', 'Gulshan Ali', 'PIAIC-176719'),
 ('12:06:29', 'M 

In [548]:
import pandas as pd

df = pd.DataFrame(match, columns=["Time", "Student Name", "Roll_No"])
df

Unnamed: 0,Time,Student Name,Roll_No
0,12:05:40,Ali Asar,PIAIC-187258
1,12:05:41,Hasnain Munir,PIAIC-187272
2,12:05:44,Shaheer Baig,PIAIC-169519
3,12:05:45,Humera Naz,PIAIC-173431
4,12:05:47,Sadia Anwar,PIAIC-180028
5,12:05:47,Muhammad Mehroz,PIAIC-131496
6,12:05:47,Home,PIAIC-172998
7,12:05:47,Muhammad Sadullah = PIAIC-178950,PIAIC-178950
8,12:05:48,Muhammad Uzair-177637,PIAIC-177637
9,12:05:48,Taqwa Khaliq,PIAIC-173701
