---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 4.3</h1>

## _Regular Expressions in NLP.ipynb_

## Learning agenda of this notebook

1. Overview of Regular Expressions (Recap)
2. Modifying Strings
    1. `Split()` method in Regex
    2. Limit the number of splits
    3. Regex to Split string with multiple delimiters
    4. Split strings by delimiters and specific word
    5. Regex split a string and keep the separators
3. Replace Pattern in a string using re.sub() method
    1. `re.sub()` method in Regex
    2. Regex example to replace all whitespace with an underscore
    3. Regex to remove whitespaces from a string
    4. Regex to remove leading Spaces from a string
    5. Regex to remove both leading and trailing spaces

## Overview of Regular Expressions (Recap)
- Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. 


- Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain `English sentences`, or `e-mail addresses`, or `TeX commands`, or `anything you like`. 


- You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to `modify a string` or to `split` it apart in various ways.

## 1. Modifying Strings
- Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods:

        - split(): Split the string into a list, splitting it wherever the RE matches
        - sub(): Find all substrings where the RE matches, and replace them with a different string
        - subn(): Does the same thing as sub(), but returns the new string and the number of replacements

### `Split()` method in Regex
- The `split()` method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. It’s similar to the split() method of strings but provides much more generality in the delimiters that you can split by; string split() only supports splitting by whitespace or by a fixed string.

#### <center> re.split(pattern, string, maxsplit=0) </center>

        - pattern: the regular expression pattern used for splitting the target string.
        - string: The variable pointing to the target string (i.e., the string we want to split).
        - maxsplit: The number of splits you wanted to perform. If maxsplit is 2, at most two splits occur, 
            and the remainder of the string is returned as the final element of the list.
    
    
    
- It split the target string as per the regular expression pattern, and the matches are returned in the form of a list.


- If the specified pattern is not found inside the target string, then the string is not split in any way, but the split method still generates a list since this is the way it’s designed. However, the list contains just one element, the target string itself.

In [2]:
# importing required libraries
import re

# defining string
target_string = "My name is Arif Butt and my lucky numbers are 12 45 78"


# using re.split() method
# defining pattern that splits the string on the occurence of one or more white-spaces
word_list = re.split(r"\s+", target_string)

# print the list
print(word_list)

['My', 'name', 'is', 'Arif', 'Butt', 'and', 'my', 'lucky', 'numbers', 'are', '12', '45', '78']


### Limit the number of splits
The `maxsplit` parameter of re.split() is used to define how many splits you want to perform. In simple words, if the maxsplit is 2, then two splits will be done, and the remainder of the string is returned as the final element of the list.

In [4]:
# importing required libraries
import re

# defining string
target_string = "12-45-78"


# let’s take a simple example to split a string on the occurrence of any non-digit. 
# Here we will use the \D special sequence that matches any non-digit character.
# Split only on the first occurrence (maxsplit is 1)
result = re.split(r"\D", target_string, maxsplit=1)
print(result)

# Split on the two occurrence, (maxsplit is 2)
result = re.split(r"\D", target_string, maxsplit=2)
print(result)


['12', '45-78']
['12', '45', '78']


### Regex to Split string with multiple delimiters
- With the regex split() method, you will get more flexibility. You can specify a pattern for the delimiters where you can specify multiple delimiters, while with the string’s split() method, you could have used only a fixed character or set of characters to split a string.


- For example, using the regular expression re.split() method, we can split the string either by the `comma` or by `space`.

In [5]:
# importing required libraries
import re

# defining string
target_string = "12,45,78,85-17-89"

# splitting on the basis of 2 delimiter - and ,
# use OR (|) operator to combine two pattern
result = re.split(r"-|,", target_string)

# print list
print(result)

['12', '45', '78', '85', '17', '89']


### Split strings by delimiters and specific word

In [6]:
# importing required libraries
import re

# defining string
text = "12, and45,78and85-17and89-97"

# split by word 'and' space, and comma
# defined pattern includes and | set of one or more whitspaces, -
result = re.split(r"and|[\s,-]+", text)

# print list
print(result)

['12', '', '45', '78', '85', '17', '89', '97']


### Regex split a string and keep the separators

In [7]:
# importing required libraries
import re

# defining string
target_string = "12-45-78"


# let’s take a simple example to split a string on the occurrence of any non-digit. 
# Here we will use the \D special sequence that matches any non-digit character.
# use parenthese to keep the separator as well
result = re.split(r'(\D+)', target_string)

# print list
print(result)


['12', '-', '45', '-', '78']


## 2. Replace Pattern in a string using `re.sub()` method
- Python regex offers `sub()` the `subn()` methods to `search` and `replace` patterns in a string. Using these methods we can replace one or more occurrences of a regex pattern in the target string with a substitute string.

        - re.sub(pattern, replacement, string):	Find and replaces all occurrences of pattern with replacement
        
        - re.sub(pattern, replacement, string, count=1): Find and replaces only the first occurrences of pattern 
          with replacement
          
        - re.sub(pattern, replacement, string, count=n)	Find and replaces first n occurrences of pattern with 
          the replacement

### `re.sub()` method in Regex
#### <center> re.sub(pattern, replacement, string) </center>

- `pattern`: The regular expression pattern to find inside the target string.


- `replacement`: The replacement that we are going to insert for each occurrence of a pattern. The replacement can be a string or function.


- `string`: The variable pointing to the target string (In which we want to perform the replacement).


- `count`: Maximum number of pattern occurrences to be replaced. The count must always be a positive integer if specified. .By default, the count is set to zero.


- It returns the string obtained by replacing the pattern occurrences in the string with the replacement string. If the pattern isn’t found, the string is returned unchanged.

### Regex example to replace all whitespace with an underscore

In [8]:
# importing required libraries
import re

# defining string
target_str = "Learning is fun with Arif Butt"

# passing whitespace character as pattern, that will be replaced with _ in the target string
res_str = re.sub(r"\s", "_", target_str)

# Print String after replacement
print(res_str)

Learning_is_fun_with_Arif_Butt


### Regex to remove whitespaces from a string

In [9]:
# importing required libraries
import re

# defining string
target_str = "Learning is fun with Arif Butt"

# using \s+ to remove all spaces
# + indicate 1 or more occurrence of a space
res_str = re.sub(r"\s+", "", target_str)

# String after replacement
print(res_str)

LearningisfunwithArifButt


### Regex to remove leading Spaces from a string

In [10]:
# importing required libraries
import re

# defining string
target_str = "   Learning is fun with Arif Butt"

# ^\s+ remove only leading spaces
# caret (^) matches only at the start of the string
res_str = re.sub(r"^\s+", "", target_str)

# String after replacement
print(res_str)

Learning is fun with Arif Butt


### Regex to remove both leading and trailing spaces

In [11]:
# importing required libraries
import re

# defining string
target_str = "   Learning is fun with Arif Butt  \t"

# ^\s+ remove leading spaces
# ^\s+$ removes trailing spaces
# | operator to combine both patterns
res_str = re.sub(r"^\s+|\s+$", "", target_str)

# String after replacement
print(res_str)

Learning is fun with Arif Butt
