# Regular Expression

Regular expressions, commonly known as regex, are a handy tool for finding patterns in text data. More specifically, they’re a sequence of characters 
that define a search pattern, which can be used to match or replace specific parts of a string. We use regular expressions in text preprocessing tasks 
such as cleaning, tokenization, and pattern recognition. The ability to use regular expressions to quickly and accurately process large volumes of text
makes them an essential tool for NLP and data science applications.

Application of regex :

We can use regex to carry out various NLP tasks. For example, in tokenization, we use regular expressions to identify delimiters between words,
sentences, or paragraphs. We use regex to remove unwanted characters, punctuation, or whitespace in text cleaning.

Regular expressions are also instrumental in web scraping and data mining tasks, where we need to extract specific information from web pages or large 
datasets. Regex can identify patterns in web pages’ HTML or XML source code and extract relevant information such as email addresses, phone numbers, 
URLs, or other structured data. A great example is using regex to collect data on job openings from LinkedIn, Indeed, and other job sites.



## regex in text processing

## 1. Tokenization

In the following example, we use the \w+ regular expression to match (or search for) one or more consecutive word characters
(letters, digits, and underscores), which we extract as tokens from the input text using the re.findall() function.


In [1]:
# import necessary libraries

import re
import pandas as pd

In [2]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/movie_review.csv")
df.head()

Unnamed: 0,review_id,review_text,rating,author_name
0,1,I have to go to the store.,5,john smith
1,2,This product is amazing.,4,jane doe
2,3,This is the best movie I have ever seen.,5,alex johnson
3,4,The customer support was terrible.,2,emily thompson
4,5,The food was delicious.,4,michael brown


In [3]:
# define a tokenization function

"""
The \w+ regular expression matches one or more consecutive word characters. Word characters typically include letters (a–z, A–Z), digits (0–9), and 
underscores (_).
"""

def tokenize_text(text):
    return re.findall(r'\w+', text)


In [4]:
# Apply tokenize function

df['tokens'] = df['review_text'].apply(tokenize_text)

In [5]:
df.head()

Unnamed: 0,review_id,review_text,rating,author_name,tokens
0,1,I have to go to the store.,5,john smith,"[I, have, to, go, to, the, store]"
1,2,This product is amazing.,4,jane doe,"[This, product, is, amazing]"
2,3,This is the best movie I have ever seen.,5,alex johnson,"[This, is, the, best, movie, I, have, ever, seen]"
3,4,The customer support was terrible.,2,emily thompson,"[The, customer, support, was, terrible]"
4,5,The food was delicious.,4,michael brown,"[The, food, was, delicious]"


# 2. Text cleaning and normalization

We can  use regex to clean up a given text by removing extra spaces and converting all the text to lowercase, as shown in the following example.

In [6]:
# import necessary libraries

import re
import pandas as pd

In [7]:
# read the dataset and visualize the dataframe and concerned column

df = pd.read_csv('C:/Users/ariji/OneDrive/Desktop/Data/movie_review.csv')
print(df)
print(df.review_text)

    review_id                                        review_text  rating  \
0           1                         I have to go to the store.       5   
1           2                           This product is amazing.       4   
2           3           This is the best movie I have ever seen.       5   
3           4                 The customer support was terrible.       2   
4           5                            The food was delicious.       4   
5           6        I had a wonderful experience at this hotel.       5   
6           7    The shopping experience was very disappointing.       1   
7           8             The spelling in this book is horrible.       3   
8           9       The performance of the actor was impressive.       4   
9          10             The product description is inaccurate.       2   
10         11         The weather today is absolutely beautiful.       5   
11         12         The new software update is causing issues.       2   
12         1

In [8]:
# Define a function to clean text

"""
We define the clean_text function that removes multiple spaces and non-alphanumeric characters and converts the text to lowercase. We use the re.sub()
function to perform a regex-based substitution on the given text string. The first re.sub() function replaces multiple consecutive spaces with a single 
space using the ' +' pattern. The second re.sub() function replaces any non-alphanumeric characters with a space using the [^0-9a-zA-Z]+ pattern. We convert the resulting text to lowercase using text.lower() and return the cleaned text from the clean_text function using return cleaned_text.
"""

def clean_text(text):
    cleaned_text = re.sub(' +', ' ', re.sub('[^0-9a-zA-Z]+', ' ', text.lower()))
    return cleaned_text


In [9]:
# apply clean_text function

df['cleaned_text'] = df['review_text'].apply(clean_text)
df.head()

Unnamed: 0,review_id,review_text,rating,author_name,cleaned_text
0,1,I have to go to the store.,5,john smith,i have to go to the store
1,2,This product is amazing.,4,jane doe,this product is amazing
2,3,This is the best movie I have ever seen.,5,alex johnson,this is the best movie i have ever seen
3,4,The customer support was terrible.,2,emily thompson,the customer support was terrible
4,5,The food was delicious.,4,michael brown,the food was delicious


# 3. Named Entity Recognition

We can also extract named entities (such as persons and locations) from the given text using regular expressions and storing them in a dictionary, as 
shown below.

In [10]:
# Import necessary libraries

import re
import pandas as pd

In [12]:
# read the dataset and visualize the dataframe and concerned column

df = pd.read_csv('C:/Users/ariji/OneDrive/Desktop/Data/reviews_uk.csv')
df.head()

Unnamed: 0,review_id,text
0,txt1,"I recently visited London, and the British Mus..."
1,txt2,"While exploring Edinburgh, I had the chance to..."
2,txt3,"During my stay in Oxford, I attended lectures ..."
3,txt4,I watched a play at Shakespeare's Globe Theatr...
4,txt5,"My favorite British author is Charles Dickens,..."


In [13]:
df

Unnamed: 0,review_id,text
0,txt1,"I recently visited London, and the British Mus..."
1,txt2,"While exploring Edinburgh, I had the chance to..."
2,txt3,"During my stay in Oxford, I attended lectures ..."
3,txt4,I watched a play at Shakespeare's Globe Theatr...
4,txt5,"My favorite British author is Charles Dickens,..."
5,txt6,"I traveled to Manchester, and the ancient ston..."
6,txt7,"During my trip to Scotland, I explored the sce..."
7,txt8,I enjoyed a traditional afternoon tea in Winds...
8,txt9,I had a delightful fish and chips meal in a qu...
9,txt10,London's West End theaters offer some of the b...


In [27]:
"""
We create the patterns dictionary, which contains two key-value pairs. Each key represents a named entity type (PERSON and LOCATION), and each value 
is a regular expression pattern corresponding to the named entity type.
"""
patterns = {
    'PERSON': r'(Shakespeare|Charles Dickens|Jane Smith)',
    'LOCATION': r'(London|Edinburgh|Oxford|Manchester|Scotland)'
}

In [29]:
"""


"""

named_entities = {}
for entity, pattern in patterns.items():
    def find_entities(text):
        return re.findall(pattern, text)
    df[entity] = df['text'].apply(find_entities)
    named_entities[entity] = df[entity].tolist()
print(named_entities)

{'PERSON': [[], [], [], ['Shakespeare'], ['Charles Dickens'], [], [], [], [], []], 'LOCATION': [['London'], ['Edinburgh', 'Edinburgh'], ['Oxford', 'Oxford'], ['London'], [], ['Manchester'], ['Scotland'], [], [], ['London']]}


In [30]:
named_entities

{'PERSON': [[],
  [],
  [],
  ['Shakespeare'],
  ['Charles Dickens'],
  [],
  [],
  [],
  [],
  []],
 'LOCATION': [['London'],
  ['Edinburgh', 'Edinburgh'],
  ['Oxford', 'Oxford'],
  ['London'],
  [],
  ['Manchester'],
  ['Scotland'],
  [],
  [],
  ['London']]}