# 1. Character Class

A character class is a way of defining a group of characters that can match any one character from a given set. We use character classes in regex when 
we want to match a single character that can be any one of a specific set of characters instead of writing out all the possible characters that can be
matched. To define a character class, we enclose a set of characters within square brackets [].

Here are some character classes and what they match:

[a-z]: Matches any lowercase letter

[A-Z]: Matches any uppercase letter

[0-9]: Matches any digit

[a-zA-Z]: Matches any letter (both lowercase and uppercase)

[a-zA-Z0-9]: Matches any letter or digit

In [1]:
# import necessary libraries

import re
import pandas as pd

In [2]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [6]:
matches = re.findall(r"[0-9]+",df.review_id.iloc[0])
matches

['145']

In [7]:
matches = re.findall(r"[a-z]+",df.review_id.iloc[0])
matches

['txt']

In [8]:
matches = re.findall(r"[A-Z]+",df.review_id.iloc[0])
matches

[]

In [9]:
matches = re.findall(r"[a-zA-Z]+",df.review_id.iloc[0])
matches

['txt']

In [10]:
matches = re.findall(r"[a-zA-Z0-9]+",df.review_id.iloc[0])
matches

['txt145']

# 2. Metacharacters

Metacharacters are characters with special meanings, and we use them when we want to search, extract, or manipulate text data based on specific
patterns or rules. For example, let’s say we have a text document containing a list of email addresses. We can use metacharacters to search for all the
email addresses that follow a specific pattern. Suppose we want to extract all the email addresses that end with @hello.com from the document. We can
use the .* metacharacter to represent any number of characters before @hello.com. Here’s an example regular expression pattern using
metacharacters: .*@hello\.com. In this pattern, the .* metacharacter sequence matches any number of characters (including zero characters) before 
the @ symbol. The . represents the literal dot character because the dot is also a metacharacter that matches any character. Finally, the \.com pattern
matches the .com characters in the email addresses.

Some common metacharacters :

Dot (.): This metacharacter matches exactly one character. For example, c.t would match any three-letter word beginning with c and ending with t.

Asterisk, aka Kleene star (*): This metacharacter matches zero or more occurrences of the preceding element. For example, a* would match zero or more 
occurrences of the letter a.

Plus symbol, aka Kleene plus (+): This metacharacter matches one or more occurrences of the preceding element. For example, a+ would match one or more
occurrences of the letter a.

Question mark (?): This metacharacter matches zero or one occurrence of the preceding element. For example, a? would match zero or one occurrence of
the letter a.

Caret (^): We use this metacharacter as the beginning of the line anchor. It asserts that the following pattern must start at the beginning of a line.
For example, the ^hello pattern will only match if the input text starts with hello.

Dollar sign ($): We use this metacharacter as the end-of-the-line anchor. It asserts that the preceding pattern must end at the end of a line. For 
example, the world$ pattern will only match if the input text ends with world.

Pipe (|): This metacharacter allows us to specify multiple options or alternatives for matching. It allows us to match any one of a set of possible
patterns, effectively creating a logical OR operation. The pipe symbol denotes it |. For example, th(e|is|at) will return the and this in this is the 
day.


In [16]:
print(df['text'])
text = " ".join(df['text'])
print(text)

0     The software had a steep learning curve at fir...
1     I'm really impressed with the user interface o...
2     The latest update to the software fixed severa...
3     I encountered a few glitches while using the s...
4     I was skeptical about trying the software init...
5     The analytics features have provided us with v...
6     I appreciate the regular updates that the soft...
7     I attended a training session for the software...
8     The software documentation could be more compr...
9     I've recommended the software to colleagues du...
10    The software integration with third-party plug...
11    I'm looking forward to the upcoming release of...
12    The user community is active and supportive, m...
13    I've been using the software for a while now, ...
14    The user interface could use some modernizatio...
15    I went for a run and the software did a good j...
Name: text, dtype: object
The software had a steep learning curve at first, but after a while, I started

In [17]:

result_dot = re.findall(r'b.t', text)
result_star = re.findall(r'competitors.*', text)
result_plus = re.findall(r'soft.+', text)
result_question = re.findall(r'm.?', text)
result_pipe = re.findall(r'software|customer', text)

In [19]:
print("Dot metacharacter: 'b.t'", result_dot) 
print("\nStar metacharacter: 'competitors.*'", result_star) 
print("\nPlus metacharacter: 'soft.+'", result_plus) 
print("\nQuestion mark metacharacter: 'm.?'", result_question)  
print("\nPipe metacharacter: 'software|customer'", result_pipe)

Dot metacharacter: 'b.t' ['but', 'but', 'but']

Star metacharacter: 'competitors.*' ['competitors. I went for a run and the software did a good job of mapping the route.']

Plus metacharacter: 'soft.+' ["software had a steep learning curve at first, but after a while, I started to appreciate its powerful features. I'm really impressed with the user interface of the software. It's intuitive and easy to navigate. The latest update to the software fixed several bugs and improved its overall performance. I encountered a few glitches while using the software, but the customer support was quick to help me resolve them. I was skeptical about trying the software initially, but it turned out to be a game-changer for our productivity. The analytics features have provided us with valuable insights that have guided our decision-making. I appreciate the regular updates that the software receives, as they often bring new and useful features. I attended a training session for the software, and it gre

# 3. Quantifiers

Quantifiers are metacharacters that specify the number of occurrences of the preceding character, group, or character class. We use quantifiers to
match a pattern that occurs a specific number or within a range of times, from zero to many.

Common quantifiers include:

The dot metacharacter (.)

The asterisk metacharacter (*)

The plus metacharacter (+)

The question mark metacharacter (?)

{n}, which matches precisely n occurrences of the preceding element, e.g., a{3} would match exactly three occurrences of the letter a

{n,}, which matches at least n occurrences of the preceding element, e.g., a{3,} would match at least three occurrences of the letter a

{,m}, which matches at most m occurrences of the preceding element, e.g., a{,3} would match at most three occurrences of the letter a

{n,m}, which matches between n and m occurrences (inclusive) of the preceding element, e.g., a{2,5} would match between two and five occurrences of 
the letter a



In [20]:
print(df['text'])
text = " ".join(df['text'])
print(text)

0     The software had a steep learning curve at fir...
1     I'm really impressed with the user interface o...
2     The latest update to the software fixed severa...
3     I encountered a few glitches while using the s...
4     I was skeptical about trying the software init...
5     The analytics features have provided us with v...
6     I appreciate the regular updates that the soft...
7     I attended a training session for the software...
8     The software documentation could be more compr...
9     I've recommended the software to colleagues du...
10    The software integration with third-party plug...
11    I'm looking forward to the upcoming release of...
12    The user community is active and supportive, m...
13    I've been using the software for a while now, ...
14    The user interface could use some modernizatio...
15    I went for a run and the software did a good j...
Name: text, dtype: object
The software had a steep learning curve at first, but after a while, I started

In [28]:
"""
The t..t pattern searches for any occurrence of a t followed by two characters and then another t
The t[a-z]*t pattern searches for any occurrence of a t followed by zero or more lowercase alphabetic characters and then another t
The s[a-z]+ pattern searches for any occurrence of an lowercase s followed by one or more lowercase alphabetic characters.
The regular expression regexp\? searches for the literal string 'regexp?' in the text variable. The question mark (?) is a special character in regular
expressions, and it’s escaped with a backslash (\) to be treated as a literal character rather than a metacharacter
The [a-zA-Z]{5} pattern specifies that it should match exactly five letters. The \b before and after the pattern indicates a word boundary, ensuring 
that the regular expression matches whole words. The re.findall() function returns a list containing all such five-letter words.
The [a-zA-Z]{3,} pattern indicates that it should match three or more letters. The \b before and after the pattern ensures that the whole word matches.
The [a-zA-Z]{,4} pattern indicates that it should match zero to four letters. The \b before and after the pattern ensures that the whole word matches.
The pattern s[a-zA-Z]{1,5} specifies that the word should start with “s” and be followed by one to five letters. The \b before and after the pattern 
ensures that the whole word matches.
The re.findall() function returns a list containing all such words meeting the defined criteria

"""

result_dot = re.findall(r't..t', text)
result_star = re.findall(r't[a-z]*t', text)
result_plus = re.findall(r's[a-z]+', text)
result_question = re.findall(r'regexp\?', text) 
result_n = re.findall(r'\b[a-zA-Z]{5}\b', text)
result_n_min = re.findall(r'\b[a-zA-Z]{3,}\b', text)
result_m_max = re.findall(r'\b[a-zA-Z]{,4}\b', text)
result_n_m = re.findall(r'\bs[a-zA-Z]{1,5}\b', text)

In [29]:
print("Dot quantifier: 't..t'\n", result_dot) 
print("\nStar quantifier: 't[a-z]*t'\n", result_star) 
print("\nPlus quantifier: 's[a-z]+'\n", result_plus) 
print("\nQuestion mark quantifier: 'regexp?'\n", result_question)
print("\n{n} quantifier: '[a-zA-Z]{5}'\n", result_n)
print("\n{n,} quantifier: '[a-zA-Z]{3,}'\n", result_n_min)
print("\n{,m} quantifier: '[a-zA-Z]{,4}'\n", result_m_max)
print("\n{n,m} quantifier: 's[a-zA-Z]{1,5}'\n", result_n_m)

Dot quantifier: 't..t'
 ['tart', 'th t', 'tuit', 'test', 'te t', 't it', 'ts t', 'te t', 'that', 'th t', 'to t', 'to t', 'tent', 'tdat']

Star quantifier: 't[a-z]*t'
 ['tart', 'tuit', 'test', 'tivit', 'that', 'that', 'tt', 'tionalit', 'tat', 'tegrat', 'tionalit', 'tat', 'troubleshoot', 'tent', 'tabilit', 'tdat', 'tit']

Plus quantifier: 's[a-z]+'
 ['software', 'steep', 'st', 'started', 'ssed', 'ser', 'software', 'sy', 'st', 'software', 'several', 'sing', 'software', 'stomer', 'support', 'solve', 'skeptical', 'software', 'sights', 'sion', 'software', 'seful', 'session', 'software', 'standing', 'software', 'sive', 'some', 'software', 'software', 'satile', 'se', 'software', 'ses', 'ss', 'some', 'ser', 'supportive', 'sier', 'shoot', 'ssues', 'share', 'sights', 'sing', 'software', 'sistently', 'ssed', 'stability', 'ser', 'se', 'some', 'somewhat', 'software']

Question mark quantifier: 'regexp?'
 []

{n} quantifier: '[a-zA-Z]{5}'
 ['steep', 'curve', 'first', 'after', 'while', 'fixed', 'while

# 4. Shorthand character classes

Shorthand character classes are predefined character classes representing a set of characters matching a particular pattern. They are a shorthand way
of representing common character sets without listing every character explicitly. We use them to match specific common character sets quickly and 
easily. For example, we can use the \d shorthand character to represent any digit character instead of writing [0-9].

Other character classes include:

\D: Matches a non-digit

\W: Matches non-word (non-alphanumeric) characters

\w: Matches word (alphanumeric) characters

\S: Matches non-whitespace characters

\s: Matches whitespace characters

In [30]:
print(df['text'])
text = " ".join(df['text'])
print(text)

0     The software had a steep learning curve at fir...
1     I'm really impressed with the user interface o...
2     The latest update to the software fixed severa...
3     I encountered a few glitches while using the s...
4     I was skeptical about trying the software init...
5     The analytics features have provided us with v...
6     I appreciate the regular updates that the soft...
7     I attended a training session for the software...
8     The software documentation could be more compr...
9     I've recommended the software to colleagues du...
10    The software integration with third-party plug...
11    I'm looking forward to the upcoming release of...
12    The user community is active and supportive, m...
13    I've been using the software for a while now, ...
14    The user interface could use some modernizatio...
15    I went for a run and the software did a good j...
Name: text, dtype: object
The software had a steep learning curve at first, but after a while, I started

In [31]:
"""
re.findall("\D", text): This finds all non-digit characters in the text string (i.e., any character that’s not a digit). The result is stored in the 
non_digits list.
re.findall("\d", text): This finds all digit characters in the text string. The result is stored in the digits list.
re.findall("\W", text): This finds all non-word characters in the text string. A non-word character is anything that’s not a word character, meaning it 
includes symbols and punctuation but excludes digits and underscores. The result is stored in the non_words list.
re.findall("\w", text): This finds all word characters in the text string. A word character includes letters, digits, and underscores. The result is 
stored in the words list.
re.findall("\S", text): This finds all non-space characters in the text string. A non-space character is anything other than a whitespace character. 
The result is stored in the non_spaces list.
re.findall("\s", text): This finds all space characters in the text string. A space character includes spaces, tabs, and line breaks. The result is
stored in the spaces list.

"""

non_digits_count = len(re.findall("\D", text))
digits_count = len(re.findall("\d", text))
non_words_count = len(re.findall("\W", text))
words_count = len(re.findall("\w", text))
spaces_count = len(re.findall("\s", text)) 

In [32]:
print("Non-digits count:", non_digits_count)
print("Digits count:", digits_count)
print("Non-words count:", non_words_count)
print("Words count:", words_count)
print("Spaces count:", spaces_count) 

Non-digits count: 1678
Digits count: 0
Non-words count: 300
Words count: 1378
Spaces count: 263


# 5. Escape Sequence

Escape sequences are a combination of characters in a string representing a unique character or characters with a special meaning. We refer to them as 
escape sequences because they escape the usual character interpretation and treat them as a unique code representing a specific character or action. 
For example, if we want to match a period (.) in a text, we use an escape sequence (\.) because the period is a metacharacter (and a quantifier) that
matches any character.

Examples of other escape characters include:

\n: Matches a newline

\t: Matches a tab

\r: Matches a carriage return

\f: Matches a form feed

\b: Matches a backspace

In [33]:
print(df['text'])
text = " ".join(df['text'])
print(text)

0     The software had a steep learning curve at fir...
1     I'm really impressed with the user interface o...
2     The latest update to the software fixed severa...
3     I encountered a few glitches while using the s...
4     I was skeptical about trying the software init...
5     The analytics features have provided us with v...
6     I appreciate the regular updates that the soft...
7     I attended a training session for the software...
8     The software documentation could be more compr...
9     I've recommended the software to colleagues du...
10    The software integration with third-party plug...
11    I'm looking forward to the upcoming release of...
12    The user community is active and supportive, m...
13    I've been using the software for a while now, ...
14    The user interface could use some modernizatio...
15    I went for a run and the software did a good j...
Name: text, dtype: object
The software had a steep learning curve at first, but after a while, I started

In [34]:
re.findall("\n", text)

[]