# Regular expressions
## Practice:
1. https://www.regexone.com/
2. https://regexr.com/
3. https://regex101.com/

## Whats the purpose of regular expressions?

- Regular Expressions are extensively used for:
       
       1.To Extract insights about the text ->  `findall()`
       
       2.To Clean the text -> `sub()`
        
- A regular expression is a group of characters or symbols which is used to find a specific pattern in a text.

- A regular expression is a pattern that is matched against a subject string from left to right. Regular expressions are used to replace text within a string, validate forms, extract a substring from a string based on a pattern match, and so much more. The term "regular expression" is a mouthful, so you will usually find the term abbreviated to "regex" or "regexp".

---

## **Regular Expressions (Regex) Overview**
### **Purpose of Regular Expressions**
Regular expressions are extensively used for:
1. **Extracting insights from text** → `findall()`
2. **Cleaning the text** → `sub()`

A **regular expression** is a pattern used to search for specific text patterns in a string. It allows us to:
- **Search** for patterns
- **Replace** text
- **Validate** form inputs
- **Extract** information from text

### **Basic Meta Characters in Regular Expressions**
| Meta Character | Description |
|---------------|-------------|
| `.`  | Matches any single character |
| `*`  | Matches zero or more occurrences of the previous character |
| `+`  | Matches one or more occurrences of the previous character |
| `^`  | Matches at the start of a string |
| `$`  | Matches at the end of a string |
| `\d` | Matches any digit (0-9) |
| `\D` | Matches any non-digit character |
| `\w` | Matches any word character (letters, digits, underscore) |
| `\W` | Matches any non-word character |
| `\s` | Matches any whitespace (spaces, tabs, newlines) |
| `\S` | Matches any non-whitespace character |
| `[]` | Defines a character class |
| `|`  | Acts as an OR operator |

---


## **2. Basic Meta Characters in Regular Expressions**


Basic meta charecters used in regular expressions

    . -> matches a single character
    * -> matches zero or more occurrence of the previous character
    + -> matches one or more occurrence of the previous character
    ^ -> matches any character start of a string and negation
    $ -> matches any character end of a string
    [] -> matches one of the set of characters within []
    [a-z] -> matches on of the range of characters in lowercase alphabet
    [^abc] -> matches a character that is not a, b, or c
    {n,m} -> matches atleast "n" but not more than "m" repetitions of the preceding symbol

Here is a structured table comparing **Basic** and **Extended** Regular Expressions:

### **Basic Meta Characters in Regular Expressions**
| Meta Character | Description | Example |
|---------------|-------------|---------|
| `.`  | Matches any single character except newline | `c.t` → matches `cat`, `cot`, `cut` |
| `*`  | Matches **zero or more** occurrences of the previous character | `ca*t` → matches `ct`, `cat`, `caaat` |
| `+`  | Matches **one or more** occurrences of the previous character | `ca+t` → matches `cat`, `caaat` but not `ct` |
| `^`  | Matches at the **start** of a string and negation| `^Hello` → matches `"Hello world"`, but not `"world Hello"` |
| `$`  | Matches at the **end** of a string | `world$` → matches `"Hello world"`, but not `"world Hello"` |
| `[]` | Matches **one** of the set of characters within brackets | `[aeiou]` → matches any vowel in a word |
| `[a-z]` | Matches **one** lowercase letter from `a` to `z` | `[a-z]` → matches `a, b, c, ..., z` |
| `[^abc]` | Matches **any character except** `a`, `b`, or `c` | `[^abc]` → matches `d, e, f, ...` |
| `{n,m}` | Matches at **least `n` but not more than `m`** occurrences of the preceding character | `a{2,4}` → matches `aa`, `aaa`, `aaaa` |

---



## 3. Extracting Insights from Text using re.findall()

In [1]:
import re
import pandas as pd

In [2]:
text = 'Around 2500 patients are a taking_part in clinical trails #Coronavirus'


In [3]:
print(text)
# Use re.findall() to find all occurrences of lowercase letter sequences in the text
re.findall('[a-z]+', text) 

Around 2500 patients are a taking_part in clinical trails #Coronavirus


['round',
 'patients',
 'are',
 'a',
 'taking',
 'part',
 'in',
 'clinical',
 'trails',
 'oronavirus']

In [4]:
print(text)
# Use re.findall() to find all occurrences of letter sequences (both uppercase and lowercase)
re.findall('[A-Za-z]+', text) 


Around 2500 patients are a taking_part in clinical trails #Coronavirus


['Around',
 'patients',
 'are',
 'a',
 'taking',
 'part',
 'in',
 'clinical',
 'trails',
 'Coronavirus']

In [5]:
print(text)
# Use re.findall() to find all sequences of non-alphabetic characters
re.findall('[^a-zA-Z]+', text)

Around 2500 patients are a taking_part in clinical trails #Coronavirus


[' 2500 ', ' ', ' ', ' ', '_', ' ', ' ', ' ', ' #']

## **4. Using Extended Regular Expressions**


### Extended Regular Expressions:
 
     \d -> Any digit, equivalent to [0-9]
     \D -> Any non-digit, equivalent to [^0-9]
    \w -> Any alphanumeric, equivalent to [a-zA-Z0-9_]
    \W -> Non-alphanumeric, equivalent to [^a-zA-Z0-9_]
    \s -> Any whitespace character
    \S -> Any nonwhitespace character
    
    () -> Scoping for extraction
    {} -> Frequency for extraction
    ? -> Make a pattern non greedy

### **Extended Meta Characters in Regular Expressions**
| Meta Character | Description | Example |
|---------------|-------------|---------|
| `\d`  | Matches **any digit** (0-9) | `\d+` → matches `123`, `456` |
| `\D`  | Matches **any non-digit** character | `\D+` → matches `Hello`, `abc` |
| `\w`  | Matches **any word character** (letters, digits, underscore) | `\w+` → matches `word`, `test_123` |
| `\W`  | Matches **any non-word** character | `\W+` → matches `@#$%` |
| `\s`  | Matches **any whitespace** character (spaces, tabs, newlines) | `\s+` → matches spaces in `"Hello  World"` |
| `\S`  | Matches **any non-whitespace** character | `\S+` → matches `"Hello"` in `"Hello  World"` |
| `()`  | **Grouping** for extraction | `(ab)+` → matches `ababab` |
| `{}`  | **Frequency specifier** for exact match count | `a{3}` → matches `aaa` but not `aa` |
| `?`  | Makes a pattern **non-greedy** (matches the shortest possible string) | `ab+?` matches only `ab` in `abbbb` |


In [6]:
print(text)
# Use re.findall() to find all words including numbers and underscores
re.findall('\w+', text) 

Around 2500 patients are a taking_part in clinical trails #Coronavirus


['Around',
 '2500',
 'patients',
 'are',
 'a',
 'taking_part',
 'in',
 'clinical',
 'trails',
 'Coronavirus']

## Example:2

In [49]:
text1 = "The film Titanic was released in year 98 and was a hit till the year 2000 \n5000 was the cost of the mobile\ni bargained it to "

In [50]:
print(text1)

The film Titanic was released in year 98 and was a hit till the year 2000 
5000 was the cost of the mobile
i bargained it to 


In [53]:
# Finds all numbers across multiple lines without considering line breaks.
re.findall('\d+',text1)

['98', '2000', '5000']

In [51]:
# Prints numbers for each line separately, including empty lists for lines without numbers.
text1.splitlines()

['The film Titanic was released in year 98 and was a hit till the year 2000 ',
 '5000 was the cost of the mobile',
 'i bargained it to ']

In [None]:
# Prints numbers for each line separately, including empty lists for lines without numbers.

# Process text line by line
for line in text1.splitlines():
    # Find all numbers in the current line
    patterns = re.findall("\d+", line)

    # Print numbers found in this line
    print(patterns)


In [10]:
# Prints numbers for each line separately, including empty lists for lines without numbers.

for line in text1.splitlines():
    # Find all digit sequences in the current line
    patterns = re.findall("\d+",line)
    #if len(patterns)>0:
    print(patterns)

['98', '2000']
['5000']
[]


In [11]:
# Avoids empty list
for line in text1.splitlines():
    patterns = re.findall("^\d+",line)
    if len(patterns)>0:
        print(patterns)

['5000']


In [12]:
text1

'The film Titanic was released in year 98 and was a hit till the year 2000 \n5000 was the cost of the mobile\ni bargained it to '

In [64]:

print('text : ',text1)
print('----')
for line in text1.splitlines():
    line = line.strip()  # Remove spaces
    patterns = re.findall("\d+$", line)
    if patterns:  # This avoids checking `len(patterns) > 0`
        print(patterns)

        
#Python strip() method removes any spaces or specified characters at the start and end of a string

text :  The film Titanic was released in year 98 and was a hit till the year 2000 
5000 was the cost of the mobile
i bargained it to 
----
['2000']


In [14]:
# Extracts only numbers that are between 2 and 4 digits long
for line in text1.splitlines():
    patterns = re.findall("\d{2,4}",line)
    if len(patterns)>0:
        print(patterns)

['98', '2000']
['5000']


## Extra Example : 3

text2 = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'

# Find all occurrences of text that contains '@' without spaces
re.findall("\S+@\S+", text2)

Explanation of \S+@\S+

1. `\S+` → Matches one or more non-whitespace characters before @.
2. `@` → Matches the @ symbol.
3. `\S+` → Matches one or more non-whitespace characters after @.

re.findall("\s+@\S+", text2)

re.findall("\S+@(\S+)", text2)

re.findall("(\S+)@\S+", text2)

# Cleaning text using re.sub

In [20]:
text3 = 'Around 2,500 patients are taking-part in clinical trails #Coronavirus'

In [21]:
print(text3)
# 1.1 Replaces all non-word characters with an empty string ('').
# 1.2 Effectively removes punctuation and spaces.
re.sub('[^\w]', '', text3)

Around 2,500 patients are taking-part in clinical trails #Coronavirus


'Around2500patientsaretakingpartinclinicaltrailsCoronavirus'

In [22]:
print(text3)
# 1.1 Removes all punctuation (,, !, ., etc.).
# 1.2 Preserves spaces and words.
re.sub('[^\w\s]', '', text3)

Around 2,500 patients are taking-part in clinical trails #Coronavirus


'Around 2500 patients are takingpart in clinical trails Coronavirus'

In [23]:
text4 = "film ABC  @ was ? produced %  in , year $ 1994  .  'by'   Mr_X"

In [24]:
#Removing special characters with nothing
# 1.1 Removes only the specified characters.
# 1.2 Leaves everything else (letters, numbers, and spaces) unchanged.
result1 = re.sub("[,@'?.$%_]", "", text4)
result1

'film ABC   was  produced   in  year  1994    by   MrX'

In [25]:
#Removing special charecters(non Alpha numeric and Space) with nothing
# 1.1 Removes special characters (@, !, ', ., $, %, ?, _).
# 1.2 Keeps letters, numbers, and spaces.
result1 = re.sub("[^a-zA-Z0-9 ]","",text4)
result1

'film ABC   was  produced   in  year  1994    by   MrX'

In [26]:
# Removes all punctuation (@, !, ', ., $, %, ?, _).
# Keeps words, numbers, and spaces intact.
result1 = re.sub("[^\w\s]","",text4)
result1

'film ABC   was  produced   in  year  1994    by   Mr_X'

In [27]:
# Replaces non-word characters and spaces with a single space.
# Multiple spaces may appear as a result.
result1 = re.sub("[\W ]"," ",text4)
result1

'film ABC    was   produced    in   year   1994      by    Mr_X'

In [28]:
#result1 = re.sub("\W\s"," ",text4)
#result1

In [29]:
#Removing multiple spaces with a single space
# Finds all spaces and removes them.
# Joins all words into one continuous string.
result = re.sub("\s+", "", result1)
result

'filmABCwasproducedinyear1994byMr_X'

## Cleaning salary from Naukri Job dataset

In [30]:
jobs = pd.read_csv('datascience_jobs.csv')
jobs.head()

Unnamed: 0,title,location,experience,skills,company,salary,description,posted_date
0,Data Science,Mumbai,2-4 yrs,"Algorithms, Machine Learning, Python, Java, Da...",Netcore Solutions Pvt Ltd,"2,00,000 - 7,00,000 P.A.",At least 2 year of experience in data engineer...,1 day ago
1,Analyst / Sr. Analyst (data Science),Gurgaon,5-8 yrs,"predictive modeling, predictive analytics, mac...",Cvent India Pvt. Ltd.,"5,00,000 - 10,00,000 P.A.",Strong experience on providing predictive mode...,Today
2,ETL Lead & Data Science,"Chennai, Bengaluru, Mumbai, Pune, Noida",7-10 yrs,"SQL, Data Analysis, Text Mining, SAS, R, Stati...",COMPUTER POWER GROUP PRIVATE LIMITED,"10,00,000 - 15,00,000 P.A.",Industry experience in building and operationa...,1 day ago
3,Specialist - Data Science,"Delhi NCR, Bengaluru, Gurgaon",7-12 yrs,"Specialist - Data Science, Data Science, data ...",Brainsearch Consulting Pvt Ltd.Â,Not disclosed,- Experience with one or more data science pro...,1 day ago
4,Group Manager - Data Science - Python/nlp,Bengaluru,6-11 yrs,"machine learning, text mining, r, nlp, data sc...",Staffio HR,Not disclosed,- This is a Team management role - Skill set ...,1 day ago


## Task : From the salary column extract the minimum and maximum salary, NA if unable to extract

## So far used methods
### re methods
1. .findall()
2. .sub()

### string methods
1. .replace()
2. .split()
3. .splitlines()
4. .strip()

## Single value logic check

In [31]:
# Logic to check of it works for Single row value
salary = '5,00,000 - 10,00,000 P.A.'
salary = salary.replace(",","")
salary

'500000 - 1000000 P.A.'

`() - () P.A.`

In [32]:
min_sal = re.findall('([0-9]+) - [0-9]+ P.A.',salary)
min_sal

['500000']

In [33]:
max_sal = re.findall('[0-9]+ - ([0-9]+) P.A.',salary)
max_sal

['1000000']

## Entire column logic check

In [34]:
jobs['salary'].head(n=15)

0           2,00,000 - 7,00,000 P.A.  
1          5,00,000 - 10,00,000 P.A.  
2         10,00,000 - 15,00,000 P.A.  
3                       Not disclosed 
4                       Not disclosed 
5           2,00,000 - 4,25,000 P.A.  
6           4,00,000 - 8,00,000 P.A.  
7                       Not disclosed 
8                       Not disclosed 
9                       Not disclosed 
10                      Not disclosed 
11          1,00,000 - 3,00,000 P.A.  
12          1,25,000 - 3,00,000 P.A.  
13                      Not disclosed 
14                      Not disclosed 
Name: salary, dtype: object

# Lets run on `vs code` with extension to preview code debugging
### List of dictionaries to store job data
```python
import pandas as pd
import re
import matplotlib.pyplot as plt

job_listings_data = [
    {
        "Title": "Data Science",
        "Location": "Mumbai",
        "Experience": "2-4 yrs",
        "Skills": "Algorithms, Machine Learning, Python, Java, Data",
        "Company": "Netcore Solutions Pvt Ltd",
        "Salary": "2,00,000 - 7,00,000 P.A.",
        "Description": "At least 2 years of experience in data engineering...",
        "Posted Date": "1 day ago"
    },
    {
        "Title": "Analyst / Sr. Analyst (Data Science)",
        "Location": "Gurgaon",
        "Experience": "5-8 yrs",
        "Skills": "Predictive Modeling, Predictive Analytics, Machine Learning",
        "Company": "Cvent India Pvt. Ltd.",
        "Salary": "5,00,000 - 10,00,000 P.A.",
        "Description": "Strong experience in providing predictive models...",
        "Posted Date": "Today"
    },
    {
        "Title": "ETL Lead & Data Science",
        "Location": "Chennai, Bengaluru, Mumbai, Pune, Noida",
        "Experience": "7-10 yrs",
        "Skills": "SQL, Data Analysis, Text Mining, SAS, R, Statistics",
        "Company": "COMPUTER POWER GROUP PRIVATE LIMITED",
        "Salary": "10,00,000 - 15,00,000 P.A.",
        "Description": "Industry experience in building and operationalizing data science...",
        "Posted Date": "1 day ago"
    },
    {
        "Title": "Specialist - Data Science",
        "Location": "Delhi NCR, Bengaluru, Gurgaon",
        "Experience": "7-12 yrs",
        "Skills": "Data Science, Data Engineering",
        "Company": "Brainsearch Consulting Pvt Ltd",
        "Salary": "Not disclosed",
        "Description": "Experience with one or more data science projects...",
        "Posted Date": "1 day ago"
    },
    {
        "Title": "Group Manager - Data Science - Python/NLP",
        "Location": "Bengaluru",
        "Experience": "6-11 yrs",
        "Skills": "Machine Learning, Text Mining, R, NLP, Data Science",
        "Company": "Staffio HR",
        "Salary": "Not disclosed",
        "Description": "This is a team management role - Skill set includes NLP and Python...",
        "Posted Date": "1 day ago"
    }
]

# Convert list of dictionaries into a Pandas DataFrame
job_listings_df = pd.DataFrame(job_listings_data)

# Function logic
# Removes commas from the salary text.
# Uses regex to extract the minimum salary (first number before -).
# Returns None if no match is found.
def get_salary_min(row):
    row = row.replace(',', '')
    pattern = '([0-9]+) - [0-9]+ P.A.'
    salary = re.findall(pattern, row)
    if len(salary):
        return salary[0]
    else:
        return None

# Calls get_salary_min() for each row in the salary column.
# Ensures the extracted salary is stored as a numeric value.
# Converts invalid values to NaN (errors='coerce').
job_listings_df['Salary_min'] = job_listings_df['Salary'].apply(get_salary_min)
job_listings_df['Salary_min'] = pd.to_numeric(job_listings_df['Salary_min'], errors='coerce')

```


In [69]:
job_listings_df

Unnamed: 0,Title,Location,Experience,Skills,Company,Salary,Description,Posted Date
0,Data Science,Mumbai,2-4 yrs,"Algorithms, Machine Learning, Python, Java, Data",Netcore Solutions Pvt Ltd,"2,00,000 - 7,00,000 P.A.",At least 2 years of experience in data enginee...,1 day ago
1,Analyst / Sr. Analyst (Data Science),Gurgaon,5-8 yrs,"Predictive Modeling, Predictive Analytics, Mac...",Cvent India Pvt. Ltd.,"5,00,000 - 10,00,000 P.A.",Strong experience in providing predictive mode...,Today
2,ETL Lead & Data Science,"Chennai, Bengaluru, Mumbai, Pune, Noida",7-10 yrs,"SQL, Data Analysis, Text Mining, SAS, R, Stati...",COMPUTER POWER GROUP PRIVATE LIMITED,"10,00,000 - 15,00,000 P.A.",Industry experience in building and operationa...,1 day ago
3,Specialist - Data Science,"Delhi NCR, Bengaluru, Gurgaon",7-12 yrs,"Data Science, Data Engineering",Brainsearch Consulting Pvt Ltd,Not disclosed,Experience with one or more data science proje...,1 day ago
4,Group Manager - Data Science - Python/NLP,Bengaluru,6-11 yrs,"Machine Learning, Text Mining, R, NLP, Data Sc...",Staffio HR,Not disclosed,This is a team management role - Skill set inc...,1 day ago


In [35]:
# fuinctio logic
# Removes commas from the salary text.
# Uses regex to extract the minimum salary (first number before -).
# Returns None if no match is found.
def get_salary_min(row):
    row = row.replace(',', '')
    pattern = '([0-9]+) - [0-9]+ P.A.'
    salary = re.findall(pattern, row)
    if len(salary):
        return salary[0]
    else:
        return None

In [70]:
# Calls get_salary_min() for each row in the salary column.
# Ensures the extracted salary is stored as a numeric value.
# Converts invalid values to NaN (errors='coerce').
jobs['salary_min'] = jobs['salary'].apply(get_salary_min)
jobs['salary_min'] = pd.to_numeric(jobs['salary_min'],errors='coerce')
jobs[['salary', 'salary_min']].head()

Unnamed: 0,salary,salary_min
0,"2,00,000 - 7,00,000 P.A.",200000.0
1,"5,00,000 - 10,00,000 P.A.",500000.0
2,"10,00,000 - 15,00,000 P.A.",1000000.0
3,Not disclosed,
4,Not disclosed,


In [37]:
def get_salary_max(row):
    row = row.replace(',', '')
    pattern = '[0-9]+ - ([0-9]+) P.A.'
    salary = re.findall(pattern, row)
    if len(salary):
        return salary[0]
    else:
        return None

In [38]:
jobs['salary_max'] = jobs['salary'].apply(get_salary_max)
jobs['salary_max'] = pd.to_numeric(jobs['salary_max'],
                                  errors='coerce')

In [39]:
jobs[['salary', 'salary_min','salary_max']].head()

Unnamed: 0,salary,salary_min,salary_max
0,"2,00,000 - 7,00,000 P.A.",200000.0,700000.0
1,"5,00,000 - 10,00,000 P.A.",500000.0,1000000.0
2,"10,00,000 - 15,00,000 P.A.",1000000.0,1500000.0
3,Not disclosed,,
4,Not disclosed,,


In [40]:
jobs['salary_avg'] = (jobs['salary_min']+jobs['salary_max'])/2
jobs['salary_avg']

0        450000.0
1        750000.0
2       1250000.0
3             NaN
4             NaN
          ...    
3995          NaN
3996          NaN
3997          NaN
3998          NaN
3999          NaN
Name: salary_avg, Length: 4000, dtype: float64

## Extracting hashtags from tweets

In [41]:
text5 = 'Around 2,500 patients are taking part in clinical trails #Coronavirus'
print(text)
print(re.findall('#\w+', text))

Around 2,500 patients are taking part in clinical trails #Coronavirus
['#Coronavirus']


In [42]:
tweets = pd.read_csv('tweets_donald_trump.csv')
tweets.head()

Unnamed: 0,created_at,language,likes,retweets,text
0,6/17/2020 3:27,en,123212,18568,96% Approval Rating in the Republican Party. T...
1,6/17/2020 2:45,und,0,7942,RT @TONYxTWO: @thejtlewis @JoeBiden https://t....
2,6/17/2020 2:38,en,0,23815,RT @thejtlewis: â€œTrump isnâ€™t going to acce...
3,6/17/2020 2:37,en,0,6781,"RT @thejtlewis: With the utmost respect, I tha..."
4,6/17/2020 2:31,en,56840,14231,A GREAT woman. Her son is looking down from he...


In [43]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   created_at  400 non-null    object
 1   language    400 non-null    object
 2   likes       400 non-null    int64 
 3   retweets    400 non-null    int64 
 4   text        400 non-null    object
dtypes: int64(2), object(3)
memory usage: 15.8+ KB


Obtain the frequency of each of the hashtags

- Step1: Extract all the hash tags and store them in a list
- Step2: Compute the frequency of each of the hashtags

In [44]:
all_hashtags = []
for row in tweets['text']:
    row_hashtags = re.findall('#\w+', row)
    all_hashtags.extend(row_hashtags)

In [45]:
len(all_hashtags)

23

In [46]:
all_hashtags[:10]

['#ArmyBdaâ',
 '#HappyBirthdayTrump',
 '#GoArmy',
 '#USMA2020',
 '#WithVisionWeLead',
 '#JUSTIâ',
 '#BarelyThereBiden',
 '#MAGA',
 '#NHSEN',
 '#2A']

In [47]:
freq_hashtags = pd.Series(all_hashtags).value_counts()
freq_hashtags.head(10)

#MAGA            4
#2A              2
#NH01            1
#FoxNews         1
#Democrats       1
#PPPworks        1
#JobsReport      1
#jobsreport      1
#AMERICAFirst    1
#WGDP            1
Name: count, dtype: int64