# **Introduction**
## Whats is regular expressions?

- A regular expression is a group of characters or symbols which is used to find a specific pattern in a text.

- A regular expression is a pattern that is matched against a subject string from left to right. Regular expressions are used to replace text within a string, validate forms, extract a substring from a string based on a pattern match, and so much more. The term "regular expression" is a mouthful, so you will usually find the term abbreviated to "regex" or "regexp".


## **Regular Expressions (Regex) Overview**

### **Purpose of Regular Expressions**
Regular expressions are widely used for:
1. **Extracting insights from text** → `findall()`
2. **Cleaning text** → `sub()`
3. **Validating input formats** → Ensuring correctness of data, such as emails, phone numbers, and IDs.
4. **Manipulating and transforming strings** → Reformatting text based on patterns.


A **regular expression** is a pattern used to search for specific text patterns in a string. It allows us to:
- **Search** for patterns
- **Replace** text
- **Validate** form inputs
- **Extract** information from text


### **Basic Meta Characters in Regular Expressions**
| Meta Character | Description | Example |
|---------------|-------------|---------|
| `.`  | Matches any single character except newline | `c.t` → matches `cat`, `cot`, `cut` |
| `*`  | Matches **zero or more** occurrences of the previous character | `ca*t` → matches `ct`, `cat`, `caaat` |
| `+`  | Matches **one or more** occurrences of the previous character | `ca+t` → matches `cat`, `caaat` but not `ct` |
| `^`  | Matches at the **start** of a string and negation| `^Hello` → matches `"Hello world"`, but not `"world Hello"` |
| `$`  | Matches at the **end** of a string | `world$` → matches `"Hello world"`, but not `"world Hello"` |
| `[]` | Matches **one** of the set of characters within brackets | `[aeiou]` → matches any vowel in a word |
| `[a-z]` | Matches **one** lowercase letter from `a` to `z` | `[a-z]` → matches `a, b, c, ..., z` |
| `[^abc]` | Matches **any character except** `a`, `b`, or `c` | `[^abc]` → matches `d, e, f, ...` |
| `{n,m}` | Matches at **least `n` but not more than `m`** occurrences of the preceding character | `a{2,4}` → matches `aa`, `aaa`, `aaaa` |

---


---


# **Regular Expressions (Regex) Overview**

Regular expressions, often abbreviated as **regex** or **regexp**, are sequences of characters defining search patterns in text. They are used for text processing tasks such as searching, replacing, extracting, and validating text. Regular expressions provide a powerful and efficient way to manipulate strings, making them a valuable tool in various programming languages and applications.


## **Applications of Regular Expressions**
Regular expressions are widely used in:
- **Data Validation:** Ensuring proper formats for email addresses, phone numbers, and IDs.
- **Web Scraping:** Extracting specific data from web pages.
- **Log File Analysis:** Searching for patterns in logs for debugging and monitoring.
- **Text Processing:** Automated text replacement, formatting, and parsing in programming.
- **Form Input Validation:** Ensuring correctness of user inputs in web forms.

## Practice:
1. https://www.regexone.com/
2. https://regexr.com/
3. https://regex101.com/


In [1]:
import re

In [2]:
text = 'Around 2500 patients are a taking_part in clinical trails #Coronavirus'

In [3]:
# finad all lower case letter sequence
re.findall('[a-z]+',text)

['round',
 'patients',
 'are',
 'a',
 'taking',
 'part',
 'in',
 'clinical',
 'trails',
 'oronavirus']

In [4]:
# lower and upper case(all case) squence
re.findall('[A-Za-z]+',text)

['Around',
 'patients',
 'are',
 'a',
 'taking',
 'part',
 'in',
 'clinical',
 'trails',
 'Coronavirus']

In [7]:
# non-alph characters
re.findall('[^a-zA-Z]+',text)

[' 2500 ', ' ', ' ', ' ', '_', ' ', ' ', ' ', ' #']

### **Extended Meta Characters in Regular Expressions**
| Meta Character | Description | Example |
|---------------|-------------|---------|
| `\d`  | Matches **any digit** (0-9) | `\d+` → matches `123`, `456` |
| `\D`  | Matches **any non-digit** character | `\D+` → matches `Hello`, `abc` |
| `\w`  | Matches **any word character** (letters, digits, underscore) | `\w+` → matches `word`, `test_123` |
| `\W`  | Matches **any non-word** character | `\W+` → matches `@#$%` |
| `\s`  | Matches **any whitespace** character (spaces, tabs, newlines) | `\s+` → matches spaces in `"Hello  World"` |
| `\S`  | Matches **any non-whitespace** character | `\S+` → matches `"Hello"` in `"Hello  World"` |
| `()`  | **Grouping** for extraction | `(ab)+` → matches `ababab` |
| `{}`  | **Frequency specifier** for exact match count | `a{3}` → matches `aaa` but not `aa` |
| `?`  | Makes a pattern **non-greedy** (matches the shortest possible string) | `ab+?` matches only `ab` in `abbbb` |


In [9]:
# to find all words including number and underscore
re.findall('\w+',text)

['Around',
 '2500',
 'patients',
 'are',
 'a',
 'taking_part',
 'in',
 'clinical',
 'trails',
 'Coronavirus']

In [10]:
text1 = "The film Titanic was released in year 98 and was a hit till the year 2000 \n5000 was the cost of the mobile\ni bargained it to "
print(text1)

The film Titanic was released in year 98 and was a hit till the year 2000 
5000 was the cost of the mobile
i bargained it to 


In [11]:
# find all the numbers across multple lines without condsidering line breaking
re.findall('\d+',text1)

['98', '2000', '5000']

In [None]:
# get only numbers in 1st line of the sentence ->['98','2000']


In [12]:
text1.splitlines()

['The film Titanic was released in year 98 and was a hit till the year 2000 ',
 '5000 was the cost of the mobile',
 'i bargained it to ']

In [13]:
for line in text1.splitlines():
    pattern = re.findall('\d+',line)
    print(pattern)

['98', '2000']
['5000']
[]


In [15]:
for line in text1.splitlines():
    pattern = re.findall('\d+',line)
    if len(pattern) > 0:
        print(pattern)

['98', '2000']
['5000']


In [16]:
text1

'The film Titanic was released in year 98 and was a hit till the year 2000 \n5000 was the cost of the mobile\ni bargained it to '

In [18]:
# extraxt only numbers that are between 2 and 4 digits
for line in text1.splitlines():
    pattern = re.findall('\d{2,4}',line)
    if len(pattern)>0:
        print(pattern)

['98', '2000']
['5000']


# Cleaning of text using `re.sub()`

In [23]:
text3='Around 2,500 patients are a taking-part in clinical trails #Coronavirus.'

In [24]:
# replace all non-words characters with empty string
re.sub('[^\w]','',text3)

'Around2500patientsareatakingpartinclinicaltrailsCoronavirus'

In [25]:
# Removes all punctuation (@, !, ', ., $, %, ?, _).
re.sub('[^\w\s]','',text3)

'Around 2500 patients are a takingpart in clinical trails Coronavirus'

In [26]:

text4 = "film ABC  @ was ? produced %  in , year $ 1994  .  'by'   Mr_X"

In [27]:
#Removing special characters with nothing
re.sub("[@,'?.$%_]",'',text4)

'film ABC   was  produced   in  year  1994    by   MrX'

In [29]:
#Removing special characters with nothing
re.sub('[^a-zA-Z0-9 ]','',text4)

'film ABC   was  produced   in  year  1994    by   MrX'

In [30]:
# replace non-words char and space with single space
re.sub('[\W ]',' ',text4)
# re.sub('[\W\s]',' ',text4)

'film ABC    was   produced    in   year   1994      by    Mr_X'

In [32]:
# remove mutiple space with single space
re.sub('\s+',' ',text4)

"film ABC @ was ? produced % in , year $ 1994 . 'by' Mr_X"

# Cleaning of nuakri.com(datascience jobs) dataset

In [36]:
import pandas as pd
jobs = pd.read_csv('datascience_jobs.csv')
jobs.head()

Unnamed: 0,title,location,experience,skills,company,salary,description,posted_date
0,Data Science,Mumbai,2-4 yrs,"Algorithms, Machine Learning, Python, Java, Da...",Netcore Solutions Pvt Ltd,"2,00,000 - 7,00,000 P.A.",At least 2 year of experience in data engineer...,1 day ago
1,Analyst / Sr. Analyst (data Science),Gurgaon,5-8 yrs,"predictive modeling, predictive analytics, mac...",Cvent India Pvt. Ltd.,"5,00,000 - 10,00,000 P.A.",Strong experience on providing predictive mode...,Today
2,ETL Lead & Data Science,"Chennai, Bengaluru, Mumbai, Pune, Noida",7-10 yrs,"SQL, Data Analysis, Text Mining, SAS, R, Stati...",COMPUTER POWER GROUP PRIVATE LIMITED,"10,00,000 - 15,00,000 P.A.",Industry experience in building and operationa...,1 day ago
3,Specialist - Data Science,"Delhi NCR, Bengaluru, Gurgaon",7-12 yrs,"Specialist - Data Science, Data Science, data ...",Brainsearch Consulting Pvt Ltd.Â,Not disclosed,- Experience with one or more data science pro...,1 day ago
4,Group Manager - Data Science - Python/nlp,Bengaluru,6-11 yrs,"machine learning, text mining, r, nlp, data sc...",Staffio HR,Not disclosed,- This is a Team management role - Skill set ...,1 day ago


In [37]:
jobs.shape

(4000, 8)

In [None]:
# task : from salary column, extract min and max saalry into different columns, 
# if salary is ND, make it na value

In [38]:
salary = "2,00,000 - 7,00,000 P.A."
salary = salary.replace(',','')
salary

'200000 - 700000 P.A.'

`() - () P.A. `

In [40]:
min_sal = re.findall('([0-9]+) - [0-9]+ P.A.',salary)
min_sal

['200000']

In [41]:
max_sal = re.findall('[0-9]+ - ([0-9]+) P.A.',salary)
max_sal

['700000']

In [42]:
jobs['salary'].head(10)

0          2,00,000 - 7,00,000 P.A.  
1         5,00,000 - 10,00,000 P.A.  
2        10,00,000 - 15,00,000 P.A.  
3                      Not disclosed 
4                      Not disclosed 
5          2,00,000 - 4,25,000 P.A.  
6          4,00,000 - 8,00,000 P.A.  
7                      Not disclosed 
8                      Not disclosed 
9                      Not disclosed 
Name: salary, dtype: object

In [45]:
def get_sala_min(row):
    row = row.replace(',','')
    pattern = '([0-9]+) - [0-9]+ P.A.'
    salary = re.findall(pattern, row)
    if len(salary):
        return salary[0]
    else:
        return None

In [47]:
jobs['min_salary'] = jobs['salary'].apply(get_sala_min)

In [51]:
jobs[['salary','min_salary']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   salary      4000 non-null   object
 1   min_salary  478 non-null    object
dtypes: object(2)
memory usage: 62.6+ KB


In [67]:
jobs['min_salary'] = pd.to_numeric(jobs['min_salary'], errors='coerce')
jobs['min_salary'].head(10)

0     200000.0
1     500000.0
2    1000000.0
3          NaN
4          NaN
5     200000.0
6     400000.0
7          NaN
8          NaN
9          NaN
Name: min_salary, dtype: float64

In [53]:
def get_sala_max(row):
    row = row.replace(',','')
    pattern = '[0-9]+ - ([0-9]+) P.A.'
    salary = re.findall(pattern, row)
    if len(salary):
        return salary[0]
    else:
        return None

In [54]:
jobs['max_salary'] = jobs['salary'].apply(get_sala_max)

In [55]:
jobs['max_salary'] = pd.to_numeric(jobs['max_salary'], errors='coerce')
jobs['min_salary'].head(10)

0     200000.0
1     500000.0
2    1000000.0
3          NaN
4          NaN
5     200000.0
6     400000.0
7          NaN
8          NaN
9          NaN
Name: min_salary, dtype: float64

In [56]:
jobs.head()

Unnamed: 0,title,location,experience,skills,company,salary,description,posted_date,min_salary,max_salary
0,Data Science,Mumbai,2-4 yrs,"Algorithms, Machine Learning, Python, Java, Da...",Netcore Solutions Pvt Ltd,"2,00,000 - 7,00,000 P.A.",At least 2 year of experience in data engineer...,1 day ago,200000.0,700000.0
1,Analyst / Sr. Analyst (data Science),Gurgaon,5-8 yrs,"predictive modeling, predictive analytics, mac...",Cvent India Pvt. Ltd.,"5,00,000 - 10,00,000 P.A.",Strong experience on providing predictive mode...,Today,500000.0,1000000.0
2,ETL Lead & Data Science,"Chennai, Bengaluru, Mumbai, Pune, Noida",7-10 yrs,"SQL, Data Analysis, Text Mining, SAS, R, Stati...",COMPUTER POWER GROUP PRIVATE LIMITED,"10,00,000 - 15,00,000 P.A.",Industry experience in building and operationa...,1 day ago,1000000.0,1500000.0
3,Specialist - Data Science,"Delhi NCR, Bengaluru, Gurgaon",7-12 yrs,"Specialist - Data Science, Data Science, data ...",Brainsearch Consulting Pvt Ltd.Â,Not disclosed,- Experience with one or more data science pro...,1 day ago,,
4,Group Manager - Data Science - Python/nlp,Bengaluru,6-11 yrs,"machine learning, text mining, r, nlp, data sc...",Staffio HR,Not disclosed,- This is a Team management role - Skill set ...,1 day ago,,


In [58]:
jobs['Avg_salary'] = (jobs['min_salary']+jobs['max_salary'])/2
jobs['Avg_salary'].head()

0     450000.0
1     750000.0
2    1250000.0
3          NaN
4          NaN
Name: Avg_salary, dtype: float64

In [59]:
jobs.head()

Unnamed: 0,title,location,experience,skills,company,salary,description,posted_date,min_salary,max_salary,Avg_salary
0,Data Science,Mumbai,2-4 yrs,"Algorithms, Machine Learning, Python, Java, Da...",Netcore Solutions Pvt Ltd,"2,00,000 - 7,00,000 P.A.",At least 2 year of experience in data engineer...,1 day ago,200000.0,700000.0,450000.0
1,Analyst / Sr. Analyst (data Science),Gurgaon,5-8 yrs,"predictive modeling, predictive analytics, mac...",Cvent India Pvt. Ltd.,"5,00,000 - 10,00,000 P.A.",Strong experience on providing predictive mode...,Today,500000.0,1000000.0,750000.0
2,ETL Lead & Data Science,"Chennai, Bengaluru, Mumbai, Pune, Noida",7-10 yrs,"SQL, Data Analysis, Text Mining, SAS, R, Stati...",COMPUTER POWER GROUP PRIVATE LIMITED,"10,00,000 - 15,00,000 P.A.",Industry experience in building and operationa...,1 day ago,1000000.0,1500000.0,1250000.0
3,Specialist - Data Science,"Delhi NCR, Bengaluru, Gurgaon",7-12 yrs,"Specialist - Data Science, Data Science, data ...",Brainsearch Consulting Pvt Ltd.Â,Not disclosed,- Experience with one or more data science pro...,1 day ago,,,
4,Group Manager - Data Science - Python/nlp,Bengaluru,6-11 yrs,"machine learning, text mining, r, nlp, data sc...",Staffio HR,Not disclosed,- This is a Team management role - Skill set ...,1 day ago,,,


# Extrract hastag from tweets

In [60]:
tweets = pd.read_csv('tweets_donald_trump.csv')
tweets.head()

Unnamed: 0,created_at,language,likes,retweets,text
0,6/17/2020 3:27,en,123212,18568,96% Approval Rating in the Republican Party. T...
1,6/17/2020 2:45,und,0,7942,RT @TONYxTWO: @thejtlewis @JoeBiden https://t....
2,6/17/2020 2:38,en,0,23815,RT @thejtlewis: â€œTrump isnâ€™t going to acce...
3,6/17/2020 2:37,en,0,6781,"RT @thejtlewis: With the utmost respect, I tha..."
4,6/17/2020 2:31,en,56840,14231,A GREAT woman. Her son is looking down from he...


In [61]:
text5 = 'Around 2,500 patients are taking part in clinical trails #Coronavirus'
print(text)
print(re.findall('#\w+', text))

Around 2500 patients are a taking_part in clinical trails #Coronavirus
['#Coronavirus']


In [62]:
all_hastags=[]
for row in tweets['text']:
    row_hastags = re.findall('#\w+',row)
    all_hastags.extend(row_hastags)

Extend vs Append
- append: single elment to end liust
-  extend : element to end of list

    - Multi-elemt adding - Extend
    - Single-elet addding - append

In [63]:
all_hastags

['#ArmyBdaâ',
 '#HappyBirthdayTrump',
 '#GoArmy',
 '#USMA2020',
 '#WithVisionWeLead',
 '#JUSTIâ',
 '#BarelyThereBiden',
 '#MAGA',
 '#NHSEN',
 '#2A',
 '#NH01',
 '#2A',
 '#MAGA',
 '#WGDP',
 '#MAGA',
 '#AMERICAFirst',
 '#jobsreport',
 '#JobsReport',
 '#PPPworks',
 '#Democrats',
 '#FoxNews',
 '#MAGA',
 '#1']

In [64]:
len(all_hastags)

23

In [65]:
pd.Series(all_hastags).value_counts()

#MAGA                  4
#2A                    2
#NH01                  1
#FoxNews               1
#Democrats             1
#PPPworks              1
#JobsReport            1
#jobsreport            1
#AMERICAFirst          1
#WGDP                  1
#ArmyBdaâ              1
#HappyBirthdayTrump    1
#NHSEN                 1
#BarelyThereBiden      1
#JUSTIâ                1
#WithVisionWeLead      1
#USMA2020              1
#GoArmy                1
#1                     1
Name: count, dtype: int64