# Inbuilt text handling functions in python

Having understood how to play with numerical data, this module focuses on cleaning textual data. For starters, we shall learn how to clean using available string methods in python to sort of clean texts stored in DataFrame columns. We have understood in python that the data in a dataframe is stored in  2 dimensional numpy array and for accessing values in rows and column we can use a loc/iloc operator. In this topic we will focus on text manipulation on array/list like objects which has string type data stored. These concepts can be applied to any dataframe columns as well.

In [1]:
# starting with the random paragraph generator
!pip install essential-generators



Collecting essential-generators
  Downloading essential_generators-1.0-py3-none-any.whl (9.5 MB)
                                              0.0/9.5 MB ? eta -:--:--
                                              0.1/9.5 MB 2.8 MB/s eta 0:00:04
     -                                        0.3/9.5 MB 4.2 MB/s eta 0:00:03
     --                                       0.6/9.5 MB 3.9 MB/s eta 0:00:03
     ---                                      0.7/9.5 MB 4.2 MB/s eta 0:00:03
     ----                                     1.0/9.5 MB 4.0 MB/s eta 0:00:03
     ----                                     1.1/9.5 MB 4.0 MB/s eta 0:00:03
     -----                                    1.3/9.5 MB 4.0 MB/s eta 0:00:03
     ------                                   1.5/9.5 MB 3.9 MB/s eta 0:00:03
     ------                                   1.6/9.5 MB 3.9 MB/s eta 0:00:02
     -------                                  1.8/9.5 MB 3.9 MB/s eta 0:00:02
     --------                                 2.0/9.

In [2]:
from essential_generators import DocumentGenerator
gen= DocumentGenerator()
para = gen.paragraph()

In [3]:
para

'Not even member citizens account for much of the transmission of. In tertiary late 1970s, the railroad industry. six of the very. Including collier defense, though he avoided formal proceedings, and a centre of civilization with trading. For hosting two species of game fish including seven species of life that. Sea. opened berlin in 1878. \n the egyptian squash. Characteristic features comprehensive reform package in 1996 by. Bridge formed indigenous populations. in later decades.. Falling from field need. Particularly strong culture (most notably the great lakes.'

In [4]:
len(para)

569

In [5]:
words = para.split(' ')

In [6]:
# Total words
len(words)

86

In [7]:
words

['Not',
 'even',
 'member',
 'citizens',
 'account',
 'for',
 'much',
 'of',
 'the',
 'transmission',
 'of.',
 'In',
 'tertiary',
 'late',
 '1970s,',
 'the',
 'railroad',
 'industry.',
 'six',
 'of',
 'the',
 'very.',
 'Including',
 'collier',
 'defense,',
 'though',
 'he',
 'avoided',
 'formal',
 'proceedings,',
 'and',
 'a',
 'centre',
 'of',
 'civilization',
 'with',
 'trading.',
 'For',
 'hosting',
 'two',
 'species',
 'of',
 'game',
 'fish',
 'including',
 'seven',
 'species',
 'of',
 'life',
 'that.',
 'Sea.',
 'opened',
 'berlin',
 'in',
 '1878.',
 '\n',
 'the',
 'egyptian',
 'squash.',
 'Characteristic',
 'features',
 'comprehensive',
 'reform',
 'package',
 'in',
 '1996',
 'by.',
 'Bridge',
 'formed',
 'indigenous',
 'populations.',
 'in',
 'later',
 'decades..',
 'Falling',
 'from',
 'field',
 'need.',
 'Particularly',
 'strong',
 'culture',
 '(most',
 'notably',
 'the',
 'great',
 'lakes.']

In [8]:
# Filter out greater than 3 letter words out of all other using list comprehenson

[x for x in words if len(x)>3]

['even',
 'member',
 'citizens',
 'account',
 'much',
 'transmission',
 'tertiary',
 'late',
 '1970s,',
 'railroad',
 'industry.',
 'very.',
 'Including',
 'collier',
 'defense,',
 'though',
 'avoided',
 'formal',
 'proceedings,',
 'centre',
 'civilization',
 'with',
 'trading.',
 'hosting',
 'species',
 'game',
 'fish',
 'including',
 'seven',
 'species',
 'life',
 'that.',
 'Sea.',
 'opened',
 'berlin',
 '1878.',
 'egyptian',
 'squash.',
 'Characteristic',
 'features',
 'comprehensive',
 'reform',
 'package',
 '1996',
 'Bridge',
 'formed',
 'indigenous',
 'populations.',
 'later',
 'decades..',
 'Falling',
 'from',
 'field',
 'need.',
 'Particularly',
 'strong',
 'culture',
 '(most',
 'notably',
 'great',
 'lakes.']

In [9]:
str.istitle??

In [10]:
# The above method when used, returns strings starting with capital letters lets find out words starting with capital letters in our paragraphs

[ x for x in words if x.istitle()]

['Not',
 'In',
 'Including',
 'For',
 'Sea.',
 'Characteristic',
 'Bridge',
 'Falling',
 'Particularly']

In [11]:
# we can also search/filter for words that starts with or ends with something

[ x for x in words if x.startswith('s') or x.endswith('d')]

['railroad',
 'six',
 'avoided',
 'and',
 'species',
 'seven',
 'species',
 'opened',
 'squash.',
 'formed',
 'field',
 'strong']

In [12]:
# Lets learn jow to use join method

' '.join(words)

# Every word in the list will be connected to each other with a space 
# You can add string as a connector
'-'.join(words)

'Not-even-member-citizens-account-for-much-of-the-transmission-of.-In-tertiary-late-1970s,-the-railroad-industry.-six-of-the-very.-Including-collier-defense,-though-he-avoided-formal-proceedings,-and-a-centre-of-civilization-with-trading.-For-hosting-two-species-of-game-fish-including-seven-species-of-life-that.-Sea.-opened-berlin-in-1878.-\n-the-egyptian-squash.-Characteristic-features-comprehensive-reform-package-in-1996-by.-Bridge-formed-indigenous-populations.-in-later-decades..-Falling-from-field-need.-Particularly-strong-culture-(most-notably-the-great-lakes.'

In [13]:

sentences = DocumentGenerator().gen_sentence()

In [14]:
sentences = ' ' + sentences + ' '

In [15]:
sentences

' Arrondissements were is 1.5%. There is a term of Dilma '

In [16]:
sentences.strip(' ')

'Arrondissements were is 1.5%. There is a term of Dilma'

In [17]:
#  To find the index of a character/word(in case of words, it tells u the first character index) in a sentence we use the find method
sentences.strip(' ').find('The')

30

In [18]:
sentences.strip(' ')[9:]

'ements were is 1.5%. There is a term of Dilma'

In [19]:
# use the replace method to replace characters/words with something in a string

sentences.strip(' ').replace('fifth','fourth')

'Arrondissements were is 1.5%. There is a term of Dilma'

In [20]:
# I have deliberately not saved the strip method output back to the variable,'sentences' so as to show how to use multiple methods in a single operation


In [21]:
import pandas as pd
import numpy as np
x =pd.DataFrame([ f'{np.random.randint(1,10,3)} - {np.random.randint(1,10,3)} -{np.random.randint(1,10,3)}' for i in range(100)],columns=['Phone_no'])

In [22]:
x

Unnamed: 0,Phone_no
0,[2 2 4] - [1 5 7] -[6 3 2]
1,[2 3 5] - [7 9 1] -[4 4 4]
2,[8 5 1] - [1 1 1] -[8 4 7]
3,[6 9 6] - [6 2 5] -[4 1 4]
4,[1 6 7] - [3 2 9] -[5 4 3]
...,...
95,[6 2 4] - [2 2 2] -[4 3 8]
96,[8 4 9] - [9 6 2] -[9 4 7]
97,[5 7 8] - [2 6 1] -[1 7 3]
98,[4 7 4] - [9 3 2] -[6 8 4]


In [23]:
# from whatever we learned above, lets convert the column of phone number in a format such that there is no - or square brackets

x['Phone_no']=x.Phone_no.map(lambda x : '+44 ' + x.replace('[','').replace(']','').replace('-','').replace(' ',''))

In [24]:
0044 521-234-212

SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers (2160925594.py, line 1)

In [None]:
x['Phone_no'].map(lambda phone :'0044')

In [None]:
x

Now, Lets dive deeper into cleaning by using REGEX. Regex, short for regular expression, is often used in programming languages for matching patterns in strings, find and replace, input validation, and reformatting text. Learning how to properly use Regex can make working with text much easier.

In [None]:
# First import re
import re

In [None]:
sentences = gen.sentence()

In [None]:
sentences

In [None]:
re.compile??

As per the documentation:

re.compile: "Compile a regular expression pattern, returning a Pattern object."

REGEX patterns are nothing but rules defined to tell the different REGEX methods on where and how the required patterns are stored within a string.

We use the compile method within re module to store those pattern in a pattern object. Think of it as a dye which we try to fit when we itereate over the strings. Whatever fits is returned using the selected methods.

In [None]:
ptrn = re.compile(r'homonuclear')

In [None]:
ptrn.findall(sentences)

If we look at the input of compile method, it starts with r'' string. It is called a raw string. Lets understand with an example 

In [None]:
x = '\thi there'
y = r'\thithere'

print('x = ',x)
print('y = ',y)

To cover REGEX, We shall divide the module is the following categories and dive deeper into it.



1.   Different Search methods
2.   Different methods of the match objects
3.   Special meaning metacharacters
4.   Forming sets using [ ] square brackets
5.   Quantifiers
6.   Logical Conditions
7.   Forming groups in pattern match
8.   Modification methods






# Search methods

Obj.match(): This method will search the pattern in the beginning of the string. Lets understand with examples

In [26]:
# Lets suppose the string is 
import re
x = 'abcd5678ABC4356abc'

# we can use the match method to search 'abc' in the string as follows

ptrn = re.compile(r'abc')

match_obj = ptrn.match(x)

In [27]:
match_obj

<re.Match object; span=(0, 3), match='abc'>

As we can see above, the objects has 2 attributes span and match. span gives the index where the substring starts and ends. match gives the output of the string we are looking for

This is a very simple use of match method to understand the usecase and we shall increase the complexity gradually

In [29]:
# Lets try to search if numbers are present at the start of the string

x = '768KapilDS'

ptrn = re.compile(r'768') 

match_obj = ptrn.match(x)

print(match_obj)

<re.Match object; span=(0, 3), match='768'>


In [31]:
y = '1234UzoIM' #create string

ptrn = re.compile(r'123') # seach function
match_obj = ptrn.match(y) # match the object
print(match_obj)

<re.Match object; span=(0, 3), match='123'>


## Methods accociated with the match object

In [None]:
x = '768KapilDS'

ptrn = re.compile(r'768')

match_obj = ptrn.match(x)

print(match_obj.span())

In [None]:
x = '768JosephDS'

ptrn = re.compile(r'768')

match_obj = ptrn.match(x)

print(match_obj.group())

To get the mattched string as an output we use the group method within the match object

If you want the start and end index as a part of the span attribute, we can use start and end method

In [None]:
match_obj.start()

In [None]:
match_obj.end()

obj.search(): by using this method, we can find the match within any location of the string.

In [None]:
x = '768JosephDS'

ptrn = re.compile(r'Joseph')

match_obj = ptrn.search(x)

print(match_obj)

In [None]:
x = '768JosephDS'

ptrn = re.compile(r'768')

match_obj = ptrn.search(x)

print(match_obj)

In [None]:
match_obj.span()

In [None]:
match_obj.group()

In [None]:
match_obj.start()

In [None]:
match_obj.end()

obj.finditer(): Find all substrings where the pattern matches the string. The object returned is an iterator and we can use loops to operate on each substrings. For more lets have a look at the examples given below.

In [None]:
x = 'Hi my name is Aladeen, and I am a friend of Jerry from Aladeen\'s kingdom'

ptrn = re.compile('Aladeen')
substrings = ptrn.finditer(x)

In [None]:
for substring in substrings:
  print(substring)
  # print(substring.span(), substring.start(), substring.end())
  # print(substring.group)

In [None]:
for substring in substrings:
  # print(substring)
  print(substring.span(), substring.start(), substring.end())
  # print(substring.group)

In [None]:
for substring in substrings:
  # print(substring)
  # print(substring.span(), substring.start(), substring.end())
  print(substring.group())

obj.findall(): Find all substrings where the pattern matches the string. The returned part is a List and we can use loops to operate on each substrings. For more lets have a look at the examples given below.

In [None]:
x = 'Hi my name is Aladeen, and I am a friend of Jerry from Aladeen\'s kingdom'

ptrn = re.compile('Aladeen')
substrings = ptrn.findall(x)

In [None]:
substrings

Since the returned part is a list we don't have access to span method


In [None]:
ptrn.findall??

#Meta characters

A metacharacter is a character that has a special meaning during pattern processing. You use metacharacters in regular expressions to define the search criteria and any text manipulations.

Please have a look at the cheat sheet to understand more



```

.       - Any Character Except New Line
\d      - Digit (0-9)
\D      - Not a Digit (0-9)
\w      - Word Character (a-z, A-Z, 0-9, _)
\W      - Not a Word Character
\s      - Whitespace (space, tab, newline)
\S      - Not Whitespace (space, tab, newline)

\b      - Word Boundary
\B      - Not a Word Boundary
^       - Beginning of a String
$       - End of a String

[]      - Matches Characters in brackets
[^ ]    - Matches Characters NOT in brackets
|       - Either Or
( )     - Group

Quantifiers:
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)


#### Sample Regexs ####

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+





```



In [32]:
import numpy as np

In [34]:
Subjects[randrange(0,len(Subjects)-1)]

[9]

In [33]:
# Lets generate some texts and understand some usecases of these metacharacters in regex
from random import randrange
Subjects = ['Kaps','Miracle','Flavio','Joseph','Criston','Tish','Said','John','Danielle','Dimple','Sneha','Tanisha']
Departments = ['IT','Ad','MKT','Sls','Op']

text_example = f'Hi my name is {Subjects[randrange(0,len(Subjects)-1)]} and my Employee SSN number is {np.random.randint(100,200)}{Departments[randrange(0,len(Departments)-1)]}'
with open('Test.txt','a') as f:
  print('We shall append this paragrph below to the end of the file:')
  for i in range(10):
    store = f'Hi my name is {Subjects[randrange(0,len(Subjects)-1)]} and my Employee SSN number is {np.random.randint(100,200)}{Departments[randrange(0,len(Departments)-1)]}' + '\n'
    print(store)
    f.write(store)


We shall append this paragrph below to the end of the file:
Hi my name is Kaps and my Employee SSN number is 145MKT

Hi my name is Sneha and my Employee SSN number is 181IT

Hi my name is Danielle and my Employee SSN number is 171Ad

Hi my name is Sneha and my Employee SSN number is 127MKT

Hi my name is Kaps and my Employee SSN number is 188IT

Hi my name is Sneha and my Employee SSN number is 131IT

Hi my name is Sneha and my Employee SSN number is 182IT

Hi my name is John and my Employee SSN number is 191Ad

Hi my name is Sneha and my Employee SSN number is 156IT

Hi my name is Said and my Employee SSN number is 163Ad



We shall store it in sentences array/ txt file for later use

In [None]:
# sentences = []
with open('Test.txt','r') as f:
  sentences = f.readlines()

In [None]:
sentences

In [None]:
text_example


\d searches for digits between 0-9

In [None]:
ptrn = re.compile(r'\d')
match_obj = ptrn.finditer(text_example)

In [None]:
for i in match_obj:
  print(i)

\s searches for White spaces like a tab or a space

In [None]:
ptrn = re.compile(r'\s')
match_obj = ptrn.finditer(text_example)
for i in match_obj:
  print(i)

\S note that here the S is capital which means not a white space

In [None]:
ptrn = re.compile(r'\S')
match_obj = ptrn.finditer(text_example)
for i in match_obj:
  print(i)

\D searches for anything except a digit from 0-9

In [None]:
ptrn = re.compile(r'\D')
match_obj = ptrn.finditer(text_example)
for i in match_obj:
  print(i)

\b is an interesting one, it points to the beginning of each word in a sentence. These words in a sentence are called as blocks. If the desired sequence is in the beginning of each block. then it is returned. 

In [None]:
exmple = 'hehihahahello hellohihihaha'

ptrn = re.compile(r'\bhello')
match_obj = ptrn.finditer(exmple)
for i in match_obj:
  print(i)

\B is opposite of \b where a match is returned if its not at the beginning of the block

In [None]:
ptrn = re.compile(r'\Bhello')
match_obj = ptrn.finditer(exmple)
for i in match_obj:
  print(i)

if we compare the span of each returned parts we can understand it in more detail

In [None]:
x = 'Hi my name is rick'
y = 'Rick here, Hi'

ptrn = re.compile(r'^Hi')

match_obj1 = ptrn.findall(x)
match_obj2 = ptrn.findall(y)




In [None]:
print('match1 = ',match_obj1,'match2 = ',match_obj2)

Even though findall searches for matches through out the string, the '^' character searches for the substring mentioned at the beginning of the string.

In [None]:
x = 'Hi my name is rick'
y = 'Rick here, Hi'

ptrn = re.compile(r'Hi$')

match_obj1 = ptrn.findall(x)
match_obj2 = ptrn.findall(y)




In [None]:
print('match1 = ',match_obj1,'match2 = ',match_obj2)

the dollar sign when written a the end of substring, it will look for the substring from in the end of the string.

#SETS

A set is a group of characters with a particular meaning that are enclosed in a pair of square brackets, []. Add many conditions in a row, as [aA-Z].
The expression is negated by a (caret) inside a set.
If a dash occurs between two values in a set, it specifies a range; otherwise, it is the dash itself.

Examples:

1. [ASN] will return a match where either A or S or N are present
2. [a-n] will form a range as mentioned above and will only return a match if there exist a lower case character between a to n alphabetically
3. [^asn] will return a match for any character except a, r and n
4. [0123] works similar as [ASN] will return a match if the character is either 0,1 ,2 or 3
similarly, [a-zA-Z] will return a match for any character alphabetically between a and z lowercase OR A and Z UPPERCASE


In [None]:
text_example

In [None]:
ptrn = re.compile(r'[a-z]')

for i in ptrn.finditer(text_example):
  print(i)

In [None]:
dates = '''
02.03.2020


2021-05-07
2020-05-30

2020.06.22

2020-06-23
2020-06-14
2020-03-15
2020 03 04
2020/09/30

2022_04_24
2021_02_24
'''



See any pattern with the dates above? All of them follow some sort of a format  
```
[YYYY mm dd, YYYY.mm.dd, YYYY-mm-dd, YYYY/mm/dd, YYYY_mm_dd] 

```

the dot('.') operator is used to pass any character except new line.
Lets see how we can use them in this example

In [None]:
ptrn = re.compile(r'\d\d\d\d.\d\d.\d\d')
match_obj = ptrn.finditer(dates)

for i in match_obj:
  print(i)

Here the dates with spaces in between isn't included. We cannot use sets in this situation to only include either all the character or spaces in between years, month and date. The reason being, when we try to implement something like [.\s] it would treat . like a character and not a metacharacter

In [None]:
ptrn = re.compile(f'\d\d\d\d[.\s]\d\d[.\s)\d\d')
match_obj = ptrn.finditer(dates)

for i in match_obj:
  print(i)

The correct way to do it is below but we shall dive into grouping and the | operator in a bit.

In [None]:
ptrn = re.compile(f'\d\d\d\d(.|\s)\d\d(.|\s)\d\d')
match_obj = ptrn.finditer(dates)

for i in match_obj:
  print(i)

Only dates with month year etc seperated by - or .

In [None]:
ptrn = re.compile(r'\d\d\d\d[-.]\d\d[-.]\d\d') 
match_obj = ptrn.finditer(dates)
for i in match_obj:
    print(i)

Dates in between month [3-5]

In [None]:
ptrn = re.compile(r'\d\d\d\d.0[3-5].\d\d') 
match_obj = ptrn.finditer(dates)
for i in match_obj:
    print(i)

Dates between [3-5] but only with seperation by - and .

In [None]:
ptrn = re.compile(r'\d\d\d\d[-.]0[3-5][.-]\d\d') 
match_obj = ptrn.finditer(dates)
for i in match_obj:
    print(i)

if you observe the above code closely, you can understand the sequence of characters written inside a set doesn't matter since it is either . or -

# Quantifiers

Writing a Regular expression with repeated meta characters or characters in general looks a bit rough. That is why we have quantifiers. Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found. If we have a look at the cheatsheet, each of these quantifiers are defined in terms of their operations. Lets list them down here

```
*       - 0 or More

+       - 1 or More

?       - 0 or One

{3}     - Exact Number

{3,4}   - Range of Numbers (Minimum, Maximum)

```


In [25]:
text_example

'Hi my name is Criston and my Employee SSN number is 127Sls'

In [26]:
# lets try and extract the numbers out of this string

ptrn = re.compile(r'\d*')

for i in ptrn.finditer(text_example):
  print(i)

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(1, 1), match=''>
<re.Match object; span=(2, 2), match=''>
<re.Match object; span=(3, 3), match=''>
<re.Match object; span=(4, 4), match=''>
<re.Match object; span=(5, 5), match=''>
<re.Match object; span=(6, 6), match=''>
<re.Match object; span=(7, 7), match=''>
<re.Match object; span=(8, 8), match=''>
<re.Match object; span=(9, 9), match=''>
<re.Match object; span=(10, 10), match=''>
<re.Match object; span=(11, 11), match=''>
<re.Match object; span=(12, 12), match=''>
<re.Match object; span=(13, 13), match=''>
<re.Match object; span=(14, 14), match=''>
<re.Match object; span=(15, 15), match=''>
<re.Match object; span=(16, 16), match=''>
<re.Match object; span=(17, 17), match=''>
<re.Match object; span=(18, 18), match=''>
<re.Match object; span=(19, 19), match=''>
<re.Match object; span=(20, 20), match=''>
<re.Match object; span=(21, 21), match=''>
<re.Match object; span=(22, 22), match=''>
<re.Match object; span=(23, 23)

If we look at the output above this is what the * quantifier will do. it will look for 0 or more digits in the string. even though alphabetical characters won't be returned in the output but empty string will be returned to fix it, we shall use + instead of *

In [27]:
ptrn = re.compile(r'\d+')

for i in ptrn.finditer(text_example):
  print(i)

<re.Match object; span=(52, 55), match='127'>


+ is a one or more quantifier hence only one or more digits will be returned if you want to use the match as an output use the group method

In [28]:
ptrn = re.compile(r'\d+')

for i in ptrn.finditer(text_example):
  print(i.group())

127


In [32]:
string = '1 22 333 4444 55555 666666'

# If we want to only fetch those numbers which are in between 1,4 digits we can use range quantifiers

ptrn = re.compile(r'\d{1,4}')

for i in ptrn.finditer(string):
  print(i.group())

1
22
333
4444
5555
5
6666
66


in this we can see that only 4 digits from 55555 are selected and the remaining digit is outputted seperately. we can restrict only 4 digit number by adding a \s to search for those number which are at max 4 digits and followed by a space/tab

In [42]:
string = '1 22 333 4444 55555 666666'

# If we want to only fetch those numbers which are in between 1,4 digits we can use range quantifiers

ptrn = re.compile(r'\d{1,4}\s+')

for i in ptrn.finditer(string):
  print(i)

<re.Match object; span=(0, 2), match='1 '>
<re.Match object; span=(2, 5), match='22 '>
<re.Match object; span=(5, 9), match='333 '>
<re.Match object; span=(9, 14), match='4444 '>
<re.Match object; span=(15, 20), match='5555 '>


Even now our problem isn't solved. it is returning a 4 digit 5555 from 5 digit 55555 by bypassing one 5 inside the string. To fix it we need to search at the beginning of the word boundary so we begin the patter seach with \b

In [43]:
string = '1 22 333 4444 55555 666666'

# If we want to only fetch those numbers which are in between 1,4 digits we can use range quantifiers

ptrn = re.compile(r'\b\d{1,4}\s+')

for i in ptrn.finditer(string):
  print(i)

<re.Match object; span=(0, 2), match='1 '>
<re.Match object; span=(2, 5), match='22 '>
<re.Match object; span=(5, 9), match='333 '>
<re.Match object; span=(9, 14), match='4444 '>


Fixedit!

Before going into shortening the REGEX used for extracting dates from the string, we shall understand the ? quantifier.

It is used alongside a metacharacter/characters to fetch 0 or 1 match. Refer below cell for more explanation

In [44]:
string = 'Steam login kapslock_12_3'

pattern = re.compile('\d')
match_obj = pattern.finditer(string)
for i in match_obj:
  print(i)

<re.Match object; span=(21, 22), match='1'>
<re.Match object; span=(22, 23), match='2'>
<re.Match object; span=(24, 25), match='3'>


if we want to fetch the part after kapslock, we can use the ? quantifier to keep _ if it exist in the string or bypass it if it doesn't exist.

In [45]:

pattern = re.compile('[_?\d]+')
match_obj = pattern.finditer(string)
for i in match_obj:
  print(i)

<re.Match object; span=(20, 25), match='_12_3'>


we can understand from this that if _ is present we include if not then we look for digit directly. putting it inside a set to fix the pattern we are looking for and using + as a quantifier to give one or more match of the pattern as a part of a single returned substring

Lets fix the dates now!

In [30]:
print(dates)


02.03.2020


2021-05-07
2020-05-30

2020.06.22

2020-06-23
2020-06-14
2020-03-15
2020 03 04
2020/09/30

2022_04_24
2021_02_24



In [46]:
# lets work again with dates and make the regex cleaner

ptrn = re.compile(r'\d{4}.\d{2}.\d{2}')
match_obj = ptrn.finditer(dates)
for i in match_obj:
  print(i)

<re.Match object; span=(14, 24), match='2021-05-07'>
<re.Match object; span=(25, 35), match='2020-05-30'>
<re.Match object; span=(37, 47), match='2020.06.22'>
<re.Match object; span=(49, 59), match='2020-06-23'>
<re.Match object; span=(60, 70), match='2020-06-14'>
<re.Match object; span=(71, 81), match='2020-03-15'>
<re.Match object; span=(82, 92), match='2020 03 04'>
<re.Match object; span=(93, 103), match='2020/09/30'>
<re.Match object; span=(105, 115), match='2022_04_24'>
<re.Match object; span=(116, 126), match='2021_02_24'>


We can do it another way with the + quantifier as follows

In [47]:
ptrn = re.compile(r'\d+.\d+.\d+')
match_obj = ptrn.finditer(dates)
for i in match_obj:
  print(i)

<re.Match object; span=(1, 11), match='02.03.2020'>
<re.Match object; span=(14, 24), match='2021-05-07'>
<re.Match object; span=(25, 35), match='2020-05-30'>
<re.Match object; span=(37, 47), match='2020.06.22'>
<re.Match object; span=(49, 59), match='2020-06-23'>
<re.Match object; span=(60, 70), match='2020-06-14'>
<re.Match object; span=(71, 81), match='2020-03-15'>
<re.Match object; span=(82, 92), match='2020 03 04'>
<re.Match object; span=(93, 103), match='2020/09/30'>
<re.Match object; span=(105, 115), match='2022_04_24'>
<re.Match object; span=(116, 126), match='2021_02_24'>


with this we have also included the mm dd yyyy format of date

# Logical conditions

In [48]:
string = '''
Mr said ali
Mrs tina mukherjee
Mr. bennet 

Mr. P



'''

In [49]:
ptrn = re.compile(r'Mr\.?\s\w+')
match_obj = ptrn.finditer(string)
for i in match_obj:
  print(i)

<re.Match object; span=(1, 8), match='Mr said'>
<re.Match object; span=(32, 42), match='Mr. bennet'>
<re.Match object; span=(45, 50), match='Mr. P'>


In the above string, we can suggest a pattern to look for either Mr or Mrs followed with or without a ".". So for this we shall use Logical conditions.

To implement it we shall use the | sign.

In [51]:
ptrn = re.compile(r'(Mr|Mrs)\.?\s\w+')
match_obj = ptrn.finditer(string)
for i in match_obj:
  print(i)

<re.Match object; span=(1, 8), match='Mr said'>
<re.Match object; span=(13, 21), match='Mrs tina'>
<re.Match object; span=(32, 42), match='Mr. bennet'>
<re.Match object; span=(45, 50), match='Mr. P'>


if we look closesly, there is a \ before using '.'. This means that we are not referring to . as a meta character but as a dot itself

Exercise:

Explain the output below

In [52]:
ptrn = re.compile(r'(Mr|Mrs|)\.?\s\w+')
match_obj = ptrn.finditer(string)
for i in match_obj:
  print(i)

<re.Match object; span=(0, 3), match='\nMr'>
<re.Match object; span=(3, 8), match=' said'>
<re.Match object; span=(8, 12), match=' ali'>
<re.Match object; span=(12, 16), match='\nMrs'>
<re.Match object; span=(16, 21), match=' tina'>
<re.Match object; span=(21, 31), match=' mukherjee'>
<re.Match object; span=(31, 34), match='\nMr'>
<re.Match object; span=(34, 42), match='. bennet'>
<re.Match object; span=(44, 47), match='\nMr'>
<re.Match object; span=(47, 50), match='. P'>


In [56]:
courses = '''
MSc. Big Data Science
Masters in cs
BA. Arts
'''

In [65]:
ptrn = re.compile(r'(MSc|Masters|BA)\.?\s?\s?\w+')
match_obj = ptrn.finditer(courses)
for i in match_obj:
  print(i)

<re.Match object; span=(1, 9), match='MSc. Big'>
<re.Match object; span=(23, 33), match='Masters in'>
<re.Match object; span=(37, 45), match='BA. Arts'>


From the courses string example, we are not able to find the substring which has a connector like 'in' in the course name

if we put in in round brackets, we can then find the match we are looking for this is called grouping. more on this in next section.

In [62]:
ptrn = re.compile(r'(MSc|Masters|BA)\.?\s(in)?\s?\w+')
match_obj = ptrn.finditer(courses)
for i in match_obj:
  print(i)

<re.Match object; span=(1, 9), match='MSc. Big'>
<re.Match object; span=(23, 36), match='Masters in cs'>
<re.Match object; span=(37, 45), match='BA. Arts'>


# Grouping

A group is a part of a regex pattern enclosed in parentheses () metacharacter. We create a group by placing the regex pattern inside the set of parentheses 

Lets generate some random emails and understand grouping with it.

In [72]:
emails = []
for i in range(10):
  emails.append(gen.email())

In [73]:
string = '\n'.join(emails)

In [75]:
print(string)

Indest@plogri.com
of@Atly.us
hyps@Nor.com
Asicti.a@scallu.com
ces@the.edu
as@Fory.gov
Dembe@ition645.co.uk
slater@tortio.com
cuivik@Mohe.ru
postel@coment.us


We can split the email into 3 parts. The first part which is the user_name, second part is the mail_server and finally the 3rd part contains the domain name like .com, .us etc

for now, lets concentrate on user_name

In [77]:
ptrn = re.compile(r'[a-zA-Z1-9]+@')

match_obj = ptrn.finditer(string)

for i in match_obj:
  print(i)

<re.Match object; span=(0, 7), match='Indest@'>
<re.Match object; span=(18, 21), match='of@'>
<re.Match object; span=(29, 34), match='hyps@'>
<re.Match object; span=(49, 51), match='a@'>
<re.Match object; span=(62, 66), match='ces@'>
<re.Match object; span=(74, 77), match='as@'>
<re.Match object; span=(86, 92), match='Dembe@'>
<re.Match object; span=(107, 114), match='slater@'>
<re.Match object; span=(125, 132), match='cuivik@'>
<re.Match object; span=(140, 147), match='postel@'>


if we are only interested in the user_name, we have to exclude the @ in finding the matches. we can achieve this by grouping.

every part inside a paranthesis is a part of the group. if there are multiple paranthesis we can use the group method to return the grouped data. the indexing of group starts with 1. if there are 2 groups and we want to find matches from group 1, we use group method with 1 as input.

I have used '.' as a character in the set as shown below.

In [81]:
ptrn = re.compile(r'([a-zA-Z1-9.]+)@')

match_obj = ptrn.finditer(string)

for i in match_obj:
  print(i.group(1))

Indest
of
hyps
Asicti.a
ces
as
Dembe
slater
cuivik
postel


splitting the server name and the domain name is a 2 step process. First step we find the whole match as an aggregation of servername and domain name and store it in a list.

In [83]:
ptrn = re.compile(r'([a-zA-Z1-9.]+)@([a-zA-Z1-9.]+)')

match_obj = ptrn.finditer(string)

for i in match_obj:
  print(i.group(2))

plogri.com
Atly.us
Nor.com
scallu.com
the.edu
Fory.gov
ition645.co.uk
tortio.com
Mohe.ru
coment.us


In [84]:
agg_list =[]

ptrn = re.compile(r'([a-zA-Z1-9.]+)@([a-zA-Z1-9.]+)')

match_obj = ptrn.finditer(string)

for i in match_obj:
  agg_list.append(i.group(2))

In [85]:

agg_list

['plogri.com',
 'Atly.us',
 'Nor.com',
 'scallu.com',
 'the.edu',
 'Fory.gov',
 'ition645.co.uk',
 'tortio.com',
 'Mohe.ru',
 'coment.us']

If we observe carefully, we can see some of the domain name contains 2 '. ' characters we can group the server name until we find 1 dot character after which the substring is a domain name

but in group 2 we need to delete the dot character, otherwise it would include the first dot out of the 2 dots.

Observe example 1 and 2 to know more

In [93]:
# with dot in group 2

agg_list =[]

ptrn = re.compile(r'([a-zA-Z1-9.]+)@([a-zA-Z1-9.]+)\.([a-zA-Z1-9.]+)')

match_obj = ptrn.finditer(string)

for i in match_obj:
  # agg_list.append(i.group(2))
  print(i.group(3))

com
us
com
com
edu
gov
uk
com
ru
us


In [96]:
# without dot in group 2

agg_list =[]

ptrn = re.compile(r'([a-zA-Z1-9.]+)@([a-zA-Z1-9]+)\.([a-zA-Z1-9.]+)')

match_obj = ptrn.finditer(string)

for i in match_obj:
  # agg_list.append(i.group(2))
  print(i.group(3))

com
us
com
com
edu
gov
co.uk
com
ru
us


In [98]:
# without dot in group 2
import pandas as pd
user_name =[]
server_name =[]
domain_name =[]

ptrn = re.compile(r'([a-zA-Z1-9.]+)@([a-zA-Z1-9]+)\.([a-zA-Z1-9.]+)')

match_obj = ptrn.finditer(string)

for i in match_obj:
  # agg_list.append(i.group(2))
  user_name.append(i.group(1))
  server_name.append(i.group(2))
  domain_name.append(i.group(3))


pd.DataFrame({'User_name':user_name,'Server_name':server_name,'domain_name':domain_name})

Unnamed: 0,User_name,Server_name,domain_name
0,Indest,plogri,com
1,of,Atly,us
2,hyps,Nor,com
3,Asicti.a,scallu,com
4,ces,the,edu
5,as,Fory,gov
6,Dembe,ition645,co.uk
7,slater,tortio,com
8,cuivik,Mohe,ru
9,postel,coment,us


# Modifying strings

In [99]:
import re
import pandas as pd
import numpy as np

1. split(): This is similar to the split function in python but instead of a delimiter like ',' or ' ' or '.' we use REGEX

In [100]:
string = 'one.two_three#four'

print(re.split('[._#]', string))
# ['one', 'two', 'three', 'four']

['one', 'two', 'three', 'four']


In [102]:
url = []

for i in range(10):
  url.append(gen.url())

print(url)

['http://thing.com/seregi.html', 'http://a.jp/In/in.of-try-cons-Meclis', 'http://Johaph.edu/Nighe/nol.html', 'http://the.gov/Sch/overm/come/trin.larthe-of-thosts-clople-shey', 'http://Mexpro.org/widema/Ye/sych/to.asp', 'https://a.com/nuages.jpg', 'https://ing.co.uk/anteni/ps.lion45-Ang-fortio5859', 'https://lartia.com/Depope.asp', 'http://thers.ways.us/thes.php', 'http://ean.tv.by4781-isse-bution-Nation']


In [103]:
string = ' \n'.join(url)

In [105]:
print(string)

http://thing.com/seregi.html 
http://a.jp/In/in.of-try-cons-Meclis 
http://Johaph.edu/Nighe/nol.html 
http://the.gov/Sch/overm/come/trin.larthe-of-thosts-clople-shey 
http://Mexpro.org/widema/Ye/sych/to.asp 
https://a.com/nuages.jpg 
https://ing.co.uk/anteni/ps.lion45-Ang-fortio5859 
https://lartia.com/Depope.asp 
http://thers.ways.us/thes.php 
http://ean.tv.by4781-isse-bution-Nation


In [119]:
# as we can see from the links above, it is missing www. Lets look at a way to use split method available via match object

ptrn = re.compile(r'(http://|https://)')



In [144]:
x= list(map(lambda x: 'www.'+ x ,ptrn.split(string)[1:][1::2]))

In [148]:
string=''.join([a+b for a,b in zip(ptrn.split(string)[1:][::2],x)])

In [149]:
print(string)

http://www.thing.com/seregi.html 
http://www.a.jp/In/in.of-try-cons-Meclis 
http://www.Johaph.edu/Nighe/nol.html 
http://www.the.gov/Sch/overm/come/trin.larthe-of-thosts-clople-shey 
http://www.Mexpro.org/widema/Ye/sych/to.asp 
https://www.a.com/nuages.jpg 
https://www.ing.co.uk/anteni/ps.lion45-Ang-fortio5859 
https://www.lartia.com/Depope.asp 
http://www.thers.ways.us/thes.php 
http://www.ean.tv.by4781-isse-bution-Nation


Now we shall learn about one more essential method called as sub():
Find all substrings where the RE matches, and replace them with a different string

In [156]:
url = []

for i in range(10):
  url.append(gen.url())

print(url)

string = ' \n'.join(url)

['http://bed.co.uk/us.up-is-a86-to-sheope', 'http://a.net/cal.Seas-procei-or-foreas-of-aillea-land', 'http://wity.edu/of/conts/rempli.A-havin-imaten-Bels-Humate-alarra-thersi', 'http://ome.co.uk/Cousin/Djorio/war/ths.asives-rentio2633-in', 'https://felink.fr/inguar/coper/ducies.the-an-post-vatabl', 'https://Fort.tv/of/the/intrat/fritab.aces,-med-porge', 'http://a.laste.co.uk/vicult/pronom.png', 'http://somour.us/led/Jactur.png', 'http://Rwas.edu/are.oplic-Terain-it437313', 'http://thon.60.org/ch/only/vot/Depenj.palt-lies8051-mesed-sounit']


In [158]:
# Before implementing the actual bit, we need to understand what is backreference. 
# When we group expressions with paranthesis, we can use the mathes by these groups using backreference
print(string)

http://bed.co.uk/us.up-is-a86-to-sheope 
http://a.net/cal.Seas-procei-or-foreas-of-aillea-land 
http://wity.edu/of/conts/rempli.A-havin-imaten-Bels-Humate-alarra-thersi 
http://ome.co.uk/Cousin/Djorio/war/ths.asives-rentio2633-in 
https://felink.fr/inguar/coper/ducies.the-an-post-vatabl 
https://Fort.tv/of/the/intrat/fritab.aces,-med-porge 
http://a.laste.co.uk/vicult/pronom.png 
http://somour.us/led/Jactur.png 
http://Rwas.edu/are.oplic-Terain-it437313 
http://thon.60.org/ch/only/vot/Depenj.palt-lies8051-mesed-sounit


In [161]:
ptrn = re.compile(r'(http://|https://)(.+)')

#                        group 1     group 2

match_obj = ptrn.finditer(string)
for i in match_obj:
  print(i.group(2))

bed.co.uk/us.up-is-a86-to-sheope 
a.net/cal.Seas-procei-or-foreas-of-aillea-land 
wity.edu/of/conts/rempli.A-havin-imaten-Bels-Humate-alarra-thersi 
ome.co.uk/Cousin/Djorio/war/ths.asives-rentio2633-in 
felink.fr/inguar/coper/ducies.the-an-post-vatabl 
Fort.tv/of/the/intrat/fritab.aces,-med-porge 
a.laste.co.uk/vicult/pronom.png 
somour.us/led/Jactur.png 
Rwas.edu/are.oplic-Terain-it437313 
thon.60.org/ch/only/vot/Depenj.palt-lies8051-mesed-sounit


In [163]:
print(ptrn.sub(r'\1www.\2',string))

http://www.bed.co.uk/us.up-is-a86-to-sheope 
http://www.a.net/cal.Seas-procei-or-foreas-of-aillea-land 
http://www.wity.edu/of/conts/rempli.A-havin-imaten-Bels-Humate-alarra-thersi 
http://www.ome.co.uk/Cousin/Djorio/war/ths.asives-rentio2633-in 
https://www.felink.fr/inguar/coper/ducies.the-an-post-vatabl 
https://www.Fort.tv/of/the/intrat/fritab.aces,-med-porge 
http://www.a.laste.co.uk/vicult/pronom.png 
http://www.somour.us/led/Jactur.png 
http://www.Rwas.edu/are.oplic-Terain-it437313 
http://www.thon.60.org/ch/only/vot/Depenj.palt-lies8051-mesed-sounit


#Using the sentences generated and stored in the txt file

In [164]:

with open('Test.txt','r') as f:
  sentences = f.readlines()

In [165]:
sentences

['Hi my name is Miracle and my Employee SSN number is 188Ad\n',
 'Hi my name is Flavio and my Employee SSN number is 173Ad\n',
 'Hi my name is Joseph and my Employee SSN number is 125MKT\n',
 'Hi my name is Joseph and my Employee SSN number is 179MKT\n',
 'Hi my name is Flavio and my Employee SSN number is 108Ad\n',
 'Hi my name is Tish and my Employee SSN number is 136Ad\n',
 'Hi my name is Flavio and my Employee SSN number is 144Sls\n',
 'Hi my name is Joseph and my Employee SSN number is 163IT\n',
 'Hi my name is Danielle and my Employee SSN number is 180Ad\n',
 'Hi my name is Tish and my Employee SSN number is 147Ad\n']

In [166]:
string = ''.join(sentences)

In [179]:
ptrn = re.compile(r'Hi my name is (\w+) and my Employee SSN number is (\d+)(\w+)')
match_obj = ptrn.finditer(string)
name =[]
id = []
department = []
for i in match_obj:
  name.append(i.group(1))
  id.append(i.group(2))
  department.append(i.group(3))

df = pd.DataFrame({'Name':name,'Identity_NO':id,'Department':department})


df['Department'] = df['Department'].map({'Ad':'Admin', 'MKT':'Marketing','Sls':'Sales', 'IT' : 'Information Technology'})

In [180]:
df

Unnamed: 0,Name,Identity_NO,Department
0,Miracle,188,Admin
1,Flavio,173,Admin
2,Joseph,125,Marketing
3,Joseph,179,Marketing
4,Flavio,108,Admin
5,Tish,136,Admin
6,Flavio,144,Sales
7,Joseph,163,Information Technology
8,Danielle,180,Admin
9,Tish,147,Admin
