## Splitting Strings on Any of Multiple Delimiters

In [1]:
import re

### Split a string into fields, but the delimiters (and spacing around them) aren’t consistent throughout the string.

In [2]:
line = 'asdf fjdk; afed, fjek,asdf, foo'
re.split(r'[;,\s]\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

`[]` - Matches `one or more` charaters inside the bracker
`;`,`,`,`\\s` - Semicolon, Colon and Whitespace delimiters
`*` - Matches `Zero or more` characters

`[;,\s]` - Checks if any word is starting with either of the delimiters mentioned in the bracket

`[;,\s]\s*` - Check if there are any whitespaces after the word

### Split the string but capture the delimiters

In [3]:
re.split(r'(;|,|\s)\s*', line)

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

`()` - Parenthesis are used to Capture group
`|` - Pipepline operator acts as `OR` condition to combine multiple expressions

To get the output similar to that of brackets but using Parenthesis or Groping, add `?:`

In [4]:
re.split(r'(?:,|;|\s)\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

## Matching Text at Start or End of string

In [5]:
file = 'spam.txt'
print(file.endswith('txt'))
print(file.startswith('new'))

True
False


In [7]:
url = 'http://localhost:8888/notebooks/Python%20String%20and%20Text.ipynb'
url.startswith('http')

True

In [16]:
from urllib.request import urlopen
import pprint
def read_data(url):
    if url.startswith(('http', 'https', 'ftp')):
        return urlopen(url).read()
    else:
        with open(url) as u:
            return u.read()

url = 'http://www.pythontutor.com/visualize.html#mode=edit'
pprint.pprint(read_data(url)[:50])

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans'


For multiple choices, `Tuple` is required as used in `url.startswith(('http', 'https', 'ftp'))`

In [19]:
re.match('http:|https:|ftp:', url)

<re.Match object; span=(0, 5), match='http:'>

### Matching string using Shell wildcard pattern

In [21]:
from fnmatch import fnmatch, fnmatchcase
print(fnmatch('foo.txt', '*.txt'))
print(fnmatch('foo.txt', '*?o.txt'))
print(fnmatch('Dat23.csv', 'Dat[0-9]*'))

True
True
True


In [22]:
names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
[n for n in names if fnmatch(n, 'Dat*.csv')]

['Dat1.csv', 'Dat2.csv']

For `Windows`, to make the matching case-sensitive, use `fnmatchcase`

In [23]:
print(fnmatch('foo.txt', '*.TXT'))
print(fnmatchcase('foo.txt', '*.TXT'))

True
False


In [24]:
addresses = [
'5412 N CLARK ST',
'1060 W ADDISON ST',
'1039 W GRANVILLE AVE',
'2122 N CLARK ST',
'4802 N BROADWAY',
]

[addr for addr in addresses if fnmatchcase(addr, '* ST')]

['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']

In [25]:
[addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]

['5412 N CLARK ST']

### Matching and Searching for Text Patterns

In [26]:
text = 'yeah, but no, but yeah, but no, but yeah'

In [28]:
text == 'yeah'

False

In [29]:
text.startswith('yeah')

True

In [30]:
text.endswith('yeah')

True

In [31]:
'no' in text

True

In [32]:
text.find('no')

10

`startswith`, `endswith`, `in`, `find` are good for simple type of matching. For complex matching `regex` is required

In [34]:
text1 = '11/27/2012'
text2 = 'Nov 27, 2012'

if re.match(r'\d+/\d+/\d+',text1):
    print('True')
else:
    print('False')
    
if re.match(r'\d+/\d+/\d+',text2):
    print('True')
else:
    print('False')

True
False


`\d` is to match the digits
`+` is to check for at least one occurence

`\d+` checks for 1 or more digits
`\d+/` checks for 1 or more digits followed by `/` e.g. `11/`

If you want to perform lots of matchings with the same pattern, it is better to precompile the pattern using `compile`

`match` checks only 1st occurence. For finding all the occurences use `findall`

In [35]:
date_pattern = re.compile(r'\d+/\d+/\d+')

if date_pattern.match(text1):
    print('True')
else:
    print('False')

True


In [36]:
text = 'Today is 02/02/2020. New year will start on 01/01/2021.'
date_pattern.findall(text)

['02/02/2020', '01/01/2021']

#### Capturing patterns in Group

In [50]:
date_pattern = re.compile(r'(\d+)/(\d+)/(\d+)')
a = date_pattern.match(text1)
a

<re.Match object; span=(0, 10), match='11/27/2012'>

In [51]:
print(a.group(0))
print(a.group(1))
print(a.group(2))
print(a.group(3))
print(a.group())
print(a.groups())

11/27/2012
11
27
2012
11/27/2012
('11', '27', '2012')


In [61]:
a_all = date_pattern.findall(text)
a_all

[('02', '02', '2020'), ('01', '01', '2021')]

In [53]:
for day, mon, year in a_all:
    print(f'{day}-{mon}-{year}')

02-02-2020
01-01-2021


In [57]:
for m in date_pattern.finditer(text):
    print(m.groups())

('02', '02', '2020')
('01', '01', '2021')


`match()` always tries to find the match at the start of a string.

Use the `findall()` method, to find all occurences

Compile the pattern using `compile` method if using same pattern for many occurences

`group() or group(0)` will return complete result while to display each group pass on the other digits. 

`groups()` will return all the captured groups.

`finditer()` can be used to find the occurences iteratively.

## Search and replace text

In [62]:
text = 'yeah, but no, but yeah, but no, but yeah'

In [63]:
text.replace('yeah', 'yup')

'yup, but no, but yup, but no, but yup'

For complex patterns use `sub()` of `regex`

In [64]:
text = 'Today is 02/02/2020. New year will start on 01/01/2021.'

In [69]:
re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)

'Today is 2020-02-02. New year will start on 2021-01-01.'

`'\3-\1-\2'` is specifying the group number to replace

Similar to `match()`, it's also possible to compile the pattern first and then use

To get the number of substitutions along with substitution, use `subn()` method

In [70]:
date_pattern.subn(r'\3-\1-\2', text)

('Today is 2020-02-02. New year will start on 2021-01-01.', 2)

In [71]:
text_sub, n = date_pattern.subn(r'\3-\1-\2', text)
print(text_sub)
print(n)

Today is 2020-02-02. New year will start on 2021-01-01.
2


### Searching and Replacing Case-Insensitive pattern

In [72]:
text = 'UPPER PYTHON, lower python, Mixed Python'

In [73]:
re.findall('python', text, flags=re.IGNORECASE)

['PYTHON', 'python', 'Python']

In [74]:
re.sub('python', 'snake', text, flags=re.IGNORECASE)

'UPPER snake, lower snake, Mixed snake'

`flags=re.IGNORECASE` will perform the *case-insensitive* operation

### Finding Shortest possible match

In [75]:
str_pat = re.compile(r'\"(.*)\"')
text2 = 'Computer says "no." Phone says "yes."'
str_pat.findall(text2)

['no." Phone says "yes.']

`r'\"(.*)\"'` is attempting to match text enclosed inside quotes. However, the `*` operator in a regular expression is `greedy`, so matching is based on finding the longest possible match.

To fix this, add the `?` modifier `after the *` operator in the pattern to perform `non-greedy` search

In [79]:
str_pat = re.compile(r'\"(.*?)\"')
str_pat.findall(text2)

['no.', 'yes.']