In [10]:
import re

In [6]:
text ='Awesome, I am doing the #100DaysOfCode challenge'

Does text start with 'Awesome' ?

In [3]:
text.startswith('Awesome')

True

In [4]:
text.endswith('challenge')

True

In [7]:
'100daysofcode' in text.lower()

True

In [8]:
text.replace('100','200')

'Awesome, I am doing the #200DaysOfCode challenge'

### `search` vs `match`

The main methods you want to know about are ```search``` and match, former matches a substring, latter matches the string from beginning to end. I always embed my regex in `r''` to avoid having to escape special characters like \d (digit), \w (char), \s (space), \S (non-space), etc (I think \\d and \\s clutters up the regex)

In [11]:
re.search(r'I am',text)

<re.Match object; span=(9, 13), match='I am'>

In [12]:
re.match(r'I am',text)

In [13]:
re.match(r'Awesome.*challenge',text)

<re.Match object; span=(0, 48), match='Awesome, I am doing the #100DaysOfCode challenge'>

### Capturing strings

A common task is to retrieve a match, you can use capturing () parenthesis for that:

In [30]:
hundred='Awesome, I am doing the #100DaysOfCode challenge'
two_hundred='Awesome, I am doing the #200DaysOfCode challenge'

m1=re.match(r'.*(#\d+DaysOfCode).*',hundred)
m1.groups()[0]

'#100DaysOfCode'

In [26]:
m2 = re.search(r'(#\d+DaysOfCode)', two_hundred)
m2.groups()[0]

'#200DaysOfCode'

### findall is your friend

What if you want to match multiple instances of a pattern? `re` has the convenient `findall` method I use a lot. For example in our 100 Days Of Code we used the re module for the following days - how would I extract the days from this string?

In [32]:
text = '''
$ python module_index.py |grep ^re
re                 | stdlib | 005, 007, 009, 015, 021, 022, 068, 080, 081, 086, 095
'''

re.findall(r'\d+',text)

['005', '007', '009', '015', '021', '022', '068', '080', '081', '086', '095']

In [47]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been 
the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and 
scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of
Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus
PageMaker including versions of Lorem Ipsum"""

text.split()

re.findall(r'\w+',text)[:5]

['Lorem', 'Ipsum', 'is', 'simply', 'dummy']

In [41]:
from collections import Counter

re.findall(r'[A-Z][a-z]+',text)[:4]

['Lorem', 'Ipsum', 'Lorem', 'Ipsum']

In [44]:
cnt=Counter(re.findall(r'[A-Z][a-z]+',text))
cnt.most_common(5)

[('Lorem', 4), ('Ipsum', 4), ('It', 2), ('Letraset', 1), ('Aldus', 1)]

### Compiling regexes

If you want to run the same regex multiple times, say in a for loop it is best practice to define the regex one time using re.compile, here is an example:

In [48]:
movies = '''1. Citizen Kane (1941)
2. The Godfather (1972)
3. Casablanca (1942)
4. Raging Bull (1980)
5. Singin' in the Rain (1952)
6. Gone with the Wind (1939)
7. Lawrence of Arabia (1962)
8. Schindler's List (1993)
9. Vertigo (1958)
10. The Wizard of Oz (1939)'''.split('\n')
movies

['1. Citizen Kane (1941)',
 '2. The Godfather (1972)',
 '3. Casablanca (1942)',
 '4. Raging Bull (1980)',
 "5. Singin' in the Rain (1952)",
 '6. Gone with the Wind (1939)',
 '7. Lawrence of Arabia (1962)',
 "8. Schindler's List (1993)",
 '9. Vertigo (1958)',
 '10. The Wizard of Oz (1939)']

In [57]:
for i in movies:
    
    print(re.findall(r'[A-Z][a-z]',i))

['Ci', 'Ka']
['Th', 'Go']
['Ca']
['Ra', 'Bu']
['Si', 'Ra']
['Go', 'Wi']
['La', 'Ar']
['Sc', 'Li']
['Ve']
['Th', 'Wi', 'Oz']


In [58]:

pat = re.compile(r'''
                  ^             # start of string
                  \d+           # one or more digits
                  \.            # a literal dot
                  \s+           # one or more spaces
                  (?:           # non-capturing parenthesis, so I don't want store this match in groups()
                  [A-Za-z']+\s  # character class (note inclusion of ' for "Schindler's"), followed by a space
                  )             # closing of non-capturing parenthesis
                  {2}           # exactly 2 of the previously grouped subpattern
                  \(            # literal opening parenthesis
                  \d{4}         # exactly 4 digits (year)
                  \)            # literal closing parenthesis
                  $             # end of string
                  ''', re.VERBOSE)

In [61]:
for movie in movies:
    #print(movie,':',pat.match(movie))
    print(f'{movie}: {pat.match(movie)}')

1. Citizen Kane (1941): <re.Match object; span=(0, 22), match='1. Citizen Kane (1941)'>
2. The Godfather (1972): <re.Match object; span=(0, 23), match='2. The Godfather (1972)'>
3. Casablanca (1942): None
4. Raging Bull (1980): <re.Match object; span=(0, 21), match='4. Raging Bull (1980)'>
5. Singin' in the Rain (1952): None
6. Gone with the Wind (1939): None
7. Lawrence of Arabia (1962): None
8. Schindler's List (1993): <re.Match object; span=(0, 26), match="8. Schindler's List (1993)">
9. Vertigo (1958): None
10. The Wizard of Oz (1939): None


### Advanced string replacing

As shown before str.replace probably covers a lot of your needs, for more advanced usage there is re.sub:

In [64]:
text = '''Awesome, I am doing #100DaysOfCode, #200DaysOfDjango and of course #365DaysOfPyBites'''

re.sub(r'\d+','100',text)

'Awesome, I am doing #100DaysOfCode, #100DaysOfDjango and of course #100DaysOfPyBites'

In [69]:
re.sub(r'(#\d+DaysOf)\w+',r'\1Python',text)

'Awesome, I am doing #100DaysOfPython, #200DaysOfPython and of course #365DaysOfPython'

In [71]:
re.findall(r'(#\d+DaysOf)\w+',text)

['#100DaysOf', '#200DaysOf', '#365DaysOf']