# Text as Data Tutorial -- Intro to String Manipulation and Regular Expressions in Python

This tutorial parallels a similar tutorial in R. In most cases, the task achieved is identical or very similar in both.

Many string manipulation methods below are provided in base Python.  When we get to regular expressions, you'll need the "re" library.

The basic thing we want to manipulate are *strings*. These can be specified using double quotes (“) or single quotes (’):

In [8]:
a_string = 'Example STRING, with numbers (12, 15 and also 10.2)?!'
a_string

'Example STRING, with numbers (12, 15 and also 10.2)?!'

It’s really a matter of style or convenience, but you might use one if your string actually contains the other:

In [9]:
my_double_quoted_string = "He asked, 'Why would you use double quotes?'"
my_double_quoted_string

"He asked, 'Why would you use double quotes?'"

You can still use either one if you like, using \ (backslash) to tell Python to “escape” the next character. In the example below, the \" is saying, " is part of the string, not the end of the string.

In [5]:
my_string_with_double_quotes = "She answered, \"Convenience, but you never really have to.\""
my_string_with_double_quotes

'She answered, "Convenience, but you never really have to."'

If you ever want to see how your string with escape characters displays when printed or (typically) in an editor, use print.

In [10]:
print(my_double_quoted_string)

He asked, 'Why would you use double quotes?'


In [11]:
print(my_string_with_double_quotes)

She answered, "Convenience, but you never really have to."


This can get a little bit confusing. For example, since the backslash character tells Python to escape, to indicate an actual backslash character you have to backslash your backslashes:

In [12]:
a_string_with_backslashes = "To indicate a backslash, \\, you have to type two: \\\\. Just there, to indicate two backslashes, I had to type four: \\\\\\\\."
a_string_with_backslashes

'To indicate a backslash, \\, you have to type two: \\\\. Just there, to indicate two backslashes, I had to type four: \\\\\\\\.'

In [13]:
print(a_string_with_backslashes)

To indicate a backslash, \, you have to type two: \\. Just there, to indicate two backslashes, I had to type four: \\\\.


There are a number of special escape characters that are used to represent things like “control characters.” The most common are two that you’re already used to tapping a keyboard key for without expecting a character to appear on your screen: \t (tab) and \n (newline).

In [19]:
test_string = "abc ABC 123\t.!?\\(){}\n  \nthird line"
test_string

'abc ABC 123\t.!?\\(){}\n  \nthird line'

In [20]:
print(test_string)

abc ABC 123	.!?\(){}
  
third line


As with pretty much everything in Python, you can have a list of strings.

In [22]:
a_list_of_strings = ["abcde", "123", "chicken of the sea"]
a_list_of_strings

['abcde', '123', 'chicken of the sea']

In the R tutorial, we made use of a few collections of strings provided in base R or stringr. To make this comparable, we'll create or load these here.

The letters of the alphabet are available as a basic string method.

In [70]:
# import string
letters_string = string.ascii_lowercase
letters_string

'abcdefghijklmnopqrstuvwxyz'

In [71]:
letters_list = list(letters_string)
letters_list

['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [72]:
LETTERS_string = string.ascii_uppercase
LETTERS_string

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [73]:
LETTERS_list = list(LETTERS_string)
LETTERS_list

['A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z']

We'll just make the month lists.

In [30]:
month_abb = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
month_abb

['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec']

In [32]:
month_name = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
month_name

['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

In [109]:
# REQUIRES file "fruit.txt" be in the same directory
fruitfile = open("fruit.txt","r")
fruit = fruitfile.read().splitlines()
print(fruit)

['apple', 'apricot', 'avocado', 'banana', 'bell pepper', 'bilberry', 'blackberry', 'blackcurrant', 'blood orange', 'blueberry', 'boysenberry', 'breadfruit', 'canary melon', 'cantaloupe', 'cherimoya', 'cherry', 'chili pepper', 'clementine', 'cloudberry', 'coconut', 'cranberry', 'cucumber', 'currant', 'damson', 'date', 'dragonfruit', 'durian', 'eggplant', 'elderberry', 'feijoa', 'fig', 'goji berry', 'gooseberry', 'grape', 'grapefruit', 'guava', 'honeydew', 'huckleberry', 'jackfruit', 'jambul', 'jujube', 'kiwi fruit', 'kumquat', 'lemon', 'lime', 'loquat', 'lychee', 'mandarine', 'mango', 'mulberry', 'nectarine', 'nut', 'olive', 'orange', 'pamelo', 'papaya', 'passionfruit', 'peach', 'pear', 'persimmon', 'physalis', 'pineapple', 'plum', 'pomegranate', 'pomelo', 'purple mangosteen', 'quince', 'raisin', 'rambutan', 'raspberry', 'redcurrant', 'rock melon', 'salal berry', 'satsuma', 'star fruit', 'strawberry', 'tamarillo', 'tangerine', 'ugli fruit', 'watermelon']


In [39]:
# REQUIRES file "words.txt" be in the same directory
wordsfile = open("words.txt","r")
words = wordsfile.read().splitlines()
len(words)

980

The word list is long, so let's just look at the top 5. 

(A couple things for R users here. First, Python starts counting at 0 not 1. So, in R `words[0]` is an error, `words[1]` is "a", and `words[5]` is "accept"; in Python, `words[0]` is "a", `words[1]` is "able", and `words[5]` is an "account". Second, the "slicing" notation in Python is also weird if you're used to R. In R, to get the first five members, you ask for item 1 through item 5:`words[1:5]`. In Python, you might imagine the list members sitting on a number line that puts the first list member between 0 and 1, the second between 1 and 2m and so on. Then to get the first 5 members of the list, we need to ask for the slice between "0" just to the "left" of the slice and "5" at the "right" of the stuff we want: `words[0:5]`. 

In [38]:
words[0:5]

['a', 'able', 'about', 'absolute', 'accept']

In [40]:
words[5]

'account'

The sentences list is also long.

In [149]:
# REQUIRES file "sentences.txt" be in the same directory
sentencesfile = open("sentences.txt","r")
sentences = sentencesfile.read().splitlines()
len(sentences)

720

In [150]:
sentences[0:5]

['The birch canoe slid on the smooth planks.',
 'Glue the sheet to the dark blue background.',
 "It's easy to tell the depth of a well.",
 'These days a chicken leg is a rare dish.',
 'Rice is often served in round bowls.']

## Manipulating strings

You can combine, or “concatenate”, strings very naturally using the "+" sign.

In [44]:
second_string = "Wow, two sentences."
combined_string = a_string + " " + second_string
combined_string

'Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.'

You can also combine lists of strings by a separator using the "join" method. To again join the two strings above separated by a space, place the strings to be joined in a *list* by using square brackets, and the separator in a string and use the syntax *sep*`.join(`*list*`)`:

In [49]:
" ".join([a_string,second_string]) 

'Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.'

Note that "join" takes a list of strings of *any* length and concatenates *all* the strings together with the separator.

In [75]:
" then ".join(month_name)

'January then February then March then April then May then June then July then August then September then October then November then December'

In the R notebook, we next created a new vector of strings concatenating the month vectors into strings like "Jan stands for January". Python doesn't have the element-by-element "vectorized" syntax R does, so we have to more explicitly *iterate* over the elements to do this here. There are a several ways to do this.

The most straightforward to understand, but not very "Pythonic", way is to do this in a for loop:

In [57]:
month_explanations = []
for i in range(12):
    new_string = " stands for ".join([month_abb[i], month_name[i]])
    month_explanations.append(new_string)
month_explanations

['Jan stands for January',
 'Feb stands for February',
 'Mar stands for March',
 'Apr stands for April',
 'May stands for May',
 'Jun stands for June',
 'Jul stands for July',
 'Aug stands for August',
 'Sep stands for September',
 'Oct stands for October',
 'Nov stands for November',
 'Dec stands for December']

There are more compact, Pythonic ways to do this. One is to use the "zip" function which creates an "iterator" of "tuples" element by element and then iterate over that. Let's look inside the zip function first by making a list of the zipped elements:

In [58]:
list(zip(month_abb, month_name))

[('Jan', 'January'),
 ('Feb', 'February'),
 ('Mar', 'March'),
 ('Apr', 'April'),
 ('May', 'May'),
 ('Jun', 'June'),
 ('Jul', 'July'),
 ('Aug', 'August'),
 ('Sep', 'September'),
 ('Oct', 'October'),
 ('Nov', 'November'),
 ('Dec', 'December')]

Not quite what we want, but we can join those tuples as we iterate over the zip object using a "list comprehension":

In [59]:
[" stands for ".join([abbrev,name]) for abbrev,name in zip(month_abb, month_name)]

['Jan stands for January',
 'Feb stands for February',
 'Mar stands for March',
 'Apr stands for April',
 'May stands for May',
 'Jun stands for June',
 'Jul stands for July',
 'Aug stands for August',
 'Sep stands for September',
 'Oct stands for October',
 'Nov stands for November',
 'Dec stands for December']

The "list comprehension" is defined by those square brackets on the outside (making it a list) and the "for loop"-like instruction inside. 

There are many ways to do the same thing. For example, we can change the string manipulation operation that gets repeated from the join method to the format method:

In [60]:
["{} stands for {}".format(abbrev,name) for abbrev,name in zip(month_abb, month_name)]

['Jan stands for January',
 'Feb stands for February',
 'Mar stands for March',
 'Apr stands for April',
 'May stands for May',
 'Jun stands for June',
 'Jul stands for July',
 'Aug stands for August',
 'Sep stands for September',
 'Oct stands for October',
 'Nov stands for November',
 'Dec stands for December']

The join/zip idiom works for the letters example in the other notebook as well:

In [89]:
letterpairs = ["".join([lower,upper]) for lower, upper in zip(letters_list, LETTERS_list)]
print(letterpairs)

['aA', 'bB', 'cC', 'dD', 'eE', 'fF', 'gG', 'hH', 'iI', 'jJ', 'kK', 'lL', 'mM', 'nN', 'oO', 'pP', 'qQ', 'rR', 'sS', 'tT', 'uU', 'vV', 'wW', 'xX', 'yY', 'zZ']


You can zip two lists together, concatenate those element by element, and then join them by a separator.

In [77]:
" then ".join(["{} ({})".format(name,abbrev) for name,abbrev in zip(month_name,month_abb)])

'January (Jan) then February (Feb) then March (Mar) then April (Apr) then May (May) then June (Jun) then July (Jul) then August (Aug) then September (Sep) then October (Oct) then November (Nov) then December (Dec)'

You can split up a string into pieces, based on a pattern, with the "split" method.

In [78]:
combined_string.split("!")

['Example STRING, with numbers (12, 15 and also 10.2)?',
 ' Wow, two sentences.']

## Substrings (Slices)

Substrings are just slices in Python. To get a list of the second through fourth character in each fruit name:

In [88]:
substringfromfruit = [eachfruit[1:4] for eachfruit in fruit]
print(substringfromfruit)

['ppl', 'pri', 'voc', 'ana', 'ell', 'ilb', 'lac', 'lac', 'loo', 'lue', 'oys', 'rea', 'ana', 'ant', 'her', 'her', 'hil', 'lem', 'lou', 'oco', 'ran', 'ucu', 'urr', 'ams', 'ate', 'rag', 'uri', 'ggp', 'lde', 'eij', 'ig', 'oji', 'oos', 'rap', 'rap', 'uav', 'one', 'uck', 'ack', 'amb', 'uju', 'iwi', 'umq', 'emo', 'ime', 'oqu', 'ych', 'and', 'ang', 'ulb', 'ect', 'ut', 'liv', 'ran', 'ame', 'apa', 'ass', 'eac', 'ear', 'ers', 'hys', 'ine', 'lum', 'ome', 'ome', 'urp', 'uin', 'ais', 'amb', 'asp', 'edc', 'ock', 'ala', 'ats', 'tar', 'tra', 'ama', 'ang', 'gli', 'ate']


Substrings from the end of the string can be accessed by slices using negative numbers. 

In [87]:
subfromend = [eachfruit[-3:-1] for eachfruit in fruit]
print(subfromend)

['pl', 'co', 'ad', 'an', 'pe', 'rr', 'rr', 'an', 'ng', 'rr', 'rr', 'ui', 'lo', 'up', 'oy', 'rr', 'pe', 'in', 'rr', 'nu', 'rr', 'be', 'an', 'so', 'at', 'ui', 'ia', 'an', 'rr', 'jo', 'fi', 'rr', 'rr', 'ap', 'ui', 'av', 'de', 'rr', 'ui', 'bu', 'ub', 'ui', 'ua', 'mo', 'im', 'ua', 'he', 'in', 'ng', 'rr', 'in', 'nu', 'iv', 'ng', 'el', 'ay', 'ui', 'ac', 'ea', 'mo', 'li', 'pl', 'lu', 'at', 'el', 'ee', 'nc', 'si', 'ta', 'rr', 'an', 'lo', 'rr', 'um', 'ui', 'rr', 'll', 'in', 'ui', 'lo']


You can use slicing to extract data from strings:

In [86]:
some_dates = ["1999/01/01","1998/12/15","2001/09/03"]
years = [date[0:4] for date in some_dates]
print(years)

['1999', '1998', '2001']


In [90]:
months = [date[5:7] for date in some_dates]
print(months)

['01', '12', '09']


Getting a copy of a string with specific positions replaced is also a matter of slicing:

In [94]:
apple = "apple"
zebra = "--!ZEBRA!--"
zebraapple = apple[0:1] + zebra + apple[3:]
zebraapple

'a--!ZEBRA!--le'

Replicating the R result over the whole list can be done by putting within a list comprehension.

In [95]:
zebrafruit = [fr[0:1] + zebra + fr[3:] for fr in fruit]
print(zebrafruit)

['a--!ZEBRA!--le', 'a--!ZEBRA!--icot', 'a--!ZEBRA!--cado', 'b--!ZEBRA!--ana', 'b--!ZEBRA!--l pepper', 'b--!ZEBRA!--berry', 'b--!ZEBRA!--ckberry', 'b--!ZEBRA!--ckcurrant', 'b--!ZEBRA!--od orange', 'b--!ZEBRA!--eberry', 'b--!ZEBRA!--senberry', 'b--!ZEBRA!--adfruit', 'c--!ZEBRA!--ary melon', 'c--!ZEBRA!--taloupe', 'c--!ZEBRA!--rimoya', 'c--!ZEBRA!--rry', 'c--!ZEBRA!--li pepper', 'c--!ZEBRA!--mentine', 'c--!ZEBRA!--udberry', 'c--!ZEBRA!--onut', 'c--!ZEBRA!--nberry', 'c--!ZEBRA!--umber', 'c--!ZEBRA!--rant', 'd--!ZEBRA!--son', 'd--!ZEBRA!--e', 'd--!ZEBRA!--gonfruit', 'd--!ZEBRA!--ian', 'e--!ZEBRA!--plant', 'e--!ZEBRA!--erberry', 'f--!ZEBRA!--joa', 'f--!ZEBRA!--', 'g--!ZEBRA!--i berry', 'g--!ZEBRA!--seberry', 'g--!ZEBRA!--pe', 'g--!ZEBRA!--pefruit', 'g--!ZEBRA!--va', 'h--!ZEBRA!--eydew', 'h--!ZEBRA!--kleberry', 'j--!ZEBRA!--kfruit', 'j--!ZEBRA!--bul', 'j--!ZEBRA!--ube', 'k--!ZEBRA!--i fruit', 'k--!ZEBRA!--quat', 'l--!ZEBRA!--on', 'l--!ZEBRA!--e', 'l--!ZEBRA!--uat', 'l--!ZEBRA!--hee', 'm--!ZEB

Strings have a simple casefolding method that can be applied:

In [96]:
combined_string.lower()

'example string, with numbers (12, 15 and also 10.2)?! wow, two sentences.'

In [97]:
combined_string.upper()

'EXAMPLE STRING, WITH NUMBERS (12, 15 AND ALSO 10.2)?! WOW, TWO SENTENCES.'

Also several to trim excess white space *off the ends* of strings:

In [102]:
lotsofspace = '   Why   so much  space?   '
lotsofspace.strip()

'Why   so much  space?'

In [100]:
lotsofspace.lstrip()

'Why   so much  space?   '

In [101]:
lotsofspace.rstrip()

'   Why   so much  space?'

## Matching substrings

If we're looking for specific substrings, there are string methods to do that.

In [113]:
"strawberry".find("berry")

5

That returns the position of the first match. If there is no match, find returns a value of -1.

In [116]:
"apple".find("berry")

-1

If there are multiple matches, find returns the position of the first match.

In [117]:
"berryberryboberrybananafanafoferrymemymomerry berry".find("berry")

0

We can usr this in a list comprehension, with the addition of an "if" condition, to extract a list of all matching fruits.

In [119]:
[fr for fr in fruit if fr.find("berry")> -1]

['bilberry',
 'blackberry',
 'blueberry',
 'boysenberry',
 'cloudberry',
 'cranberry',
 'elderberry',
 'goji berry',
 'gooseberry',
 'huckleberry',
 'mulberry',
 'raspberry',
 'salal berry',
 'strawberry']

We can get a copy of the string with the substring replaced with something else:

In [120]:
"strawberry".replace("berry","fish")

'strawfish'

In [121]:
fishfruit = [fr.replace("berry","fish") for fr in fruit]
print(fishfruit)

['apple', 'apricot', 'avocado', 'banana', 'bell pepper', 'bilfish', 'blackfish', 'blackcurrant', 'blood orange', 'bluefish', 'boysenfish', 'breadfruit', 'canary melon', 'cantaloupe', 'cherimoya', 'cherry', 'chili pepper', 'clementine', 'cloudfish', 'coconut', 'cranfish', 'cucumber', 'currant', 'damson', 'date', 'dragonfruit', 'durian', 'eggplant', 'elderfish', 'feijoa', 'fig', 'goji fish', 'goosefish', 'grape', 'grapefruit', 'guava', 'honeydew', 'hucklefish', 'jackfruit', 'jambul', 'jujube', 'kiwi fruit', 'kumquat', 'lemon', 'lime', 'loquat', 'lychee', 'mandarine', 'mango', 'mulfish', 'nectarine', 'nut', 'olive', 'orange', 'pamelo', 'papaya', 'passionfruit', 'peach', 'pear', 'persimmon', 'physalis', 'pineapple', 'plum', 'pomegranate', 'pomelo', 'purple mangosteen', 'quince', 'raisin', 'rambutan', 'raspfish', 'redcurrant', 'rock melon', 'salal fish', 'satsuma', 'star fruit', 'strawfish', 'tamarillo', 'tangerine', 'ugli fruit', 'watermelon']


## Searching for patterns with regular expressions

So far, I’ve only searched for patterns that are only alphabetic characters like "berry". But we can use make much more elaborate and flexible patterns using regular expressions. For this we need to import the "re" module.

I recommend you reference the cheat sheet and the online regex tool https://regex101.com in parallel.

Regular expressions work a little differently in Python than in R.

Square brackets for “or” (disjunction) of characters. Match “any one of” the characters in the square brackets.
str_subset(sentences, ' [bhp]eat ')

Just for comparison's sake, let's start with a search for the same pattern as above: "berry".

In [123]:
import re

pattern = r'berry' #Define the pattern you are looking for.
reo = re.compile(r'berry') # compile the pattern into a regular expression object
mo = reo.search('strawberry') # search for the pattern and return a match object
mo

<re.Match object; span=(5, 10), match='berry'>

The start and end positions of the match are in the "span" attribute:

In [124]:
mo.span()

(5, 10)

The match itself is in the "group" attribute, which I'll explain below.

In [128]:
mo.group()

'berry'

If there is no match, the match object is null-valued ("None"). You can, more or less, use match objects in conditional statements, with null equalling "False" and any match resulting in "True".

In [130]:
reo = re.compile(r'berry')
mo_miss = reo.search('apple')
mo_miss

In [131]:
print(mo_miss)

None


In [135]:
if mo:
    print("Strawberry is a berry!")
else:
    print("Strawberry is not a berry.")

Strawberry is a berry!


In [136]:
if mo_miss:
    print("Apple is a berry")
else:
    print("Apple is not a berry.")

Apple is not a berry.


Which, again can be put in a list comprehension to get a list of all berries:

In [144]:
berries = [itsaberry for itsaberry in fruit if reo.search(itsaberry)]
print(berries)

['bilberry', 'blackberry', 'blueberry', 'boysenberry', 'cloudberry', 'cranberry', 'elderberry', 'goji berry', 'gooseberry', 'huckleberry', 'mulberry', 'raspberry', 'salal berry', 'strawberry']


The "search" method will return a single object describing only the first match in the string.

In [145]:
mo_many = reo.search("berryberryboberrybananafanafoferrymemymomerry berry")
mo_many

<re.Match object; span=(0, 5), match='berry'>

The findall method returns a list of all matching strings.

In [146]:
mo_many2 = reo.findall("berryberryboberrybananafanafoferrymemymomerry berry")
mo_many2

['berry', 'berry', 'berry', 'berry']

The "finditer" method returns an "iterator" (thing, like a list, over which you can, um, iterate) containing match objects for every match.

In [147]:
mo_iter = reo.finditer("berryberryboberrybananafanafoferrymemymomerry berry")
for moi in mo_iter:
    print(moi)

<re.Match object; span=(0, 5), match='berry'>
<re.Match object; span=(5, 10), match='berry'>
<re.Match object; span=(12, 17), match='berry'>
<re.Match object; span=(46, 51), match='berry'>


Now let's use regex to look for more complex patterns than just substrings.

#### Square brackets for “or” (disjunction) of characters.

Match “any one of” the characters in the square brackets.

In [156]:
reodemo = re.compile(r' [bhp]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches

['The heart beat strongly and with firm strokes.',
 'Burn peat after the logs give out.',
 'Feel the heat of the weak dying flame.',
 'A speedy man can beat this track mark.',
 'Even the worst will beat his low score.',
 'It takes heat to bring out the odor.']

#### Square brackets with ^ for negation.

Match “anything but one of” the characters in the square brackets.

(Be careful ... the carat ... ^ ... means something else in different context.)

In [157]:
reodemo = re.compile(r' [^bhp]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches

['Pack the records in a neat thin case.', 'A clean neck means a neat collar.']

#### Square brackets for “or” over a range of characters

In [158]:
reodemo = re.compile(r' [b-p]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches

['The heart beat strongly and with firm strokes.',
 'Burn peat after the logs give out.',
 'Feel the heat of the weak dying flame.',
 'A speedy man can beat this track mark.',
 'Even the worst will beat his low score.',
 'Pack the records in a neat thin case.',
 'It takes heat to bring out the odor.',
 'A clean neck means a neat collar.']

#### Pipe operator for "or" over multi-character patterns

When we need an “or” over multi-character patterns, we can use the “pipe” operator, using parentheses as necessary to identify what’s with what.

In [159]:
reodemo = re.compile(r'(black|blue|red)(currant|berry)')
matches = [itsamatch for itsamatch in fruit if reodemo.search(itsamatch)]
matches

['blackberry', 'blackcurrant', 'blueberry', 'redcurrant']

#### Special characters and the backslash

In addition to the backslash itself, there are several characters that have special meaning in Python regexes, and (may) have to be escaped in order to match the literal character. I think the full list is this: ^ $ . * + | ! ? ( ) [ ] { } < >.

For example, the period – “.” – means “any character but a newline.” It’s a wildcard. We get different results when we escape or don’t escape it.

In [161]:
reo_anychar = re.compile(r'.')
allchars = reo_anychar.findall(combined_string)
print(allchars)

['E', 'x', 'a', 'm', 'p', 'l', 'e', ' ', 'S', 'T', 'R', 'I', 'N', 'G', ',', ' ', 'w', 'i', 't', 'h', ' ', 'n', 'u', 'm', 'b', 'e', 'r', 's', ' ', '(', '1', '2', ',', ' ', '1', '5', ' ', 'a', 'n', 'd', ' ', 'a', 'l', 's', 'o', ' ', '1', '0', '.', '2', ')', '?', '!', ' ', 'W', 'o', 'w', ',', ' ', 't', 'w', 'o', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']


In [162]:
reo_justperiod = re.compile(r'\.')
allperiods = reo_justperiod.findall(combined_string)
print(allperiods)

['.', '.']


In [165]:
reodemo = re.compile(r'a.') # a followed by any character
matches = reodemo.findall(combined_string)
print(matches)

['am', 'an', 'al']


In [166]:
reodemo = re.compile(r'a\.') # a followed by a period
matches = reodemo.findall(combined_string)
print(matches)

[]


Some of these are only special characters in certain contexts and don’t have to be escaped to be recognized when not in those contexts. But they can be escaped in all circumstances and I recommend that rather than trying to figure out the exact rules.

The exclamation point is such a character.

In [167]:
reodemo = re.compile(r'\!') # literal !
matches = reodemo.findall(combined_string)
print(matches)

['!']


In [169]:
reodemo = re.compile(r'!') # special character ! isn't meaningful in this context so it assumes just !
matches = reodemo.findall(combined_string)
print(matches)

['!']


#### Class shorthands

Conversely, there are a number of characters that have special meaning only when escaped. The main ones for now are “\w” (any alphanumeric character), “\s” (any space character), and “\d” (any numeric digit). The capitalized versions of these are used to mean “anything but” that class.

In [170]:
reodemo = re.compile(r'\w') # any alphanumeric character
matches = reodemo.findall(combined_string)
print(matches)

['E', 'x', 'a', 'm', 'p', 'l', 'e', 'S', 'T', 'R', 'I', 'N', 'G', 'w', 'i', 't', 'h', 'n', 'u', 'm', 'b', 'e', 'r', 's', '1', '2', '1', '5', 'a', 'n', 'd', 'a', 'l', 's', 'o', '1', '0', '2', 'W', 'o', 'w', 't', 'w', 'o', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's']


In [171]:
reodemo = re.compile(r'\W') # any non-alphanumeric character
matches = reodemo.findall(combined_string)
print(matches)

[' ', ',', ' ', ' ', ' ', '(', ',', ' ', ' ', ' ', ' ', '.', ')', '?', '!', ' ', ',', ' ', ' ', '.']


In [172]:
reodemo = re.compile(r'\s') # any whitespace character
matches = reodemo.findall(combined_string)
print(matches)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


In [174]:
reodemo = re.compile(r'\S') # any non-whitespace character
matches = reodemo.findall(combined_string)
print(matches)

['E', 'x', 'a', 'm', 'p', 'l', 'e', 'S', 'T', 'R', 'I', 'N', 'G', ',', 'w', 'i', 't', 'h', 'n', 'u', 'm', 'b', 'e', 'r', 's', '(', '1', '2', ',', '1', '5', 'a', 'n', 'd', 'a', 'l', 's', 'o', '1', '0', '.', '2', ')', '?', '!', 'W', 'o', 'w', ',', 't', 'w', 'o', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']


In [175]:
reodemo = re.compile(r'\d') # any numeric digit
matches = reodemo.findall(combined_string)
print(matches)

['1', '2', '1', '5', '1', '0', '2']


In [176]:
reodemo = re.compile(r'\D') # any non-digit character
matches = reodemo.findall(combined_string)
print(matches)

['E', 'x', 'a', 'm', 'p', 'l', 'e', ' ', 'S', 'T', 'R', 'I', 'N', 'G', ',', ' ', 'w', 'i', 't', 'h', ' ', 'n', 'u', 'm', 'b', 'e', 'r', 's', ' ', '(', ',', ' ', ' ', 'a', 'n', 'd', ' ', 'a', 'l', 's', 'o', ' ', '.', ')', '?', '!', ' ', 'W', 'o', 'w', ',', ' ', 't', 'w', 'o', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']


The Python re module does not directly support "POSIX" classes.

#### Quantifiers: * (zero or more of the previous)

This is also known as the “Kleene star” (pronounced clean-ee), after its original user (Kleene) who introduced the notation in formal logic.

In [178]:
reodemo = re.compile(r'\d*') # any string of zero or more digits
matches = reodemo.findall(combined_string)
print(matches)

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '12', '', '', '15', '', '', '', '', '', '', '', '', '', '', '10', '', '2', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


Note the "zero" or more led it to identify every position of the string as a match, many of them empty (containing no characters).

#### Quantifiers: + (one or more of the previous)

This is also known as the “Kleene plus.”

In [179]:
reodemo = re.compile(r'\d+') # any string of one or more digits
matches = reodemo.findall(combined_string)
print(matches)

['12', '15', '10', '2']


#### Quantifiers {n} {n,m} and {n,}

{n} = “exactly n” of the previous
{n,m} = “between n and m” of the previous
{n,} = “n or more” of the previous

In [180]:
reodemo = re.compile(r'x{3}') # 3 x's
matches = reodemo.findall('x xx xxx xxxx xxxxx')
print(matches)

['xxx', 'xxx', 'xxx']


In [181]:
reodemo = re.compile(r'x{3,4}') # 3 or 4 x's
matches = reodemo.findall('x xx xxx xxxx xxxxx')
print(matches)

['xxx', 'xxxx', 'xxxx']


In [182]:
reodemo = re.compile(r'x{3,}') # 3 or more x's
matches = reodemo.findall('x xx xxx xxxx xxxxx')
print(matches)

['xxx', 'xxxx', 'xxxxx']


Were any of those unexpected? (Probably ... how many strings of 3 x's are in that string?) Use your regex viewer to see what's going on.

#### Quantifier ? (zero or one of the previous)

In [183]:
reodemo = re.compile(r'\d?') # any string of zero or one digits
matches = reodemo.findall(combined_string)
print(matches)

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '2', '', '', '1', '5', '', '', '', '', '', '', '', '', '', '', '1', '0', '', '2', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


In [184]:
reodemo = re.compile(r' [bp]?eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches

['The heart beat strongly and with firm strokes.',
 'Burn peat after the logs give out.',
 'A speedy man can beat this track mark.',
 'Even the worst will beat his low score.',
 'Quench your thirst, then eat the crackers.']

#### Question Mark as Nongreedy Modifier to Quantifier (smallest match of previous possible)


In [186]:
reodemo = re.compile(r'\(.+\)') # greedy - roughly, longest match
matches = reodemo.findall('(First bracketed statement) Other text (Second bracketed statement)')
print(matches)

['(First bracketed statement) Other text (Second bracketed statement)']


In [187]:
reodemo = re.compile(r'\(.+?\)') # nongreedy - roughly, smallest matches
matches = reodemo.findall('(First bracketed statement) Other text (Second bracketed statement)')
print(matches)

['(First bracketed statement)', '(Second bracketed statement)']


In [188]:
reodemo = re.compile(r'x.+x') # greedy - matches whole string
matches = reodemo.findall('x xx xxx xxxx xxxxx')
print(matches)

['x xx xxx xxxx xxxxx']


In [189]:
reodemo = re.compile(r'x.+?x') # nongreedy - minimal match as placeholder moves across string
matches = reodemo.findall('x xx xxx xxxx xxxxx')
print(matches)

['x x', 'x x', 'xx x', 'xxx', 'xxx']


#### Anchors at beginning and end of string

In [190]:
reodemo = re.compile(r'^\w+') # ^ is beginning of string
matches = reodemo.findall(combined_string)
print(matches)

['Example']


In [191]:
reodemo = re.compile(r'\w+$') # $ is end of string
matches = reodemo.findall(combined_string)
print(matches)

[]


In [192]:
reodemo = re.compile(r'\W+$') # $ is end of string
matches = reodemo.findall(combined_string)
print(matches)

['.']


#### Anchors at word boundaries

Similarly, we can identify "word boundaries" with \b. This solves the greedy/nongreedy problem we had with the ”x" sequences above. It still thinks the decimal point in 10.2 is a word boundary, though.

In [194]:
reodemo = re.compile(r'\bx.*?\b') # 
matches = reodemo.findall('x xx xxx xxxx xxxxx')
print(matches)

['x', 'xx', 'xxx', 'xxxx', 'xxxxx']


In [195]:
reodemo = re.compile(r'\b\w+?\b') # still a little dumb
matches = reodemo.findall(combined_string)
print(matches)

['Example', 'STRING', 'with', 'numbers', '12', '15', 'and', 'also', '10', '2', 'Wow', 'two', 'sentences']
