# Character Classes

In the previous lessons, we used regular expressions to match typical strings like `"Python"`. This isn't what regexes are usually used for though, since matching fixed strings can be very easily done without regexes.

Typically, when we use regexes, we are searching a text for characters from a set of characters (aka character class). This can be a predefined/standard character class or a class we define for our program.

In [1]:
import re

## Defining classes

To define a character class, we use the `[]` characters, two of Python `re`'s **metacharacters** (characters with special meanings).

Inside of the `[]`, we place a set of characters. There are several ways to do this, depending on the specific set. The first way is to just enter all the characters in the class. For instance, if we want to match something among the characters `a`, `b`, and `c`, we can just write `[abc]`.

In [2]:
pattern = re.compile("[abc]")
print(pattern.findall("Programming"))

print(re.findall("[h1@]", "Hello, @futureprogrammer360"))

['a']
['@']


`-` in sets allows us to match a range of characters.

In [3]:
print(re.findall("[b-f]", "abcdefgh"))
print(re.findall("[a-zA-Z]", "AaBb123!@#"))
print(re.findall("[3-4][0-3][2-4]", "312423567"))

['b', 'c', 'd', 'e', 'f']
['A', 'a', 'B', 'b']
['312', '423']


`^` complements a set, allowing us to match anything *not* in the given set. Note that `^` must appear as the **first** character of the class definition.

In [4]:
print(re.findall("[^0-9]", "ABC123!@^"))
print(re.findall("[0-9^]", "ABC123!@^"))
print(re.findall("[^23]", "ABC123!@^"))

['A', 'B', 'C', '!', '@', '^']
['1', '2', '3', '^']
['A', 'B', 'C', '1', '!', '@', '^']


## Predefined sequences

While we can always define our own character classes, there are also some special sequences that we don't need to define ourselves.

| Character | Equivalent Class | Description                 |
|-----------|------------------|-----------------------------|
| `\w`      | `[a-zA-Z0-9_]`   | Alphanumeric characters     |
| `\W`      | `[^a-zA-Z0-9_]`  | Non-alphanumeric characters |
| `\d`      | `[0-9]`          | Number digits               |
| `\D`      | `[^0-9]`         | Non-digit characters        |
| `\s`      | `[ \t\n\r\f\v]`  | Whitespace characters       |
| `\S`      | `[^ \t\n\r\f\v]` | Non-whitespace characters   |

Because in Python strings, the `\` character has special meaning (think `\n` and `\t`), we have to insert 2 `\`s to insert a literal `\` character.

In [5]:
print(re.findall("\\w", "Aa12 \t*&"))
print(re.findall("\\W", "Aa12 \t*&"))
print(re.findall("\\d", "Aa12 \t*&"))
print(re.findall("\\D", "Aa12 \t*&"))
print(re.findall("\\s", "Aa12 \t*&"))
print(re.findall("\\S", "Aa12 \t*&"))

['A', 'a', '1', '2']
[' ', '\t', '*', '&']
['1', '2']
['A', 'a', ' ', '\t', '*', '&']
[' ', '\t']
['A', 'a', '1', '2', '*', '&']


The more `\`s we write, the more unreadable our regexes become. To solve that problem, we can use r-strings (raw strings). In r-strings, `\`s are treated as literal `\` characters.

In [6]:
pattern = re.compile(r"\d")
print(pattern.findall("I have 200 dogs and 500 cats"))

['2', '0', '0', '5', '0', '0']


Lastly, we can also combine self-defined and pre-defined character classes.

In [7]:
pattern = re.compile(r"[a\dc]")
print(pattern.findall("abc123"))

['a', 'c', '1', '2', '3']


## `.`

`.` is a special character that defines a character set that matches anything except a newline character.

In [8]:
pattern = re.compile("A.A")
print(pattern.findall("A1A"))
print(pattern.findall("A$A"))
print(pattern.findall("A\nA"))
print(pattern.findall("A11A"))

['A1A']
['A$A']
[]
[]


## Summary

That is all for today's lesson on character classes in Python regular expressions. You learned how to define your own character sets using `[]`, `-`, and `^`. You also learned to use predefined sequences in your programs. Lastly, you learned to combine self- and pre-defined character classes.