# Week8: Regular Expressions

— _CS50’s Introduction to Programming with Python_

### Regular Expressions

Regular expressions are a powerful tool for matching patterns in text. They are a domain-specific language for describing patterns in text. Regular expressions are a language unto themselves, and they are not specific to Python. They are used in many programming languages, and in many applications.

They are used in text editors, word processors, and other applications to find and replace text, in compilers and interpreters to find and replace patterns in source code and in search engines to find and replace patterns in web pages. 

They are also used to validate user input such as email addresses, phone numbers, and credit card numbers. As well as to parse text, and to extract information from text.

##### validate.py

In [1]:
email = input("Enter your email: ").strip()

if "@" in email:
    print("Your email is valid")
else:
    print("Your email is invalid")

Your email is valid


In [2]:
email = input("Enter your email: ").strip()

if "@" and "." in email:
    print("Your email is valid")
else:
    print("Your email is invalid")

Your email is valid


In [None]:
email = input("Enter your email: ").strip()

username, domain = email.split("@")

if username and "." in domain:
    print("Your email is valid")
else:
    print("Your email is invalid")

In [None]:
email = input("Enter your email: ").strip()

username, domain = email.split("@")

if username and domain.endswith(".edu"):
    print("Your email is valid")
else:
    print("Your email is invalid")

### Uisng the re module

In [1]:
import re

email = input("Enter your email: ").strip()

if re.search("@", email):
    print("Your email is valid")
else:
    print("Your email is invalid")

Your email is valid


The pattern you pass re.search() can take a lot of special symbols and characters. The most common ones are:

* `.`: matches any character except a newline
* `*`: matches zero or more repetitions of the preceding character
* `+`: matches one or more repetitions of the preceding character
* `?`: matches zero or one of the preceding character
* `{n}`: matches exactly n of the preceding character
* `{n,}`: matches n or more of the preceding character
* `{,n}`: matches 0 to n of the preceding character
* `{m}`: matches m repetitions
* `{m,n}`: matches m to n repetitions
* `[abc]`: matches any character between the brackets (such as a, b, or c)
* `[^abc]`: matches any character that isn’t between the brackets
* `\d`: matches any decimal digit
* `\D`: matches any non-decimal digit characters
* `\w`: matches any alphanumeric character as well as the underscore
* `\W`: matches any non-alphanumeric character
* `\s`: matches any whitespace character
* `\S`: matches any non-whitespace character
* `\b`: matches the empty string between `\w` and `\W` characters
* `\B`: matches the empty string anywhere else
* `A|B`: matches either A or B
* `()`: matches a group of characters
* `(?:)`: matches a group of characters without creating a capture group
* `([])`: matches a set of characters
* `[^]`: matches any character not in a set of characters

In [None]:
import re

email = input("Enter your email: ").strip()

#`.*` means give me something to the left and something to the right of the `@`
if re.search(".*@.*", email):
    print("Your email is valid")
else:
    print("Your email is invalid")

Using this regex, the program will print valid even if you don't add anything after the `@` because the `*` means it can accept zero or more repetitions of the preceding character. so in this case if we want something to be required after the `@` we can use the `+` instead of the `*` which means it can accept one or more repetitions of the preceding character.

In [None]:
import re

email = input("Enter your email: ").strip()

#`.+` means at least one character to the left and right of the `@` 
if re.search(".+@.+", email):
    print("Your email is valid")
else:
    print("Your email is invalid")

In [None]:
import re

email = input("Enter your email: ").strip()

# using r before the string means it's a raw string indicating 
# that we don't want to escape any characters just the literal "\" string
if re.search(r".+@.+\.edu", email):
    print("Your email is valid")
else:
    print("Your email is invalid")

in the context of a regular expression 

* `^`: matches the start of the string
* `$`: matches the end of the string or just before the newline at the end of the string

In [None]:
import re

email = input("Enter your email: ").strip()

if re.search(r"^.+@.+\.edu$", email):
    print("Your email is valid")
else:
    print("Your email is invalid")

You can use

* `[]` to indicate a set of characters, so `[a-m]` will match any lowercase character from `a` to `m`.
* `[^]` to indicate a set of characters you do not want to match, so `[^a-m]` will match any character that is not a lowercase character from `a` to `m`.

In [None]:
import re

email = input("Enter your email: ").strip()

if re.search(r"^[^@]+@[^@]+\.edu$", email):
    print("Your email is valid")
else:
    print("Your email is invalid")

This regex `^[^@]+@[^@]+\.edu$` will match any string that starts with one or more characters that are not `@` followed by an `@` followed by one or more characters that is not `@` followed by a `.` and ends with `edu`.

In [None]:
import re

email = input("Enter your email: ").strip()

if re.search(r"^[a-zA-Z0-9_]+@[A-ZA-Z0-9_]+\.edu$", email):
    print("Your email is valid")
else:
    print("Your email is invalid")

In [None]:
import re

email = input("Enter your email: ").strip()

if re.search(r"^\w+@\w+\.edu$", email):
    print("Your email is valid")
else:
    print("Your email is invalid")

Using a flag in the re.search() function will change the way the regex is interpreted. The most common flags are:
* `re.IGNORECASE`: ignores case
* `re.MULTILINE`: treats the beginning and end of the string as the beginning and end of each line
* `re.DOTALL`: makes the `.` special character match any character, including a newline
* `re.VERBOSE`: ignores whitespace and comments inside the regular expression string
* `re.ASCII`: makes several escapes like `\w`, `\b`, `\s` match only on ASCII characters

>_Read more about the re module [here](https://docs.python.org/3/library/re.html)._

In [None]:
import re

email = input("Enter your email: ").strip()

if re.search(r"^\w+@(\w+\.)?\w+\.edu$", email, re.IGNORECASE):
    print("Your email is valid")
else:
    print("Your email is invalid")

##### format.py

In [3]:
name = input("What's your name? ").strip()

if "," in name:
    last, first = name.split(", ")
    name = f"{first} {last}"

print(f"Hello {name}")

Hello David Malan


In [None]:
import re

name = input("What's your name? ").strip()
matches = re.search(r"^(.+), (.+)$", name)

if matches:
    last, first = matches.groups()
    name = f"{first} {last}"

print(f"Hello, {name}")

In [None]:
import re

name = input("What's your name? ").strip()
matches = re.search(r"^(.+), (.+)$", name)

if matches:
    last = matches.group(1)
    first = matches.group(2)
    name = f"{first} {last}"

print(f"Hello, {name}")

In [None]:
import re

name = input("What's your name? ").strip()
matches = re.search(r"^(.+), *(.+)$", name)

if matches:
    name = matches.group(2) + " " + matches.group(1)

print(f"Hello, {name}")

In [None]:
import re

name = input("What's your name? ").strip()

if matches := re.search(r"^(.+), *(.+)$", name):
    name = matches.group(2) + " " + matches.group(1)

print(f"Hello, {name}")

`:=` The walrus operator assigns values to variables as part of a larger expression. It was introduced in Python 3.8.

It is used if you want to assign something from right to left and ask an `if` or `elif` question on the same line

##### twitter.py

In [None]:
url = input("URL: ").strip()

username = url.replace("https://twitter.com/", "")
print(f"Username: {username}")

In [None]:
url = input("URL: ").strip()

username = url.removeprefix("https://twitter.com/")
print(f"Username: {username}")

In [None]:
import re

url = input("URL: ").strip()

username = re.sub(r"https://twitter.com/", "", url)

print(f"Username: {username}")

In [None]:
import re

url = input("URL: ").strip()

username = re.sub(r"^(https?://)?(www\.)?twitter\.com/", "", url)

print(f"Username: {username}")

In [None]:
import re

url = input("URL: ").strip()

if matches := re.search(r"^(https?://)?(www\.)?twitter\.com/(.+)$", url, re.IGNORECASE):
    print(f"Username:", matches.group(2))

In [None]:
import re

url = input("URL: ").strip()

if matches := re.search(r"^(https?://)?(?:www\.)?twitter\.com/(.+)$", url, re.IGNORECASE):
    print(f"Username:", matches.group(1))

In [None]:
import re

url = input("URL: ").strip()

if matches := re.search(r"^(https?://)?(?:www\.)?twitter\.com/(\w+)", url, re.IGNORECASE):
    print(f"Username:", matches.group(1))