# Exercises from ChatGPT

Here are the regex practice exercises, formulated in English with increasing levels of difficulty:
### Beginner Level:

1.    String: "12345"
    Task: Find all digits in the string.

2.    String: "abc123xyz"
    Task: Extract the numbers.

3.    String: "hello world"
    Task: Check if the string starts with "hello".

4.    String: "a1b2c3d4"
    Task: Find all characters that are followed by a digit.

In [51]:
import re

def apply_regex(string: str, pattern: str, group_nr: list = None)->None:
    matches = re.finditer(pattern, string)
    for match in matches:
        if group_nr is not None:
            for el in group_nr:
                print(match.group(el))
        else:
            print(match)

### I. Beginner

In [52]:
# I.1 Find all digits in the string
apply_regex("12345", r"\d")

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(1, 2), match='2'>
<re.Match object; span=(2, 3), match='3'>
<re.Match object; span=(3, 4), match='4'>
<re.Match object; span=(4, 5), match='5'>


In [48]:
# I.2 Extract the numbers.
apply_regex("abc123xyz", r"\d")

<re.Match object; span=(3, 4), match='1'>
<re.Match object; span=(4, 5), match='2'>
<re.Match object; span=(5, 6), match='3'>


In [49]:
# I.3 Check if the string starts with "hello".
apply_regex("hello world", r"\bhello")

<re.Match object; span=(0, 5), match='hello'>


In [50]:
# I.4 Find all characters that are followed by a digit.
apply_regex("a1bc3D4", r"[a-zA-Z](?=\d)")
# (?=) postive lookahead: only look ahead but not include the element to search for (not "greedy")

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(5, 6), match='D'>


### II. Intermediate Level:

1.    String: "dog, cat, owl, bird, apple, fish"
    Task: Extract all words that start with a consonant.

2.    String: "12-34-56"
    Task: Replace the hyphens with colons.

3.    String: "email@example.com"
    Task: Validate if the string is a proper email address.

4.    String: "2023-01-06"
    Task: Extract the year, month, and day as separate groups.

In [76]:
# II.1 Extract all words that start with a consonant.
string21 = "dog, cat, owl, bird, apple, fish"
pattern = r"\b[b-df-hj-np-tv-z]\w*"
pattern2 = r"\b[^aeiouAEIOU\W\s\d]\w*" # better way to solve this by negated search
# \b start at the beginning of a word
# [b-df-hj-np-tv-z] includes all characters which are not consonants (a,e,i,o,u)
# \w* 0 or more alphanumerical characters (rest of word)
matches = re.findall(pattern2, string21)
for match in matches:
    print(match)

dog
cat
bird
fish


[^...]: Negiertes Set, welches folgende Zeichen ausschließt:

    aeiouAEIOU: Vokale (klein und groß).
    \s: Leerzeichen (damit keine leeren Matches entstehen).
    \W: Nicht-alphanumerische Zeichen (wie Komma, Punkt, etc.).
    \d: Ziffern, falls Zahlen ausgeschlossen werden sollen.

\b: Stellt sicher, dass das Match am Anfang eines Wortes beginnt.

\w*: Matcht den Rest des Wortes (0 oder mehr alphanumerische Zeichen).

Warum kein finditer?

    findall liefert eine Liste aller Treffer (Wörter, die mit Konsonanten beginnen) und ist ideal für diesen Fall, da wir nur die Treffer selbst benötigen.
    Mit finditer bekommst du Match-Objekte, was mehr Aufwand bedeutet, wenn du nur die Strings brauchst (du müsstest .group() für jeden Treffer aufrufen).

Wann finditer?

    Wenn du zusätzliche Informationen benötigst, wie z. B. die Start- und Endposition jedes Treffers, dann wäre finditer die richtige Wahl.

In [53]:
# II.2 Replace the hyphens with colons.
print(re.sub("-", ".", "12-34-56"))

12.34.56


In [62]:
# II.3 Validate if the string is a proper email address.
# RFC 5322 standard for email addresses: 
#   email@example.com
#   first.last@example.com
#   email_with-dash@example.com
#   email+tag@example.com
#   email%custom@example.com

string22 = "email@example.com"
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
match = re.fullmatch(pattern, string22)
if match:
    print(match.group())
else:
    print("No match!")

# {2,} specifies "at least 2".
# $ ensures, pattern must be at the end of string
# {0,2} would be "min 0 and maximum of 2", {,2} is NOT a valid regex syntax

email@example.com


Warum kein finditer?

    Die Aufgabe verlangt nur zu prüfen, ob der komplette String einem bestimmten Muster entspricht. Dafür ist fullmatch perfekt, da es prüft, ob der gesamte String das Muster erfüllt.
    Mit finditer würdest du zwar das Muster im String finden, aber es prüft nicht, ob der String ausschließlich aus diesem Muster besteht.

Wie würde es mit finditer aussehen?

    Für Validierung nicht geeignet. Du könntest zwar suchen, aber es wäre nicht sicher, dass der gesamte String gültig ist.

In [66]:
# II.4 Extract the year, month and day as seperate groups.
string24 = "2023-01-06"
pattern =r"(\d{4})-(\d{2})-(\d{2})"
match = re.match(pattern, string24)
if match:
    print(match.groups()) # groups instead of group
else:
    print("No match!")

('2023', '01', '06')


Warum kein finditer?

    Hier wollen wir Gruppen extrahieren (z. B. Jahr, Monat, Tag). Dafür ist re.match ideal, da es direkt Zugriff auf die Gruppen bietet.
    Mit finditer bekämst du Match-Objekte und müsstest für jede Gruppe .group() oder .groups() manuell aufrufen. Es wäre unnötig kompliziert, da hier nur ein Treffer relevant ist.

```python
matches = re.finditer(r"(\d{4})-(\d{2})-(\d{2})", string_8)
results = [match.groups() for match in matches]
print(results)  # Output: [('2023', '01', '06')]
```

### III. Expert

In [71]:
# III.1 Extract all prices (e.g., $25.50).
string31 = "The price is $25.50 and the discount is $5.00."
pattern = r"\$\d+\.\d{2}"
matches = re.finditer(pattern, string31)
for match in matches:
    print(match)

# ^ stands for the beginning of a string, meaning "The" only in this case
# ^ inside [^\d] stands for everything except numbers

<re.Match object; span=(13, 19), match='$25.50'>
<re.Match object; span=(40, 45), match='$5.00'>


In [77]:
# III.2 Find all uppercase letters that are followed by lowercase letters.
string32 = "HeLLo WOrLD"
pattern = r"[A-Z](?=[a-z])"
matches = re.finditer(pattern, string32)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='H'>
<re.Match object; span=(3, 4), match='L'>
<re.Match object; span=(7, 8), match='O'>


In [84]:
# III.3 Find pairs of two uppercase letters followed by a digit and group them.
string33 = "AA1, B2, CE3, D4"
pattern = r"([A-Z]{2}\d)"
matches = re.finditer(pattern, string33)
for match in matches:
    print(match.group())

# inside brackets [] every char is seen as an symbol itself: [(A-Z{2}0-9)] does not work because {2} is seen as "{2}" not as "exact 2"

AA1
CE3


In [95]:
# III.4 Validate if the string is a phone number in the format xxx-xxx-xxxx. Allow whitespaces in front or end.
string34 = " 123-456-7890  "
pattern = r"^\s*\d{3}-\d{3}-\d{4}\s*$"
cleaned_string = re.sub(r"\s+", "", string34)  # remove all whitespaces while keeping pattern simple
match = re.fullmatch(pattern, cleaned_string)
if match:
    print(match.group())
else:
    print("No match!")

123-456-7890
