# Regular Expressions

---

Regular expressions (regex) are powerful pattern-matching tools used extensively in shell scripting for text processing, validation, and data extraction.

This notebook covers:
- Basic regex patterns
- Character classes and quantifiers
- Anchors and groups
- Practical validation examples

## Basic Regular Expression Concepts

### Literal Characters

In [1]:
# Simple string matching
text="Hello World"

if [[ $text =~ Hello ]]; then
    echo "Found 'Hello' in the text"
fi

# Case-sensitive matching
if [[ $text =~ hello ]]; then
    echo "Found 'hello'"
else
    echo "Did not find 'hello' (case-sensitive)"
fi

Found 'Hello' in the text
Did not find 'hello' (case-sensitive)


### Character Classes

In [2]:
text="abc123XYZ"

# Match any digit
if [[ $text =~ [0-9] ]]; then
    echo "Contains digits"
fi

# Match any letter
if [[ $text =~ [a-zA-Z] ]]; then
    echo "Contains letters"
fi

# Match specific characters
if [[ $text =~ [aeiou] ]]; then
    echo "Contains vowels"
fi

# Negated character class
if [[ $text =~ [^0-9] ]]; then
    echo "Contains non-digit characters"
fi

Contains digits
Contains letters
Contains vowels
Contains non-digit characters


### POSIX Character Classes

In [3]:
text="Hello123 World!"

# POSIX character classes
if [[ $text =~ [[:digit:]] ]]; then
    echo "Contains digits"
fi

if [[ $text =~ [[:alpha:]] ]]; then
    echo "Contains letters"
fi

if [[ $text =~ [[:space:]] ]]; then
    echo "Contains whitespace"
fi

if [[ $text =~ [[:punct:]] ]]; then
    echo "Contains punctuation"
fi

Contains digits
Contains letters
Contains whitespace
Contains punctuation


## Quantifiers

### Zero or More (*), One or More (+), Zero or One (?)

In [4]:
# Zero or more (*)
texts=("abc" "abbbbc" "ac")
pattern="ab*c"

for text in "${texts[@]}"; do
    if [[ $text =~ $pattern ]]; then
        echo "'$text' matches pattern '$pattern'"
    fi
done

'abc' matches pattern 'ab*c'
'abbbbc' matches pattern 'ab*c'
'ac' matches pattern 'ab*c'


In [5]:
# Optional character (?)
texts=("color" "colour" "colors" "colours")
pattern="colou?r"

for text in "${texts[@]}"; do
    if [[ $text =~ ^$pattern$ ]]; then
        echo "'$text' matches exactly"
    elif [[ $text =~ $pattern ]]; then
        echo "'$text' contains the pattern"
    fi
done

'color' matches exactly
'colour' matches exactly
'colors' contains the pattern
'colours' contains the pattern


### Specific Quantities {n}, {n,m}, {n,}

In [6]:
# Phone number validation
phone_numbers=("123-456-7890" "12-34-567" "1234567890" "123-45-67890")
phone_pattern="[0-9]{3}-[0-9]{3}-[0-9]{4}"

for phone in "${phone_numbers[@]}"; do
    if [[ $phone =~ ^$phone_pattern$ ]]; then
        echo "Valid phone: $phone"
    else
        echo "Invalid phone: $phone"
    fi
done

Valid phone: 123-456-7890
Invalid phone: 12-34-567
Invalid phone: 1234567890
Invalid phone: 123-45-67890


## Anchors and Boundaries

### Start (^) and End ($) Anchors

In [7]:
text="The quick brown fox"

# Start of string (^)
if [[ $text =~ ^The ]]; then
    echo "Starts with 'The'"
fi

# End of string ($)
if [[ $text =~ fox$ ]]; then
    echo "Ends with 'fox'"
fi

# Exact match
if [[ $text =~ ^The.*fox$ ]]; then
    echo "Starts with 'The' and ends with 'fox'"
fi

Starts with 'The'
Ends with 'fox'
Starts with 'The' and ends with 'fox'


## Groups and Capturing

### Basic Grouping with BASH_REMATCH

In [8]:
# Extract date components
dates=("2023-12-25" "12/25/2023" "2023/12/25")

# ISO date format
iso_pattern="([0-9]{4})-([0-9]{2})-([0-9]{2})"

for date in "${dates[@]}"; do
    if [[ $date =~ $iso_pattern ]]; then
        echo "ISO date: $date"
        echo "  Year: ${BASH_REMATCH[1]}"
        echo "  Month: ${BASH_REMATCH[2]}"
        echo "  Day: ${BASH_REMATCH[3]}"
    fi
done

ISO date: 2023-12-25
  Year: 2023
  Month: 12
  Day: 25


### Extracting Data from Log Entries

In [9]:
# Parse log entries
log_entries=(
    "2023-10-26 14:30:22 ERROR Failed to connect to database"
    "2023-10-26 14:31:15 INFO User login successful"
    "2023-10-26 14:32:01 WARN Memory usage high"
)

log_pattern="([0-9-]+) ([0-9:]+) ([A-Z]+) (.*)"

for entry in "${log_entries[@]}"; do
    if [[ $entry =~ $log_pattern ]]; then
        echo "Log entry parsed:"
        echo "  Date: ${BASH_REMATCH[1]}"
        echo "  Time: ${BASH_REMATCH[2]}"
        echo "  Level: ${BASH_REMATCH[3]}"
        echo "  Message: ${BASH_REMATCH[4]}"
        echo
    fi
done

Log entry parsed:
  Date: 2023-10-26
  Time: 14:30:22
  Level: ERROR
  Message: Failed to connect to database

Log entry parsed:
  Date: 2023-10-26
  Time: 14:31:15
  Level: INFO
  Message: User login successful

Log entry parsed:
  Date: 2023-10-26
  Time: 14:32:01
  Level: WARN
  Message: Memory usage high



## Practical Examples

### Email Validation

In [10]:
validate_email() {
    local email=$1
    local pattern="^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    
    if [[ $email =~ $pattern ]]; then
        return 0
    else
        return 1
    fi
}

emails=("user@example.com" "invalid.email" "test@domain" "valid.user+tag@example.org")

for email in "${emails[@]}"; do
    if validate_email "$email"; then
        echo "✓ Valid: $email"
    else
        echo "✗ Invalid: $email"
    fi
done

✓ Valid: user@example.com
✗ Invalid: invalid.email
✗ Invalid: test@domain
✓ Valid: valid.user+tag@example.org


### URL Parsing

In [11]:
parse_url() {
    local url=$1
    local pattern="^(https?)://([^/]+)(/.*)?"
    
    if [[ $url =~ $pattern ]]; then
        echo "URL: $url"
        echo "  Protocol: ${BASH_REMATCH[1]}"
        echo "  Host: ${BASH_REMATCH[2]}"
        echo "  Path: ${BASH_REMATCH[3]:-/}"
    else
        echo "Invalid URL: $url"
    fi
}

urls=(
    "https://www.example.com/path/to/resource"
    "http://localhost:8080/"
)

for url in "${urls[@]}"; do
    parse_url "$url"
    echo
done

URL: https://www.example.com/path/to/resource
  Protocol: https
  Host: www.example.com
  Path: /path/to/resource

URL: http://localhost:8080/
  Protocol: http
  Host: localhost:8080
  Path: /



### Phone Number Formatting

In [12]:
format_phone() {
    local phone=$1
    # Remove all non-digits
    local digits_only=$(echo "$phone" | sed 's/[^0-9]//g')
    
    # Check if it's a valid US phone number
    if [[ $digits_only =~ ^1?([0-9]{3})([0-9]{3})([0-9]{4})$ ]]; then
        local area_code=${BASH_REMATCH[1]}
        local exchange=${BASH_REMATCH[2]}
        local number=${BASH_REMATCH[3]}
        
        echo "($area_code) $exchange-$number"
    else
        echo "Invalid phone number: $phone"
    fi
}

phones=(
    "1234567890"
    "123-456-7890"
    "(123) 456-7890"
    "1-123-456-7890"
    "12345"
)

for phone in "${phones[@]}"; do
    formatted=$(format_phone "$phone")
    echo "$phone -> $formatted"
done

1234567890 -> (123) 456-7890
123-456-7890 -> (123) 456-7890
(123) 456-7890 -> (123) 456-7890
1-123-456-7890 -> (123) 456-7890
12345 -> Invalid phone number: 12345


## Using grep with Regular Expressions

In [13]:
# Create sample data
cat << 'EOF' > /tmp/sample.txt
John Doe, john@example.com, 555-123-4567
Jane Smith, jane.smith@company.org, 555-987-6543
Bob Johnson, bob@invalid, 123-456
Alice Brown, alice.brown@university.edu, 555-111-2222
EOF

echo "All email addresses:"
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /tmp/sample.txt

echo
echo "Phone numbers:"
grep -oE '[0-9]{3}-[0-9]{3}-[0-9]{4}' /tmp/sample.txt

echo
echo "Lines with .com domains:"
grep '\.com' /tmp/sample.txt

# Clean up
rm /tmp/sample.txt

All email addresses:
[01;31m[Kjohn@example.com[m[K
[01;31m[Kjane.smith@company.org[m[K
[01;31m[Kalice.brown@university.edu[m[K

Phone numbers:
[01;31m[K555-123-4567[m[K
[01;31m[K555-987-6543[m[K
[01;31m[K555-111-2222[m[K

Lines with .com domains:
John Doe, john@example[01;31m[K.com[m[K, 555-123-4567


## Exercise

Create validation functions for the following data formats:

1. **Email addresses** - Extract valid email addresses from text
2. **Phone numbers** - Format US phone numbers consistently  
3. **URLs** - Parse and validate URLs
4. **Credit card numbers** - Validate format (16 digits, groups of 4)

In [14]:
# Test data
cat > /tmp/test_data.txt << 'EOF'
Contact: john.doe@example.com, phone: (555) 123-4567
Website: https://www.example.com/path?query=value
Payment: 4532-1234-5678-9012
Invalid email: notanemail.com
Invalid phone: 12345
EOF

echo "Extracting emails:"
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /tmp/test_data.txt

echo
echo "Extracting credit card numbers:"
grep -oE '[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}' /tmp/test_data.txt

rm /tmp/test_data.txt

Extracting emails:
[01;31m[Kjohn.doe@example.com[m[K

Extracting credit card numbers:
[01;31m[K4532-1234-5678-9012[m[K
