# Python basics Assignment 7

1. What is the name of the feature responsible for generating Regex objects?

Ans. The re.compile() function is used to compile a regular expression pattern into a Regex object, which can then be used for various operations such as searching, matching, and manipulation of strings.

2. Why do raw strings often appear in Regex objects?

Ans. Raw strings are commonly used in regular expressions to represent patterns because they prevent backslashes (\) from being treated as escape characters by Python's string literal syntax. 
To avoid conflicts between Python's string escape characters and regular expression metacharacters, raw strings (prefixed with r) are often used in regular expressions.

3. What is the return value of the search() method?

Ans. The re.search() function will search the regular expression pattern and return the first occurrence. it will check all lines of the input string. If the pattern is found, the match object will be returned, otherwise “null” is returned.

4. From a Match item, how do you get the actual strings that match the pattern?

Ans. The group() method of a match object returns the actual string that matched the pattern.

In [1]:
import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')

In [3]:
mo.group()

'415-555-4242'

5. In the regex which created from the r'(\d\d\d)-(\d\d\d-\d\d\d\d)', what does group zero cover? Group 2? Group 1?

Ans. 
In the regular expression r'(\d\d\d)-(\d\d\d-\d\d\d\d)', the group zero (group(0)) covers the entire matched string, including both capturing groups and any characters that matched the pattern.

Group 1 (group(1)) corresponds to the first capturing group (\d\d\d), which captures three digits before the hyphen. It represents the first part of the matched string.

Group 2 (group(2)) corresponds to the second capturing group (\d\d\d-\d\d\d\d), which captures a group of three digits followed by a hyphen and four more digits. It represents the second part of the matched string.


In [4]:
import re

pattern = r'(\d\d\d)-(\d\d\d-\d\d\d\d)'
text = "Phone number: 123-456-7890"

regex = re.compile(pattern)
match = regex.search(text)

if match:
    print("Pattern found!")
    print("Entire match:", match.group(0))  # Group 0: Entire matched string
    print("Group 1:", match.group(1))        # Group 1: First capturing group
    print("Group 2:", match.group(2))        # Group 2: Second capturing group
else:
    print("Pattern not found!")

Pattern found!
Entire match: 123-456-7890
Group 1: 123
Group 2: 456-7890


6. In standard expression syntax, parentheses and intervals have distinct meanings. How can you tell a regex that you want it to fit real parentheses and periods?

Ans. To match literal parentheses and periods in a regular expression pattern, you can use the backslash character to escape them. By preceding a special character with a backslash, you indicate that you want to match the character itself rather than its special meaning in regex syntax.

7. The findall() method returns a string list or a list of string tuples. What causes it to return one of the two options?

Ans. 

The findall() method in Python's re module returns a list of strings when the regular expression pattern being searched has no capturing groups. On the other hand, it returns a list of string tuples when the pattern includes capturing groups.

If the regular expression pattern contains no capturing groups (expressions enclosed in parentheses ()), findall() will return a list of strings. Each string in the list represents a complete match of the pattern within the searched string.

In [6]:
import re

pattern = r'\d+'
text = 'I have 10 apples and 5 bananas'

matches = re.findall(pattern, text)

print(matches)

['10', '5']


On the other hand, when the regular expression pattern includes capturing groups, findall() returns a list of string tuples. Each tuple in the list represents a complete match, and each element within the tuple corresponds to a capturing group within the pattern.

In [8]:
import re

pattern = r'(\d+)-(\d+)'
text = 'The phone numbers are 123-456 and 789-012'

matches = re.findall(pattern, text)

print(matches)

[('123', '456'), ('789', '012')]


In this example, the pattern r'(\d+)-(\d+)' matches two sets of digits separated by a hyphen. The pattern includes two capturing groups. When findall() is used, it returns a list of string tuples, where each tuple contains the complete match and the matched groups. 

So, the presence or absence of capturing groups in the regular expression pattern determines whether findall() returns a list of strings or a list of string tuples.

8. In standard expressions, what does the | character mean?

Ans. 

In Python, the '|' operator is defined by default on integer types and set types.

If the two operands are integers, then it will perform a bitwise OR, which is a mathematical operation.

If the two operands are set types, the '|' operator will return the union of two sets.

9. In regular expressions, what does the character stand for?

In regular expression, it is used to denote a logical OR operation, allowing you to specify multiple alternative patterns to match. It matches the left-hand side (LHS) if it is found, or the right-hand side (RHS) if the LHS is not found.

In [9]:
import re

pattern = r'cat|dog'
text = 'I have a cat and a dog'

matches = re.findall(pattern, text)

print(matches)

['cat', 'dog']


10.In regular expressions, what is the difference between the + and * characters?

Ans. In regular expressions, the + and * characters are quantifiers that specify the number of occurrences of the preceding element in the pattern

+ (Plus): The "+" quantifier matches one or more occurrences of the preceding element.
It requires at least one occurrence of the preceding element for a match.

* (Asterisk):
The "*" quantifier matches zero or more occurrences of the preceding element.
It allows for an optional occurrence of the preceding element in the pattern.

In [11]:
import re

pattern1 = r'ab+'  # Matches 'a' followed by one or more 'b'
pattern2 = r'ab*'  # Matches 'a' followed by zero or more 'b'

text = 'a ab abb abb'

matches1 = re.findall(pattern1, text)
matches2 = re.findall(pattern2, text)

print(matches1)  # Output: ['ab', 'abb', 'abb']
print(matches2)  # Output: ['a', 'ab', 'abb', 'abb']

['ab', 'abb', 'abb']
['a', 'ab', 'abb', 'abb']


11. What is the difference between {4} and {4,5} in regular expression?

Ans. The {4} quantifier matches exactly four occurrences of the preceding element.
It specifies a fixed repetition count.

The {4,5} quantifier matches a range of four to five occurrences of the preceding element.
It specifies a minimum and maximum repetition count.

12. What do you mean by the \d, \w, and \s shorthand character classes signify in regular expressions?

Ans. 

The \d shorthand character class matches any digit character.

The \w shorthand character class matches any word character.

The \s shorthand character class matches any whitespace character.

In [12]:
import re

pattern1 = r'\d'  # Matches a single digit
pattern2 = r'\w'  # Matches a single word character
pattern3 = r'\s'  # Matches a single whitespace character

text = 'Hello 123 world!'

matches1 = re.findall(pattern1, text)
matches2 = re.findall(pattern2, text)
matches3 = re.findall(pattern3, text)

print(matches1)  
print(matches2)  
print(matches3)  

['1', '2', '3']
['H', 'e', 'l', 'l', 'o', '1', '2', '3', 'w', 'o', 'r', 'l', 'd']
[' ', ' ']


13. What do means by \D, \W, and \S shorthand character classes signify in regular expressions?

Ans. In regular expressions, the shorthand character classes \D, \W, and \S are negated versions of the \d, \w, and \s shorthand character classes, respectively. They represent predefined character classes that match specific types of characters that are not within the original character classes.

The \D shorthand character class matches any character that is not a digit.

The \W shorthand character class matches any character that is not a word character.

The \S shorthand character class matches any character that is not a whitespace character.

In [13]:
import re

pattern1 = r'\D'  # Matches a single non-digit character
pattern2 = r'\W'  # Matches a single non-word character
pattern3 = r'\S'  # Matches a single non-whitespace character

text = 'Hello 123 world!'

matches1 = re.findall(pattern1, text)
matches2 = re.findall(pattern2, text)
matches3 = re.findall(pattern3, text)

print(matches1)  
print(matches2)  
print(matches3)

['H', 'e', 'l', 'l', 'o', ' ', ' ', 'w', 'o', 'r', 'l', 'd', '!']
[' ', ' ', '!']
['H', 'e', 'l', 'l', 'o', '1', '2', '3', 'w', 'o', 'r', 'l', 'd', '!']


14. What is the difference between .*? and .*?

Ans. 

The .*? construct is a non-greedy or lazy quantifier.
It matches as few characters as possible to satisfy the pattern.
It will stop matching as soon as the subsequent part of the pattern can be matched.
For example, in the pattern a.*?b, if there are multiple occurrences of "a" and "b" in the input, it will match the shortest possible substring that starts with "a" and ends with "b".

The .* construct is a greedy quantifier.
It matches as many characters as possible to satisfy the pattern.
It will match as much of the input as it can and still allow the subsequent part of the pattern to match.
For example, in the pattern a.*b, if there are multiple occurrences of "a" and "b" in the input, it will match the longest possible substring that starts with "a" and ends with "b".

In [14]:
import re

text = 'aabab'

pattern1 = r'a.*?b'  # Non-greedy match
pattern2 = r'a.*b'   # Greedy match

matches1 = re.findall(pattern1, text)
matches2 = re.findall(pattern2, text)

print(matches1)  
print(matches2)  

['aab', 'ab']
['aabab']


15. What is the syntax for matching both numbers and lowercase letters with a character class?

Ans. This character class [0-9a-z] will match any single character that is a digit from 0 to 9 or a lowercase letter from a to z. The hyphen - inside the character class denotes a range of characters.

16. What is the procedure for making a normal expression in regax case insensitive?

Ans. To make a regular expression case-insensitive in Python, you can use the re.IGNORECASE flag or the (?i) flag within the regular expression pattern. 

In [15]:
import re

pattern = r'example'
text = 'This is an Example'

matches = re.findall(pattern, text, re.IGNORECASE)
print(matches)  

['Example']


In [16]:
import re

pattern = r'(?i)example'
text = 'This is an Example'

matches = re.findall(pattern, text)
print(matches) 

['Example']


17. What does the . character normally match? What does it match if re.DOTALL is passed as 2nd argument in re.compile()?

Ans. 
In regular expressions, the . (dot) character normally matches any character except a newline character (\n). It is a wildcard that represents any single character.

However, if the re.DOTALL flag (or re.S) is passed as the second argument to the re.compile() function or used inline in the regular expression pattern, the behavior of the dot character changes. In this case, the dot (.) matches any character, including newline characters.

In [19]:
import re

text = 'Hello\nWorld'

pattern1 = r'H.*o'   # Dot does not match newline by default
pattern2 = r'H.*o'   # Dot matches newline with re.DOTALL

regex1 = re.compile(pattern1)
regex2 = re.compile(pattern2, re.DOTALL)

match1 = regex1.search(text)
match2 = regex2.search(text)

print(match1)  
print(match2)

<re.Match object; span=(0, 5), match='Hello'>
<re.Match object; span=(0, 8), match='Hello\nWo'>


18. If numReg = re.compile(r'\d+'), what will numRegex.sub('X', '11 drummers, 10 pipers, five rings, 4 hen') return?

Ans. If numRegex is defined as re.compile(r'\d+'), and the sub() method is called with the replacement string 'X' and the input string '11 drummers, 10 pipers, five rings, 4 hen', the sub() method will replace all occurrences of digits with 'X'. The returned value will be:

'X drummers, X pipers, five rings, X hen'

19. What does passing re.VERBOSE as the 2nd argument to re.compile() allow to do?

Ans. Passing re.VERBOSE as the second argument to re.compile() allows you to write more readable and organized regular expressions by enabling the use of comments and whitespace within the pattern. It helps in improving the clarity and maintainability of complex regular expressions.

In [22]:
# Using VERBOSE
regex_email = re.compile(r"""
            ^([a-z0-9_\.-]+)              # local Part
            @                             # single @ sign
            ([0-9a-z\.-]+)                # Domain name
            \.                            # single Dot .
            ([a-z]{2,6})$                 # Top level Domain  
             """,re.VERBOSE | re.IGNORECASE)

20. How would you write a regex that match a number with comma for every three digits? It must match the given following:

'42'

'1,234'

'6,368,745'

but not the following:

'12,34,567' (which has only two digits between the commas)

'1234' (which lacks commas)
 

In [23]:
import re

pattern = r'^\d{1,3}(,\d{3})*$'

numbers = ['42', '1,234', '6,368,745', '12,34,567', '1234']

for number in numbers:
    match = re.match(pattern, number)
    if match:
        print(f"'{number}' is a valid number.")
    else:
        print(f"'{number}' is not a valid number.")

'42' is a valid number.
'1,234' is a valid number.
'6,368,745' is a valid number.
'12,34,567' is not a valid number.
'1234' is not a valid number.


Explanation of the pattern r'^\d{1,3}(,\d{3})*$:

^ asserts the start of the string.

\d{1,3} matches one to three digits.

(,\d{3})* matches zero or more occurrences of a comma followed by exactly three digits.

$ asserts the end of the string.

21. How would you write a regex that matches the full name of someone whose last name is Watanabe? You can assume that the first name that comes before it will always be one word that begins with a capital letter. The regex must match the following:

'Haruto Watanabe'

'Alice Watanabe'

'RoboCop Watanabe'

but not the following:

'haruto Watanabe' (where the first name is not capitalized)

'Mr. Watanabe' (where the preceding word has a nonletter character)

'Watanabe' (which has no first name)

'Haruto watanabe' (where Watanabe is not capitalized)



In [24]:
import re

pattern = r'^[A-Z][a-zA-Z]* Watanabe$'

names = ['Haruto Watanabe', 'Alice Watanabe', 'RoboCop Watanabe', 'haruto Watanabe', 'Mr. Watanabe', 'Watanabe', 'Haruto watanabe']

for name in names:
    match = re.match(pattern, name)
    if match:
        print(f"'{name}' is a valid full name.")
    else:
        print(f"'{name}' is not a valid full name.")

'Haruto Watanabe' is a valid full name.
'Alice Watanabe' is a valid full name.
'RoboCop Watanabe' is a valid full name.
'haruto Watanabe' is not a valid full name.
'Mr. Watanabe' is not a valid full name.
'Watanabe' is not a valid full name.
'Haruto watanabe' is not a valid full name.


Explanation of the pattern r'^[A-Z][a-zA-Z]* Watanabe$':

^ asserts the start of the string.

[A-Z] matches a single capital letter as the first character of the first name.

[a-zA-Z]* matches zero or more lowercase or uppercase letters for the remaining characters of the first name.

matches a space character that separates the first and last names.

Watanabe matches the literal string "Watanabe" for the last name.

$ asserts the end of the string.

22. How would you write a regex that matches a sentence where the first word is either Alice, Bob, or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs; and the sentence ends with a period? This regex should be case-insensitive. It must match the following:

'Alice eats apples.'

'Bob pets cats.'

'Carol throws baseballs.'

'Alice throws Apples.'

'BOB EATS CATS.'

but not the following:

'RoboCop eats apples.'

'ALICE THROWS FOOTBALLS.'

'Carol eats 7 cats.'


In [26]:
import re

pattern = r'^(Alice|Bob|Carol) (eats|pets|throws) (apples|cats|baseballs)\.$'

sentences = [
    'Alice eats apples.',
    'Bob pets cats.',
    'Carol throws baseballs.',
    'Alice throws Apples.',
    'BOB EATS CATS.',
    'RoboCop eats apples.',
    'ALICE THROWS FOOTBALLS.',
    'Carol eats 7 cats.'
]

for sentence in sentences:
    match = re.match(pattern, sentence, re.IGNORECASE)
    if match:
        print(f"'{sentence}' is a valid sentence.")
    else:
        print(f"'{sentence}' is not a valid sentence.")

'Alice eats apples.' is a valid sentence.
'Bob pets cats.' is a valid sentence.
'Carol throws baseballs.' is a valid sentence.
'Alice throws Apples.' is a valid sentence.
'BOB EATS CATS.' is a valid sentence.
'RoboCop eats apples.' is not a valid sentence.
'ALICE THROWS FOOTBALLS.' is not a valid sentence.
'Carol eats 7 cats.' is not a valid sentence.


Explanation of the pattern r'^(Alice|Bob|Carol) (eats|pets|throws) (apples|cats|baseballs)\.$':

^ asserts the start of the string.

(Alice|Bob|Carol) matches either "Alice", "Bob", or "Carol" as the first word of the sentence.

(eats|pets|throws) matches either "eats", "pets", or "throws" as the second word of the sentence.

(apples|cats|baseballs) matches either "apples", "cats", or "baseballs" as the third word of the sentence.

\. matches the period at the end of the sentence.

$ asserts the end of the string.

By using the re.IGNORECASE flag, the pattern becomes case-insensitive, allowing the matching of any case variation of the specified words.