# CA4 - Question 1

In [36]:
import re

class ValidatorMeta(type):
    def __new__(cls, name, bases, dct):
        
        validation_rules = {
            #TO DO
            'email': r'^[a-zA-Z0-9]+@[a-zA-Z0-9]+\.(com|org)$',
            'phone_number': r'^\+98[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$',
            'password': r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!%@*])[A-Za-z0-9!%@*]{8,12}$',
            'product_code': r'^[A-Z][A-Z][0-9]{2,4}[a-z]?(-v[1-9][0-9]?)?$',
            'stop_word': r'(?<!\w)[Ss]top(?!\w|[^\s])',
            'repeated_phrase': r'\b(some students|many employees)\b.*\b\1\b',
            'date': r'^([1-9][0-9]{3})/(0[1-6]/(0[1-9]|[12][0-9]|3[01])|0[7-9]/(0[1-9]|[12][0-9]|30)|1[0-2]/(0[1-9]|[12][0-9]|30))$',
            'quotation': r'^([\'"])[a-zA-Z0-9 ]*\1$',
            'parenthesis': r'^[^()]*((\([^()]*\)[^()]*)*)$',
        }

        custom_error_messages = {
            'email': 'Email must have a valid format and end with .com or .org.',
            'phone_number': 'Phone number must start with +98 and be followed by 8 digits.',
            'password': 'Password must be between 8 and 12 characters and include at least one uppercase letter, one lowercase letter, one number, and one special character (*, @, %, !).',
            'product_code': 'Product code must consist of 2 uppercase letters, 2 to 4 digits, an optional lowercase letter, and an optional version number (v1-v99).',
            'stop_word': 'The word "Stop" or "stop" must be separate and not part of a larger word or attached to punctuation.',
            'repeated_phrase': 'The phrase "some students" or "many employees" must repeat exactly, with no different terms in between.',
            'date': 'Date must be in YYYY/MM/DD format, with valid month and day values.',
            'quotation': 'Text must be enclosed in balanced single or double quotes and contain only letters, numbers, and spaces.',
            'parenthesis': 'Parentheses must be balanced with no unmatched opening or closing parentheses.',
        }

        # Method generator for validation functions
        def create_validator(field, rule, custom_message):
            def validator(self, value):
                if not re.match(rule, value):
                    raise ValueError(custom_message)
                return True
            return validator

        # Add validation methods to the class dynamically
        for field, rule in validation_rules.items():
            dct[f'validate_{field}'] = create_validator(field, rule, custom_error_messages.get(field, f"Invalid {field}"))

        return super().__new__(cls, name, bases, dct)


class FormValidator(metaclass=ValidatorMeta):
    """Concrete validator class with all validation methods"""
    pass


### Email: 
It begins with ^, which means the pattern must start at the beginning of the string. The first part, [a-zA-Z0-9]+, matches the username portion of the email. It allows letters (both uppercase and lowercase) and numbers. The + means that at least one of these characters must be present, and they can repeat. Next comes the @ symbol, which is required to separate the username from the domain. After that, [a-zA-Z0-9]+ matches the domain name, allowing letters and numbers again, with one or more characters required. Then we have \. which is an escaped dot (a plain dot (.) means "any character" in regex). Finally, (com|org) ensures the email ends with either .com or .org, and the $ at the end makes sure nothing comes after that.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Phone number:
It begins with ^, which means the pattern must start at the beginning of the string. The next part is \+98, which matches the literal characters +98. The plus sign is escaped (\+) because + has a special meaning in regex. After that, we have [0-9] repeated eight times in a row. Each [0-9] matches a single digit from 0 to 9. Finally, the $ at the end of the expression makes sure the match stops there.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Password:
It begins with ^, which means the pattern must start at the beginning of the string. Then, we have a series of lookahead assertions.
- (?=.*[a-z]) ensures the password contains at least one lowercase letter. The .* part allows any number of characters before that lowercase letter, so it can be anywhere in the string.
- (?=.*[A-Z]) does the same for an uppercase letter.
- (?=.*\d) does the same for a numeric character.
- (?=.*[!%@*]) confirms that the password includes at least one special character from this specific set: !, %, @, or *.

After that, the final pattern is [A-Za-z0-9!%@*]{8,12}, which defines exactly what characters are allowed in the password. It must be made up only of letters, digits, and the specified special characters, and must be between 8 and 12 characters long. Finally, the $ at the end of the expression makes sure the match stops there.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Product code:
It begins with ^, which means the pattern must start at the beginning of the string. The first part, [A-Z][A-Z], requires exactly two uppercase English letters. Next, [0-9]{2,4} matches 2 to 4 digits immediately after the letters. After that, [a-z]? allows for one optional lowercase letter.  Then comes an optional version section: (-v[1-9][0-9]?)?. -v must appear exactly like that if the version is included. [1-9] ensures the version number doesn't start with 0 (e.g., v0 is invalid). [0-9]? optionally adds one more digit, allowing versions from v1 to v99. The entire group is optional because of the outer ?. Finally, the $ at the end of the expression makes sure the match stops there.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Stop word:
(?<!\w) checks what comes right before the word “Stop”. 
- \w represents any word character (letters, digits, or underscore). 
- (?<!\w) means: make sure there’s NOT a word character right before. 

[Ss]top matches "Stop" or "stop". (?!\w|[^\s]) checks what comes right after the word “Stop”. 
- \w: a word character. 
- [^\s]: any character that is not a space.
- \w|[^\s]: together, this means "a word character or punctuation-like character"
- (?!...): makes sure none of those appear immediately after.



------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Repeated phrase:
It starts with \b, which stands for a word boundary. This ensures that what follows is not part of a longer phrase or word. Then we have (some students|many employees), which is a capturing group. It matches either "some students" or "many employees", but not both at once. Then there's another \b to close off the first phrase. After that, .* allows any number of characters between the first and second instance of the phrase. Then comes \b\1\b. The \1 is a backreference to the phrase that was captured earlier (either "some students" or "many employees"), and the \b before and after it ensures it's once again matched as a standalone phrase.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Date:
It begins with ^, which means the pattern must start at the beginning of the string. Next, we have ([1-9][0-9]{3}). This part matches the year. The first digit must be non-zero, ensuring the year doesn't start with 0. The rest must be three digits.

Then comes the first /, which is just a separator between the year and month. The rest of the regex handles the month and day, broken down into three major groups: 
- 0[1-6]/(0[1-9]|[12][0-9]|3[01]): This handles months 01 to 06, each of which has 31 days. After the /, it allows days from 01 to 31.
- 0[7-9]/(0[1-9]|[12][0-9]|30): This handles months 07 to 09, which each have 30 days. After the /, it only allows days from 01 to 30.
- 1[0-2]/(0[1-9]|[12][0-9]|30): This covers months 10 to 12. After the /, it only allows days from 01 to 30.

Finally, the $ at the end of the expression makes sure the match stops there.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Quotation:
It begins with ^, which means the pattern must start at the beginning of the string. Next comes ([\'"]). This part is a capturing group that matches either a single quote ' or a double quote ". Then we have [a-zA-Z0-9 ]*, which matches zero or more characters inside the quotes (letters (uppercase or lowercase), digits, or spaces). Finally, \1 is a backreference. It matches the exact same quote character that was captured at the beginning (either ' or "). Finally, the $ at the end of the expression makes sure the match stops there.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Parenthesis:
It begins with ^, which means the pattern must start at the beginning of the string. Next, we have [^()]*. This part matches any number of characters that are not parentheses. It allows the text around or between parentheses to be anything (letters, numbers, punctuation), as long as it's not a ( or ). Then we have this part: (\([^()]*\)[^()]*)*.
- \( and \) match a pair of parentheses (the opening and closing brackets).
- Between them, [^()]* says: allow any content that is not itself a parenthesis.
- After the closing parenthesis, it again allows more non-parenthesis characters with [^()]*.
- The outer (...)+ means this whole balanced-pair pattern can appear zero or more times.

Finally, the $ at the end of the expression makes sure the match stops there.




In [39]:
validator = FormValidator()

# Test Validation
# TO DO


In [None]:
validator.validate_email("example@example.com") 

True

In [14]:
validator.validate_email("example.com")

ValueError: Email must have a valid format and end with .com or .org.

In [29]:
validator.validate_email("example@example.ir") 

ValueError: Email must have a valid format and end with .com or .org.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [12]:
validator.validate_phone_number("+9812345678") 

True

In [13]:
validator.validate_phone_number("0912345678")

ValueError: Phone number must start with +98 and be followed by 8 digits.

In [28]:
validator.validate_phone_number("+981234567") 

ValueError: Phone number must start with +98 and be followed by 8 digits.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [15]:
validator.validate_password("Abcd123@")

True

In [17]:
validator.validate_password("bcd123@")

ValueError: Password must be between 8 and 12 characters and include at least one uppercase letter, one lowercase letter, one number, and one special character (*, @, %, !).

In [25]:
validator.validate_password("A123@")

ValueError: Password must be between 8 and 12 characters and include at least one uppercase letter, one lowercase letter, one number, and one special character (*, @, %, !).

In [26]:
validator.validate_password("Abcd@")

ValueError: Password must be between 8 and 12 characters and include at least one uppercase letter, one lowercase letter, one number, and one special character (*, @, %, !).

In [27]:
validator.validate_password("Abcd123")

ValueError: Password must be between 8 and 12 characters and include at least one uppercase letter, one lowercase letter, one number, and one special character (*, @, %, !).

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
validator.validate_product_code("AB123v-v1") 

True

In [19]:
validator.validate_product_code("AB123v") 

True

In [20]:
validator.validate_product_code("AB123") 

True

In [23]:
validator.validate_product_code("AB123v-v100") 

ValueError: Product code must consist of 2 uppercase letters, 2 to 4 digits, an optional lowercase letter, and an optional version number (v1-v99).

In [21]:
validator.validate_product_code("A123B") 

ValueError: Product code must consist of 2 uppercase letters, 2 to 4 digits, an optional lowercase letter, and an optional version number (v1-v99).

In [22]:
validator.validate_product_code("A123") 

ValueError: Product code must consist of 2 uppercase letters, 2 to 4 digits, an optional lowercase letter, and an optional version number (v1-v99).

In [24]:
validator.validate_product_code("AB123vv-v1") 

ValueError: Product code must consist of 2 uppercase letters, 2 to 4 digits, an optional lowercase letter, and an optional version number (v1-v99).

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [42]:
validator.validate_stop_word("Stop") 

True

In [43]:
validator.validate_stop_word("stop") 

True

In [44]:
validator.validate_stop_word("Stopped") 

ValueError: The word "Stop" or "stop" must be separate and not part of a larger word or attached to punctuation.

In [41]:
validator.validate_stop_word("Stop.") 

ValueError: The word "Stop" or "stop" must be separate and not part of a larger word or attached to punctuation.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [45]:
validator.validate_repeated_phrase("some students like some students")

True

In [46]:
validator.validate_repeated_phrase("some students and many employees")

ValueError: The phrase "some students" or "many employees" must repeat exactly, with no different terms in between.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [48]:
validator.validate_date("1403/06/31") 

True

In [49]:
validator.validate_date("1403/07/31") 

ValueError: Date must be in YYYY/MM/DD format, with valid month and day values.

In [50]:
validator.validate_date("1403/12/31") 

ValueError: Date must be in YYYY/MM/DD format, with valid month and day values.

In [51]:
validator.validate_date("403/06/31") 

ValueError: Date must be in YYYY/MM/DD format, with valid month and day values.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [52]:
validator.validate_quotation('"Hello World"')

True

In [54]:
validator.validate_quotation("Hello!")

ValueError: Text must be enclosed in balanced single or double quotes and contain only letters, numbers, and spaces.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [55]:
validator.validate_parenthesis("This is (valid)")

True

In [56]:
validator.validate_parenthesis("This is valid")

True

In [57]:
validator.validate_parenthesis("This is not valid)")

ValueError: Parentheses must be balanced with no unmatched opening or closing parentheses.

In [58]:
validator.validate_parenthesis("This is ((valid)")

ValueError: Parentheses must be balanced with no unmatched opening or closing parentheses.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Bonus:

In [97]:
import re
import urllib.request

url = "https://www.imdb.com/chart/top/"
headers = {'User-Agent': 'Mozilla/5.0'}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')

titles = re.findall(r'<h3 class="ipc-title__text[^"]*">(?:\d+\.\s)?(.*?)</h3>', html)
urls = re.findall(r'<a href="(/title/tt\d+/[^"]*)"', html)
years = re.findall(r'<span class="[^"]*cli-title-metadata-item[^"]*">(\d{4})</span>', html)
durations = re.findall(r'<span class="[^"]*cli-title-metadata-item[^"]*">(\d{1,2}h\s?\d{1,2}m?)</span>', html)
ratings = re.findall(r'aria-label="IMDb rating:\s*([0-9.]+)"', html)
rating_counts = re.findall(r'<span class="ipc-rating-star--voteCount">\s*\(<!-- -->\s*([\d.,MK]+)\s*<!-- -->\)</span>',html)

for i in range(min(len(titles), len(urls), len(years), len(durations), len(ratings), len(rating_counts))):
    print(f"Title: {titles[i+1]}")
    print(f"URL: https://www.imdb.com{urls[i]}")
    print(f"Year: {years[i]}")
    print(f"Duration: {durations[i]}")
    print(f"Rating: {ratings[i]}")
    print(f"Rating Count: {rating_counts[i] if i < len(rating_counts) else 'N/A'}")

    print("-" * 40)


Title: The Shawshank Redemption
URL: https://www.imdb.com/title/tt0111161/?ref_=chttp_t_1
Year: 1994
Duration: 2h 22m
Rating: 9.3
Rating Count: 3.1M
----------------------------------------
Title: The Godfather
URL: https://www.imdb.com/title/tt0068646/?ref_=chttp_t_2
Year: 1972
Duration: 2h 55m
Rating: 9.2
Rating Count: 2.1M
----------------------------------------
Title: The Dark Knight
URL: https://www.imdb.com/title/tt0468569/?ref_=chttp_t_3
Year: 2008
Duration: 2h 32m
Rating: 9.0
Rating Count: 3M
----------------------------------------
Title: The Godfather Part II
URL: https://www.imdb.com/title/tt0071562/?ref_=chttp_t_4
Year: 1974
Duration: 3h 22m
Rating: 9.0
Rating Count: 1.4M
----------------------------------------
Title: 12 Angry Men
URL: https://www.imdb.com/title/tt0050083/?ref_=chttp_t_5
Year: 1957
Duration: 1h 36m
Rating: 9.0
Rating Count: 937K
----------------------------------------
Title: The Lord of the Rings: The Return of the King
URL: https://www.imdb.com/title/tt

Using regex for this web scraping project came with several challenges. First, HTML doesn’t have a regular or consistent structure, and since regex is designed for simple text patterns, it has difficulties with the complexity of web pages. For example, some data like rating counts were hidden inside HTML comments or formatted in tricky ways, making them hard to match. Also, even small changes in the page’s structure (like class names or tag order) can easily break the regex. Writing the expressions was also very sensitive; if just one part of the pattern was off, it wouldn’t work at all.

To deal with the challenges of using regex for web scraping, there are a few better and more reliable solutions. One option is to use specialized libraries like BeautifulSoup, which are designed to parse HTML properly and can handle nested tags much more easily than regex. Another solution is to use Selenium or, which simulate a real browser and can load dynamic content that’s generated by JavaScript. If using these tools isn’t allowed, we should carefully inspect the page’s HTML and write more flexible regex patterns, while also making sure to check the length of all extracted lists to avoid errors.


------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Questions:

1. form validation, data cleaning, searching and replacing patterns in text, syntax highlighting in code editors, and simple web scraping.

2. Besides the problems with HTML parsing, nested parentheses or quotations couldn’t be handled properly using regex.