## Regular Expressions (RE)

- Patterns to match strings.
- Enclosed in r"" notation.
- Use square brackets for character choices, e.g., [Aa].
- | for alternatives, like [Aa]pple|[Bb]anana.
- Legal variable names pattern: r"[A-Za-z_][A-Za-z_0-9]*\Z".
- Use raw strings (r"") for RE patterns to avoid conflicts with special notations.
## Common RE Notations

- . matches any character.
- [...] matches characters in brackets.
- [^...] matches characters not in brackets.
- ^ matches start of the string.
- $ matches end of the string.
- '*' matches zero or more.
- '+' matches one or more.
- {m,n} matches m to n occurrences.
- ? matches zero or one occurrence.
## Grouping and Shorthand

- Grouping with parentheses.
- Backreferences with \number.
- Shorthand notations: \d, \D, \s, \S, \w, \W.
## Matching Functions

- `re.match` for the start of a string.
- `re.search` for matching anywhere.
- `re.findall` finds all occurrences.
- `re.finditer` returns an iterator of match objects.
- `re.sub` replaces pattern matches.
## Pattern Compilation

- Precompile patterns for efficiency with re.compile.
- Flags like `re.IGNORECASE, re.MULTILINE`, and `re.DOTALL` modify matching behavior.
## Match Objects

- Returned by re.match and re.search.
- Access matched substrings and groups with methods.
- Use boolean checks like if mo: or convert with bool(mo).

In [1]:
# Match and search functions
import re
s = "Doing things, going home, staying awake, sleeping later"
re.findall(r'\w+ing\b', s)

['Doing', 'going', 'staying', 'sleeping']

In [2]:
re.findall(r'[+-]?\d+', "23 + -24 = -1")

['23', '-24', '-1']

In [4]:
s = ("if I'm not in a hurry, then I should stay. " + " On the other hand, if I leave, then I can sleep.")
# Greedy matching (.*) tries to match as many characters as possible
re.findall(r'[Ii]f (.*), then', s)

["I'm not in a hurry, then I should stay.  On the other hand, if I leave"]

In [5]:
"""
The repetition specifiers +, *, ?, and {m,n} have corresponding non-greedy versions: +?, *?, ??, and {m,n}?. 
These expressions use as few characters as possible to make the whole pattern match some substring. 
"""
s = ("if I'm not in a hurry, then I should stay. " + " On the other hand, if I leave, then I can sleep.")
# Non - Greedy matching (.*?) tries to match as many characters as possible
re.findall(r'[Ii]f (.*?), then', s)

["I'm not in a hurry", 'I leave']

In [6]:
# Functions in the re module
import re
str = "She goes where she wants to, she's a sheriff."
newstr = re.sub(r'\b[Ss]he\b', 'he', str)
print(newstr)

he goes where he wants to, he's a sheriff.


In [11]:
import re
str = """He is a timelord.
He has a Tardis."""
newstr = re.sub(r'(\b[Hh]e\b)', r'\1 (The Doctor)', str, 1)
print(newstr)

He (The Doctor) is a timelord.
He has a Tardis.


In [13]:
# Match Object

mo = re.search(r'\d+ (\d+) \d+ (\d+)', 'first 123 45 67 890 last')
if mo:
    print(mo)

<re.Match object; span=(6, 19), match='123 45 67 890'>


In [15]:
# ignore cases
# pre compile pattern for faster response 
# (?i) > re.IGNORECASE
# (?m) > re.MULTILINE
# (?s) > re.DOTALL
import re
pattern = r'hello world'
re.compile(pattern, re.MULTILINE | re.DOTALL)

re.compile(r'hello world', re.MULTILINE|re.DOTALL|re.UNICODE)

In [19]:
"""
Write function integers_in_brackets that finds from a given string all integers that are enclosed in brackets.
Example run: 
integers_in_brackets(" afd [asd] [12 ] [a34] [ -43 ]tt [+12]xxx") 
returns [12, -43, 12]. 
So there can be whitespace between the number and the brackets, 
but no other character besides those that make up the integer.

Test your function from the main function.
"""
import re
def integers_in_brackets(s):
    pattern = r'\[\s*?([-+]?\d+)\s*?\]'
    result = re.findall(pattern, s)
    return(result)

def main():
    result = integers_in_brackets(" afd [asd] [12 ] [a34] [ -43 ]tt [+12]xxx")
    print(result)
main()


['12', '-43', '+12']


## Basic File Processing:

- Open a file with open(filename, mode="r").
- Use the file object to read or write.
- Close the file with close() when done.
### File Opening Modes:

- r: Read-only, file must exist.
- w: Write-only, creates or overwrites.
- a: Write-only, appends to the end.
- r+: Read/write, file must exist.
- w+: Read/write, creates or overwrites.
- a+: Read/write, appends to the end.
- t (text mode, default) or b (binary mode).
### Text Mode vs. Binary Mode:

- Text mode converts line endings `\n` to two bytes and convert back to newline when read (e.g., Windows) and 
- One character is encodes characters (e.g., utf-8) to one or two bytes during read or write conversion.
- Binary mode handles bytes directly.
### Common File Object Methods:

- `read(size)`: Read a specific number of characters/bytes.
- `write(string)`: Write a string/bytes to a file.
- `readline()`: Read a line until the next newline character.
- `readlines()`: Return a list of all lines in a file.
- `writelines()`: Write a list of lines to a file.
- `flush()`: Ensure changes are written to disk immediately.
### Context Manager:
Use `with open(...) as f:` to automatically close the file.

### Iterating Through File Lines:
- The file object is iterable; you can use a for loop to iterate through lines.

### Standard File Objects:

- sys.stdin: Standard input.
- sys.stdout: Standard output.
- sys.stderr: Standard error.
### Reading/Writing from Standard File Objects:

- Read from user (keyboard) using sys.stdin.readline().
- Write to user (screen) using sys.stdout.write(line).
- Use sys.stderr for error messages.
### Changing File Object Destinations:
- You can redirect standard file objects to point elsewhere, like log files.

### sys Module:

- sys.path: List of folders to find imported modules.
- sys.argv: Command line parameters.
    - sys.argv[0] is the program name.
    - Additional parameters are in the list.
- sys.exit(): Exit a program with a return value (0 for success, non-zero for errors).

In [20]:
# Basic file processing
# encoding with utf-8
"ä".encode("utf-8")

b'\xc3\xa4'

In [22]:
# hex to decimal between 0- 255
list("ä".encode("utf-8"))

[195, 164]

In [23]:
# Some common file object methods
# unfortunately we don't have basics.ipynb file
f = open("basics.ipynb", 'r') # Let's open this notebook file,
# which is essentially a text file.
# So you can open it in a texteditor as well.
for i in range(5):  # And read the first five lines
    line = f.readline()
    print(f"Line {i}: {line}", end="")
f.close()

FileNotFoundError: [Errno 2] No such file or directory: 'basics.ipynb'

In [None]:
# second example for opening file with better opening file method
# get the max length of the line from text content
max_len = 0
with open("basics.ipynb", "r") as f:    # the file will be automatically closed.
    # when the with block exits
    for i in range(5):
        line = f.readline()
        if len(line) > max_len:
            max_len = len(line)
        print(f"Line {i}: {line}", end="")
print(f"The longest line in this file has length {max_len}")
# out put should something look like this
# The longest line in this file has length 1046

In [24]:
# Standard file objects
# These standard file objects are meant to be a basic input/output mechanism in textual form. 
import sys
import random
# we are getting number between -10 and 10
i= random.randint(-10, 10)
if i >= 0:
    sys.stdout.write("Got a positive integer.\n")
else:
    sys.stderr.write("Got a negative integer.\n")
# Got a negative integer.



Got a negative integer.


In [45]:
"""
Exercise 2.2 (file listing)

The file src/listing.txt contains a list of files with one line per file. 
Each line contains seven fields: access rights, number of references, owner's name, name of owning group, file size, date, filename. 
These fields are separated with one or more spaces. 
Note that there may be spaces also within these seven fields.

Write function file_listing that loads the file src/listing.txt. 
It should return a list of tuples (size, month, day, hour, minute, filename). 
Use regular expressions to do this (either match, search, findall, or finditer method).

An example: for line

-rw-r--r-- 1 jttoivon hyad-all   25399 Nov  2 21:25 exception_hierarchy.pdf
the function should create the tuple (25399, "Nov", 2, 21, 25, "exception_hierarchy.pdf").
"""

import re


def file_listing(filename="src/listing.txt"):
    # function should create the tuple (25399, "Nov", 2, 21, 25, "exception_hierarchy.pdf")
    # re.match is for matching exactly as the input string represent. So it need to consider the rest of none match.
    pattern = r'(\d+)\s+(\w{3})\s+(\d+)\s+(\d{2}):(\d{2})\s+(.+)'
    """
    input_string = "-rw-r--r-- 1 jttoivon hyad-all   25399 Nov  2 21:25 exception_hierarchy.pdf"
    match_group = re.search(pattern, input_string)
    size, month, day, hour, minute, filename = match_group.groups()
    print(size, month, day, hour, minute, filename)
    """
    result = []
    with open(filename,"r") as file:
        for line in file:
           
            # first match all the group
            match_group = re.search(pattern, line)
            if match_group:
                size, month, day, hour, minute, filename = match_group.groups()
                result.append((int(size), month, int(day), int(hour), int(minute), filename))
            
    return result

"""
# Different version
def file_listing(filename="src/listing.txt"):
    with open(filename) as f:
        lines = f.readlines()
    result=[]
    for line in lines:
        pattern = r".{10}\s+\d+\s+.+\s+.+\s+(\d+)\s+(...)\s+(\d+)\s+(\d\d):(\d\d)\s+(.+)"
        if True:      # Two alternative ways of doing the same thing
            m = re.match(pattern, line)
        else:
            compiled_pattern = re.compile(pattern)
            m = compiled_pattern.match(line)
        if m:
            t = m.groups()
            result.append((int(t[0]), t[1], int(t[2]), int(t[3]), int(t[4]), t[5]))
        else:
            print(line)
    return result
 

"""
  
def main():
    result = file_listing()
    print(result)
main()


25399 Nov 2 21 25 exception_hierarchy.pdf
None


In [54]:
"""
Exercise 2.3 (red green blue)
The file src/rgb.txt contains names of colors and their numerical representations in RGB format. 
The RBG format allows a color to be represented as a mixture of red, green, and blue components. 
Each component can have an integer value in the range [0,255]. 
Each line in the file contains four fields: red, green, blue, and colorname. 
Each field is separated by some amount of whitespace (tab or space in this case). 
The text file is formatted to make it print nicely, but that makes it harder to process by a computer. 
Note that some color names can also contain a space character.

Write function red_green_blue that reads the file rgb.txt from the folder src. 
Remove the irrelevant first line of the file. The function should return a list of strings. 
Clean-up the file so that the strings in the returned list have four fields separated by a single tab character (\t). 
Use regular expressions to do this.

The first string in the returned list should be:

'255\t250\t250\tsnow'

str = '''He is a timelord.
He has a Tardis.'''
newstr = re.sub(r'(\b[Hh]e\b)', r'\1 (The Doctor)', str, 1)
"""
import re

def red_green_blue(filename="src/rgb.txt"):
    result = []
    # input string need to be 
    """
    ! $Xorg: rgb.txt,v 1.3 2000/08/17 19:54:00 cpqbld Exp $
    255 250 250		snow
    """
    # get rid off the head
    # replace all the single space with \t
    # it matches all the single space and we replace them with \t character
    # input_string = "255 250 250		snow"

    with open(filename, "r") as file:
        # we skip the first line
        first_line = file.readline()
        result = []
      
        # we iterate through the line
        for line in file:
            #line = "255 250 250		snow white"
          
            newstr = re.search(r'(\d+)\s+(\d+)\s+(\d+)\s+(\w+.*)', line)
           
            # newstr = re.sub(r'(\d+)\s+', r'\1\\t', line)
            # 255\t250\t250\tsnow
            # eliminate new line at the end
            result.append(("\t".join(newstr.groups())).strip())
    return result
"""
# second version

def red_green_blue(filename="src/rgb.txt"):
    with open(filename) as in_file:
        l = re.findall(r"(\d+)\s+(\d+)\s+(\d+)\s+(.*)\n", in_file.read())
        return [
            "{}\t{}\t{}\t{}".format(r, g, b, name)
            for r, g, b, name
            in l
"""

def main():
    red_green_blue()
main()


255\t250\t250\tsnow


In [74]:
"""
Exercise 2.4 (word frequencies)
Create function word_frequencies that gets a filename as a parameter and 
returns a dict with the word frequencies. 
In the dictionary the keys are the words and 
the corresponding values are the number of times that word occurred in the file specified by the function parameter. 
Read all the lines from the file and split the lines into words using the split() method. 
Further, remove punctuation from the ends of words using the strip(''"!"#$%&'()*,-./:;?@[]_''') method call.

Test this function in the main function using the file alice.txt. In the output, there should be a word and 
its count per line separated by a tab:

The     64
Project 83
Gutenberg   26
EBook   3
of      303
"""
def word_frequencies(filename):
    input_string = """
    The Project Gutenberg EBook of Alice in Wonderland, by Lewis Carroll\n

    This eBook is for the use of anyone anywhere at no cost and with\n
    almost no restrictions whatsoever.  You may copy it, give it away or\n
    re-use it under the terms of the Project Gutenberg License included\n
    with this eBook or online at www.gutenberg.org\n
    """
    result = {}
    with open(filename, "r") as file:
        for input_string in file:
            lines = input_string.split()
            for word in lines:
                strip_word = word.strip("""!"#$%&'()*,-./:;?@[]_""")
                if strip_word not in result or not result:
                    result[strip_word] = 0
                
                result[strip_word] += 1
    return(result)
'''
def word_frequencies(filename):
    result = {}
    with open(filename) as in_file:
        for w in in_file.read().split():
            ws = w.strip("""!"#$%&'()*,-./:;?@[]_""")
            if ws not in result:
                result[ws] = 0
            result[ws] += 1
    return result
'''
def main():
    words = word_frequencies("src/alice.txt")
    for word, frequency in words.items():
        print(f"{word}\t{frequency}")
    
main()

{'The': 1, 'Project': 2, 'Gutenberg': 2, 'EBook': 1, 'of': 3, 'Alice': 1, 'in': 1, 'Wonderland,': 1, 'by': 1, 'Lewis': 1, 'Carroll': 1, 'This': 1, 'eBook': 2, 'is': 1, 'for': 1, 'the': 3, 'use': 1, 'anyone': 1, 'anywhere': 1, 'at': 2, 'no': 2, 'cost': 1, 'and': 1, 'with': 2, 'almost': 1, 'restrictions': 1, 'whatsoever.': 1, 'You': 1, 'may': 1, 'copy': 1, 'it,': 1, 'give': 1, 'it': 2, 'away': 1, 'or': 2, 're-use': 1, 'under': 1, 'terms': 1, 'License': 1, 'included': 1, 'this': 1, 'online': 1, 'www.gutenberg.org': 1}
