# Methods I: Programming and Data Analysis

## Session 10: Finishing the Basics; Regular Expressions

### Gerhard Jäger

#### (based on Johannes Dellert's slides)

January 11, 2022

### Usage of `while` Loops



Question 1: What might this while loop be doing?

```python
while command != "quit":
    if command == "left":
        player.direction.turn(-5)
    elif command == "right":
        player.direction.turn(+5)
    elif command == "accelerate":
        player.speed += 3
    elif command == "stop":
        player.speed -= 5
        if player.speed < 0:
              player.speed = 0
    for obj in objects:
        obj.continue_movement(1)
    graphics.draw_scene()
    command = get_next_command()
```



### Usage of `while` Loops

Question 1: What might this while loop be doing?
<img src="_img/code_01.svg" width=800>



### Usage of `while` Loops

Question 2: What might this while loop be doing?

```python
agenda = get_new_tasks()
done = set()
while len(agenda) > 0:
    task = agenda.pop(0)
    if is_simple(task):
        task.carry_out()
        done.add(task)
    else:
        subtasks = break_into_subtasks(task)
        agenda.extend(subtasks)
    new_tasks = get_new_tasks()
    for task in new_tasks:
        if task not in done:
            agenda.append(task)
```


### Usage of `while` Loops

Question 2: What might this while loop be doing?

<img src="_img/code_02.svg" width=1000>

### More on Sorting: Reverse Order

Default sorting order can be reverted:

-   sorting is in ascending order by default (1 before 10, A before B)

-   what if we want to compute e.g. a ranking,\
    where the highest score is best, and should be listed first?

-   both `sort()` and `sorted()` support the named argument `reverse`
    which takes a boolean value:

In [None]:
test_list = [5,3,14,5,1,2,3,7,8,9,12,-3,-4]
print(sorted(test_list))


In [None]:
print(sorted(test_list, reverse=True))


In [None]:
test_list


In [None]:
test_list.sort(reverse=True)


In [None]:
test_list


### Sorting a Dictionary by Value

Sorting entries in a dictionary by their values:

-   by default, sorting a dictionary will sort it by key

-   a function value for the named argument `key` can override this

-   for sorting a dict by value, one commonly used option is to use the
    `itemgetter()` function from the `operator` module:

In [None]:
import operator
f = operator.itemgetter(1)
grades = {1717345 : "D", 1456345 : "A", 1334521: "C"}
for (student_ID, grade) in sorted(grades.items(), key=f):
    print(str(student_ID) + ": " + grade)
    

- also works with the in-place `sort()` method, e.g. on a list of
tuples!

### More on Optimization

Some basic facts about **optimization**:

-   for every computable problem or task, there are infinitely many
    programs which solve it; they will differ in memory and time usage!

-   differences in speed can be huge, even if you might not notice it on
    toy data (prefixes from a full frequency list takes minutes!)

-   some **principles of optimization**:

    -   avoid running costly computations (File I/O, search) multiple
        times

    -   avoid computing the same thing twice

    -   avoid creating unnecessary objects

-   get used to writing reasonably efficient code now,\
    it will be much harder later when you have formed habits!

-   if you want to be extremely efficient,\
    learn more programming languages (C, Java)

### Quiz: Efficient or Not? (1)

- Is there any obvious way to optimize this code?

```python
older_student_ids = set()
newer_student_ids = set()
for line in open("participants.tsv","r"):
    student_id = int(line.split("\t")[0])
    if student_id < 3000000:
        older_student_ids.add(student_id)
    else:
        newer_student_ids.add(student_id)
with open("older_student_ids.txt","w") as output:
    for student_id in sorted(older_student_ids):
        output.write(str(student_id) + "\n")
with open("newer_student_ids.txt","w") as output:
    for student_id in sorted(newer_student_ids):
        output.write(str(student_id) + "\n")
```


### Quiz: Efficient or Not? (1)

- Is there any obvious way to optimize this code?

<img src="_img/code_03.svg" width=1000>




### Quiz: Efficient or Not? (2)

- Is there any obvious way to optimize this code?

```python
students_by_id = dict()
for line in open("participants.tsv","r"):
    student_id = line.split("\t")[0]
    first_name = line.split("\t")[1]
    last_name = line.split("\t")[2]
    student_name = (first_name, last_name)
    students_by_id[student_id] = student_name
for student_id in students_by_id.keys():
    print(last_name + ", " + first_name + ": " + student_id)
```


### Quiz: Efficient or Not? (2)

- Is there any obvious way to optimize this code?


<img src="_img/code_04.svg" width=1000>

### Quiz: Efficient or Not? (3)

- Is there any obvious way to optimize this code?

```python
student_names = list()
for line in open("participants.tsv","r"):
    first_name, last_name = line.split("\t")[1:3]
    student_names.append((last_name, first_name))
for i in range(len(student_names)):
    print(sorted(student_names)[i])
```


### Hints on Maintainability

What is **maintainability** about?

-   being able to quickly change the behavior of a program\
    by editing the code in as few places as possible

-   being able to revisit a program years from now, and still understand
    it

-   being able to quickly add or change some functionality as
    requirements change

How can one ensure maintainability?

-   if you use identical literals everywhere, store them in a global
    variable!

-   do not copy-and-paste code! (That's what functions and loops are
    for!)

-   comment your code!

    -   informative variable names

    -   block comments describing function behavior


### Maintainability: Example

How can the maintainability of this function be increased?

```python
def print_matrix(mtx):
    for row in mtx:
        if row == mtx[0]:
            continue
        entry1 = str(row[1])[:6]
        if len(entry1) < 6:
            entry1 += (6 - len(entry1)) * " "
        entry2 = str(row[2])[:6]
        if len(entry2) < 6:
            entry2 += (6 - len(entry2)) * " "
        entry3 = str(row[3])[:6]
        if len(entry3) < 6:
            entry3 += (6 - len(entry3)) * " "
        print(entry1 + " " + entry2 + " " + entry3)
```





### Maintainability: Example

How can the maintainability of this function be increased?
<img src="_img/code_05.svg" width=1000>






### Structured Programs: Modules

Python code (in `.py` files) can be used in two ways:

-   as a standalone program (e.g. using "Run" in PyCharm)

-   as a module that is imported and used from inside another program

For larger projects, this difference should be made use of:

-   package the main functionality in different Python files

-   import these files as modules from a main file where the execution
    code resides, adding the imported names to the namespace

-   this main code should only be executed when the code is run as a
    standalone program

### Structured Programs: The `main` block

Revisiting the `main` block:

- the line `if __name__ == "__main__"` checks whether the current
  Python file is being executed as the main program

- this is an important pattern to separate functionality (the function
  definitions) from function (the executable program code)

```python
def calculation_step_1():
    #some code here
  
def calculation_step_2():
    #some code here

if __name__ = "__main__":
    calculation_step_1()
    calculation_step_2()
  
```

- you have already used this pattern during development!

### Structured Programs: Example

- this program packages its functionality well:

```python
def sum_lists(list1, list2):
    return sum(list1) + sum(list2)

def read_floats_from_file(filename):
    with open(filename, "r") as in_file:
        lines = in_file.readlines()
    float_list = []
    for line in lines:
        float_list.append(float(line.strip()))
    return float_list

if __name__ = "__main__":
    l1 = read_floats_from_file("costs-2017-1.txt")
    l2 = read_floats_from_file("costs-2017-2.txt")
    print("Sum of both lists: " + str(sum_lists(l1,l2)))
```




### Structured Programs: Example

- this program packages its functionality well:

<img src=./_img/code_06.svg width=1000>


### Packages: Motivation

-   some functionality is needed by many different programs, but are neither part of the core language (like `sorted()` or `str.lower()`), nor the built-in libraries (like `math` or  `itertools`)

-   examples:
    -   visualization of objects likes trees and graphs
    -   conversion between different text formats
    -   efficient algorithms for parsing

-   programmers like to share code for such common tasks via package management systems, where they can search for relevant packages, and install them using automated tools which ensure compatibility

-   as a novice programmer, you
    -   do not need to reinvent the wheel
    -   can focus on the functionality of your program that is actually new
    -   can rely on stable code used (and tested) by many others

-   later, you can contribute your own code to the community



### Installing Packages

How to install a new package:

- google for packages that might provide the functionality you want

- you will end up with some names of packages that look like they
  might work (`the-package` will be our dummy name)

- with the recommended configuration, try using **conda** first:

  ``` {style="console"}
    $ conda install the-package
  ```

- package not found $\rightarrow$ access Python Package Index
  (**PyPI**) via **pip**:

  ``` {style="console"}
    $ pip install the-package
  ```

- for the latter option, you might need to install pip first:

  ``` {style="console"}
    $ conda install pip
  ```


### Using Packages

How to use a package:

-   installing a package makes a set of modules available (often only
    one)

-   take a look at the examples on the package's website

-   try out the code from the first of the examples on an interactive
    console to see whether everything works as expected

-   slowly expand your usage of the new tool, trying out various
    functions until you have what you want

-   not sure about what one of the functions or variables in the
    module's namespace does? Many packages provide good documentation
    online!

Trying out the `transliterate` package to convert between alphabets:

- installing `transliterate`:

  ``` {style="console"}
  $ conda install pip
  $ pip install transliterate
  ```

- loading the `translit` function from the newly available package:

  ``` {style="console"}
  >>> from transliterate import translit
  ```

- using translit to transliterate from Greek:

  ```python
  >>> translit("Βικιπαίδεια%)", "el", reversed=True)
  'Vikipaideia'
  ```

- transliterating into Russian Cyrillic (will not work for
  everything!):

  ```python
  >>> transliterate.translit("Dortmund + Dresden", 'ru')
  'Дортмунд + Дресден)'
  ```

- more information: <https://pypi.org/project/transliterate/>

### Using Classes

Many packages define their own **classes** (datatypes):

- class names are conventionally written with an uppercase letter
  (dummy example: `Thing`)

- classes are similar to the built-in complex datatypes like lists and
  dictionaries: they define the behavior of objects of their type

- creating an **instance** of a class (an object of its type) works by
  calling its **constructor**, like you used `dict()` to create a new
  dictionary:

  ``` {style="console"}
  >>> my_thing = Thing()
  ```

- classes provide **methods**, i.e. functions which take an instance
  of the class as their first argument, and calls to which are written
  `instance.method(additional, arguments)`

  ``` {style="console"}
  >>> my_thing.do_something()
  >>> my_thing.process_this(some_other_object)
  ```


### Example: Processing Video Data

Basic usage of MoviePy (after installation):

- load classes and functions from the `editor` subpackage:

  ``` {style="console"}
  >>> from moviepy.editor import VideoFileClip
  ```

- load a video clip from a file into a new instance of a classes
  provided by MoviePy, handing filename as argument:

  ``` {style="console"}
  >>> clip = VideoFileClip("yourvideo.mp4")
  ```




- shorten the video to a ten-second fragment:

  ```python
  >>> shortened_clip = clip.subclip(50,60)
  ```

- rotate the clip by 180 degrees:

  ```python
  >>> final_clip = shortened_clip.rotate(180)
  ```

- save the result to a new file:
  ```python
  >>> final_clip.write_videofile("processed.webm", fps=25)
  ```

### Pattern Detection in Strings

In many applications, we need to **find strings matching a pattern**:

-   find all documents containing a given name

-   find example sentences for the usage of some word in a corpus

-   find the places in your code where you used some variable

Also, we often need to **extract parts of a string** matching a pattern:

-   extract addresses from a text

-   extract everything that is formatted like a name\
    (e.g. a sequence of several tokens starting with uppercase letters)

-   extract the words which can occur as arguments to a specific verb
    from a corpus (e.g. to determine selectional restrictions)

### Regular Expressions: Basics

What are **regular expressions** (short: **regex**)?

-   a language of patterns which define sets of strings

-   **literal characters** (mostly letters of the alphabet) represent
    themselves in a pattern

-   **special characters** (mostly punctuation) do not represent
    themselves, but modify the meaning of surrounding patterns:




-   first examples of special characters:

    -   plus `+` designates one or more instances of the previous
        character:\
        `"ba+"` represents `{"ba", "baa", "baaa", ...}`

    -   square brackets `[]` represent character sets:\
        `"ba[tr]"` represents `{"bat", "bar"}`

    -   both can be combined: `"ba[tr]+"` represents\
        `{"bat", "bar", "batt", "batr", "bart", "barr", "battt", "battr", "batrt", "batrr", "bartt", "bartr", "barrt", "barrr", "batttt", "batttr", "battrt", "battrr", ...}`

### Regular Expressions: Quantifiers

**Quantifiers** range over the preceding item and decide how many times
it can or must be repeated to be matched:

-   `*` for zero or more repetitions

-   `+` for at least one repetition

-   `?` for optional items (zero repetitions or one repetition)

More general quantification can be achieved by `{min, max}`, where `min`
and `max` must be positive integers:

-   `"a{4,6}"` matches the strings `"aaaa"`, `"aaaaa"`, and `"aaaaaa"`

-   `"[01]{8}"` matches bitstrings of length 8 (byte representations)

-   `"0{2,}"` matches sequences of at least 2 zeroes

### Regular Expressions: The Wildcard

The **wildcard symbol** `.` (the dot) matches any character except the
new-line character, e.g.

-   `"h.t"` matches `hat`, `hot`, and `hit`, but not `heat`

-   `".a.a.a"` matches `banana` and `papaya`, but not `kaaba`

-   `"9.00"` matches `9a00`, `9100`, `9y00`, and `9c00`, not only
    `9.00`\
    (you need to **escape** the dot for that: `"9\.00"`)

-   `" .{3} "` matches any three-character word

Special symbols match the beginning and the end of the line:

-   `"^"` matches the beginning of the line

-   `"$"` matches the end of the line

### Regular Expressions: Character Sets and Ranges

Brackets `[ ]` define **character sets** matching a single character,
and can be **negated** using a caret (`^`) after the opening bracket:

-   `[aeiou]` matches one (Latin) vowel

-   `[^aeiou]` matches everything except Latin vowels

Some character sets can conveniently be defined using **character
ranges**:

-   `[A-Z]` is the same as `[ABCDEFGHIJKLMNOPQRSTUVWXYZ]`

-   `[0-9]` is the same as `[0123456789]`

Several escaped characters serve as convenient shorthands:

-   `\d` for digits (= `[0-9]`)

-   `\w` for word characters (= `[a-zA-Z0-9_]`)

-   `\s` for whitespace (= `[ \t\r\n]`)

### Regular Expressions: Grouping

The **grouping metacharacters** `( )` serve to

-   apply repetition operators to a sequence of literal characters

-   make expressions easier to read

-   define groups for use in matching and replacing

Examples:

-   `"(abc)+"` matches e.g. `"abc"` and `"abcababc"`

-   `"(in)?dependent"` matches `"independent"` and `"dependent"`

### Regular Expressions: Referencing Groups

-   a group can be *referenced* later in the same string
-   "`\\1`" matches *exactly the same string* that matched the first preceding group
-   "`\\2`" matches the second preceding group etc.
-   "`A (rose|tulip) is a \\1 is a \\1`":
    - "A rose is a rose is a rose" ✔
    - "A tulip is a tulip is a tulip" ✔
    - "A rose is a rose is a tulip" ❌
    - "A tulip is a rose is a rose" ❌
-   "`(.).*\\1`" matches any string where the first and the last character are identical and non-overlapping:
    -   "aa" ✔
    -   "axyzdefa" ✔
    -   "axyzdefb" ❌
    -   "a" ❌

### Regular Expressions: Alternation

The **alternation metacharacter** `|` matches either the previous or the
next expression:

-   `"apple|orange"` matches `"apple"` and `"orange"`

-   Q: what does `"apple(juice|sauce)"` match?

-   Q: what does `"w(ei|ie)rd"` match?

Multiple alternatives can be used as well:

-   `"apple|orange|banana"`

-   `"(AA|BB|CC){6}"` matches e.g. `"AABBAACCAABB"`

### Regular Expressions: The `re` module

Basic usage of the built-in `re` module:

- import the module to make the namespace available:

In [None]:
import re

- compile your regular expression string into a **regular expression
object** which can be used to very efficiently match against the
regex

In [None]:
matcher = re.compile("(.)([aeiou]{2}n)")

- use the `match()` method to test the entire string:

In [None]:
matcher.match("moon")

In [None]:
matcher.match("I have been to the moon.")

### Regular Expressions: The `re` module

-   `search()` looks for matching substrings instead:

In [None]:
matcher.search("I have been to the moon.")

- using a match result object:

In [None]:
match = matcher.search("I have been to the moon.")

In [None]:
match.start(), match.end(), match.groups()

In [None]:
matcher.search("I have been to the moon.", 8).groups()

`findall()` lists all groupings in matched substrings:

In [None]:
matcher.findall("I have been to the moon.")

Alternative third-party implementation: the `regex` package

-   all the functionality is available under the same names as in `re`

-   advantage: support for Unicode is much better

### Regular Expressions: The `re` module

-   `sub(repl, string)` replaces each each matching substring in
    `string` with `repl`

-   `repl` can contain references to groups

In [None]:
matcher = re.compile("[0-9]")

In [None]:
matcher.sub("?", "UFKc17X")

In [None]:
matcher = re.compile("(.+)")
matcher.sub("A \\1 is a \\1 is a \\1", "rose")

In [None]:
matcher = re.compile("(.+)")
matcher.sub("A \\1 is a \\1 is a \\1", "tulip")