# Methods I: Programming and Data Analysis

## Session 11: Regular Expressions; NLTK; Classes

### Gerhard Jäger

#### (based on Johannes Dellert's slides)

January 18, 2022

### Pattern Detection in Strings

In many applications, we need to **find strings matching a pattern**:

-   find all documents containing a given name

-   find example sentences for the usage of some word in a corpus

-   find the places in your code where you used some variable

Also, we often need to **extract parts of a string** matching a pattern:

-   extract addresses from a text

-   extract everything that is formatted like a name\
    (e.g. a sequence of several tokens starting with uppercase letters)

-   extract the words which can occur as arguments to a specific verb
    from a corpus (e.g. to determine selectional restrictions)

### Regular Expressions: Basics

What are **regular expressions** (short: **regex**)?

-   a language of patterns which define sets of strings

-   **literal characters** (mostly letters of the alphabet) represent
    themselves in a pattern

-   **special characters** (mostly punctuation) do not represent
    themselves, but modify the meaning of surrounding patterns:

-   first examples of special characters:

    -   plus `+` designates one or more instances of the previous
        character:\
        `"ba+"` represents `{"ba", "baa", "baaa", ...}`

    -   square brackets `[]` represent character sets:\
        `"ba[tr]"` represents `{"bat", "bar"}`

    -   both can be combined: `"ba[tr]+"` represents\
        `{"bat", "bar", "batt", "batr", "bart", "barr", "battt", "battr", "batrt", "batrr", "bartt", "bartr", "barrt", "barrr", "batttt", "batttr", "battrt", "battrr", ...}`


### Regular Expressions: Quantifiers

**Quantifiers** range over the preceding item and decide how many times
it can or must be repeated to be matched:

-   `*` for zero or more repetitions

-   `+` for at least one repetition

-   `?` for optional items (zero repetitions or one repetition)

More general quantification can be achieved by `{min, max}`, where `min`
and `max` must be positive integers:

-   `"a{4,6}"` matches the strings `"aaaa"`, `"aaaaa"`, and `"aaaaaa"`

-   `"[01]{8}"` matches bitstrings of length 8 (byte representations)

-   `"0{2,}"` matches sequences of at least 2 zeroes

### Regular Expressions: The Wildcard

The **wildcard symbol** `.` (the dot) matches any character except the
new-line character, e.g.

-   `"h.t"` matches `hat`, `hot`, and `hit`, but not `heat`

-   `".a.a.a"` matches `banana` and `papaya`, but not `kaaba`

-   `"9.00"` matches `9a00`, `9100`, `9y00`, and `9c00`, not only
    `9.00`\
    (you need to **escape** the dot for that: `"9\.00"`)

-   `" .{3} "` matches any three-character word

Special symbols match the beginning and the end of the line:

-   `"^"` matches the beginning of the line

-   `"$"` matches the end of the line

### Regular Expressions: Character Sets and Ranges

Brackets `[ ]` define **character sets** matching a single character,
and can be **negated** using a caret (`^`) after the opening bracket:

-   `[aeiou]` matches one (Latin) vowel

-   `[^aeiou]` matches everything except Latin vowels

Some character sets can conveniently be defined using **character
ranges**:

-   `[A-Z]` is the same as `[ABCDEFGHIJKLMNOPQRSTUVWXYZ]`

-   `[0-9]` is the same as `[0123456789]`

Several escaped characters serve as convenient shorthands:

-   `\d` for digits (= `[0-9]`)

-   `\w` for word characters (= `[a-zA-Z0-9_]`)

-   `\s` for whitespace (= `[ \t\r\n]`)

### Regular Expressions: Grouping

The **grouping metacharacters** `( )` serve to

-   apply repetition operators to a sequence of literal characters

-   make expressions easier to read

-   define groups for use in matching and replacing

Examples:

-   `"(abc)+"` matches e.g. `"abc"` and `"abcababc"`

-   `"(in)?dependent"` matches `"independent"` and `"dependent"`

### Regular Expressions: Referencing Groups

-   a group can be *referenced* later in the same string
-   "`\\1`" matches *exactly the same string* that matched the first preceding group
-   "`\\2`" matches the second preceding group etc.
-   "`A (rose|tulip) is a \\1 is a \\1`":
    - "A rose is a rose is a rose" ✔
    - "A tulip is a tulip is a tulip" ✔
    - "A rose is a rose is a tulip" ❌
    - "A tulip is a rose is a rose" ❌
-   "`(.).*\\1`" matches any string where the first and the last character are identical and non-overlapping:
    -   "aa" ✔
    -   "axyzdefa" ✔
    -   "axyzdefb" ❌
    -   "a" ❌

### Regular Expressions: Alternation

The **alternation metacharacter** `|` matches either the previous or the
next expression:

-   `"apple|orange"` matches `"apple"` and `"orange"`

-   Q: what does `"apple(juice|sauce)"` match?

-   Q: what does `"w(ei|ie)rd"` match?

Multiple alternatives can be used as well:

-   `"apple|orange|banana"`

-   `"(AA|BB|CC){6}"` matches e.g. `"AABBAACCAABB"`

### Regular Expressions: The `re` module

Basic usage of the built-in `re` module:

- import the module to make the namespace available:

In [None]:
import re

- compile your regular expression string into a **regular expression
object** which can be used to very efficiently match against the
regex

In [None]:
matcher = re.compile("(.)([aeiou]{2}n)")

- use the `match()` method to test the entire string:

In [None]:
matcher.match("moon")

In [None]:
matcher.match("I have been to the moon.")

### Regular Expressions: The `re` module

-   `search()` looks for matching substrings instead:

In [None]:
matcher.search("I have been to the moon.")

- using a match result object:

In [None]:
match = matcher.search("I have been to the moon.")

In [None]:
match.start(), match.end(), match.groups()

In [None]:
matcher.search("I have been to the moon.", 8).groups()

`findall()` lists all groupings in matched substrings:

In [None]:
matcher.findall("I have been to the moon.")

### Regular Expressions: The `re` module

-   `sub(repl, string)` replaces each each matching substring in
    `string` with `repl`

-   `repl` can contain references to groups

In [None]:
matcher = re.compile("[0-9]")

In [None]:
matcher.sub("?", "UFKc17X")

In [None]:
matcher = re.compile("(.+)")
matcher.sub("A \\1 is a \\1 is a \\1", "rose")

In [None]:
matcher = re.compile("(.+)")
matcher.sub("A \\1 is a \\1 is a \\1", "tulip")

Natural Language Toolkit (NLTK)
===============================

### NLTK

The rest of the session covers the **Natural Language Toolkit (NLTK)**:

-   example of a good Python package

-   includes implementations of many common algorithms

-   sample of the entire Natural Language Processing (NLP) toolchain

-   solves many common tasks in satisfactory quality (for English)

-   very interesting for linguists interested in computing

-   good documentation (an entire book)

### NLTK: Installation

Installing NLTK and the relevant data:

-   in a terminal, execute the following command:

  ``` {style="console"}
     $ conda install nltk
  ```

- after installation, fire up a Python console (e.g. inside PyCharm)

- run the following:

In [None]:
import nltk
nltk.download()

-   in the window that appears, double-click on `book` and `popular` to
    download and install the relevant packages and data

-   wait until everything is installed, and close the window

### NLTK: First Steps

First steps in getting to know NLTK:

-   go to <https://www.nltk.org/>

-   run the examples listed under\
    "Some simple things you can do with NLTK":

    -   tokenizing and tagging

    -   named entity recognition

    -   exploring a treebank

- Tokenize and tag some text

In [None]:
sentence = """At eight o'clock on Thursday morning Arthur didn't feel very good."""

In [None]:
tokens = nltk.word_tokenize(sentence)
tokens

In [None]:
tagged = nltk.pos_tag(tokens)
tagged[0:6]

- Identify named entitites:

In [None]:
import numpy
entities = nltk.chunk.ne_chunk(tagged)
print(entities)

- Display a parse tree:

In [None]:
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
t

### NLTK: Documentation

How to find out more:

-   open the book at <http://www.nltk.org/book/>

-   start reading the chapter you want to learn about

-   make sure to try out the examples\
    (interactivity helps you to understand things better)

-   take a look at the exercises and try your hand at the ones which
    involve skills you might need

-   BTW: the book also doubles as an introduction to Python, consider
    working through the examples and exercises if you want to brush up
    on programming later on!


### NLTK: Accessing Text Corpora

-   take a look at Chapter 2 of the book
    (<http://www.nltk.org/book/ch02.html>)

-   browse through it until you find something interesting

-   run the examples using an interactive console

-   play around with the objects, fiddle with arguments,\
    explore the possibilities!

-   suggestions:

    -   Inaugural Address Corpus

    -   Corpora in Other Languages

    -   WordNet



In [None]:
nltk.corpus.gutenberg.fileids()

In [None]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)

In [None]:
from nltk.corpus import gutenberg
gutenberg.fileids()

In [None]:
emma = gutenberg.words('austen-emma.txt')

In [None]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

In [None]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
macbeth_sentences

In [None]:
macbeth_sentences[1116]

In [None]:
longest_len = max(len(s) for s in macbeth_sentences)
[s for s in macbeth_sentences if len(s) == longest_len]

In [None]:
from nltk.corpus import brown
news_text = brown.words(categories='news')
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
    print(m + ':', fdist[m], end=' ')

In [None]:
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

In [None]:
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

In [None]:
from nltk.corpus import inaugural
inaugural.fileids()

In [None]:
[fileid[:4] for fileid in inaugural.fileids()]

In [None]:
import matplotlib
matplotlib.rcParams['figure.figsize'] = [15, 10]

cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target))
cfd.plot()

### NLTK: Part-of-speech Tagging

-   take a look at Chapter 5 of the book
    (<http://www.nltk.org/book/ch05.html>)

-   go through the introductory part (5.1)

-   find the most common verbs in news text (5.2.5)

-   compare the three taggers that come with NLTK (5.4)




### NLTK: Parsing

-   take a look at Chapter 8 of the book
    (<http://www.nltk.org/book/ch08.html>)

-   run the example of an ambiguous sentence (8.1.2)

-   load the toy CFG in Section 8.3.1

-   expand the toy CFG by some additional words and structures

-   start working through Section 8.5 on dependency grammar

# Classes

### Defining Classes

- a **class** is a user-defined datatype

- you can define which methods instances of that class have

- naming convention: start with a capital letter\
  (`City`, `Course`, `Grammar`, `Language`, `Movie`, `Student`,
  `Word`)

- template for defining a class `YourClass`:

  ``` {language="python"}
  class YourClass:
    variable1 = initial_val1
    variable2 = initial_val2
    
    def method1(self):
      statement1
    
    def method2(self):
      statement2
    
  ```




### Instance Variables

Data specific to each instance can be stored in **instance variables**:

- inside class definitions, you have access to the variable `self`

- `self` represents one object of the datatype you are defining

- variable declarations in the class body are **class variables**

- using `self.variable`, you model data assigned to a specific
  instance

- inside class methods, `self.variable` works like any variable:

  ``` {language="python"}
    self.name = "Uga Blamp"
    self.first_name = self.name.split(" ")[0]
    print(self.first_name)
  ```

### Constructors

The **constructor** is a special method of each class:

- the default constructor does not create any instance variables

- by defining a constructor `__init__(self)`, you can initialize
  variables for the instance you are creating:

  ``` {language="python"}
  class YourClass:
    def __init__(self):
      self.property1 = "initial_value"
      self.data_store = dict()
  ```

- constructors can have additional (also named) arguments:

  ``` {language="python"}
  class Language:
    def __init__(self,name,family="Unknown"):
      self.name = name
      self.family = family
  ```

In [None]:
class Language:
    def __init__(self, name, family="Unkown"):
        self.name = name
        self.family = family

### Instance Objects

Using a class definition to create instance objects:

-   instances are created by calling the class name like a function

-   instances you created can be assigned to variables like any other
    object, and calling `type()` on them will return your class object

-   the arguments of the constructor define in which ways you can create
    instances of your class (= objects of your type):

In [None]:
eus = Language("Basque")

In [None]:
eus.family

In [None]:
hun = Language("Hungarian",family="Uralic")

In [None]:
hun.family

In [None]:
kbd = Language("Kabardian","Northwest Caucasian")

In [None]:
kbd.family

### Defining Methods

Methods are functions declared inside a class body:

-   they must have `self` as the first argument:

In [None]:
class Language:
    def __init__(self,name,family="Unknown"):
        self.name = name
        self.family = family
        
    def print_information(self):
        print(self.name + " belongs to " + self.family)

- in method calls, the object before the dot is assigned to `self`:

In [None]:
hun = Language("Hungarian",family="Uralic")
hun.print_information()

In [None]:
eus = Language("Basque")
eus.print_information()

### A Glimpse at Inheritance

Some quick info on a more advanced subject:

-   `class A(B):` declares A as a **subtype** of B

-   all functionality (class variables, methods) are **inherited**, but
    can be overridden by redefinition or reassignment in the subclass

-   these concepts are only relevant in advanced software engineering
    (which is typically not done in Python)

-   quite frequently, you see the idiom `class A(object)` in
    introductory materials, though

Random Sampling
===============

### Random Sampling

Often, we want to use random data in our programs:

-   only take a random subset of the input data for tractability

-   creating realistic-looking dummy data in order to test a program

-   simulating processes like language change, or a walk through the Web

-   re-sampling from a dataset to evaluate the stability of results on
    varying input (bootstrapping)

For all of this, we have the `random` library!

### The `random` library

Basic functionality of the `random` library:

-   `random.randint(a,b)` uniformly samples an integer
    $a \leq N \leq b$:

In [None]:
import random
[random.randint(1,6) for i in range(10)]

- `random.random()` samples from the uniform distribution over the
interval $[0.0,1.0)$:

In [None]:
[random.random() for i in range(3)]

- `random.gauss(mu,sigma)` samples from a Gaussian with mean `mu` and
standard deviation `sigma`:

In [None]:
[random.gauss(2,0.5) for i in range(3)]

### The `random` library

Sequence sampling using the `random` library:

-   `random.choice(seq)` samples from a uniform distribution over a
    sequence `seq` of possible values:

In [None]:
[random.choice(["a","b","c"]) for i in range(10)]

- `random.choices(seq,weights)` samples from a distribution over `seq`
provided by the `weight` list:

In [None]:
[random.choices(["a","b"],[0.7,0.3]) for i in range(8)]

- `random.shuffle()` re-orders the sequence in place in a random way:

In [None]:
list = ["r","a","n","d","o","m"]

In [None]:
for i in range(3):
    random.shuffle(list); print("".join(list))

Much to cover in more advanced course: random seeds, other
distributions, ...

### Example: Generating random names

Assume that we have the following frequency statistics:

-   onsets: 25% k, 25% p, 15% l, 10% m, 10% n, 10% \"\", 5% w

-   vowels: 40% a, 20% i, 20% u, 10% e, 10% o

Here is a function which samples names from these distributions:

In [None]:
from random import choices

onset_options = ["k","p","l","m","n","","w"]
onset_weights = [25, 25, 15, 10, 10, 10, 5]
vowel_options = ["a","i","u","e","o"]
vowel_weights = [40, 20, 20, 10, 10]
 
def random_name(num_syllables):
    name = ""
    for i in range(num_syllables):
        name += choices(onset_options, onset_weights)[0]
        name += choices(vowel_options, vowel_weights)[0]
    return name.title()

In [None]:
random_name(3)

### Recursion

A central concept of programming: **recursion**

-   informally: a definition or function which refers to itself

-   can be used whenever a problem (e.g. finding all sources of a river)
    can be reduced to subproblems of the same type\
    (e.g. finding the sources of a tributary)

-   all loops can be implemented using recursion!

-   every recursive function can be implemented using iteration!

-   BUT: there are many problems for which either iteration or recursion
    is more efficient, or easier to implement

### Base Case and Recursive Cases

In computing, recursive definitions need to consist of two parts:

-   the **base case** covers the simplest instances of a problem, and
    does not refer to the concept being defined (ensures termination)

-   the **recursive case** refers to the concept being defined when
    describing substructures (expansion to structures of arbitrary size)

There is a close relationship between recursion and mathematical
induction!

-   In induction, you prove that a theorem holds e.g. for $n = 1$ (the
    base case), and that it holds for $n = k + 1$ if it already holds
    for $n = k$ (recursive case).

-   If you want to prove that a recursive algorithm gives the correct
    result, you need induction as a proof technique!

### Recursive Definitions

Using recursion in definitions:

-   **recursive definition**: a definition which uses the term it
    defines

-   Example: defining (a subset of) boolean expressions in Python

    -   the expressions `True` and `False` are boolean expressions (base
        case)

    -   an expression of the form `a == b` is a boolean expression (base
        case)

    -   two boolean expressions conjoined by `and` form a boolean
        expression (recursive case)

    -   two boolean expressions conjoined by `or` form a boolean
        expression (recursive case)

    -   a boolean expression preceded by `not` is a boolean expression\
        (recursive case)

-   you have probably seen recursive definitions in logic!

### Recursive Functions

Using recursion in function definitions:

-   **recursive function**: a function which calls itself

-   if the function calls itself on every input, we get **infinite
    recursion**

-   in all useful recursive functions, each nested call differs in its
    arguments (e.g. execution on subproblems)

-   for recursive functions to terminate, we need base cases!

### Recursion vs. Iteration

-   in principle, recursion and iteration are equally powerful\
    (one can be used to emulate the other)

-   however, there are many definitions and algorithms which are much
    easier to write using recursion

-   this is especially the case for processing data structures which
    contain substructures of varying size (i.e. data that is not tabular
    in shape)

-   Examples:

    -   processing syntax trees for programming languages (in a
        compiler) and natural languages (in a parser)

    -   processing more general graph structures like networks

    -   sorting and searching in structures that are more complex than
        lists or dictionaries (e.g. 3-D models)

-   We are going to use navigation through trees as our main example!