# Regular expressions
(5 points)

In this exercise, you should define 3 regular expression patterns. The patterns will be checked by selecting sub strings from example texts.

#### Example

The patterns should be stored in a variable which is already defined for the single tasks (so that the tests can access them, easily). For example, imagine that we are asked to define a pattern in variable `example` that can be used to search a string that exactly matches `abc`. We would define the following variable:
```java
String examplePattern = null;
examplePattern = "abc";
```
Note that the pattern is written as a Java string, i.e., it might be necessary to escape certain characters. For example, if our pattern should contain the newline character `\n` it has to be escaped as `"\\n"`.

#### Hints

- Unfortunately, not all engines which are able to process regular expressions offer the same character classes. Therefore, it is a good idea to read about the engine you are using and its abilities. For our tests, we are relying on the standard [Pattern class](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html) class of Java.

#### Notes

- You are free to use a different IDE to develop your solution. However, you have to copy the solution into this notebook to submit it.
- Do not add additional external libraries.
- Interface
  - You can use _[TAB]_ for autocompletion and _[SHIFT]_+_[TAB]_ for code inspection.
  - Use _Menu_ -> _View_ -> _Toggle Line Numbers_ for debugging.
  - Check _Menu_ -> _Help_ -> _Keyboard Shortcuts_.
- Known issues
  - All global variables will be set to void after an import.
  - Missing spaces arround `%` (Modulo) can cause unexpected errors so please make sure that you have added spaces around every `%` character.
- Finish
  - Save your solution by clicking on the _disk icon_.
  - Make sure that all necessary imports are listed at the beginning of your cell.
  - Run a final check of your solution by
    - click on _restart the kernel, then re-run the whole notebook_ (the fast forward arrow in the tool bar)
    - wait fo the kernel to restart and execute all cells (all executable cells should have numbers in front of them instead of a `[*]`) 
    - Check all executed cells for errors. If an exception is thrown, please check your code. Note that although the error might look cryptic, until now we never encounter that an exception was caused without a valid reason inside of the submitted code. A good way to check the code is to copy the solution into a new class in your favorite IDE and check
      - errors reported by the IDE
      - imports the IDE adds to your code which might be missing in your submission.
  - Finally, choose _Menu_ -> _File_ -> _Close and Halt_.
  - Do not forget to _Submit_ your solution in the _Assignments_ view.
  
## Task 1
(1 point)

Define a pattern for selecting all abbreviations. You can assume that an abbreviation will comprise only capital letters.

The example sentence for the example test is:
> The IEE shouldn't be confused with the IEEE. The latter was found in 1871 and merged with the IERE, IIE and IET.

Please note that the hidden tests may contain other abbreviations than the example sentence.

In [1]:
String pattern1 = "\\b[A-Z]+\\b";
// YOUR CODE HERE

#### Evaluation Task 1

- Run the following cell to test your implementation.
- You can ignore the cells containing the line `Ignore this cell`.

In [2]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;

/**
 * Method used for checking the given pattern by applying it to the given text and 
 * checking whether the selected substrings have the expected positions.
 */
public static void checkPattern(String text, String pattern, int[][] expectedPositions) {
    System.out.println("Checking your pattern with the text \"" + text + "\".");
    try {
        Assertions.assertNotNull(pattern, "The given pattern is null.");   
        int errorCount = 0;
        int[][] positions = getMatchPositions(text, pattern);
        int pos1 = 0, pos2 = 0;
        while ((pos1 < positions.length) && (pos2 < expectedPositions.length)) {
            // start matches
            if (positions[pos1][0] == expectedPositions[pos2][0]) {
                if (positions[pos1][1] != expectedPositions[pos2][1]) {
                    System.err.println("Your pattern selected an unexpected part of the text: \""
                            + text.substring(positions[pos1][0], positions[pos1][1]) + "\".");
                    ++pos1;
                    ++errorCount;
                } else {
                    // everything is ok
                    ++pos1;
                    ++pos2;
                }
            } else {
                if (positions[pos1][0] < expectedPositions[pos2][0]) {
                    System.err.println("Your pattern selected an unexpected part of the text: \""
                            + text.substring(positions[pos1][0], positions[pos1][1]) + "\".");
                    ++pos1;
                    ++errorCount;
                } else {
                    System.err.println("Your pattern did not select the expected sub string: \""
                            + text.substring(expectedPositions[pos2][0], expectedPositions[pos2][1]) + "\".");
                    ++pos2;
                    ++errorCount;
                }
            }
        }
        while (pos1 < positions.length) {
            System.err.println("Your pattern selected an unexpected part of the text: \""
                    + text.substring(positions[pos1][0], positions[pos1][1]) + "\".");
            ++pos1;
            ++errorCount;
        }
        while (pos2 < expectedPositions.length) {
            System.err.println("Your pattern did not select the expected sub string: \""
                    + text.substring(expectedPositions[pos2][0], expectedPositions[pos2][1]) + "\".");
            ++pos2;
            ++errorCount;
        }
        Assertions.assertEquals(0, errorCount, "There were errors in your pattern.");
        System.out.println("Test(s) successfully completed.");
    } catch (AssertionFailedError e) {
        System.err.println(e);
        throw e;
    } catch (Throwable e) {
        System.err.println("Your solution caused an unexpected error:");
        throw e;
    }
}

/**
 * Returns the positions (start and end positions) of the parts of the given text matching the given pattern. 
 */
public static int[][] getMatchPositions(String text, String pattern) {
    Pattern p = Pattern.compile(pattern);
    Matcher matcher = p.matcher(text);
    List<int[]> results = new ArrayList<>();
    while (matcher.find()) {
        results.add(new int[] { matcher.start(), matcher.end() });
    }
    return results.toArray(new int[results.size()][]);
}

/*
 * Example test case for Task 1
 */
String text1 = "The IEE shouldn't be confused with the IEEE. The latter was found in 1871 and merged with the IERE, IIE and IET.";
int[][] expectedPositions1 = new int[5][2];
expectedPositions1[0][0] = text1.indexOf("IEE");
expectedPositions1[0][1] = expectedPositions1[0][0] + 3;
expectedPositions1[1][0] = text1.indexOf("IEEE");
expectedPositions1[1][1] = expectedPositions1[1][0] + 4;
expectedPositions1[2][0] = text1.indexOf("IERE");
expectedPositions1[2][1] = expectedPositions1[2][0] + 4;
expectedPositions1[3][0] = text1.indexOf("IIE");
expectedPositions1[3][1] = expectedPositions1[3][0] + 3;
expectedPositions1[4][0] = text1.indexOf("IET");
expectedPositions1[4][1] = expectedPositions1[4][0] + 3;

checkPattern(text1, pattern1, expectedPositions1);

Checking your pattern with the text "The IEE shouldn't be confused with the IEEE. The latter was found in 1871 and merged with the IERE, IIE and IET.".
Test(s) successfully completed.


In [3]:
// Ignore this cell

## Task 2
(2 points)

Define a pattern for selecting all words that have `"the"` as a _true subset_, i.e., words that contain it as a substring but have at least one more character.

The example sentence for the test is:
> There was a theologian named Aristides the Athenian.

Please make sure that only the words are selected without leading or trailing whitespaces or punctuation characters.

In [4]:
String pattern2 = "(\\w*[tT]he\\w+|\\w+[tT]he\\w*)";
// YOUR CODE HERE

#### Evaluation Task 2

- Run the following cell to test your implementation.
- Make sure that you executed the test for Task 1 _before_ executing the tests here.
- You can ignore the cells containing the line `Ignore this cell`.

In [5]:
/*
 * Example test case for Task 2
 */
String text2 = "There was a theologian named Aristides the Athenian.";
int[][] expectedPositions2 = new int[3][2];
expectedPositions2[0][0] = text2.indexOf("There");
expectedPositions2[0][1] = expectedPositions2[0][0] + 5;
expectedPositions2[1][0] = text2.indexOf("theologian");
expectedPositions2[1][1] = expectedPositions2[1][0] + 10;
expectedPositions2[2][0] = text2.indexOf("Athenian");
expectedPositions2[2][1] = expectedPositions2[2][0] + 8;

checkPattern(text2, pattern2, expectedPositions2);

Checking your pattern with the text "There was a theologian named Aristides the Athenian.".
Test(s) successfully completed.


In [6]:
// Ignore this cell

In [7]:
// Ignore this cell

## Task 3
(1 point)

Define a pattern for selecting all words that have the substring `"the"` but not `"theo"`. Note that in contrast to the previous task, `"the"` does not have to be a true subset, i.e., the word `"the"` should be selected as well.

The example sentence for the test is:
> There are theologians discussing theories about the creation of the world.

Note that the tests do _not_ contain the special case words that have both substrings, e.g., `"theotherapy"` containing `"theo"` and `"the"` is excluded from tests.

Please make sure that only the words are selected without leading or trailing whitespaces or punctuation characters.

In [8]:
// String pattern3 = "\\w*[aA]\\w*[aA]\\w*[aA]\\w*";
String pattern3 = "\\S*[tT]he[^o ]*\\b";
// YOUR CODE HERE

#### Evaluation Task 3

- Run the following cell to test your implementation.
- Make sure that you executed the test for Task 1 _before_ executing the tests here.
- You can ignore the cells containing the line `Ignore this cell`.

In [9]:
/*
 * Example test case for Task 3
 */
String text3 = "There are theologians discussing theories about the creation of the world.";
int[][] expectedPositions3 = new int[3][2];
expectedPositions3[0][0] = text3.indexOf("There");
expectedPositions3[0][1] = expectedPositions3[0][0] + 5;
// We are searching for "the creation" to make sure that we are selecting the correct word.
// The position we are finally using is only marking "the"
expectedPositions3[1][0] = text3.indexOf("the creation");
expectedPositions3[1][1] = expectedPositions3[1][0] + 3;
// We are searching for "the world" to make sure that we are selecting the correct word.
// The position we are finally using is only marking "the"
expectedPositions3[2][0] = text3.indexOf("the world");
expectedPositions3[2][1] = expectedPositions3[2][0] + 3;

checkPattern(text3, pattern3, expectedPositions3);

Checking your pattern with the text "There are theologians discussing theories about the creation of the world.".
Test(s) successfully completed.


In [10]:
// Ignore this cell

## Task 4
(1 point)

Define a pattern for selecting all words containing at least three times the character `a` (including its uppercase variant `A`).

The example sentence for the test is:
> Anastasia would like to have a banana split.

In [11]:
String pattern4 = "\\w*[aA]\\w*[aA]\\w*[aA]\\w*";
// YOUR CODE HERE

#### Evaluation Task 4

- Run the following cell to test your implementation.
- Make sure that you executed the test for Task 1 _before_ executing the tests here.
- You can ignore the cells containing the line `Ignore this cell`.

In [12]:
/*
 * Example test case for Task 4
 */
String text4 = "Anastasia would like to have a banana split.";
int[][] expectedPositions4 = new int[2][2];
expectedPositions4[0][0] = text4.indexOf("Anastasia");
expectedPositions4[0][1] = expectedPositions4[0][0] + 9;
expectedPositions4[1][0] = text4.indexOf("banana");
expectedPositions4[1][1] = expectedPositions4[1][0] + 6;

checkPattern(text4, pattern4, expectedPositions4);

Checking your pattern with the text "Anastasia would like to have a banana split.".
Test(s) successfully completed.


In [13]:
// Ignore this cell