# Exercise 2 - Bigram Matrix
(6 points)

You should finish the implementation of the given `BigramMatrix` class. Its constructor takes a list of tokens (the same tokens your `preprocess` method in the former exercise should create) to generate a count matrix of bi-grams.

Let $c(w_i,w_j)$ be the number of bigrams $(w_i, w_j)$, i.e., the number of times the words $w_i$ stays in front of the word $w_j$. Your bigram matrix should contain a matrix of counts where `counts[i][j]` contains the value of $c(w_i,w_j)$.

For the input tokens 
```
[<s>, she, said, i, know, that, she, likes, english, food, </s>]
```
the matrix looks like the following table (rows are $w_i$ and columns are $w_j$)

| $w_i$ \ $w_j$ | `<s>` | `</s>` | `english` | `food` | `i` | `know` | `likes` | `said` | `she` | `that` |
|---|---|---|---|---|---|---|---|---|---|---|
| `<s>` | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| `</s>` | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| `english` | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| `food` | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| `i` | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| `know` | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| `likes` | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| `said` | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| `she` | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| `that` | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |

* `create(List<String> tokens)` should initialize the matrix.
* `normalize()` should normalize the counts of the matrix (note that a application of the Laplace smoothing should still be possible, even after normalizing).
* `performLaplaceSmoothing()` should perform Laplace smoothing on the counts in the matrix (and on the normalized values of the matrix if it has been normalized before)
* `getCount(String word1, String word2)` should return the (eventually smoothed) count for the bi-gram `(word1, word2)`.
* `getNormalizedCount(String word1, String word2)` should return the (eventually smoothed,) normalized count for the bi-gram `(word1, word2)` (i.e., the probability $P(word2|word1)$).

Additionally, you should try to handle unknown words (i.e., words which do not occur in the given text) in a meaningful way, by implementing the following rules:
* If smoothing has not been applied and one of the words of a bi-gram is not known, return $0$.
* If the matrix has been smoothed, word1 is know and word2 was not part of the given list of tokens, assume that it is a part of the matrix by returning $1$ for its count and $1/s$ as its normalization (where $s$ is the sum of the row of word1 (without adding the $1$ of word2 to this row)).
* If the matrix has been smoothed and word1 was not part of the given list of tokens, assume that its row was empty before smoothing and contains only $1$s after smoothing.

You implementation will be tested in 5 different scenarios:
* The matrix is created and the pure counts are checked.
* The matrix is normalized and the normalized counts are checked.
* The matrix is smoothed and the smoothed counts are checked.
* The matrix is normalized and smoothed and the normalized counts are checked.
* In contrast to the 4 cases above, the matrix is checked with words that do not occur in the text (again in all 4 previous scenarios).

#### Notes

- You are free to use a different IDE to develop your solution. However, you have to copy the solution into this notebook to submit it.
- Do not add additional external libraries.
- Interface
  - You can use _[TAB]_ for autocompletion and _[SHIFT]_+_[TAB]_ for code inspection.
  - Use _Menu_ -> _View_ -> _Toggle Line Numbers_ for debugging.
  - Check _Menu_ -> _Help_ -> _Keyboard Shortcuts_.
- Known issues
  - All global variables will be set to void after an import.
  - Missing spaces arround `%` (Modulo) can cause unexpected errors so please make sure that you have added spaces around every `%` character.
- Finish
  - Save your solution by clicking on the _disk icon_.
  - Make sure that all necessary imports are listed at the beginning of your cell.
  - Run a final check of your solution by
    - click on _restart the kernel, then re-run the whole notebook_ (the fast forward arrow in the tool bar)
    - wait fo the kernel to restart and execute all cells (all executable cells should have numbers in front of them instead of a `[*]`) 
    - Check all executed cells for errors. If an exception is thrown, please check your code. Note that although the error might look cryptic, until now we never encounter that an exception was caused without a valid reason inside of the submitted code. A good way to check the code is to copy the solution into a new class in your favorite IDE and check
      - errors reported by the IDE
      - imports the IDE adds to your code which might be missing in your submission.
  - Finally, choose _Menu_ -> _File_ -> _Close and Halt_.
  - Do not forget to _Submit_ your solution in the _Assignments_ view.

In [1]:
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class BigramMatrix {
	int[][] matrix;
	double[][] normalisedMatrix;
	int[][] laplacedMatrix;
	boolean smooth;
	Map<String, Integer> unigramCount = new HashMap<>();
	List<String> uniqueTokens = new ArrayList<>();

	public BigramMatrix(List<String> tokens) {
		create(tokens);
	}

	// Do remember to add the missing word functionality

	public void create(List<String> token) {

		// This is the list of unique tokens in a list
		smooth = false;
		for (String findUnique : token) {
			if (!uniqueTokens.contains(findUnique)) {
				uniqueTokens.add(findUnique);
			}
		}
		createUnigram(token);
		matrix = new int[uniqueTokens.size()][uniqueTokens.size()];
		normalisedMatrix = new double[uniqueTokens.size()][uniqueTokens.size()];

		for (int i = 0; i < token.size() - 1; i++) {
			String firstToken = token.get(i);
			int index = uniqueTokens.indexOf(firstToken);
			String laterToken = token.get(i + 1);
			int indexSecond = uniqueTokens.indexOf(laterToken);
			matrix[index][indexSecond] += 1;
		}

	}

	public Map<String, Integer> createUnigram(List<String> token) {

		for (String tokenTemp : token) {
			if (unigramCount.containsKey(tokenTemp)) {
				int count = unigramCount.get(tokenTemp);
				count++;
				unigramCount.put(tokenTemp, count);
			} else {
				unigramCount.put(tokenTemp, 1);
			}
		}
		return unigramCount;
	}

	/**
	 * Returns the count of the bi-gram matrix for the bi-gram (word1, word2).
	 */
	public double getCount(String word1, String word2) {
		double count = 0;
        
		// YOUR CODE HERE
		if (smooth == false) {
			if (!uniqueTokens.contains(word1) || !uniqueTokens.contains(word2))

				count = 0;
			else {
				int firstToken = uniqueTokens.indexOf(word1);
				int secondToken = uniqueTokens.indexOf(word2);
				count = matrix[firstToken][secondToken];
			}
		}

		else {
			if (!uniqueTokens.contains(word2) || !uniqueTokens.contains(word1))
				count = 1;
			else {
				int firstToken = uniqueTokens.indexOf(word1);
				int secondToken = uniqueTokens.indexOf(word2);
				count = matrix[firstToken][secondToken];
			}

		}
		
		return count;
	}

	public void assignMatrix(String word1, String word2, int count) {
		int firstToken = uniqueTokens.indexOf(word1);
		int secondToken = uniqueTokens.indexOf(word2);
		matrix[firstToken][secondToken] = count;
	}

	/**
	 * Transforms the internal count matrix into a normalized counts matrix.
	 */
	public void normalize() {
		// YOUR CODE HERE
		// flag = true;
		for (int i = 0; i < matrix.length; i++) {
			int sum = 0;
			for (int j = 0; j < matrix.length; j++) {
				sum += matrix[i][j];
			}
			for (int j = 0; j < matrix.length; j++) {
				if (sum != 0)
					normalisedMatrix[i][j] = (double) matrix[i][j] / sum;
			}
		}
	}

	/**
	 * Returns the normalized count of the bi-gram matrix for the bi-gram
	 * (word1, word2) (i.e., P(word2 | word1)).
	 */
	public double getNormalizedCount(String word1, String word2) {
		double normalizedCount = 0;
        
		
		// YOUR CODE HERE
		int indexWord1 = 0;
		int indexWord2 = 0;
		if (smooth == false && (!uniqueTokens.contains(word1) || !uniqueTokens.contains(word2)))
			normalizedCount = 0;
		else if (smooth == false) {
			indexWord1 = uniqueTokens.indexOf(word1);
			indexWord2 = uniqueTokens.indexOf(word2);
			normalizedCount = normalisedMatrix[indexWord1][indexWord2];
		} else if (smooth == true) {
			if (!uniqueTokens.contains(word2) && uniqueTokens.contains(word1)) {

				// As we have to do 1/s, so we are calculating the sum of
				// word1's row.
				int[] word1Array = matrix[uniqueTokens.indexOf(word1)];
				int sum = 0;
				for (int i : word1Array) {
					sum += i;
				}
				// as if word2 is not there, we assume it is there
				int laplacedCount = 1;

				// 1/c(wi)+V)
				normalizedCount = (double) laplacedCount / (unigramCount.get(word1) + uniqueTokens.size());
			}

			// if word1 is not there
			else if (!uniqueTokens.contains(word1) && uniqueTokens.contains(word2)) {
				// as word1 is not present after smoothing, we will assume it is
				// 1 for its bigram vector.
				indexWord2 = uniqueTokens.indexOf(word2);
				int laplacedCount = 1;
				normalizedCount = (double) laplacedCount / (uniqueTokens.size());
			}

			else if (uniqueTokens.contains(word1) && uniqueTokens.contains(word2)) {

				indexWord1 = uniqueTokens.indexOf(word1);
				indexWord2 = uniqueTokens.indexOf(word2);
				int laplacedCount = matrix[indexWord1][indexWord2];
				normalizedCount = (double) laplacedCount / (unigramCount.get(word1) + uniqueTokens.size());
			}

			else {
				// here both words are unknown. So we do the word2 unknown thing
				normalizedCount = (double) 1 / matrix.length;
			}

		}
		return normalizedCount;

	}

	public void performLaplaceSmoothing() {
		// YOUR CODE HERE
		// Map<String, Integer> mapCount = createUnigram(list);
		for (int i = 0; i < uniqueTokens.size(); i++) {
			for (int j = 0; j < uniqueTokens.size(); j++) {

				String secondToken = uniqueTokens.get(j);
				String firstToken = uniqueTokens.get(i);

				// c(wi-1,wi)
				double secondCount = getCount(firstToken, secondToken);

				// Laplace: add -1
				assignMatrix(firstToken, secondToken, (int) secondCount + 1);

			}
		}

		// switch the smooth flag to denote that smoothing has been completed.
		smooth = true;

	}

}

// This line should make sure that compile errors are directly identified when executing this cell
// (the line itself does not produce any meaningful result)
(new BigramMatrix(Arrays.asList("a", "b"))).normalize();

# Evaluation

- Run the following cell to test your implementation.
- You can ignore the cells afterwards.

In [2]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;

public static final double DELTA = 0.000001;

public static void checkMatrix(BigramMatrix matrix, String[][] testCases, double[] expectedValues,
        boolean checkNormalizedCounts) throws Exception {
    try {
        double value, diff;
        for (int i = 0; i < testCases.length; i++) {
            value = checkNormalizedCounts ? matrix.getNormalizedCount(testCases[i][0], testCases[i][1])
                    : matrix.getCount(testCases[i][0], testCases[i][1]);
            diff = Math.abs(value - expectedValues[i]);
            Assertions.assertTrue(diff < DELTA, "Your solution returned "
                    + (checkNormalizedCounts ? ("P(\"" + testCases[i][1] + "\"|\"" + testCases[i][0] + "\")=")
                            : ("c(\"" + testCases[i][0] + "\",\"" + testCases[i][1] + "\")="))
                    + value + " while " + expectedValues[i] + " has been expected.");
        }
        System.out.println("Test(s) successfully completed.");
    } catch (AssertionFailedError e) {
        throw e;
    } catch (RuntimeException e) {
        System.err.println("Your solution caused an unexpected error:");
        throw e;
    }
}

System.out.println("----- 1st example -----");
List<String> tokens = Arrays.asList("<s>", "she", "said", "i", "know", "that", "she", "likes",
                                        "english", "food", "</s>");
BigramMatrix m = new BigramMatrix(tokens);

System.out.print("Check counts: ");
checkMatrix(m,
        new String[][] { { "she", "said" }, { "english", "food" }, { "likes", "food" } },
        new double[] { 1, 1, 0 }, false);

m.normalize();
System.out.print("Check normalized counts: ");
checkMatrix(m,
        new String[][] { { "she", "said" }, { "english", "food" }, { "likes", "food" } },
        new double[] { 0.5, 1, 0 }, true);

m.performLaplaceSmoothing();
System.out.print("Check smoothed counts: ");
checkMatrix(m,
        new String[][] { { "she", "said" }, { "english", "food" }, { "likes", "food" } },
        new double[] { 2, 2, 1 }, false);

System.out.print("Check normalized, smoothed counts: ");
checkMatrix(m,
        new String[][] { { "she", "said" }, { "english", "food" }, { "likes", "food" } },
        new double[] { 1.0 / 6.0, 2.0 / 11.0, 1.0 / 11.0 }, true);

System.out.println("----- 2nd example -----");
// Apply the solution to a longer example
tokens = Arrays.asList("<s>", "london", "is", "the", "capital", "and", "largest", "city", "of",
                      "england", "</s>", "<s>", "million", "people", "live", "in", "london",
                      "</s>", "<s>", "the", "river", "thames", "is", "in", "london", "</s>",
                      "<s>", "london", "is", "the", "largest", "city", "in", "western",
                      "europe", "</s>");
m = new BigramMatrix(tokens);

System.out.print("Check counts: ");
checkMatrix(m, new String[][] { { "london", "</s>" }, { "largest", "city" }, { "river", "thames" },
        { "city", "river" } }, new double[] { 2, 2, 1, 0 }, false);

m.normalize();
System.out.print("Check normalized counts: ");
checkMatrix(m, new String[][] { { "london", "</s>" }, { "largest", "city" }, { "river", "thames" },
        { "city", "river" } }, new double[] { 0.5, 1.0, 1.0, 0 }, true);

m.performLaplaceSmoothing();
System.out.print("Check smoothed counts: ");
checkMatrix(m, new String[][] { { "london", "</s>" }, { "largest", "city" }, { "river", "thames" },
        { "city", "river" } }, new double[] { 3, 3, 2, 1 }, false);

System.out.print("Check normalized, smoothed counts: ");
checkMatrix(m, new String[][] { { "london", "</s>" }, { "largest", "city" }, { "river", "thames" },
        { "city", "river" } }, new double[] { 3.0 / 23.0, 1.0 / 7.0 , 0.1, 1.0 / 21 }, true);

System.out.println("----- Test with unknown words -----");
m = new BigramMatrix(tokens); // set matrix back
// Check unknown words
System.out.print("Check counts: ");
checkMatrix(m, new String[][] { { "london", "underground" }, { "small", "city" }, { "sky", "scraper" } }, 
            new double[] { 0, 0, 0 }, false);

m.normalize();
System.out.print("Check normalized counts: ");
checkMatrix(m, new String[][] { { "london", "underground" }, { "small", "city" }, { "sky", "scraper" } }, 
            new double[] { 0, 0, 0 }, true);

m.performLaplaceSmoothing();
System.out.print("Check smoothed counts: ");
checkMatrix(m, new String[][] { { "london", "underground" }, { "small", "city" }, { "sky", "scraper" } }, 
            new double[] { 1, 1, 1 }, false);

System.out.print("Check normalized, smoothed counts: ");
checkMatrix(m, new String[][] { { "london", "underground" }, { "small", "city" }, { "sky", "scraper" } }, 
            new double[] { 1.0 / 23.0, 1.0 / 19.0, 1.0 / 19.0 }, true);

----- 1st example -----
Check counts: Test(s) successfully completed.
Check normalized counts: Test(s) successfully completed.
Check smoothed counts: Test(s) successfully completed.
Check normalized, smoothed counts: Test(s) successfully completed.
----- 2nd example -----
Check counts: Test(s) successfully completed.
Check normalized counts: Test(s) successfully completed.
Check smoothed counts: Test(s) successfully completed.
Check normalized, smoothed counts: Test(s) successfully completed.
----- Test with unknown words -----
Check counts: Test(s) successfully completed.
Check normalized counts: Test(s) successfully completed.
Check smoothed counts: Test(s) successfully completed.
Check normalized, smoothed counts: Test(s) successfully completed.


In [None]:
// Ignore this cell

In [None]:
// Ignore this cell

In [None]:
// Ignore this cell

In [None]:
// Ignore this cell

In [None]:
// Ignore this cell