# Exercise 4 - Spell Correction
(8 points)

In this exercise, you should finish the implementation of the `TrigramBasedSpellCorrector` class by implementing a) the creation of a tri-gram matrix from a given text and b) a simple spell correction that is based on the trigrams in the matrix and the Levenshtein distance.

Note that for this exercise, code from the other three exercises might be reusable. However, please think about which parts you want to reuse. Not all of them might be necssary or helpful.

In the test, we are using texts from [William Shakespeare](https://en.wikipedia.org/wiki/William_Shakespeare) for getting a larger amount of tri-grams. The file `shakespeare.txt` can be downloaded from [http://norvig.com/ngrams/shakespeare.txt](http://norvig.com/ngrams/shakespeare.txt).

#### Matrix Creation

The creation of the matrix is based on a given text. This text should be preprocessed with the same preprocessing rules as in Exercise 1 of this exercise series. After that, the tri-grams need to be extracted and stored in a way that they can be reused for getting word candidates for the spell correction.

##### Hints

* While the bi-gram matrix might work as a dense matrix implementation, the tri-gram matrix should use an implementation which is optimized for a sparse matrix. Otherwise, your solution might get memory issues.
* Make sure that the creation of your matrix does not take too much time. The hidden tests might have to initialize your matrix several times and if this takes more than 1 minute, it might be possible that the complete evaluation of your solution creates a time out. (Reading the content of the `shakespeare.txt` file and generating the matrix typically takes less than 2 seconds)

#### Spell Correction

The spell correction should be implemented in the `getCorrection(String word1, String word2, String word3)`. It gets a tri-gram $(w_1,w_2,w_3)$ as input where the third word $w_3$ might be misspelled. Its aim is to provide a correct word which fits to the first two words of the trigram. Its internal process should be based on the following two steps:
1. Get a list of candidate words $w_c$ from the tri-gram matrix which are known to occur as $(w_1,w_2,w_c)$.
2. From this list, choose the word that has the smallest Levenshtein distance to $w_3$.

The chosen word should be returned as the suggested correction for the third word.

##### Hints

* The given words of the tri-gram may have to be preprocessed to fit to the preprocessed words from the text. However, you can assume that every given word will be a single word even after applying the preprocessing, i.e., non of the words swill contain characters that are not alphanumerical.
* If the list of candidates is empty (because $w_1$, $w_2$ or their combination are simply not available in your tri-gram matrix), your solution should return `null`.
* If more than 1 candidate have the lowest Levenshtein distance, all solutions will be accepted.

#### Example

The following text could be used as basis for the tri-gram matrix.
``` 
London is the capital and largest city of England. Million people live in London. 
The River Thames is in London. London is the largest city in Western Europe.
```
For the following tri-grams, your implementation of the `getCorrection` method should return the following results:

<!--<table>
    <tr>
        <th style="text-align:center">Tri-gram</th><th style="text-align:center">Result</th><th style="text-align:center">Explanation</th>
    </tr>
    <tr>
        <td style="text-align:left">("largest", "city", "im")</td>
        <td style="text-align:center">"in"</td>
        <td style="text-align:left">In the given text, there are two tri-grams starting with `"largest", "city"` leading to the two candidates `"of"` and `"in"`. The latter has the smaller Levenshtein distance.</td>
    </tr>
    <tr>
        <td style="text-align:left">("largest", "city", "on")</td>
        <td style="text-align:center">`"in"` OR `"of"`</td>
        <td style="text-align:left">There are the same candidates as in the line above but both of them have the same Levenshtein distance to the given third word. So both results would be correct.</td>
    </tr>
    <tr>
        <td style="text-align:left">("London", "is", "teh")</td>
        <td style="text-align:center">`"in"` OR `"of"`</td>
        <td style="text-align:left">In the given text, there are two tri-grams starting with `"London", "is"`. However, both have `"the"` as a third word. So it is the only candidate.</td>
    </tr>
    <tr>
        <td style="text-align:left">("largest", "capital", "in")</td>
        <td style="text-align:center">`"in"` OR `"of"`</td>
        <td style="text-align:left">There are no tri-grams starting with `"largest", "capital"`.</td>
    </tr>
</table>-->

| Tri-gram | Result | Explanation |
|---|---|---|
| `("largest", "city", "im")` | `"in"` | In the given text, there are two tri-grams starting with `"largest", "city"` leading to the two candidates `"of"` and `"in"`. The latter has the smaller Levenshtein distance. |
| `("largest", "city", "on")` | `"in"` OR `"of"` | There are the same candidates as in the line above but both of them have the same Levenshtein distance to the given third word. So both results would be correct. |
| `("London", "is", "teh")` | `"the"` | In the given text, there are two tri-grams starting with `"London", "is"`. However, both have `"the"` as a third word. So it is the only candidate. |
| `("largest", "capital", "in")` | `null` | There are no tri-grams starting with `"largest", "capital"`. |

#### Notes

- You are free to use a different IDE to develop your solution. However, you have to copy the solution into this notebook to submit it.
- Do not add additional external libraries.
- Interface
  - You can use _[TAB]_ for autocompletion and _[SHIFT]_+_[TAB]_ for code inspection.
  - Use _Menu_ -> _View_ -> _Toggle Line Numbers_ for debugging.
  - Check _Menu_ -> _Help_ -> _Keyboard Shortcuts_.
- Known issues
  - All global variables will be set to void after an import.
  - Missing spaces arround `%` (Modulo) can cause unexpected errors so please make sure that you have added spaces around every `%` character.
- Finish
  - Save your solution by clicking on the _disk icon_.
  - Make sure that all necessary imports are listed at the beginning of your cell.
  - Run a final check of your solution by
    - click on _restart the kernel, then re-run the whole notebook_ (the fast forward arrow in the tool bar)
    - wait fo the kernel to restart and execute all cells (all executable cells should have numbers in front of them instead of a `[*]`) 
    - Check all executed cells for errors. If an exception is thrown, please check your code. Note that although the error might look cryptic, until now we never encounter that an exception was caused without a valid reason inside of the submitted code. A good way to check the code is to copy the solution into a new class in your favorite IDE and check
      - errors reported by the IDE
      - imports the IDE adds to your code which might be missing in your submission.
  - Finally, choose _Menu_ -> _File_ -> _Close and Halt_.
  - Do not forget to _Submit_ your solution in the _Assignments_ view.

In [1]:
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
 * A simple spell correction approach based on tri-grams and the Levenshtein
 * distance.
 */
public class TrigramBasedSpellCorrector {
	// YOUR CODE HERE
	public static final String SENTENCE_START = "<s>";
	public static final String SENTENCE_END = "</s>";

	Map<String, Integer> map_trigram;

	/**
	 * Constructor.
	 */
	public TrigramBasedSpellCorrector(String text) {
		create(text);
	}


	public static List<String> preprocess(String text) {
		List<String> tokenList = new ArrayList<>();
		String text_array[] = text.replaceAll("[^ a-zA-Z0-9.?!]", " ").replaceAll("[.!?]", " </s> <s> ").toLowerCase()
				.split(" ");
		tokenList.add("<s>");
		for (String token : text_array) {
			if (!(token.equals("")))
				tokenList.add(token);
		}
		tokenList.remove(tokenList.size() - 1);
		return tokenList;
	}

	public int calcLevenshteinDistance(String string1, String string2) {
		 int distance = 0;
		// YOUR CODE HERE

		int[] changes = new int[string2.length() + 1];
		int index2 = 0;
		while(index2<changes.length){

			changes[index2] = index2;
			index2++;
		}
        int index1 = 1;
       while(index1<=string1.length()){
			changes[0] = index1;
			int previousEdit = index1 - 1;
			for ( index2 = 1; index2 <= string2.length(); index2++) {

				// whenever we found a change in the character, we add 1 in the
				// previous editdistance and then run the minimum formula on it
				// with the other distance
				int newEditDistance = Math.min(1 + Math.min(changes[index2 - 1], changes[index2]),
						string1.charAt(index1 - 1) == string2.charAt(index2 - 1) ? previousEdit : previousEdit + 1);
				previousEdit = changes[index2];
				changes[index2] = newEditDistance;
			}
                index1++;
		}
		distance = changes[string2.length()];
		return distance;
	}

	/**
	 * Internal methods for determining the necessary statistics about the
	 * tri-grams of the given text.
	 */
	protected void create(String text) {
		// similar to the bigram matrix
		// YOUR CODE HERE
		List<String> tokens = preprocess(text);
		List<String> listTrigrams = new ArrayList<>();
		map_trigram = new HashMap<>();
		for (int i = 0; i < tokens.size() - 2; i++) {
			String string = tokens.get(i) + " " + tokens.get(i + 1) + " " + tokens.get(i + 2);
			listTrigrams.add(string); // add trigrams to list
		}
		// iterate trigram and check for existence of it into hashmap
		for (int indexTrigram = 0; indexTrigram < listTrigrams.size(); indexTrigram++) {
			if (map_trigram.containsKey(listTrigrams.get(indexTrigram))) {
				// add 1 to the counts of trigrams in hashmap
				int countTrigrams = map_trigram.get(listTrigrams.get(indexTrigram));
				map_trigram.put(listTrigrams.get(indexTrigram), countTrigrams + 1);
			} else {
				// add entry of trigram into hashmap as it does not exist
				map_trigram.put(listTrigrams.get(indexTrigram), 1);
			}
		}

	}

	/**
	 * Returns the correction of the third word based on the internal tri-grams
	 * that start with word1 and word2 as well as the Levenshtein distance of
	 * candidates from these tri-grams to the given word3.
	 * 
	 * @return a word for which a tri-gram with word1 and word2 at the beginning
	 *         exists and which has the smallest Levenshtein distance to the
	 *         given word3. Or null, if such a word does not exist.
	 */
	public String getCorrection(String word1, String word2, String word3) {
		String correctWord = null;
		word1 = word1.replaceAll("[^a-zA-Z0-9]", "");
		word2 = word2.replaceAll("[^a-zA-Z0-9]", "");
		word3 = word3.replaceAll("[^a-zA-Z0-9]", "");

		String bigram_previous = (word1 + " " + word2).toLowerCase();
		if(word1 == "" || word2 =="" || word3 == "")
			bigram_previous = null;

		int dist_threshold = 100000000;
		List<String> trigram_keys = new ArrayList<>(map_trigram.keySet());
		String lastWord = "";

		for (int i = 0; i < trigram_keys.size(); i++) {
			
			String[] words_trigram = trigram_keys.get(i).split(" ");
			String[] words_bigram = bigram_previous.split(" ");
			
			if ( bigram_previous == null)
				correctWord = null;
			

			else if (words_trigram[0].equals(words_bigram[0]) && words_trigram[1].equals(words_bigram[1]) ) {

				lastWord = trigram_keys.get(i).substring(trigram_keys.get(i).lastIndexOf(" ") + 1);
				int distance = calcLevenshteinDistance(lastWord, word3);
				if (distance < dist_threshold) {
					//we find the word with minimum distance to the main word
					dist_threshold = distance;
					correctWord = lastWord;
				}

			}
		}
		return correctWord;
		// YOUR CODE HERE
	}
}

// This line should make sure that compile errors are directly identified when executing this cell
// (the line itself does not produce any meaningful result)
Arrays.sort(new Object[]{new TrigramBasedSpellCorrector("")});

# Evaluation

- Run the following cell to test your implementation.
- You can ignore the cells afterwards.

In [2]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
%maven commons-io:commons-io:2.6
import org.apache.commons.io.FileUtils;
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;

public void checkCorrection(TrigramBasedSpellCorrector corrector, String word1, String word2, String word3,
        String... expectedCorrections) {
    try {
        String result = corrector.getCorrection(word1, word2, word3);
        if(expectedCorrections.length > 0) {
            Set<String> expectedResults = new HashSet<String>(Arrays.asList(expectedCorrections));
            Assertions.assertTrue(expectedResults.contains(result),
                    "For the trigram (\"" + word1 + "\",\"" + word2 + "\",\"" + word3 + "\") your solution returned "
                            + result + " while one of the following words has been expected: "
                            + Arrays.toString(expectedCorrections));
        } else {
            Assertions.assertNull(result,
                    "For the trigram (\"" + word1 + "\",\"" + word2 + "\",\"" + word3 + "\") your solution returned "
                            + result + " while null has been expected.");
        }
        System.out.println("Test(s) successfully completed.");
    } catch (AssertionFailedError e) {
        throw e;
    } catch (Throwable e) {
        System.err.println("Your solution caused an unexpected error:");
        throw e;
    }
}

System.out.println("----- Testing on short example ----");
String text = "London is the capital and largest city of England. Million people " + 
    "live in London. The River Thames is in London. London is the largest city in " +
    "Western Europe.";

TrigramBasedSpellCorrector corrector = new TrigramBasedSpellCorrector(text);

checkCorrection(corrector, "largest", "city", "im", "in");
checkCorrection(corrector, "largest", "city", "on", "in", "of"); // we expect "in" OR "of"
checkCorrection(corrector, "London", "is", "teh", "the");
checkCorrection(corrector, "largest", "capital", "in"); // we expect null as rsult
checkCorrection(corrector, "natural", "language", "processing"); // we expect null as rsult

System.out.println("----- Testing on Shakespeare example ----");
// Read text of Shakespeare
File file = new File("/srv/distribution/shakespeare.txt");
text = FileUtils.readFileToString(file, "UTF-8");
long time = System.currentTimeMillis();
corrector = new TrigramBasedSpellCorrector(text);
time = System.currentTimeMillis() - time;
System.out.println("Loading the tri-grams from Shakespeare took " + time + "ms.");
if(time > 60000) {
    System.out.println("Loading the tri-grams took very long. You may want to check your implementation.");
}

checkCorrection(corrector, "The", "river", "stüx", "styx");
checkCorrection(corrector, "The", "River", "Stüx", "styx");
checkCorrection(corrector, "ambassadors", "from", "noway", "norway");
checkCorrection(corrector, "first", "noble", "Frodo", "friend");
checkCorrection(corrector, "the", "devil", "siaeks", "speaks", "rides"); // we expect "speaks" OR "rides"

----- Testing on short example ----
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.
----- Testing on Shakespeare example ----
Loading the tri-grams from Shakespeare took 1387ms.
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.


In [None]:
// Ignore this cell

In [None]:
// Ignore this cell

In [None]:
// Ignore this cell