# Exercise 1 - Shingling
(5 points)

Finalize the implementation of the `ShinglingProcessor` class. 
* Its `apply` method implements the shingling from the lecture slides based on set semantics. It returns the ids of the shingles that have been found within a given document.
* Its constructor takes the length of the shingles.
* The `jaccardSim` method should return the jaccard similarity of the two given shingle sets.

#### Example

The document
```
google is good
```
has the following shingles with length 3
```
"goo", "oog", "ogl", "gle", "le ", "e i", " is", "is ", "s g", " go", "ood"
```
Since set semantics is used, the second occurence of `"goo"` is not added a second time to the list of shingles. If the shingles are simply assigned ids in the order in which they have been seen, the document would be represented by the following shingle ids (starting with 0):
```
    0,     1,     2,     3,     4,     5,     6,     7,     8,     9,   10
```
A second document
```
gooses google
```
would lead to the shingles
```
"goo", "oos", "ose", "ses", "es ", "s g", " go", "oog", "ogl", "gle"
```
and the ids
```
    0,    11,    12,    13,    14,     8,     9,     1,     2,     3
```

The intersection off their ids is $\{0,1,2,3,8,9\}$ while their union is 
$\{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14\}$. Therefore, their Jaccard similarity
is $6/15 = 0.4$

#### Hints

- As it can be seen in the example, the first three letter of a document form the first shingle of the document and the last three letters form the last shingle.
- For the tests in **this notebook**, you can assume, that the input documents have already been preprocessed and contain only the following three character classes:
  - lowercased alphabetic characters
  - digits
  - whitespaces

#### Notes

- You are free to use a different IDE to develop your solution. However, you have to copy the solution into this notebook to submit it.
- Do not add additional external libraries.
- Interface
  - You can use _[TAB]_ for autocompletion and _[SHIFT]_+_[TAB]_ for code inspection.
  - Use _Menu_ -> _View_ -> _Toggle Line Numbers_ for debugging.
  - Check _Menu_ -> _Help_ -> _Keyboard Shortcuts_.
- Known issues
  - All global variables will be set to void after an import.
  - Missing spaces arround `%` (Modulo) can cause unexpected errors so please make sure that you have added spaces around every `%` character.
- Finish
  - Save your solution by clicking on the _disk icon_.
  - Make sure that all necessary imports are listed at the beginning of your cell.
  - Run a final check of your solution by
    - click on _restart the kernel, then re-run the whole notebook_ (the fast forward arrow in the tool bar)
    - wait fo the kernel to restart and execute all cells (all executable cells should have numbers in front of them instead of a `[*]`) 
    - Check all executed cells for errors. If an exception is thrown, please check your code. Note that although the error might look cryptic, until now we never encounter that an exception was caused without a valid reason inside of the submitted code. A good way to check the code is to copy the solution into a new class in your favorite IDE and check
      - errors reported by the IDE
      - imports the IDE adds to your code which might be missing in your submission.
  - Finally, choose _Menu_ -> _File_ -> _Close and Halt_.
  - Do not forget to _Submit_ your solution in the _Assignments_ view.

In [1]:
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

public class ShinglingProcessor {
	
	List<String> setString = new ArrayList<>();
	int length=0;
	ShinglingProcessor(int length){
		this.length = length;
	}
	
	public Set<Integer> applyShingling(String shingle) {
		Set<Integer> shingle1 = new HashSet<Integer>();
		String token = null;
		for (int i = 0; i < shingle.length()-length+1; i++) {
			token = shingle.substring(i, i + length);
			if (!setString.contains(token))
				setString.add(token);
			shingle1.add(setString.indexOf(token));
		}
		
		return shingle1;
	}

	public static double jaccardSim(Set<Integer> set1, Set<Integer> set2) {
		double similarity = 0;
		// YOUR CODE HERE
		
		Set<Integer> set = new HashSet<>();
		for(Integer in : set1){
			set.add(in);
		}
		set.addAll(set2);
		set1.retainAll(set2);
		similarity = (double)set1.size() / set.size();

		return similarity;
	}
	
	

}

// This line should make sure that compile errors are directly identified when executing this cell
// (the line itself does not produce any meaningful result)
new ShinglingProcessor(0);
System.out.println("compiled");

compiled


# Evaluation

- Run the following cell to test your implementation.
- You can ignore the cells afterwards.

In [2]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;

/**
 * Test class for the ShinglingProcessor.
 */
public class ShingleTest {
    
    public static final double DELTA = 0.000001;

    /**
     * Tests the given ShinglingProcessor by calculating the shingles of the given texts as well as
     * their jaccard similarity. The similarity is compared with the given, expected similarity.
     *
     * @param shingling The ShinglingProcessor instance that should be tested
     * @param text1 Text 1 that will be compared to Text 2
     * @param text2 Text 2 that will be compared to Text 1
     * @param expectedSim The expected value of the Jaccard similarity
     */
    public static void checkShingleSimilarity(ShinglingProcessor shingling, String text1, String text2,
            double expectedSim) throws Exception {
        checkShingleSimilarity(shingling, text1, text2, expectedSim, 0);
    }

    /**
     * Tests the given ShinglingProcessor by calculating the shingles of the given texts as well as
     * their jaccard similarity. The similarity is compared with the given, expected similarity.
     *
     * @param shingling The ShinglingProcessor instance that should be tested
     * @param text1 Text 1 that will be compared to Text 2
     * @param text2 Text 2 that will be compared to Text 1
     * @param expectedSim The expected value of the Jaccard similarity
     * @param warnRuntime Prints a warning if the runtime of the ShinglingProcessor exceeds the
     *                    given runtime. (0 turns this feature off)
     */
    public static void checkShingleSimilarity(ShinglingProcessor shingling, String text1, String text2,
            double expectedSim, long warnRuntime) throws Exception {
        try {
            long time = System.currentTimeMillis();
            double similarity = ShinglingProcessor.jaccardSim(shingling.applyShingling(text1),
                    shingling.applyShingling(text2));
            time = System.currentTimeMillis() - time;
            double diff = Math.abs(similarity - expectedSim);
            Assertions.assertTrue(diff < DELTA, "Your Jaccard similarity of the shingles of \"" + text1 + "\" and \""
                    + text2 + "\" is " + similarity + " but the expected similarity was " + expectedSim);
            System.out.print("Test successfully completed. Your implementation took ");
            System.out.print(time);
            // If we are checking for runtime and the solution took more time than expected
            if((warnRuntime > 0 ) && (time > warnRuntime)) {
                System.out.print("ms. This might be too slow for the later exercises!");
            } else {
                System.out.println("ms.");
            }
        } catch (AssertionFailedError e) {
            throw e;
        } catch (Throwable e) {
            System.err.println("Your solution caused an unexpected error:");
            throw e;
        }
    }
}

// example text
ShinglingProcessor processor = new ShinglingProcessor(3);
ShingleTest.checkShingleSimilarity(processor, "google is good", "gooses google", 0.4, 10);
ShingleTest.checkShingleSimilarity(processor, "abc", "cba", 0.0, 10);

processor = new ShinglingProcessor(1);
ShingleTest.checkShingleSimilarity(processor, "example", "elpmaxe", 1.0, 10);
processor = new ShinglingProcessor(2);
ShingleTest.checkShingleSimilarity(processor, "example", "elpmaxe", 0.0, 10);


Test successfully completed. Your implementation took 1ms.
Test successfully completed. Your implementation took 0ms.
Test successfully completed. Your implementation took 0ms.
Test successfully completed. Your implementation took 0ms.


In [3]:
// Ignore this cell

In [4]:
// Ignore this cell

In [5]:
ShinglingProcessor processor = null;


processor = new ShinglingProcessor(1);
ShingleTest.checkShingleSimilarity(processor, "abc", "cba", 1.0);
ShingleTest.checkShingleSimilarity(processor, "cba", "abc", 1.0); // check symmetry
ShingleTest.checkShingleSimilarity(processor, "abc", "def", 0.0);

processor = new ShinglingProcessor(5);
ShingleTest.checkShingleSimilarity(processor, "abcdef", "abcdeg", 1.0/3.0);
ShingleTest.checkShingleSimilarity(processor, "abcde", "abcdef", 0.5);
ShingleTest.checkShingleSimilarity(processor, "abcdef", "abcde", 0.5); // check symmetry

Test successfully completed. Your implementation took 0ms.
Test successfully completed. Your implementation took 0ms.
Test successfully completed. Your implementation took 0ms.
Test successfully completed. Your implementation took 0ms.
Test successfully completed. Your implementation took 0ms.
Test successfully completed. Your implementation took 0ms.


In [6]:
%maven commons-io:commons-io:2.6
import org.apache.commons.io.FileUtils;

ShinglingProcessor processor = new ShinglingProcessor(5);
String fileContent = FileUtils.readFileToString(new File("/srv/distribution/shakespeare-dedup-pre.txt"), "UTF-8");
int posToCut = fileContent.length() / 2;
ShingleTest.checkShingleSimilarity(processor,
                       fileContent.substring(0, posToCut),
                       fileContent.substring(posToCut),
                       0.6092518101367659);

Test successfully completed. Your implementation took 908300ms.
