# Exercise 2 - Min Hashing
(5 points)

Finalize the implementation of the MinHashingProcessor class. It should offer the following functionalities:
* The `MinHashingProcessor(int[][] permutations)` constructor takes a given set of permutations and performs its hashing based on them.
* The `MinHashingProcessor(int numberOfHashes, int numberOfShingles, long seed)` constructor takes the parameters necessary for creating the given number of permutations (randomly). The usage of the seed is optional. However, it is suggested to use the seed for initializing random number generators if something like that is used in the implementation. This would make sure that tests are repeatable.
* The `minHash(Set<Integer> s)` method takes a set of shingles (as they are created in the first task of this exercise series) and returns the hashes based on the min hashing algorithm as it is described in the lecture slides.

#### Hints

- In the test cases, all ids (the shingle ids as well as the positions of the shingles inside the permutations) start with 0. This might be different to some examples in the lecture slides.
- You should make sure that your solution is able to get through the second test cell configured with several 10000 shingles and 50 hashes within **4 minutes** (our test implementation is able to do that within ~30s). Please remember that the evaluation of the file will be aborted after 5 minutes.

#### Notes

- You are free to use a different IDE to develop your solution. However, you have to copy the solution into this notebook to submit it.
- Do not add additional external libraries.
- Interface
  - You can use _[TAB]_ for autocompletion and _[SHIFT]_+_[TAB]_ for code inspection.
  - Use _Menu_ -> _View_ -> _Toggle Line Numbers_ for debugging.
  - Check _Menu_ -> _Help_ -> _Keyboard Shortcuts_.
- Known issues
  - All global variables will be set to void after an import.
  - Missing spaces arround `%` (Modulo) can cause unexpected errors so please make sure that you have added spaces around every `%` character.
- Finish
  - Save your solution by clicking on the _disk icon_.
  - Make sure that all necessary imports are listed at the beginning of your cell.
  - Run a final check of your solution by
    - click on _restart the kernel, then re-run the whole notebook_ (the fast forward arrow in the tool bar)
    - wait fo the kernel to restart and execute all cells (all executable cells should have numbers in front of them instead of a `[*]`) 
    - Check all executed cells for errors. If an exception is thrown, please check your code. Note that although the error might look cryptic, until now we never encounter that an exception was caused without a valid reason inside of the submitted code. A good way to check the code is to copy the solution into a new class in your favorite IDE and check
      - errors reported by the IDE
      - imports the IDE adds to your code which might be missing in your submission.
  - Finally, choose _Menu_ -> _File_ -> _Close and Halt_.
  - Do not forget to _Submit_ your solution in the _Assignments_ view.

In [2]:
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Set;

public class MinHashingProcessor {
	// YOUR CODE HERE
	int columnLength = 0;
	int[][] permutations;

	/**
	 * Constructor for creating the class with an already given set of
	 * permutations.
	 */
	public MinHashingProcessor(int[][] permutations) {
		// YOUR CODE HERE
		columnLength = permutations[0].length;
		this.permutations = permutations;
	}

	/**
	 * Constructor for creating the class with a generated set of permutations.
	 * 
	 * @param numberOfHashes
	 *            number of hash functions (i.e., different permutations) that
	 *            should be generated
	 * @param numberOfShingles
	 *            number of different shingles the given documents can have
	 * @param seed
	 *            a seed i the generation is based on a random process
	 */
	public MinHashingProcessor(int numberOfHashes, int numberOfShingles, long seed) {
		// YOUR CODE HERE
		permutations= new int[numberOfHashes][numberOfShingles];
		Random random = new Random(seed);


		for (int i = 0; i < numberOfHashes; i++) 
			for (int j = 0; j < numberOfShingles; j++) {
				
				permutations[i][j] = random.nextInt(numberOfShingles+1);
// 					System.out.println(permutations[i][j]);
			}

		columnLength = permutations[0].length;
		
		

	}
    
    
    public int[] minHash(Set<Integer> s) {
        int hash[] = new int[permutations.length];
        // YOUR CODE HERE
        for (int i = 0; i < permutations.length; i++) {
            for (int j = 0; j < permutations[i].length; j++) {
                if (s.contains(permutations[i][j])) {
                    // TODO what if this condition never met??
                    hash[i] = j;
                    break;
                }
            }
        }
        return hash;
    }

}
// This line should make sure that compile errors are directly identified when executing this cell
// (the line itself does not produce any meaningful result)
new MinHashingProcessor(new int[][]{{0}});
System.out.println("compiled");

compiled


# Evaluation

- There are two different test scenarios. In the first cell, the class is initialize with a given set of permutations. In the second test scenario, the class has to create its own permutations.
- Run the following cell to test your implementation.
- You can ignore the cells afterwards.

In [3]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class MinHashTester {
    public static void checkMinHash(MinHashingProcessor processor, Set<Integer> documentShingles, int[] expectedHashes)
            throws Exception {
        try {
            int[] hashes = processor.minHash(documentShingles);
            Assertions.assertArrayEquals(expectedHashes, hashes,
                    "Your Min Hash solution of the shingles "
                            + Arrays.toString(documentShingles.toArray(new Integer[documentShingles.size()]))
                            + " created the hashes " + Arrays.toString(hashes) + " while "
                            + Arrays.toString(expectedHashes) + " was expected.");
            System.out.println("Test successfully completed.");
        } catch (AssertionFailedError e) {
            throw e;
        } catch (Throwable e) {
            System.err.println("Your solution caused an unexpected error:");
            throw e;
        }
    }

    public static void checkMinHash(MinHashingProcessor processor, int[] documentShingles, int[] expectedHashes)
            throws Exception {
        // create a set for the shingles
        checkMinHash(processor, 
                    (Set<Integer>) IntStream.of(documentShingles).mapToObj(Integer::new).collect(Collectors.toSet()),
                    expectedHashes);
    }
}

// The permutations used for these tests
int[][] permutations = new int[][] {
    { 5, 6, 7, 8, 9, 0, 1, 2, 3, 4 }, 
    { 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 },
    { 5, 3, 4, 9, 8, 1, 0, 7, 2, 6 } };
MinHashingProcessor processor = new MinHashingProcessor(permutations);

// The single test cases based on the given permutations
MinHashTester.checkMinHash(processor, new int[] {0, 2, 4, 6, 8}, new int[] { 1, 1, 2 });
MinHashTester.checkMinHash(processor, new int[] {6}, new int[] { 1, 3, 9 });
MinHashTester.checkMinHash(processor, new int[] {5, 9}, new int[] { 0, 0, 0 });
MinHashTester.checkMinHash(processor, new int[] {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, new int[] { 0, 0, 0 });

Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.


In [3]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

/**
 * Checks the MinHashingProcessor and its ability to generate permutation with simulating some
 * scenarios based on the given number of hashes, number of shingles and seed values.
 */
public static void checkMinHash(int numberOfHashes, int numberOfShingles, long... seeds) throws Exception {
    try {
        long time = System.currentTimeMillis();
        Set<Integer> documentShingles = new HashSet<>();
        int[] expectedHashes = new int[numberOfHashes];
        for (int i = 0; i < seeds.length; ++i) {
            // Initialize the processor
            MinHashingProcessor processor = new MinHashingProcessor(numberOfHashes, numberOfShingles, seeds[i]);

            // Simple test with an (unrealistic) document that has all shingles (we can
            // expect it to have 0 for all permutations)
            for (int j = 0; j < numberOfShingles; j++) {
                documentShingles.add(j);
            }
            for (int j = 0; j < expectedHashes.length; j++) {
                expectedHashes[j] = 0;
            }
            MinHashTester.checkMinHash(processor, documentShingles, expectedHashes);

            // Simple test to make sure that the permutations are different
            // (There is a chance of 25*(1/100)^25 = 25*10^(-50) that this test
            // fails while the implementation is correct. However, if the test
            // fails, the student should check his/her implementation as this
            // chance is very low)
            for (int j = 0; j < numberOfShingles; j++) {
                // For each shingle create a document that contains only this shingle
                documentShingles.clear();
                documentShingles.add(j);
                expectedHashes = processor.minHash(documentShingles);
                // Make sure that the permutations created different values (i.e., at least one
                // permutation should give this shingle a different id than the others.)
                IntSummaryStatistics statistics = IntStream.of(expectedHashes).summaryStatistics();
                Assertions.assertFalse(statistics.getMin() == statistics.getMax(),
                        "The shingle " + j + " has the same position in all permutations!");
            }
            System.out.println("Test successfully completed.");

            // Compare two completely different documents and make sure, that they never get
            // the same hash value
            documentShingles.clear();
            Set<Integer> documentShingles2 = new HashSet<>();
            int[] hashes1, hashes2;
            for (int j = 0; j < numberOfShingles; j++) {
                if ((j & 1) > 0) {
                    documentShingles.add(j);
                } else {
                    documentShingles2.add(j);
                }
            }
            hashes1 = processor.minHash(documentShingles);
            hashes2 = processor.minHash(documentShingles2);
            for (int j = 0; j < hashes1.length; j++) {
                Assertions.assertFalse(hashes1[j] == hashes2[j],
                        "The permutation " + j + " gives the same number to two completely different documents!");
            }
            System.out.println("Test successfully completed.");
        }
        System.out.println("For all tests, " + (System.currentTimeMillis()-time) + "ms were needed.");
    } catch (AssertionFailedError e) {
        throw e;
    } catch (Throwable e) {
        System.err.println("Your solution caused an unexpected error:");
        throw e;
    }
}

// The tests are carried out two times with two different seeds
long[] seeds = new long[] { 123l, -999l };
int numberOfShingles = 100;
int numberOfHashes = 25;
checkMinHash(numberOfHashes, numberOfShingles, seeds);

Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
For all tests, 21ms were needed.


In [None]:
// Ignore this cell

In [None]:
// Ignore this cell

In [4]:
// Ignore this cell
// BEGIN HIDDEN TESTS


// The permutations used for these tests
int[][] permutations = new int[][] {
    { 0, 9, 2, 7, 4, 5, 6, 3, 8, 1 }, 
    { 6, 8, 2, 5, 1, 7, 9, 4, 3, 0 },
    { 1, 8, 4, 6, 3, 9, 0, 5, 7, 2 },
    { 5, 0, 2, 7, 9, 4, 8, 1, 3, 6 } };
MinHashingProcessor processor = new MinHashingProcessor(permutations);

// The single test cases based on the given permutations
MinHashTester.checkMinHash(processor, new int[] {1, 2, 3, 4 ,5}, new int[] { 2, 2, 0, 0 });
MinHashTester.checkMinHash(processor, new int[] {0, 2, 4, 6, 8}, new int[] { 0, 0, 1, 1 });
MinHashTester.checkMinHash(processor, new int[] {1, 3, 5, 7, 9}, new int[] { 1, 3, 0, 0 });
MinHashTester.checkMinHash(processor, new int[] {0}, new int[] { 0, 9, 6, 1 });
MinHashTester.checkMinHash(processor, new int[] {6, 8}, new int[] { 6, 0, 1, 6 });
MinHashTester.checkMinHash(processor, new int[] {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, new int[] { 0, 0, 0, 0 });


Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.


In [7]:
// The tests are carried out two times with two different seeds
long[] seeds = new long[] { 9876l, -1l };
int numberOfShingles = 10000;
int numberOfHashes = 50;
checkMinHash(numberOfHashes, numberOfShingles, seeds);

Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
For all tests, 44621ms were needed.


In [6]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

/**
 * Checks the MinHashingProcessor and its ability to generate permutation with simulating some
 * scenarios based on the given number of hashes, number of shingles and seed values.
 */
public static void checkMinHash(int numberOfHashes, int numberOfShingles, long... seeds) throws Exception {
    try {
        long time = System.currentTimeMillis();
        Set<Integer> documentShingles = new HashSet<>();
        int[] expectedHashes = new int[numberOfHashes];
        for (int i = 0; i < seeds.length; ++i) {
            // Initialize the processor
            MinHashingProcessor processor = new MinHashingProcessor(numberOfHashes, numberOfShingles, seeds[i]);

            // Simple test with an (unrealistic) document that has all shingles (we can
            // expect it to have 0 for all permutations)
            for (int j = 0; j < numberOfShingles; j++) {
                documentShingles.add(j);
            }
            for (int j = 0; j < expectedHashes.length; j++) {
                expectedHashes[j] = 0;
            }
            MinHashTester.checkMinHash(processor, documentShingles, expectedHashes);

            // Simple test to make sure that the permutations are different
            // (There is a chance of 25*(1/100)^25 = 25*10^(-50) that this test
            // fails while the implementation is correct. However, if the test
            // fails, the student should check his/her implementation as this
            // chance is very low)
            for (int j = 0; j < numberOfShingles; j++) {
                // For each shingle create a document that contains only this shingle
                documentShingles.clear();
                documentShingles.add(j);
                expectedHashes = processor.minHash(documentShingles);
                // Make sure that the permutations created different values (i.e., at least one
                // permutation should give this shingle a different id than the others.)
                IntSummaryStatistics statistics = IntStream.of(expectedHashes).summaryStatistics();
                Assertions.assertFalse(statistics.getMin() == statistics.getMax(),
                        "The shingle " + j + " has the same position in all permutations!");
            }
            System.out.println("Test successfully completed.");

            // Compare two completely different documents and make sure, that they never get
            // the same hash value
            documentShingles.clear();
            Set<Integer> documentShingles2 = new HashSet<>();
            int[] hashes1, hashes2;
            for (int j = 0; j < numberOfShingles; j++) {
                if ((j & 1) > 0) {
                    documentShingles.add(j);
                } else {
                    documentShingles2.add(j);
                }
            }
            hashes1 = processor.minHash(documentShingles);
            hashes2 = processor.minHash(documentShingles2);
            for (int j = 0; j < hashes1.length; j++) {
                Assertions.assertFalse(hashes1[j] == hashes2[j],
                        "The permutation " + j + " gives the same number to two completely different documents!");
            }
            System.out.println("Test successfully completed.");
        }
        System.out.println("For all tests, " + (System.currentTimeMillis()-time) + "ms were needed.");
    } catch (AssertionFailedError e) {
        throw e;
    } catch (Throwable e) {
        System.err.println("Your solution caused an unexpected error:");
        throw e;
    }
}

// The tests are carried out two times with two different seeds
long[] seeds = new long[] { 123l, -999l };
int numberOfShingles = 100;
int numberOfHashes = 25;
checkMinHash(numberOfHashes, numberOfShingles, seeds);

Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.
For all tests, 23ms were needed.
