## Exercise 3 - Deduplication
(10 points)

In this exercise, the implementation of the `Deduplicater` class should be finished. Note that you may want to copy your solutions from the previous exercises where appropriate. The class comprises the following methods:
* The constructor takes all parameters necessary for the deduplication process as it has been described in the lecture slides.
  * `threshold` – the similarity threshold $\theta$.
  * `shingleLength` – the length of the shingles.
  * `numberOfHashes` – the number of hash functions (i.e., permutations) that should be used.
  * `seed` – the seed that can be used for pseudo random processes.
  * `b` – the number of bands for the LSH.
  * `r` – the number of rows of a single band for LSH.
* The `accept` method will be called for each document exactly once. It should
  * preprocess the document
    * transform all characters to their lower case form
    * remove all characters except lowercased characters, digits and whitespaces (i.e. `' '`)
  * assign an id to the document
    * the first document should get the id `0`, the next should get `1`, etc.
  * generate and store the document's set of shingles (not the document itself!). Please note that if we find solutions that store the original documents, we _may_ reduce the points they achieve.
* The `determineDuplicates` method should return the set of (near) duplicates.
  * A duplicate should have a Jaccard similarity higher or equal to the given threshold $\theta=0.9$. (The Jaccard similarity of two documents is based on the set of shingles of the document)
  * The Min Hashing and Local Sensitive Hashing algorithms should be used to generate a set of pair candidates. For these pairs, the similarity calculation should be carried out (using `jaccard.jaccardSim`) to make sure that the similarity is larger or equal to $\theta$.
  * The result should have the type `Set<Duplicate>` where `Duplicate` is a given, simple class used to store a document pair.
  * Since there is a small chance that not all (near) duplicates are found, it is sufficient to find 99%.
  * The result is not allowed to contain document pairs with a lower similarity than the given threshold.
* The class should make use of the `jaccard` attribute. This is an instance of the `JaccardSimilarity` class that
  * should implement the Jaccard similarity as in exercise 1 of this assignment.
  * should be used in the `determineDuplicates` method
  * count the number of times it has been called to show that this whole deduplication approach needs much less comparisons than $n*(n-1) / 2$.
  
#### Hints

- Please note that the hidden test will contain a document collection similar to the `shakespeare-dedup-small.txt` dataset. Although only the hidden test case will be executed during the evaluation, you should make sure that your implementation is able to finish its calculation in time. Thus, we added time measurements to the evaluation and you will see a warning if your implemented solution needs more than **4 minutes** for the Shakespeare example. If you see this warning, you may want to try to improve the runtime of your solution to make sure that your implementation can score points.

#### Notes

- You are free to use a different IDE to develop your solution. However, you have to copy the solution into this notebook to submit it.
- Do not add additional external libraries.
- Interface
  - You can use _[TAB]_ for autocompletion and _[SHIFT]_+_[TAB]_ for code inspection.
  - Use _Menu_ -> _View_ -> _Toggle Line Numbers_ for debugging.
  - Check _Menu_ -> _Help_ -> _Keyboard Shortcuts_.
- Known issues
  - All global variables will be set to void after an import.
  - Missing spaces arround `%` (Modulo) can cause unexpected errors so please make sure that you have added spaces around every `%` character.
- Finish
  - Save your solution by clicking on the _disk icon_.
  - Make sure that all necessary imports are listed at the beginning of your cell.
  - Run a final check of your solution by
    - click on _restart the kernel, then re-run the whole notebook_ (the fast forward arrow in the tool bar)
    - wait fo the kernel to restart and execute all cells (all executable cells should have numbers in front of them instead of a `[*]`) 
    - Check all executed cells for errors. If an exception is thrown, please check your code. Note that although the error might look cryptic, until now we never encounter that an exception was caused without a valid reason inside of the submitted code. A good way to check the code is to copy the solution into a new class in your favorite IDE and check
      - errors reported by the IDE
      - imports the IDE adds to your code which might be missing in your submission.
  - Finally, choose _Menu_ -> _File_ -> _Close and Halt_.
  - Do not forget to _Submit_ your solution in the _Assignments_ view.

In [3]:
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.Set;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.function.Consumer;

/**
 * A simple class representing a pair of documents
 */
class Duplicate {
	public int id1, id2;

	public Duplicate(int id1, int id2) {
		if (id1 < id2) {
			this.id1 = id1;
			this.id2 = id2;
		} else {
			this.id1 = id2;
			this.id2 = id1;
		}
	}

	public void setId1(int id1) {
		this.id1 = id1;
	}

	public void setId2(int id2) {
		this.id2 = id2;
	}

	@Override
	public int hashCode() {
		return 31 * (31 + id1) + id2;
	}

	@Override
	public boolean equals(Object obj) {
		if (this == obj)
			return true;
		if (obj == null)
			return false;
		if (!(obj instanceof Duplicate))
			return false;
		Duplicate other = (Duplicate) obj;
		if (id1 != other.id1)
			return false;
		if (id2 != other.id2)
			return false;
		return true;
	}

	@Override
	public String toString() {
		return (new StringBuilder()).append('(').append(id1).append(',').append(id2).append(')').toString();
	}
}

/**
 * A simple class implementing the Jaccard similarity and counting the number of
 * times it is called.
 */
class JaccardSimilarity {
	private AtomicInteger calls = new AtomicInteger(0);

	public int getCalls() {
		return calls.get();
	}

	public double jaccardSim(List<Integer> set1, List<Integer> set2) {
		calls.incrementAndGet();
		double similarity = 0;
		// You may want to copy your solution from task 1
		// YOUR CODE HERE
		Set<Integer> set = new HashSet<>();
		for (Integer in : set1) {
			set.add(in);
		}

		Set<Integer> set_2 = new HashSet<>();
		for (Integer in : set2) {
			set_2.add(in);
		}
		set_2.retainAll(set);
		set.addAll(set2);

		similarity = (double) set1.size() / set.size();

		return similarity;
	}
}

// YOUR CODE HERE

/**
 * Class for finding duplicates in a given corpus
 */
public class Deduplicater implements Consumer<String> {

	public final JaccardSimilarity jaccard = new JaccardSimilarity();
	// YOUR CODE HERE
	int shingleLength = 0;
	Map<Integer, List<Integer>> mapDocShingle = new HashMap<>();
	List<List<Integer>> listDocShingle = new ArrayList<>();
	int numberOfHashes = 0;
	long seed = 0;
	int b = 0;
	int r = 0;
	int numberOfShingles = 0;
	double threshold = 0;

	int[][] permutations;

	/**
	 * Constructor.
	 * 
	 * @param threshold
	 *            the similarity threshold theta.
	 * @param shingleLength
	 *            the length of the shingles
	 * @param numberOfHashes
	 *            the number of hash functions (i.e., permutations) that should
	 *            be used
	 * @param seed
	 *            the seed that can be used for pseudo random processes
	 * @param b
	 *            the number of bands for the LSH
	 * @param r
	 *            the number of rows of a single band for LSH
	 */
	public Deduplicater(double threshold, int shingleLength, int numberOfHashes, long seed, int b, int r) {
		// YOUR CODE HERE
		this.shingleLength = shingleLength;
		this.threshold = threshold;
		Random random = new Random(seed);
		this.numberOfHashes = numberOfHashes;
		this.seed = seed;
		this.b = b;
		this.r = r;
	}

	/**
	 * This method is called with a single document that should be added to the
	 * internal, shingled representation of documents.
	 *
	 * @param line
	 *            a single document that should be processed by the
	 *            Deduplicator.
	 */

	public void accept(String line) {
		// YOUR CODE HERE
		line = line.toLowerCase().replaceAll("[^ a-z0-9]", " ");
		applyShingling(line);

	}

	/*
	 * We are storing each document set along with the id of the document in the
	 * hashmap.
	 */
	int docId = 0, tokenId = 0;

	// map has id and list of tokenids representing the document.
	Map<Integer, List<Integer>> docShingleMap = new HashMap<>();
	// we still need a bag of tokens mapped with its id
	Map<String, Integer> shingleBag = new HashMap<>();

	public void applyShingling(String shingle) {
		List<Integer> tokenIDList_Shingle = new ArrayList<>();
		String token = null;
		for (int i = 0; i < shingle.length() - shingleLength + 1; i++) {
			token = shingle.substring(i, i + shingleLength);
			int tokenTempCount = 0;
			if (!shingleBag.containsKey(token)) {
				shingleBag.put(token, ++tokenId);
				tokenTempCount = tokenId;
			} else {
				tokenTempCount = shingleBag.get(token);
			}

			if (!tokenIDList_Shingle.contains(tokenTempCount)) {
				tokenIDList_Shingle.add(tokenTempCount);
			}
		}

		// add this tokenIDList to the map along with doc ID.
		// This list represent the shingles of a document.
		if (tokenIDList_Shingle.size() > 1)
			docShingleMap.put(++docId, tokenIDList_Shingle);
		numberOfShingles = tokenId;
		// System.out.println(docId);
		// apply id to docs and make an internal representation of shingled
		// documents
		// mapDocShingle.put((docId++ + 1), shingle1);
		// listDocShingle.add(shingle1);
		// System.out.println("Shingling is applied for " + shingle);

	}

	/*
	 * public void applyShingling(String shingle) { Set<Integer> uniqueShingle =
	 * new HashSet<Integer>(); List<Integer> shingle1 = new ArrayList<>();
	 * String token = null; for (int i = 0; i < shingle.length() - shingleLength
	 * + 1; i++) { token = shingle.substring(i, i + shingleLength);
	 * 
	 * // only if the token adds up to the shinglebad, // it means this shingle
	 * is unique if (shingleBag.add(token)) { shingleCount.put(token,
	 * ++shingleId); uniqueShingle.add(shingleId); } }
	 * docShingleMap.put(docId++, uniqueShingle); }
	 */
	public int[] minHash(List<Integer> s) {
		int hash[] = new int[permutations.length];
		for (int i = 0; i < permutations.length; i++) {
			for (int j = 0; j < permutations[i].length; j++) {
				if (s != null && s.contains(permutations[i][j])) {
					hash[i] = j;
					break;
				}
			}
		}
		return hash;
	}

	public void createPermutation() {
		int rows = numberOfHashes;
		int columns = numberOfShingles;
		Random random = new Random(seed);
		int min = 0, max = numberOfShingles;
		permutations = new int[numberOfHashes][numberOfShingles];
		for (int i = 0; i < rows; i++) {
			for (int j = 0; j < columns; j++) {
				permutations[i][j] = random.nextInt(((max + 1) - min) + min);
			}
		}
		// System.out.println("Permutation created");
	}

	// We need to hash each signature column
	public int hashSign(int[] arr) {
		int hash = 7;
		for (int a : arr) {
			hash += hash * 31 + a;
		}
		return hash;
	}

	Map<Integer, List<Integer>> bands = new HashMap<>();
	/*
	 * public void addBucket(int[] arr, int docId){
	 * 
	 * int hash = hashSign(Arrays.toString(arr).replaceAll("\\[|\\]| |\\,",
	 * ""));
	 * 
	 * if (!bands.containsKey(hash)) { List<Integer> docs = new
	 * ArrayList<Integer>(); docs.add(docId); bands.put(hash, docs); } else {
	 * List<Integer> docs = bands.get(hash);
	 * 
	 * if (! docs.contains(docId)) { docs.add(docId); bands.put(hash, docs); } }
	 * 
	 * }
	 */

	public Duplicate performJaccardCandidates(List<Integer> candidates) {
		List<Duplicate> duplicates = new ArrayList();
//		System.out.println("reached jaccard"+candidates);
				
				List<Integer> doc1 = docShingleMap.get(candidates.get(0)+1);
				List<Integer> doc2 = docShingleMap.get(candidates.get(1)+1);
				double sim = jaccard.jaccardSim(doc1, doc2);
				if (sim > threshold)
					 return new Duplicate(candidates.get(0)+1, candidates.get(1)+1);
				else
					return null;
	}

	// we create a hashmap that will hold the differnt docid as its values and
	// hash value as key

	public List<Duplicate> compare(int[][] candidates) {
//		System.out.println("in compare");
		List<Duplicate> result = new ArrayList();
		int count = 1;
		List<Integer> pairs = new ArrayList<>();
		int i = 0;
		count = 1;
		Map<Integer,Integer> pairMap = new HashMap<>(); 

		while (i < docId - 1) {
			while (count + i < docId) {
				if(pairMap.size() >0 && pairMap.get(i+1)==i+1+count){
					break;
				}
				for (int j = 0; j < r; j++) {
					if(count+i >= docId)
						break;
					if (candidates[j][i] != candidates[j][i + count])
						break;
					if(j==r-1){
						pairs.add(i+1);
						pairs.add(i+1+count);
						pairMap.put(i+1, i+count+1);
//						System.out.println("found the pairs");
						Duplicate d = performJaccardCandidates(pairs);
						if(d!=null){
						result.add(d);}
						break;
				}
					//pick up the second column for comparison against the first one
				count++;
			}
				// change the pilot column.
			count = 1;
		}

	}

	/*
	 * for(int i =0; i<arr.length;i++){ for(int j = 0; j<arr[0].length;j++){
	 * 
	 * } }
	 * 
	 * 
	 * 
	 * 
	 * 
	 * for (int i = 0; i < arr[0].length; i++) { for (int j = 0; j < arr.length;
	 * j++) { if(arr[j]!=null){ if (arr[j][i - 1] == arr[j][i]) { if (j ==
	 * candidates.size()-1) { candidateSet =
	 * performJaccardCandidates(candidates); } } else break; } } }
	 */
	return result;

	}

	public Set<Duplicate> determineDuplicates() {
		Set<Duplicate> duplicates = new HashSet<>();
		// YOUR CODE HERE
		createPermutation();
		List<Duplicate> duplicateList = new ArrayList<>();
		// we have the permutation, now its time for hashing of documents. First
		// we want to have the representation of document matrix.
		int count = 0, bandIndex = 1, row_count = 1, band_count = 0;
		// int[][] arr = new int[r][];
		int[][] sign = new int[permutations.length][docId];
		List<Integer> candidateDocs = new ArrayList<>();
		int band = 1;

		// this loop prepares a signature matrix with 50 columns and 850 rows
		for (int a = 1; a <= docId; a++) {
			// This will find the hash of the doc a.
			int[] signature = minHash(docShingleMap.get(a));
			for (int j = 0; j < permutations.length; j++) {
				sign[j][a - 1] = signature[j];
			}
		}
		int[][] si = new int[r][docId];
		row_count = 1;
		for (int i = 0; i < sign.length; i++) {

			si[i] = sign[i];
			if (row_count == r || i == sign.length - 1) {
				List<Duplicate> temp = new ArrayList<>();
//				System.out.println("going to compare for the band" +band);
				temp = compare(si);
				if(temp.size()>0){
					for(Duplicate d:temp){
						duplicates.add(d);
					}
				}
				row_count = 0;
				si = new int[r][docId];
				band++;
			} else
				row_count++;
		}
		return duplicates;
	}
}
// This line should make sure that compile errors are directly identified when
// executing this cell
// (the line itself does not produce any meaningful result)
new Deduplicater(0.9,10,50,123,10,5);System.out.println("compiled");

compiled


# Evaluation

- Run the following cells to test your implementation.
- This time, you have two different test cells. The first test uses a smaller file while the other test uses a longer file. We separated the two cells since the second one may take much more time.
- The files used for testing are:
  - [example-corpus.txt](https://hobbitdata.informatik.uni-leipzig.de/teaching/SNLP/deduplication/example-corpus.txt)
  - [example-corpus-expected.txt](https://hobbitdata.informatik.uni-leipzig.de/teaching/SNLP/deduplication/example-corpus-expected.txt)
  - [shakespeare-dedup-small.txt](https://hobbitdata.informatik.uni-leipzig.de/teaching/SNLP/deduplication/shakespeare-dedup-small.txt)
  - [shakespeare-dedup-small-expected.txt](https://hobbitdata.informatik.uni-leipzig.de/teaching/SNLP/deduplication/shakespeare-dedup-small-expected.txt)
  - The files with the expected results contain similarities which do not have to be part of the output of your solution. They are just provied as additional information.
- You can ignore the cells afterwards.

In [None]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
%maven commons-io:commons-io:2.6
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;
import java.util.stream.Collectors;
import java.io.File;

public class DeduplicationTest {

    public static final double MIN_AMOUNT_OF_TP = 0.99;

    /**
     * Simple method for reading the expected document pairs from a file.
     */
    public static Set<Duplicate> readDuplicatesFromFile(String filename) throws IOException {
        return FileUtils.readLines(new File(filename), "utf-8").parallelStream().map(s -> s.split(","))
                .filter(s -> s.length > 1).map(s -> new Duplicate(Integer.parseInt(s[0]), Integer.parseInt(s[1])))
                .collect(Collectors.toSet());
    }
    /**
     * Simple method that prints a given set of duplicates (max. the first 10).
     */
    public static void printDuplicates(Iterator<Duplicate> iter, StringBuilder builder) {
        int count = 0;
        while (iter.hasNext() && (count < 10)) {
            builder.append(count == 0 ? "[" : ", ");
            builder.append(iter.next().toString());
            ++count;
        }
        if (iter.hasNext()) {
            builder.append(", ...");
        }
        builder.append("]\n");
    }

    public static void checkDuplicator(Iterator<String> lineIterator, Set<Duplicate> expectedDuplicates,
            double threshold, int shingleLength, int numberOfHashes, long seed, int b, int r, long maxRuntime) {
        try {
            // Determine the expected number of TPs to pass the test
            int minTP = (int) Math.floor(expectedDuplicates.size() * MIN_AMOUNT_OF_TP);

            // Process the given documents
            long time1 = System.currentTimeMillis();
            Deduplicater deduplicater = new Deduplicater(threshold, shingleLength, numberOfHashes, seed, b, r);
            lineIterator.forEachRemaining(deduplicater);
            time1 = System.currentTimeMillis() - time1;
            System.out.println("processing all documents took: " + time1 + "ms");

            // Search for duplicates
            long time2 = System.currentTimeMillis();
            Set<Duplicate> duplicates = deduplicater.determineDuplicates();
            time2 = System.currentTimeMillis() - time2;
            System.out.println("determineDuplicates took: " + time2 + "ms");

            // Check runtime
            if ((maxRuntime > 0) && ((time1 + time2) > maxRuntime)) {
                System.out.println("Warning! Your solution may take too much time (" + (time1 + time2)
                        + "ms while not more than " + maxRuntime + " is suggested)");
            }

            // Print Jaccard similarities
            System.out.println(deduplicater.jaccard.getCalls() + " Jaccard similarities were calculated.");
            if(deduplicater.jaccard.getCalls() == 0) {
                System.out.println("It looks like you are not using the jaccard attribute of the class. Please fix that.");
            }

            // Check duplicates
            // Get overlap (= true positives)
            Set<Duplicate> s1, s2, overlap;
            overlap = new HashSet<>();
            if (expectedDuplicates.size() > duplicates.size()) {
                s1 = duplicates;
                s2 = expectedDuplicates;
            } else {
                s1 = expectedDuplicates;
                s2 = duplicates;
            }
            overlap = s1.parallelStream().filter(d -> s2.contains(d)).collect(Collectors.toSet());
            // Get false negatives
            expectedDuplicates.removeAll(overlap);
            // Get false positives
            duplicates.removeAll(overlap);
            System.out.print("TP=");
            System.out.print(overlap.size());
            System.out.print("\tFP=");
            System.out.print(duplicates.size());
            System.out.print("\tFN=");
            System.out.println(expectedDuplicates.size());

            // make sure that enough TPs have been found
            if (overlap.size() < minTP) {
                // The students solution has an issue... generate a detailed message
                StringBuilder builder = new StringBuilder();
                builder.append("Your solution found only ");
                builder.append(overlap.size());
                builder.append(" duplicates while at least ");
                builder.append(minTP);
                builder.append(" duplicates should have been found.\nYour solution missed:");
                printDuplicates(expectedDuplicates.iterator(), builder);
                Assertions.fail(builder.toString());
            }
            // make sure that no FPs are found
            if (duplicates.size() > 0) {
                // The students solution has an issue... generate a detailed message
                StringBuilder builder = new StringBuilder();
                builder.append("Your solution generated ");
                builder.append(duplicates.size());
                builder.append(
                        " FPs while none where expected.\nYour solution returned the following, wrong duplicates:");
                printDuplicates(duplicates.iterator(), builder);
                Assertions.fail(builder.toString());
            }
            System.out.println("Test successfully completed.");
        } catch (AssertionFailedError e) {
            throw e;
        } catch (Throwable e) {
            System.err.println("Your solution caused an unexpected error:");
            throw e;
        }
    }
}

public boolean executingHiddenTests = false;
LineIterator iterator = null;

if(executingHiddenTests) {
    System.out.println("Skipping this test in favor of the hidden test.");
} else {
    System.out.println("--- example corpus ---");
    iterator = FileUtils.lineIterator(new File("/srv/distribution/example-corpus.txt"), "UTF-8");
    DeduplicationTest.checkDuplicator(iterator, 
                    DeduplicationTest.readDuplicatesFromFile("/srv/distribution/example-corpus-expected.txt"), 
                    0.9, 10, 50, 123, 10, 5, 0L);
    iterator.close();
}

In [2]:
/*
 * This test uses a longer file as input.
 */
System.out.println("--- Shakespeare small ---");
if(executingHiddenTests) {
    System.out.println("Skipping this test in favor of the hidden test.");
} else {
    iterator = FileUtils.lineIterator(new File("/srv/distribution/shakespeare-dedup-small.txt"), "UTF-8");
    DeduplicationTest.checkDuplicator(iterator, 
                    DeduplicationTest.readDuplicatesFromFile("/srv/distribution/shakespeare-dedup-small-expected.txt"), 
                    0.9, 5, 50, -99, 5, 10, 240000L);
    iterator.close();
}

--- Shakespeare small ---


CompilationException: 

In [None]:
// Ignore this cell