# Homework 1: Finding Similar Items: Textually Similar Documents

You are to implement the stages of finding textually similar documents based on Jaccard similarity using the shingling, minhashing, and locality-sensitive hashing (LSH) techniques and corresponding algorithms. The implementation can be done using any big data processing framework, such as Apache Spark, Apache Flink, or no framework, e.g., in Java, Python, etc. To test and evaluate your implementation, write a program that uses your implementation to find similar documents in a corpus of 5-10 or more documents, such as web pages or emails.

The stages should be implemented as a collection of classes, modules, functions, or procedures depending on the framework and the language of your choice. Below, we describe sample classes implementing different stages of finding textually similar documents. You do not have to develop the exact same classes and data types described below. Feel free to use data structures that suit you best


### Load data to google colab via google drive


In [2]:
#http://mlg.ucd.ie/datasets/bbc.html
try: 
  import os
  import pandas as pd
  import numpy as np
  import math
  import binascii
  import random
  import sympy
except: 
  print("Import Error")

#Importing data
#With google docs 
from google.colab import drive
drive.mount('/content/drive')
paths =['drive/My Drive/Colab Notebooks/data/bbc-fulltext/bbc/']
#OTHERWISE 
''' 
define paths = [HERE]
'''

def flatmap(array):
  return [element for element in innerArray for innerArray in array]

#Simple helper function
def extract_to_txt(path): 
  with open(path) as f:
    contents = f.readlines()
  return contents 


df = []

#Create DF with categories
for folder in paths: 
  for topic in os.listdir(folder): 
    topic_pth = f"{folder}{topic}/"
    for txt in os.listdir(topic_pth):
        text = extract_to_txt(topic_pth+txt)
        header = "".join(text[0])
        df.append([header,"".join(text[1::]), topic])



Mounted at /content/drive


In [3]:
df = pd.DataFrame(df, columns=["headline", "text","label"])

print(df.columns)

type(df.iloc[0].text)

arr = []
for id, ele in df.iterrows(): 
  df.iloc[id].text = ele.text.replace("\n","").replace('"',"").lower()
  df.iloc[id].headline = ele.headline.replace("\n","").lower()


print(df.head(5))

Index(['headline', 'text', 'label'], dtype='object')
                                 headline  \
0           council tax rise 'reasonable'   
1         baron kinnock makes lords debut   
2  election deal faltered over heath role   
3          brown shrugs off economy fears   
4        assembly ballot papers 'missing'   

                                                text     label  
0  welsh councils should set their taxes at reaso...  politics  
1  former labour leader neil kinnock has official...  politics  
2  the tories failed to hold onto power in 1974 a...  politics  
3  gordon brown is to freeze petrol duty increase...  politics  
4  hundreds of ballot papers for the regional ass...  politics  



### **Task 1. Shingling function**

A class Shingling that constructs k–shingles of a given length k (e.g., 10) from a given document, computes a hash value for each unique shingle and represents the document in the form of an ordered set of its hashed k-shingles.

In [5]:
def shingling(k=5,txt=""): 
  arr = []
  for i in range(0,len(txt)-k):
    arr.append(txt[i:i+k])
  return set(arr)
  
def toHash(shing_list): 
  return [binascii.crc32(bytes(x.encode('utf-8'))) for x in shing_list]

shinglesDoc1 = shingling(8,df.iloc[0].text)
shinglesDoc2 = shingling(8,df.iloc[1].text)
print(shinglesDoc1)

{' taxes c', 'h wales ', 'omes wer', ' upwards', 'nd due t', 'hey have', 'n bob we', 'aid: wal', 't. the £', 'ded by c', 'gordon b', 'inance s', 'nister s', 'isappoin', 'rman sai', 'h local ', 'ue to re', 'h. mike ', ' aware o', ' for loc', 'ged coun', 'hree hom', 'expendit', 'ugely di', 'effect o', 'she said', 'r german', 'd reband', 'crat cou', 'nt hugel', 'y face i', 'se the f', 'nities a', 'lsh coun', ' propert', 'on. he a', 'an for f', 'pending ', 't depriv', 'in wales', ' to keep', 'on servi', ' have mo', 'les and ', 'd a thir', 'ears and', '7.4m fro', 'rived co', 'dded: ar', 'erman sa', 'o the vi', 'wn. but ', "'s local", 'wellingt', 'vels, ev', ' able to', 'nment. t', ' often c', 'ven an a', 'ouncil t', 'is anger', 'ant cuts', ' cuts in', 'ean stee', 'sh liber', 'sembly g', 'enables ', ' househo', 'ition pa', 'utmost t', 'ties to ', "cymru's ", 'create a', 'at for t', ' funds. ', 'ax, part', 'ties. sh', 'unds. ms', 'ities ha', ' been sh', 'r local ', 'es due t', 'ses were', 'ir

### **Task 2. CompareSets function**
A class CompareSets computes the Jaccard similarity of two sets of integers – two sets of hashed shingles.

#### 2.1 For two Sets 
This is a efficient implementation for comapring two sets of shingles. 

In [6]:
def CompareSets(set1, set2):
    list1 = list(set1)
    list2 = list(set2)
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    similarity = float(intersection) / union
    return similarity

# Example
similarity = CompareSets(shinglesDoc1, shinglesDoc2)

In [None]:
%%time
print(CompareSets(shinglesDoc1, shinglesDoc2))

0.014418329053920601
CPU times: user 3.93 ms, sys: 0 ns, total: 3.93 ms
Wall time: 11.1 ms


#### 2.2 With Characteristic Matrix

Note the task does not require these steps, although they are helpfull to fully understand the concepts discussed in the lecture. With increased docs, the matrix becomes more sparse. Resulting in lower performances. 

1. Create Charactersitc Matrix
2. Apply JacSimilarity on two columns (2 docs) **Text fett markieren**

In [7]:

"""
Visualize a collection of sets as a characteristic matrix
Rows correspond to elements (shingles) of the universal set.
Columns correspond to sets (documents)
p. 24 
"""

def createCharacteristicMatrix(docs):
  sMatrix = pd.DataFrame([],columns=["shingle"])
  for idx, i in enumerate(docs):
    sh = shingling(txt=i)
    hsh = pd.DataFrame(toHash(sh),columns=["shingle"])
    bol_vals = pd.DataFrame([1 for idx in range(len(hsh))], columns=["s"+str(idx)])
    sRow = pd.concat((hsh, bol_vals), axis=1)
    sMatrix = pd.merge(sMatrix, sRow, on='shingle', how='outer').fillna(0)
  return sMatrix

"""
1 1 a 1 in both columns
1 0 b columns are different
0 1 c
0 0 d 0 in both columns
Denote,  
A = the number of rows of type a; 
B = the number of rows of type b, etc.
Sim (C1, C2) = A /(A +B +C).
p.26
"""

# row 0 is the shingle 
def calculateSimJac(charMatrix,docCol1=1, docCol2=2, p=True): 
  a=0; b=0; c=0; 
  for i in charMatrix.iterrows():
    # i[0] --> rowNumber; i[0][0] --> Shingle Number; i[0][1] --> first document 
    dc0 = i[1][docCol1]; dc1 = i[1][docCol2]
    if dc0 == 1 and dc1 == 1: a+=1
    elif dc0 != dc1 and dc0 == 1 or dc1 == 1 : b+=1
    else: c += 1 
  if p : print(f"A:B:C -> {a,b,c}, Sum of A+B+C -> {a+b+c}, CheckLen of Matrix -> {len(charMatrix)}")
  jacSim = a / (a+b+c*0)
  return jacSim


print(df.text[0:10])
cM = createCharacteristicMatrix(df.text[0:10])
print(cM)
jacSim = calculateSimJac(cM,1,3)
print("jacSimiliarity:", jacSim)



0    welsh councils should set their taxes at reaso...
1    former labour leader neil kinnock has official...
2    the tories failed to hold onto power in 1974 a...
3    gordon brown is to freeze petrol duty increase...
4    hundreds of ballot papers for the regional ass...
5    billions of pounds spent on conflict in iraq a...
6    the leader of the british national party has b...
7    it's become commonplace to argue that blair an...
8    ken livingstone should stick to his guns and n...
9    tony blair's feud with gordon brown is damagin...
Name: text, dtype: object
          shingle   s0   s1   s2   s3   s4   s5   s6   s7   s8   s9
0      1153712461  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
1      1925331429  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
2      3779788561  1.0  0.0  1.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0
3      1664649880  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
4      3389403929  1.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0
...           ..

In [8]:
text1 =  "this is a test to understand better what we are doing! yippie!"
text2 =  "this is a test to understand better what we are doing! yippie!"
text3 =  "this is a test to not forget  what we are doing! yippie! Same but different"


test_mat =  pd.DataFrame([text1,text2,text3], columns=["text"])
#print(test_mat.text)
cM_test_1 = createCharacteristicMatrix(test_mat.text)
print(cM_test_1)
jacSim_test_1 = calculateSimJac(cM_test_1,1,2)
jacSim_test_2 = calculateSimJac(cM_test_1,1,3)
print("jacSimiliarity:", jacSim_test_1, "and", jacSim_test_2)


       shingle   s0   s1   s2
0    840883218  1.0  1.0  1.0
1   3769731328  1.0  1.0  1.0
2   2666131820  1.0  1.0  0.0
3   2793112504  1.0  1.0  1.0
4    123783608  1.0  1.0  1.0
..         ...  ...  ...  ...
86  4021045262  0.0  0.0  1.0
87  2059753855  0.0  0.0  1.0
88  3541609251  0.0  0.0  1.0
89  2698009159  0.0  0.0  1.0
90  3834743206  0.0  0.0  1.0

[91 rows x 4 columns]
A:B:C -> (57, 0, 34), Sum of A+B+C -> 91, CheckLen of Matrix -> 91
A:B:C -> (36, 55, 0), Sum of A+B+C -> 91, CheckLen of Matrix -> 91
jacSimiliarity: 1.0 and 0.3956043956043956


### **Task3. Minhashing**
A class MinHashing that builds a minHash signature (in the form of a vector or a set) of a given length n from a given set of integers (a set of hashed shingles).

#### 3.2 HashSignature with Permutation

In [9]:
%%time
def createPermSignVect(index,vec): 
  for e in zip(index,vec): 
    if e[1] == 1: return e[0] 


def signMatrixPermutations(charMatrix, permutations, createVec): 
  assert permutations > 0
  signMatrix = pd.DataFrame(columns=charMatrix.columns[1:]) 
  for _ in range(permutations): 
    #permutate all rows.  
    shuffled = charMatrix.sample(frac=1)
    ll = list();
    for e in shuffled.columns[1:]:
      ll.append(createVec(shuffled.index,shuffled[e])) 
    signMatrix.loc[len(signMatrix)] = ll
  return signMatrix
    

#createPermSignVect(cM.index,cM["s0"])
signMatPer = signMatrixPermutations(cM,100,createPermSignVect) 
print(signMatPer)

      s0    s1    s2    s3    s4    s5    s6     s7     s8     s9
0   1214  2705  4560  5942  7935  8584  9560  10935  12380  13235
1   1170  2486  4928  3708  7870  4657  7870  10660  12129  12646
2   1764  1079  4792  2219  8075  4445  4445   1764   2219  13418
3   1204  2443  4178  6303  8101  8465  2443  10477   3236  13307
4   2069   730  5047  5553  4886  5047   730   5047  11914  11914
..   ...   ...   ...   ...   ...   ...   ...    ...    ...    ...
95  1471  3669  4266  6515  4600  8589  9591  11554  11898  13214
96  2148  2577  5457  5585  7977  9140  8166  10246   2148  12876
97   948  2598  4979  6648  7685  1954  9393   9892   5238  12828
98  1897  2965  3853  1897  7431  8283  9191   9868  11857   3183
99   197  3594  5522  7057  7600  3594  9386   3594  11657   3594

[100 rows x 10 columns]
CPU times: user 570 ms, sys: 0 ns, total: 570 ms
Wall time: 708 ms


####  MinHashing with Hashfunctions

from: https://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/ <br/>
So here’s how you compute the MinHash signature for a 
given document. Generate, say, 10 random hash functions. Take the first hash function,
and apply it to all of the shingle values in a document. Find the minimum hash value produced (hey, “minimum hash”, that’s the name of the algorithm!) 
and use it as the first component of the MinHash signature. Now take the second hash function, and again find the minimum resulting hash value, 
and use this as the second component. And so on.


In [10]:

''' 
Take k (e.g., k = 100) independent hash functions, e.g., 
h(x) = (ax + b) % c
'''
#Create Hashfunctions
def createMinHashCoeff(amount,coeff = list()): 

  # own values = a = random.randint(1,100000); b = random.randint(1,100000); c =  sympy.randprime(2,97)
  # according to: https://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/
  maxShingleID = 2**32-1
  # http://compoasso.free.fr/primelistweb/page/prime/liste_online_en.php
  nextPrime = 13570    #4294967311 next higher prime of 2**32-1 
                       #13570 shingles we have in nine docuemtns

  assert amount > 0 
  arr = []
  for _ in range(amount): 
    #a = random.randint(1,100000); b = random.randint(1,100000); c =  sympy.randprime(2,97)
    a = random.randint(1, maxShingleID); b = random.randint(1,maxShingleID); c = nextPrime
    arr.append((a,b,c))
  cf = set(arr)   
  return cf if len(cf) == amount else createMinHashCoeff(amount)



In [11]:
''' 
Take k (e.g., k = 100) independent hash functions, e.g., 
h(x) = (ax + b) % c
'''

### For a single Row 
"""
What you do is hash each element of A and each element of B, and look at the minimum hash value of each set. 
As it turns out, the probability of these two hashes being equal is exactly the Jaccard similarity of the two sets. Wow -- that is elegant!
http://web.eecs.utk.edu/~jplank/plank/classes/cs494/494/notes/Min-Hash/index.html
"""
def createMinHashRow(hashedShingle,col,index,coeffList):
  hcol = []
  for coeffs in coeffList: 
    min_val = math.inf; min_pos = math.inf #  i have seen implementations using min_val and min_pos -> i guess its the same in the end? Yes. Scores for testing reamined the same. 
    for i in zip(hashedShingle,col,index):
      if i[1] == 1: 
        #(ax + b) % c
        h = (int(coeffs[0])*int(i[0]) + int(coeffs[1])) % int(coeffs[2])  
        if min_val > h:
          min_val = h 
          min_pos = i[2] 
    #print(min_val)
    hcol.append(min_pos) # we are using the position of min_val instead of the actual min_val? 
  return hcol


# Function for just a single row 
''' 
mHsCol = createMinHashRow(cM["shingle"].values,cM["s0"].values,cM.index, coeffs)
mHsCol1 = createMinHashRow(cM["shingle"].values,cM["s1"].values,cM.index, coeffs)
# with own randomly set coefficients
#[0,0,0,0] 
#[0,0,0,0]
#
print(mHsCol) 
'''

' \nmHsCol = createMinHashRow(cM["shingle"].values,cM["s0"].values,cM.index, coeffs)\nmHsCol1 = createMinHashRow(cM["shingle"].values,cM["s1"].values,cM.index, coeffs)\n# with own randomly set coefficients\n#[0,0,0,0] \n#[0,0,0,0]\n#\nprint(mHsCol) \n'

In [12]:
%%time 
"""
Thus, we can form from matrix M a signature matrix, in which
the ith column of M is replaced by the minhash signature for (the set of) the
ith column.
Note that the signature matrix has the same number of columns as M but
only n rows. Even if M is not represented explicitly, but in some compressed
form suitable for a sparse matrix (e.g., by the locations of its 1’s), it is normal
for the signature matrix to be much smaller than M
"""

# CREATE COEFFICIENTS FOR HASHFUNCTIONS
coeffs = createMinHashCoeff(100)

def signMatrix(minHashRow,charMatrix,coeffList):
  #use either : f"h(x)=({x[0]}*x + {x[1]} % {x[2]}" or h[i] as keystr 
  #keystr = signMat = pd.DataFrame([f"h({i})" for i in range(1,len(coeffList)+1)], columns=["hashfunction"])
  signMat = pd.DataFrame([])
  for e in charMatrix.columns[1:]:
    charM = charMatrix 
    signMat[e] = minHashRow(charM["shingle"].values,charM[e],charM.index, coeffList)
  return signMat

signMat = signMatrix(createMinHashRow, cM,coeffs)
print(signMat)

      s0    s1    s2    s3    s4    s5    s6     s7     s8     s9
0    150  3564  3472  5761  7772  9093  9323   5613    729  12955
1    859  3800   859   859  7969   859  4010   6047   1000    859
2   1674  3499  4218  6307  7649  8379  9514   6307   1674  13110
3    923  3644  4608  6838  7653  1684  9597  10661  12255   9597
4   1381  2707  1790  2707  3765  8376  9300   1790   4931   3311
..   ...   ...   ...   ...   ...   ...   ...    ...    ...    ...
95   912   982  4358  4616  7976  4358  7793  10728   4616   4616
96   213  3030  3014  6938  7447  8681  5184   6938   3014  13425
97   293  3606  4982  4085  7938  8374  9550  11347   6079  12738
98  1622  2921   689  1622  1225   689  9437  10118   2921  13205
99  1482  1348  3855  6132  1348  1348  9609   1348   1348   1348

[100 rows x 10 columns]
CPU times: user 6.89 s, sys: 0 ns, total: 6.89 s
Wall time: 7.6 s


#### 4. Compare Signatures 
A class CompareSignatures estimates the similarity of two integer vectors – minhash signatures – as a fraction of components in which they agree.

In [18]:
def getSigSim(a,b): 
  assert len(a) == len(b)
  count = 0
  for i in zip(a,b):
     if i[0] == i[1]: count +=1 
  return count / len(a)


sh1 = signMat["s0"].values; sh2 = signMat["s1"].values
sp1 = signMatPer["s0"].values; sp2 = signMatPer["s1"].values

sH = getSigSim(sh1,sh2)
sP = getSigSim(sp1,sp2)
# Estimates are often similiar. 
# sh and sP are changing, depending on randomness.

print(sH, sP, "for jacSim:",calculateSimJac(cM,1,2, p=False))
#For 300 hashfunctions / permutations 0.12056737588652482 0.10909090909090909
#For 600 hashfunctions / permutations 0.14145383104125736 0.08548707753479125
#For 100 hashfunctions / permutations 0.10204081632653061 0.10309278350515463


#With test matrix
coeffs_test = createMinHashCoeff(1000)
signMatTest = signMatrix(createMinHashRow, cM_test_1,coeffs_test)
signMatPerTest = signMatrixPermutations(cM_test_1,1000,createPermSignVect) 

sht1 = signMatTest["s0"].values; sht2 = signMatTest["s1"].values;  sht3 = signMatTest["s2"].values
spt1 = signMatPerTest["s0"].values; spt2 = signMatPerTest["s1"].values;  spt3 = signMatPerTest["s2"].values

sHt = getSigSim(sht1,sht2)
sPt = getSigSim(spt1,spt2)

#Checking is similar values are similar. 
print(sHt, sPt, "for jacSim:",calculateSimJac(cM_test_1,1,2, p=False))
#For mostly similiar strings: 
sHt1 = getSigSim(sht1,sht3)
sPt1 = getSigSim(spt1,spt3)
print("NOTE, we set a very high  hash / permutation number for test --> very accurate resutls")
print(sHt1, sPt1, "for jacSim:",calculateSimJac(cM_test_1,1,3, p=False))

0.01 0.05 for jacSim: 0.08534031413612565
1.0 1.0 for jacSim: 1.0
NOTE, we set a very high  hash / permutation number for test --> very accurate resutls
0.414 0.411 for jacSim: 0.3956043956043956


### Locality Sensitive Hashing (LSH)

(Optional task for extra 2 bonus points) A class LSH that implements the LSH technique: given a collection of minhash signatures (integer vectors) and a similarity threshold t, the LSH class (using banding and hashing) finds candidate pairs of signatures agreeing on at least a fraction t of their components.

In [22]:
#check https://www.youtube.com/watch?v=e_SBq3s20M8&ab_channel=JamesBriggs @ 21min
def splitVec(vec,b,r): 
  assert len(vec) % b == 0
  assert len(vec) / b % r == 0
  subvecs = []
  for i in range(0,len(vec)-1,r):
    subvecs.append(vec[i:i+r])
  return subvecs
    
#BBC - for hashMatrix
b1 = splitVec(signMat["s1"].values,20,5)
b2 = splitVec(signMat["s2"].values,20,5)

# for testdata
t1 = splitVec(signMatTest["s1"].values,20,5)
t2 = splitVec(signMatTest["s2"].values,20,5)
print(t1)

[array([21, 34, 40, 35, 45]), array([ 8, 37, 33, 31, 20]), array([23, 21, 54, 18, 12]), array([25, 14, 40, 33,  2]), array([50, 10, 32,  3, 33]), array([38,  1, 48, 15, 53]), array([20,  2, 22, 16, 26]), array([55, 40,  9, 23, 14]), array([18, 14, 25, 46, 54]), array([11, 56, 31, 33, 41]), array([42, 22, 15, 55,  3]), array([52, 23, 44, 12, 41]), array([23,  5,  2,  3, 29]), array([51, 49, 51, 33, 17]), array([24,  4, 33, 43, 19]), array([ 5, 55, 22, 55,  6]), array([41,  0, 19, 45, 12]), array([17, 48, 42, 14, 29]), array([37, 47, 52, 35,  8]), array([25, 56, 22, 30, 32]), array([43, 11, 16, 21, 23]), array([18, 24, 42, 25,  9]), array([56, 46, 35, 32, 44]), array([ 3, 25, 23, 17, 14]), array([32, 56, 31, 26, 35]), array([ 6, 32, 48, 54, 53]), array([31, 11,  1, 47, 25]), array([16, 40,  8, 19, 29]), array([ 0, 41, 43, 25, 19]), array([33, 21, 54, 14, 40]), array([12, 27, 44, 52, 23]), array([ 5, 14, 11, 36, 39]), array([ 4, 47, 26, 43, 54]), array([56, 34, 34,  3, 15]), array([36, 16

In [21]:
def getCandidatePairs(v1,v2, t):
  assert len(v1) == len(v2)
  for e in zip(v1,v2): 
    # we can reuse getSigSim / used for comparing whole signatures
    if getSigSim(e[0],e[1]) > t: 
      return True
  return False

#is candidate pair? 
# BBC 
print("t = 0.5",getCandidatePairs(b1,b2,0.5), "| t = 0.1",getCandidatePairs(b1,b2,0.1))
# Testdata
print("t = 0.9",getCandidatePairs(t1,t2,0.99), "| t = 0.1",getCandidatePairs(t1,t2,0.1))

t = 0.5 False | t = 0.1 True
t = 0.9 True | t = 0.1 True


#### Testprogramm

In [81]:
''' def findSimilarDocuments(signMatrix,b,r,t): 
  similarItems = []
  vecs  = [splitVec(signMatrix[e].values,b,r) for e in signMatrix.columns]
  for i in range(len(vecs)):
    for j in range(i+1,len(vecs)): 
      #print(i,j) 
      if getCandidatePairs(vecs[i],vecs[j],t):
        similarItems.append({i:j})
  return similarItems


sI = findSimilarDocuments(signMat,20,5,0.75)
print(sI) '''

def findSimilarDocuments(signMatrix, t): 
  similarItems = []

  for i in range(len(signMatrix.columns)):
    for j in range(i+1,len(signMatrix.columns)): 
      if getSigSim(signMatrix["s"+str(i)].values, signMatrix["s"+str(j)].values) > t: 
         similarItems.append({i:j})
  return similarItems

print(len(signMat.columns))
sI010 = findSimilarDocuments(signMat,0.10)
sI015 = findSimilarDocuments(signMat,0.15)
sI030 = findSimilarDocuments(signMat,0.30)
print("Similar Documents with 0.10 Similariy: ", sI010)
print("Similar Documents with 0.15 Similariy :", sI015)
print("Similar Documents with 0.30 Similariy :", sI030)

10
Similar Documents with 0.10 Similariy:  [{0: 2}, {1: 2}, {1: 8}, {1: 9}, {2: 3}, {2: 5}, {2: 8}, {2: 9}, {3: 5}, {3: 9}, {5: 8}, {5: 9}, {7: 9}, {8: 9}]
Similar Documents with 0.15 Similariy : [{2: 5}, {2: 9}, {7: 9}]
Similar Documents with 0.30 Similariy : []
