# CopyCat: Example Usage on the Web Track Runs

- Use cases of CopyCat:
  - Deduplicate complete corpora on Spark/Hadoop
  - Deduplicate run/qrel files on your laptop

In [1]:
# Show Usage of copy cat.
!../bash/copy-cat.sh -h

Error: Could not find or load main class de.webis.cikm20_duplicates.app.DeduplicateTrecRunFile
Caused by: java.lang.ClassNotFoundException: de.webis.cikm20_duplicates.app.DeduplicateTrecRunFile


## SERP Deduplication

- Situation:
  - Near-Duplicates are removed from the Index
    - With SimHash (3+5-grams) + Canonical-URL

- Is it still necessary to remove near-duplicates from the SERP?


## WIP: SERP Deduplication for a single run

In [None]:
# Deduplicate a single run-file (please remove the output file to run copy-cat from scratch)

!../bash/copy-cat.sh \
    --output run-UAMSA10d2a8-deduplication.jsonl \
    --input input.UAMSA10d2a8 \
    --similarities "url s3 cosine(3+5-grams) cosine(8-grams) cosine(1-grams) simhash(1-gramms) simhash(3+5-gramms)" \
    --s3Threshold 0.7 \
    --threads 5 \
    --documents ChatNoirMapfiles

In [2]:
import json
import pandas as pd
import seaborn as sns
from tqdm import tqdm
from trectools import TrecQrel

qrels = TrecQrel('qrels.web.101-150.txt')

def eval_with_threshold(threshold, run_file_name):
    rows = []
    print("Process " + run_file_name)
    with open(run_file_name) as jsonl_file:
        for jsonl in jsonl_file:
            dedup_data = json.loads(jsonl)
            topic = dedup_data['topic']
            judged_docs = qrels.get_document_names_for_topic(int(topic))
            
            for sim in tqdm(dedup_data['similarities']):
                is_judged = sim['firstId'] in judged_docs or sim['secondId'] in judged_docs
                is_relevant = False
                is_irrelevant = False
                
                if is_judged:
                    judgment_a = qrels.get_judgement(sim['firstId'], int(topic))
                    judgment_b = qrels.get_judgement(sim['secondId'], int(topic))
                    is_irrelevant = judgment_a == 0 or judgment_b == 0
                    is_relevant = not is_irrelevant
                
                rows += [{
                    'topic': topic,
                    'judged': is_judged,
                    'relevant': is_relevant,
                    'irrelevant': is_irrelevant,
                    'near-duplicate': sim['similarities']['s3'] >=  threshold,
                    'simhash(3+5-grams)': sim['similarities']['simhash(3+5-grams)'] > 0.94,
                    'simhash(1-grams)': sim['similarities']['simhash(1-grams)'] > 0.94,
                    'url': sim['similarities']['url'] > 0.5,
                    'copy-cat': ((sim['similarities']['simhash(3+5-grams)'] > 0.94) or ((sim['similarities']['url'] > 0.5) and (sim['similarities']['simhash(1-grams)'] > 0.94)))
                }]

    return rows

def eval_runs_with_threshold(threshold, run_files):
    rows = []
    for r in run_files:
        rows += eval_with_threshold(threshold, r)
    
    return pd.DataFrame(rows)

In [None]:
%time
DEDUP_TARGET_DIR='../../../top-100-deduplication/'
ALL_DIRS=!ls $DEDUP_TARGET_DIR
ALL_DIRS=!ls $DEDUP_TARGET_DIR
ALL_DIRS = [DEDUP_TARGET_DIR + i for i in ALL_DIRS]


df = eval_runs_with_threshold(0.82, ALL_DIRS)

CPU times: user 6 µs, sys: 6 µs, total: 12 µs
Wall time: 21.2 µs


  0%|          | 0/148 [00:00<?, ?it/s]

Process ../../../top-100-deduplication/input.2011SiftR1.jsonl


100%|██████████| 148/148 [00:00<00:00, 329.07it/s]
100%|██████████| 35/35 [00:00<00:00, 188.78it/s]
100%|██████████| 69/69 [00:00<00:00, 162.33it/s]
100%|██████████| 88/88 [00:00<00:00, 205.21it/s]
100%|██████████| 147/147 [00:00<00:00, 2074.69it/s]
100%|██████████| 15/15 [00:00<00:00, 140.17it/s]
100%|██████████| 88/88 [00:00<00:00, 254.16it/s]
100%|██████████| 113/113 [00:00<00:00, 161.99it/s]
100%|██████████| 2/2 [00:00<00:00, 279.17it/s]
100%|██████████| 41/41 [00:00<00:00, 297.48it/s]
100%|██████████| 71/71 [00:00<00:00, 185.03it/s]
100%|██████████| 6/6 [00:00<00:00, 157.18it/s]
100%|██████████| 48/48 [00:00<00:00, 211.61it/s]
100%|██████████| 24/24 [00:00<00:00, 199.50it/s]
100%|██████████| 15/15 [00:00<00:00, 185.00it/s]
100%|██████████| 30/30 [00:00<00:00, 168.42it/s]
100%|██████████| 468/468 [00:00<00:00, 813.97it/s] 
100%|██████████| 15/15 [00:00<00:00, 783.70it/s]
100%|██████████| 2/2 [00:00<00:00, 144.06it/s]
100%|██████████| 10/10 [00:00<00:00, 190.87it/s]
100%|██████████|

Process ../../../top-100-deduplication/input.2011SiftR2.jsonl


100%|██████████| 152/152 [00:00<00:00, 293.07it/s]
100%|██████████| 37/37 [00:00<00:00, 153.00it/s]
100%|██████████| 70/70 [00:00<00:00, 174.15it/s]
100%|██████████| 103/103 [00:00<00:00, 212.29it/s]
100%|██████████| 182/182 [00:00<00:00, 2655.77it/s]
100%|██████████| 15/15 [00:00<00:00, 205.30it/s]
100%|██████████| 115/115 [00:00<00:00, 334.18it/s]
100%|██████████| 112/112 [00:00<00:00, 168.15it/s]
100%|██████████| 2/2 [00:00<00:00, 265.14it/s]
100%|██████████| 41/41 [00:00<00:00, 288.11it/s]
100%|██████████| 69/69 [00:00<00:00, 162.23it/s]
100%|██████████| 6/6 [00:00<00:00, 166.98it/s]
100%|██████████| 48/48 [00:00<00:00, 194.98it/s]
100%|██████████| 24/24 [00:00<00:00, 217.04it/s]
100%|██████████| 16/16 [00:00<00:00, 134.14it/s]
100%|██████████| 30/30 [00:00<00:00, 84.56it/s]
100%|██████████| 459/459 [00:00<00:00, 701.36it/s] 
100%|██████████| 21/21 [00:00<00:00, 169.78it/s]
100%|██████████| 2/2 [00:00<00:00, 68.10it/s]
100%|██████████| 11/11 [00:00<00:00, 141.70it/s]
100%|█████████

Process ../../../top-100-deduplication/input.2011SiftR3.jsonl


100%|██████████| 97/97 [00:00<00:00, 545.33it/s]
100%|██████████| 5/5 [00:00<00:00, 202.16it/s]
100%|██████████| 100/100 [00:00<00:00, 200.63it/s]
100%|██████████| 228/228 [00:00<00:00, 1691.36it/s]
100%|██████████| 177/177 [00:00<00:00, 271.58it/s]
100%|██████████| 13/13 [00:00<00:00, 199.32it/s]
100%|██████████| 77/77 [00:00<00:00, 180.33it/s]
100%|██████████| 38/38 [00:00<00:00, 176.04it/s]
100%|██████████| 152/152 [00:00<00:00, 178.50it/s]
100%|██████████| 167/167 [00:00<00:00, 173.15it/s]
100%|██████████| 23/23 [00:00<00:00, 180.92it/s]
100%|██████████| 174/174 [00:01<00:00, 167.26it/s]
100%|██████████| 66/66 [00:00<00:00, 197.93it/s]
100%|██████████| 14/14 [00:00<00:00, 94.40it/s]
100%|██████████| 7/7 [00:00<00:00, 148.96it/s]
100%|██████████| 66/66 [00:00<00:00, 149.79it/s]
100%|██████████| 23/23 [00:00<00:00, 177.79it/s]
100%|██████████| 39/39 [00:00<00:00, 206.61it/s]
100%|██████████| 94/94 [00:00<00:00, 276.65it/s]
100%|██████████| 54/54 [00:00<00:00, 227.60it/s]
100%|███████

Process ../../../top-100-deduplication/input.DFalah11.jsonl


100%|██████████| 34/34 [00:00<00:00, 201.73it/s]
100%|██████████| 2/2 [00:00<00:00, 24385.49it/s]
100%|██████████| 10/10 [00:00<00:00, 370.66it/s]
100%|██████████| 108/108 [00:00<00:00, 223.80it/s]
100%|██████████| 2/2 [00:00<00:00, 287.79it/s]
100%|██████████| 2/2 [00:00<00:00, 317.20it/s]
100%|██████████| 13/13 [00:00<00:00, 187.82it/s]
0it [00:00, ?it/s]
100%|██████████| 22/22 [00:00<00:00, 871.36it/s]
100%|██████████| 15/15 [00:00<00:00, 178.66it/s]
100%|██████████| 2/2 [00:00<00:00, 56.17it/s]
100%|██████████| 71/71 [00:00<00:00, 675.90it/s]
100%|██████████| 6/6 [00:00<00:00, 316.20it/s]
100%|██████████| 5/5 [00:00<00:00, 210.55it/s]
100%|██████████| 6/6 [00:00<00:00, 300.98it/s]
100%|██████████| 4/4 [00:00<00:00, 299.01it/s]
0it [00:00, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 121.36it/s]
100%|██████████| 5/5 [00:00<00:00, 185.87it/s]
100%|██████████| 4/4 [00:00<00:00, 96.56it/s]
100%|██████████| 5/5 [00:00<00:00, 177.09it/s]
100%|██████████| 36/36 [00:00<00:00, 345.43it/s]
100%

In [62]:
, 

def precision_score(df, approach):
    from sklearn.metrics import precision_score
    return "{:0.3f}".format(precision_score(y_true=df['near-duplicate'], y_pred=df[approach]))

def recall_score(df, approach):
    from sklearn.metrics import recall_score
    return "{:0.3f}".format(recall_score(y_true=df['near-duplicate'], y_pred=df[approach]))


print('# top 100')
print('CopyCat (Precision):', precision_score(df, 'copy-cat'))
print('CopyCat (Recall):', recall_score(df, 'copy-cat'))


print('SimHash(3+5-grams) (Precision):', precision_score(df, 'simhash(3+5-grams)'))
print('SimHash(3+5-grams) (Recall):', recall_score(df, 'simhash(3+5-grams)'))

print('SimHash(1-grams) (Precision):', precision_score(df, 'simhash(1-grams)'))
print('SimHash(1-grams) (Recall):', recall_score(df, 'simhash(1-grams)'))


print('\n\n# Judged@top 100')
print('Copy-Cat (Precision):', precision_score(df[df['judged']], 'copy-cat'))
print('Copy-Cat (Recall):', recall_score(df[df['judged']], 'copy-cat'))

print('SimHash(3+5-grams) (Precision):', precision_score(df[df['judged']], 'simhash(3+5-grams)'))
print('SimHash(3+5-grams) (Recall):', recall_score(df[df['judged']], 'simhash(3+5-grams)'))

print('SimHash(1-grams) (Precision):', precision_score(df[df['judged']], 'simhash(1-grams)'))
print('SimHash(1-grams) (Recall):', recall_score(df[df['judged']], 'simhash(1-grams)'))


print('\n\n# Relevant@top 100')
print('Copy-Cat (Precision):', precision_score(df[(df['judged']) & (df['relevant'])], 'copy-cat'))
print('Copy-Cat (Recall):', recall_score(df[(df['judged']) & (df['relevant'])], 'copy-cat'))

print('SimHash(3+5-grams) (Precision):', precision_score(df[(df['judged']) & (df['relevant'])], 'simhash(3+5-grams)'))
print('SimHash(3+5-grams) (Recall):', recall_score(df[(df['judged']) & (df['relevant'])], 'simhash(3+5-grams)'))

print('SimHash(1-grams) (Precision):', precision_score(df[(df['judged']) & (df['relevant'])], 'simhash(1-grams)'))
print('SimHash(1-grams) (Recall):', recall_score(df[(df['judged']) & (df['relevant'])], 'simhash(1-grams)'))


# top 100
CopyCat (Precision): 0.999
CopyCat (Recall): 0.733
SimHash(3+5-grams) (Precision): 0.999
SimHash(3+5-grams) (Recall): 0.634
SimHash(1-grams) (Precision): 0.813
SimHash(1-grams) (Recall): 0.918


# Judged@top 100
Copy-Cat (Precision): 0.999
Copy-Cat (Recall): 0.765
SimHash(3+5-grams) (Precision): 0.998
SimHash(3+5-grams) (Recall): 0.650
SimHash(1-grams) (Precision): 0.958
SimHash(1-grams) (Recall): 0.951


# Relevant@top 100
Copy-Cat (Precision): 1.000
Copy-Cat (Recall): 0.742
SimHash(3+5-grams) (Precision): 1.000
SimHash(3+5-grams) (Recall): 0.531
SimHash(1-grams) (Precision): 0.946
SimHash(1-grams) (Recall): 0.988


In [59]:
df[df['judged']]['judged'].unique()

array([ True])

In [36]:
%time
DEDUP_TARGET_DIR='../../../top-100-deduplication/'
ALL_DIRS=!ls $DEDUP_TARGET_DIR
ALL_DIRS=!ls $DEDUP_TARGET_DIR
ALL_DIRS = [DEDUP_TARGET_DIR + i for i in ALL_DIRS]


eval_runs_with_threshold(0.82, ALL_DIRS[0:1]).mean()

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.96 µs
Process ../../../top-100-deduplication/input.2011SiftR1.jsonl
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged: 

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrel

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  Fal

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelev

Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrel

Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False

Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  Fal

Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; i

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevan

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrel

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False 

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevan

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  Fal

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  False ; relevant:  False ; irrelevant:  False
Judged:  True ; relevant:  False ; irrel

Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant: 

Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant: 

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  True ; irrelevant:  False
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:

Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:  True
Judged:  True ; relevant:  False ; irrelevant:

Recall (SimHash 3+5-grams)       0.664184
Recall (SimHash 1-grams)         0.921189
Recall (url)                     0.324333
Recall (CopyCat)                 0.740088
Precision (SimHash 3+5-grams)    0.997952
Precision (SimHash 1-grams)      0.918107
Precision (Url)                  1.000000
Precision (CopyCat)              0.998264
dtype: float64

In [114]:
eval_with_threshold(0.82).mean()

Recall (SimHash 3+5-grams)       0.688893
Recall (SimHash 1-grams)         0.849880
Precision (SimHash 3+5-grams)    0.985294
Precision (SimHash 1-grams)      0.909336
dtype: float64

In [115]:
eval_with_threshold(0.90).mean()

Recall (SimHash 3+5-grams)       0.807388
Recall (SimHash 1-grams)         0.929377
Precision (SimHash 3+5-grams)    0.941176
Precision (SimHash 1-grams)      0.798057
dtype: float64

## Next Steps:

- Improve efficiency of "local" copy-cat
- Evaluations:
  - Recall:
    - @100, @1000, @Judged, @Relevant, @Irrelevant
  - At which Positions occur duplicates?
    - Duplicates within the top k
- Evaluation Notebooks (all available runs):
  - ClueWeb
  - Robust04
  - Argsme
- Relevance Transfer:
  - Deduplicate on the top-10000 results for each topic with CopyCat
- Finish Anserini-Integration
  - Use jigsaw modules to separate ChatNoir from Anserini