# CopyCat: Example Usage on the Web Track Runs

- Use cases of CopyCat:
  - Precision oriented deduplication on complete corpora with Spark/Hadoop
  - Lossless near-duplicate detection to improve recall while maintaining high precision on small document sets (run/qrel files) on your laptop

In [37]:
EXPERIMENT_DIR='/mnt/ceph/storage/data-in-progress/data-research/web-search/'

def run_copy_cat(input_file, output_file, deduplication_depth):
    print('Process \'' + input_file + '\' to \'' + output_file + '\'.')
    !../bash/copy-cat.sh \
        --input  $input_file \
        --output $output_file \
        --similarities "url s3 cosine(3+5-grams) cosine(8-grams) cosine(1-grams) simhash(1-grams) simhash(3+5-grams) md5 text-profile" \
        --s3Threshold 0.6 \
        --threads 5 \
        --ranks $deduplication_depth \
        --documents ChatNoirMapfiles

def list_pretty_run_files(run_file_dir):
    ret = !ls $run_file_dir
    return [i.replace('.gz', '').replace('input.', '') for i in ret]

# TREC 20

In [38]:
# Run Deduplication of run files submitted to the web tracks with CopyCat

web_tracks = range(18, 23) # the web tracks took place between TREC 18 and TREC 23

for web_track in web_tracks:
    for deduplication_depth in [100]:
        run_file_dir=EXPERIMENT_DIR + 'web-search-trec/trec-system-runs/trec' + str(web_track) + '/web.adhoc/'
        output_dir=EXPERIMENT_DIR + 'SIGIR-21/sigir21-deduplicate-trec-run-files/trec' + str(web_track) + '-web.adhoc-top' + str(deduplication_depth)
        
        for run_file in list_pretty_run_files(run_file_dir):
            run_copy_cat(
                input_file=run_file_dir + '/input.' + run_file + '.gz',
                output_file=output_dir + '/' + run_file + '.jsonl',
                deduplication_depth=deduplication_depth
            )

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.ICTNETADRun3.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/ICTNETADRun3.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/ICTNETADRun3.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.ICTNETADRun4.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/ICTNETADRun4.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/ICTNETADRun4.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-p

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/Sab9wtBf2.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.THUIR09An.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/THUIR09An.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/THUIR09An.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.THUIR09LuTA.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/THUIR09LuTA.jsonl'.
The specified output '/mnt/ceph/storage/data-in-pr

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.UamsAw7an3.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/UamsAw7an3.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/UamsAw7an3.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.UamsAwebQE10.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/UamsAwebQE10.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/UamsAwebQE10.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progres

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/scutrun3.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.twCSrs9N.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/twCSrs9N.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/twCSrs9N.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.twCSrsR.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/twCSrsR.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.yhooumd09BGM.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/yhooumd09BGM.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top100/yhooumd09BGM.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.DFalah2010.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/DFalah2010.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/DFalah2010.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progres

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/THUIR10QaHt.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.THUIR10Str.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/THUIR10Str.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/THUIR10Str.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.UAMSA10d2a8.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/UAMSA10d2a8.jsonl'.
The specified output '/mnt/ceph/storage/data-

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.irra10rob.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/irra10rob.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/irra10rob.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.msrsv1.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/msrsv1.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/msrsv1.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-s

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/utwente4.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.utwente4SF.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/utwente4SF.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/utwente4SF.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.york10wA1.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/york10wA1.jsonl'.
The specified output '/mnt/ceph/storage/data-in-prog

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec20/web.adhoc//input.UWatMDSql.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/UWatMDSql.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/UWatMDSql.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec20/web.adhoc//input.UWatMDSqlt.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/UWatMDSqlt.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/UWatMDSqlt.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-re

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/umassIFuse.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec20/web.adhoc//input.uogTrA45Nm.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/uogTrA45Nm.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/uogTrA45Nm.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec20/web.adhoc//input.uogTrA45Vm.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/uogTrA45Vm.jsonl'.
The specified output '/mnt/ceph/storage/data-in-

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec21/web.adhoc//input.QUTparaTQEg1.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top100/QUTparaTQEg1.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top100/QUTparaTQEg1.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec21/web.adhoc//input.autoSTB.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top100/autoSTB.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top100/autoSTB.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-re

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top100/utw2012c2.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec21/web.adhoc//input.utw2012fc1.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top100/utw2012fc1.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top100/utw2012fc1.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec21/web.adhoc//input.utw2012lda.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top100/utw2012lda.jsonl'.
The specified output '/mnt/ceph/storage/data-in-p

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec22/web.adhoc//input.msr_alpha0.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/msr_alpha0.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/msr_alpha0.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec22/web.adhoc//input.msr_alpha0_95_4.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/msr_alpha0_95_4.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/msr_alpha0_95_4.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-i

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/wistud.runA.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec22/web.adhoc//input.wistud.runB.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/wistud.runB.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/wistud.runB.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec22/web.adhoc//input.wistud.runC.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/wistud.runC.jsonl'.
The specified output '/mnt/ceph/storage/da

## SERP Deduplication

- Situation:
  - Near-Duplicates are removed from the Index
    - With SimHash (3+5-grams) + Canonical-URL

- Is it still necessary to remove near-duplicates from the SERP?


## WIP: SERP Deduplication for a single run

In [2]:
# Deduplicate a single run-file (please remove the output file to run copy-cat from scratch)

!../bash/copy-cat.sh \
    --output run-UAMSA10d2a8-deduplication.jsonl \
    --input input.UAMSA10d2a8 \
    --similarities "url s3 cosine(3+5-grams) cosine(8-grams) cosine(1-grams) simhash(1-grams) simhash(3+5-grams) md5 text-profile" \
    --s3Threshold 0.7 \
    --threads 5 \
    --documents ChatNoirMapfiles

The specified output 'run-UAMSA10d2a8-deduplication.jsonl' exists.
Skip...


In [3]:
import json
import pandas as pd
import seaborn as sns
from trectools import TrecQrel

qrels = TrecQrel('qrels.web.101-150.txt')

def eval_with_threshold(threshold, run_file_name):
    rows = []
    with open(run_file_name) as jsonl_file:
        for jsonl in jsonl_file:
            dedup_data = json.loads(jsonl)
            topic = dedup_data['topic']
            judged_docs = qrels.get_document_names_for_topic(int(topic))
            
            for sim in dedup_data['similarities']:
                is_judged = sim['firstId'] in judged_docs or sim['secondId'] in judged_docs
                is_relevant = False
                is_irrelevant = False
                
                if is_judged:
                    judgment_a = qrels.get_judgement(sim['firstId'], int(topic))
                    judgment_b = qrels.get_judgement(sim['secondId'], int(topic))
                    is_irrelevant = judgment_a == 0 or judgment_b == 0
                    is_relevant = not is_irrelevant
                
                rows += [{
                    'topic': topic,
                    'judged': is_judged,
                    'relevant': is_relevant,
                    'irrelevant': is_irrelevant,
                    'near-duplicate': sim['similarities']['s3'] >=  threshold,
                    'simhash(3+5-grams)': sim['similarities']['simhash(3+5-grams)'] > 0.94,
                    'simhash(1-grams)': sim['similarities']['simhash(1-grams)'] > 0.94,
                    'url': sim['similarities']['url'] > 0.5,
                    'text-profile': sim['similarities']['text-profile'] > 0.5,
                    'md5': sim['similarities']['md5'] > 0.5,
                    'copy-cat': ((sim['similarities']['simhash(3+5-grams)'] > 0.94) or ((sim['similarities']['url'] > 0.5) and (sim['similarities']['simhash(1-grams)'] > 0.94))),
                    'copy-cat-tp': ((sim['similarities']['simhash(3+5-grams)'] > 0.94) or (sim['similarities']['text-profile'] > 0.5) or ((sim['similarities']['url'] > 0.5) and (sim['similarities']['simhash(1-grams)'] > 0.94)))
                }]

    return rows

def eval_runs_with_threshold(threshold, run_files):
    rows = []
    for r in run_files:
        rows += eval_with_threshold(threshold, r)
    
    return pd.DataFrame(rows)

In [4]:
DEDUP_TARGET_DIR='../../../top-100-deduplication/'
ALL_DIRS=!ls $DEDUP_TARGET_DIR
ALL_DIRS=!ls $DEDUP_TARGET_DIR
ALL_DIRS = [DEDUP_TARGET_DIR + i for i in ALL_DIRS]


df = eval_runs_with_threshold(0.82, ALL_DIRS)

In [11]:
def precision_score(df, approach):
    from sklearn.metrics import precision_score
    return "{:0.3f}".format(precision_score(y_true=df['near-duplicate'], y_pred=df[approach]))

def recall_score(df, approach):
    from sklearn.metrics import recall_score
    return "{:0.3f}".format(recall_score(y_true=df['near-duplicate'], y_pred=df[approach]))

def table_row(df, approach, approach_display_name):
    df_relevant_100 = df[(df['judged']) & (df['relevant'])]
    df_irrelevant_100 = df[(df['judged']) & (~df['relevant'])]

    return {
        'Approach': approach_display_name,
        'Precision (Top100)': precision_score(df, approach),
        'Recall (Top100)': recall_score(df, approach),
        'Precision (Relevant@Top100)': precision_score(df_relevant_100, approach),
        'Recall (Relevant@Top100': recall_score(df_relevant_100, approach),
        'Precision (Irrelevant@Top100)': precision_score(df_irrelevant_100, approach),
        'Recall (Irrelevant@Top100)': recall_score(df_irrelevant_100, approach),
    }

def report_table(df):
    rows = []
    for approach, approach_display_name in [('copy-cat-tp', 'CopyCat'), ('simhash(1-grams)', 'SimHash(1-grams)'), ('simhash(3+5-grams)', 'SimHash(3+5-grams)'), ('text-profile', 'TextProfile') , ('md5', 'MD5')]:
        rows += [table_row(df, approach, approach_display_name)]
    ret = pd.DataFrame(rows)
    ret.set_index('Approach', inplace=True)
    ret.columns = pd.MultiIndex.from_tuples([
        ('Top100', 'Precision'), ('Top100', 'Recall'),
        ('Relevant@Top100', 'Precision'), ('Relevant@Top100', 'Recall'),
        ('Irrelevant@Top100', 'Precision'), ('Irrelevant@Top100', 'Recall'),
    ])

    return ret.reset_index()

print('Precision/Recall with S3 score as ground-truth (small cw09 sample):')
report_table(df)

Precision/Recall with S3 score as ground-truth (small cw09 sample):


Unnamed: 0_level_0,Approach,Top100,Top100,Relevant@Top100,Relevant@Top100,Irrelevant@Top100,Irrelevant@Top100
Unnamed: 0_level_1,Unnamed: 1_level_1,Precision,Recall,Precision,Recall,Precision,Recall
0,CopyCat,0.997,0.701,0.998,0.836,0.996,0.634
1,SimHash(1-grams),0.932,0.792,0.952,0.961,0.958,0.759
2,SimHash(3+5-grams),0.998,0.589,1.0,0.689,0.997,0.537
3,TextProfile,0.998,0.305,0.996,0.383,1.0,0.25
4,MD5,1.0,0.135,1.0,0.239,1.0,0.147


## Next Steps:

- Improve efficiency of "local" copy-cat
- Evaluations:
  - Add top 1000 to the table above
  - At which Positions occur duplicates?
    - Duplicates within the top k
- Evaluation Notebooks (all available runs):
  - ClueWeb
  - Robust04
  - Argsme
- Relevance Transfer:
  - Deduplicate on the top-10000 results for each topic with CopyCat
- Finish Anserini-Integration
  - Use jigsaw modules to separate ChatNoir from Anserini