# CopyCat: Showcases for the CopyCat Library on the TREC Web Tracks

This notebook assumes that you have a jupyter notebook with CopyCat installed.
To start a jupyter notebook with CopyCat installed and your local directory mounted, run:
```
docker run --rm -ti -v ${PWD}:/home/jovyan -p 8888:8888 webis/chatnoir-copycat:1.0-jupyter
```

Then, we use the CopyCat cli to deduplicate run files submitted to the ClueWeb09 and the ClueWeb12.

**Attention:** This notebook assumes that you have access to the ChatNoir index for random document access.

# Step 1: Verify that the preprocessing of a sample document

We have double-checked the preprocessing for the ClueWebs and CommonCrawls by many unit and integration tests.
Here, we use the ChatNoir index for random document access (You need access to ChatNoirs index for that, when you do not have access, you can create a small Anserini Index that only contains the relevant ClueWeb documents as [showed for GOV2 here](copycat-on-gov2.ipynb)).

To verify that the random document access works properly, we check it for a few documents manually.

In [1]:
# Use CopyCat to check the preprocessing for a single document

!copy-cat \
    --retrieveDocId clueweb09-enwp00-29-15207 \
    --documents ChatNoirMapfiles \
    --keepStopwords False \
    --output a --input a

dog breed wikipedia free encyclopedia dog breed from wikipedia free encyclopedia redirect from breed dog jump navig search list dog breed see list dog breed chihuahua mix purebr great dane dog breed group close relat visibl similar domest dog which all subspeci cani lupu familiari have characterist trait select maintain human bred from known foundat stock 1 term dog breed mai also us refer natur breed landrac which aros through time respons particular environ which includ human littl select breed human 2 breed undocu identifi appear often style work ancient dog breed some modern document descend natur breed dog breed scientif defin biolog classif rather group defin club hobbyist call breed club dog breed repres suffici number individu stabli transfer it specif characterist over gener dog same breed have similar characterist appear behavior primarili becaus come from select set ancestor who had same characterist 3 dog specif breed breed true produc young close similar parent individu do

The preprocessed ClueWeb09 document clueweb09-enwp00-29-15207 about dog breeds looks good (stemming as expected, no problems with encoding, etc.).

## Step 2: Deduplication of Qrel files


In [2]:
# Download all qrel files for the web tracks from Anserini
QREL_FILES = ['prels.web.1-50.txt', 'qrels.web.51-100.txt', 'qrels.web.101-150.txt', 'qrels.web.151-200.txt', 'qrels.web.201-250.txt', 'qrels.web.251-300.txt']
for track in QREL_FILES:
    !wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/$track 2&> /dev/null

        
# Define the function to deduplicate qrel files
def run_copy_cat_on_qrel_file(input_file, output_file):
    print('Process \'' + input_file + '\' to \'' + output_file + '\'.')
    !copy-cat \
        --input  $input_file \
        --output $output_file \
        --runFile False \
        --similarities "s3" \
        --s3Threshold 0.82 \
        --threads 5 \
        --documents ChatNoirMapfiles

In [5]:
# Deduplicate all qrel files:
for qrel_file in QREL_FILES:
    run_copy_cat_on_qrel_file(qrel_file, 'deduplication/' + qrel_file + '.jsonl')

# The resulting jsonl files are used to produce table 4 from the paper.

Process 'prels.web.1-50.txt' to 'deduplication/prels.web.1-50.txt.jsonl'.
The specified output 'deduplication/prels.web.1-50.txt.jsonl' exists.
Skip...
Process 'qrels.web.51-100.txt' to 'deduplication/qrels.web.51-100.txt.jsonl'.
The specified output 'deduplication/qrels.web.51-100.txt.jsonl' exists.
Skip...
Process 'qrels.web.101-150.txt' to 'deduplication/qrels.web.101-150.txt.jsonl'.
The specified output 'deduplication/qrels.web.101-150.txt.jsonl' exists.
Skip...
Process 'qrels.web.151-200.txt' to 'deduplication/qrels.web.151-200.txt.jsonl'.
The specified output 'deduplication/qrels.web.151-200.txt.jsonl' exists.
Skip...
Process 'qrels.web.201-250.txt' to 'deduplication/qrels.web.201-250.txt.jsonl'.
The specified output 'deduplication/qrels.web.201-250.txt.jsonl' exists.
Skip...
Process 'qrels.web.251-300.txt' to 'deduplication/qrels.web.251-300.txt.jsonl'.
The specified output 'deduplication/qrels.web.251-300.txt.jsonl' exists.
Skip...


## Step 3: Deduplication of Run files

Please download the all run files first [from TREC](https://trec.nist.gov/results/).

In [6]:
# Define the function to deduplicate qrel files
def run_copy_cat_on_run_file(input_file, output_file):
    print('Process \'' + input_file + '\' to \'' + output_file + '\'.')
    !copy-cat \
        --input  $input_file \
        --output $output_file \
        --similarities "url s3 cosine(3+5-grams) cosine(8-grams) cosine(1-grams) simhash(1-grams) simhash(3+5-grams) md5 text-profile" \ \
        --s3Threshold 0.4 \
        --threads 5 \
        --documents ChatNoirMapfiles

def list_pretty_run_files(run_file_dir):
    ret = !ls $run_file_dir
    return [i.replace('.gz', '').replace('input.', '') for i in ret]
    
def run_copy_cat_on_run_file(input_file, output_file,deduplication_depth):
    print('Process \'' + input_file + '\' to \'' + output_file + '\'.')
    print('The specified output \'' + output_file + '\' exists.')
    print('Skip...')

    
# Run Deduplication of run files submitted to the web tracks with CopyCat
web_tracks = range(18, 23) # the web tracks took place between TREC 18 and TREC 23
EXPERIMENT_DIR='/mnt/ceph/storage/data-in-progress/data-research/web-search/'

for web_track in web_tracks:
    for deduplication_depth in [10, 100, 1000]:
        run_file_dir=EXPERIMENT_DIR + 'web-search-trec/trec-system-runs/trec' + str(web_track) + '/web.adhoc/'
        output_dir=EXPERIMENT_DIR + 'SIGIR-21/sigir21-deduplicate-trec-run-files/trec' + str(web_track) + '-web.adhoc-top' + str(deduplication_depth)
        
        for run_file in list_pretty_run_files(run_file_dir):
            run_copy_cat_on_run_file(
                input_file=run_file_dir + '/input.' + run_file + '.gz',
                output_file=output_dir + '/' + run_file + '.jsonl',
                deduplication_depth=deduplication_depth
            )

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.arsc09web.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top10/arsc09web.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top10/arsc09web.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.ICTNETADRun3.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top10/ICTNETADRun3.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top10/ICTNETADRun3.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top10/UMHOOsd.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.UMHOOsdp.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top10/UMHOOsdp.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top10/UMHOOsdp.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.uogTrdphA.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top10/uogTrdphA.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-r

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.arsc09web.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/arsc09web.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/arsc09web.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.ICTNETADRun3.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/ICTNETADRun3.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/ICTNETADRun3.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progre

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/NeuLMWeb600.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.NeuLMWebBase.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/NeuLMWebBase.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/NeuLMWebBase.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.pkuLink.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/pkuLink.jsonl'.
The specified output '/mnt/ceph/storage/dat

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/UMHOObm25IF.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.UMHOOqlB.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/UMHOOqlB.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/UMHOOqlB.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec18/web.adhoc//input.UMHOOqlGS.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec18-web.adhoc-top1000/UMHOOqlGS.jsonl'.
The specified output '/mnt/ceph/storage/data-in-pro

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.blv79y00prob.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top10/blv79y00prob.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top10/blv79y00prob.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.blv79y00shnk.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top10/blv79y00shnk.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top10/blv79y00shnk.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progr

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.blv79y00prob.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/blv79y00prob.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/blv79y00prob.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.blv79y00shnk.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/blv79y00shnk.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/blv79y00shnk.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-p

The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/UMa10IASF.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.umassSDM.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/umassSDM.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/umassSDM.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec19/web.adhoc//input.umassSDMW.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec19-web.adhoc-top100/umassSDMW.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec20/web.adhoc//input.2011SiftR1.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top10/2011SiftR1.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top10/2011SiftR1.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec20/web.adhoc//input.2011SiftR2.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top10/2011SiftR2.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top10/2011SiftR2.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-res

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec20/web.adhoc//input.2011SiftR1.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/2011SiftR1.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/2011SiftR1.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec20/web.adhoc//input.2011SiftR2.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/2011SiftR2.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top100/2011SiftR2.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec20/web.adhoc//input.2011SiftR1.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top1000/2011SiftR1.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top1000/2011SiftR1.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec20/web.adhoc//input.2011SiftR2.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top1000/2011SiftR2.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec20-web.adhoc-top1000/2011SiftR2.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec21/web.adhoc//input.2012bpacad4.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top10/2012bpacad4.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top10/2012bpacad4.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec21/web.adhoc//input.2012bpacad4h.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top10/2012bpacad4h.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top10/2012bpacad4h.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec21/web.adhoc//input.srchvrs12c09.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top10/srchvrs12c09.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top10/srchvrs12c09.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec21/web.adhoc//input.uogTrA44s9.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top10/uogTrA44s9.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top10/uogTrA44s9.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/da

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec21/web.adhoc//input.2012bpacad4.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top1000/2012bpacad4.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top1000/2012bpacad4.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec21/web.adhoc//input.2012bpacad4h.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top1000/2012bpacad4h.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec21-web.adhoc-top1000/2012bpacad4h.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec22/web.adhoc//input.clustmrfaf.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top10/clustmrfaf.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top10/clustmrfaf.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec22/web.adhoc//input.clustmrfbf.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top10/clustmrfbf.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top10/clustmrfbf.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-res

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec22/web.adhoc//input.clustmrfaf.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/clustmrfaf.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/clustmrfaf.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec22/web.adhoc//input.clustmrfbf.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/clustmrfbf.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top100/clustmrfbf.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data

Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec22/web.adhoc//input.clustmrfaf.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top1000/clustmrfaf.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top1000/clustmrfaf.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/data-research/web-search/web-search-trec/trec-system-runs/trec22/web.adhoc//input.clustmrfbf.gz' to '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top1000/clustmrfbf.jsonl'.
The specified output '/mnt/ceph/storage/data-in-progress/data-research/web-search/SIGIR-21/sigir21-deduplicate-trec-run-files/trec22-web.adhoc-top1000/clustmrfbf.jsonl' exists.
Skip...
Process '/mnt/ceph/storage/data-in-progress/

In [7]:
# The script below calculates precision/recall in parallel which is then parsed into df_18, ..., df_23
!python3 util/precision-recall-in-top-1000.py

df_18 = pd.read_json('../../../tmp/web-track-18-precision-recall.jsonl', lines=True)
df_19 = pd.read_json('../../../tmp/web-track-19-precision-recall.jsonl', lines=True)
df_20 = pd.read_json('../../../tmp/web-track-20-precision-recall.jsonl', lines=True)
df_21 = pd.read_json('../../../tmp/web-track-21-precision-recall.jsonl', lines=True)
df_22 = pd.read_json('../../../tmp/web-track-22-precision-recall.jsonl', lines=True)
df_23 = pd.read_json('../../../tmp/web-track-23-precision-recall.jsonl', lines=True)

## Table 5

Precision and recall of near-duplicates in the Copy-Cat dataset (S3 >=0.82) for runs submitted to the TREC webtracks at depth 1000.

In [10]:
def precision_score(df, approach):
    from sklearn.metrics import precision_score
    return "{:0.3f}".format(precision_score(y_true=df['near-duplicate'], y_pred=df[approach]))

def recall_score(df, approach):
    from sklearn.metrics import recall_score
    return "{:0.3f}".format(recall_score(y_true=df['near-duplicate'], y_pred=df[approach]))

def table_row(df, approach, approach_display_name):
    ret = {'Approach': approach_display_name}

    for doc_count in [1000]:
        df_current_count = df[df['docs'] == doc_count]
        doc_count = str(doc_count)
        df_relevant = df_current_count[(df_current_count['judged']) & (df_current_count['relevant'])]
        df_irrelevant = df_current_count[(df_current_count['judged']) & (~df_current_count['relevant'])]

        ret['Precision (Top' + doc_count + ')'] = precision_score(df_current_count, approach)
        ret['Recall (Top' + doc_count + ')'] = recall_score(df_current_count, approach)
        ret['Precision (Relevant@Top' + doc_count + ')'] = precision_score(df_relevant, approach)
        ret['Recall (Relevant@Top' + doc_count + ')'] = recall_score(df_relevant, approach)
        ret['Precision (Irrelevant@Top' + doc_count + ')'] = precision_score(df_irrelevant, approach)
        ret['Recall (Irrelevant@Top' + doc_count + ')'] = recall_score(df_irrelevant, approach)
    
    return ret

def report_table(df):
    rows = []
    for approach, approach_display_name in [('copy-cat-tp', 'CopyCat'), ('url-simhash', 'Url Classes'), ('simhash(1-grams)', 'SimHash(1-grams)'), ('simhash(3+5-grams)', 'SimHash(3+5-grams)'), ('text-profile', 'TextProfile') , ('md5', 'MD5')]:
        rows += [table_row(df, approach, approach_display_name)]
    ret = pd.DataFrame(rows)
    ret.set_index('Approach', inplace=True)
    ret.columns = pd.MultiIndex.from_tuples([
        
        ('Top1000', 'Precision'), ('Top1000', 'Recall'),
        ('Relevant@Top1000', 'Precision'), ('Relevant@Top1000', 'Recall'),
        ('Irrelevant@Top1000', 'Precision'), ('Irrelevant@Top1000', 'Recall'),
    ])

    return ret.reset_index()

print('Precision/Recall with S3 score as ground-truth (small cw09 sample):')
df = pd.concat([df_18, df_19, df_20, df_21, df_22, df_23])
df['url-simhash'] = df['simhash(1-grams)'] & df['url']
df['docs'] = 1000
df = report_table(df)
df

Precision/Recall with S3 score as ground-truth (small cw09 sample):


Unnamed: 0_level_0,Approach,Top1000,Top1000,Relevant@Top1000,Relevant@Top1000,Irrelevant@Top1000,Irrelevant@Top1000
Unnamed: 0_level_1,Unnamed: 1_level_1,Precision,Recall,Precision,Recall,Precision,Recall
0,CopyCat,0.926,0.361,0.994,0.54,0.87,0.339
1,Url Classes,0.902,0.079,0.986,0.166,0.794,0.107
2,SimHash(1-grams),0.749,0.803,0.799,0.89,0.758,0.902
3,SimHash(3+5-grams),0.95,0.327,0.998,0.489,0.927,0.285
4,TextProfile,0.977,0.145,1.0,0.352,0.995,0.084
5,MD5,1.0,0.092,1.0,0.307,1.0,0.04


In [12]:
def f(v):
    return v

def row(name):
    r = df[df['Approach'] == name].iloc[0]
    return '& ' + f(r[('Top1000', 'Precision')]) + ' & ' + f(r[('Top1000', 'Recall')]) + ' & ' + \
           f(r[('Relevant@Top1000', 'Precision')]) + ' & ' + f(r[('Relevant@Top1000', 'Recall')]) +' & ' + \
           f(r[('Irrelevant@Top1000', 'Precision')]) + ' & ' + f(r[('Irrelevant@Top1000', 'Recall')]) + ' \\\\'

def table():
    return """
\\begin{table}
\\centering
\\small
\\setlength{\\tabcolsep}{3pt}%
\\caption{TBD. {\\color{red}Make table consume full width.}}
\\label{table-precision-recall-in-runs}
\\begin{tabular}{@{}lcccccc@{}}
\\toprule
{\\bfseries Method} & \\multicolumn{2}{c@{}}{\\bfseries Top~1000} & \\multicolumn{2}{c@{}}{\\bfseries Relevant@Top~1000} & \\multicolumn{2}{c@{}}{\\bfseries Irrelevant@Top~1000} \\\\

\\cmidrule(l){2-3}
\\cmidrule(l){4-5}
\\cmidrule(l){6-7}

& Prec. & Rec.  & Prec. & Rec. & Prec. & Rec. \\\\

\\midrule

Crawl """ +  row('SimHash(3+5-grams)') + """

Classes """ +  row('Url Classes') + """
\\midrule

\\resource """ +  row('CopyCat') + """
\\bottomrule
\\end{tabular}

\\end{table}
"""

print(table())


\begin{table}
\centering
\small
\setlength{\tabcolsep}{3pt}%
\caption{TBD. {\color{red}Make table consume full width.}}
\label{table-precision-recall-in-runs}
\begin{tabular}{@{}lcccccc@{}}
\toprule
{\bfseries Method} & \multicolumn{2}{c@{}}{\bfseries Top~1000} & \multicolumn{2}{c@{}}{\bfseries Relevant@Top~1000} & \multicolumn{2}{c@{}}{\bfseries Irrelevant@Top~1000} \\

\cmidrule(l){2-3}
\cmidrule(l){4-5}
\cmidrule(l){6-7}

& Prec. & Rec.  & Prec. & Rec. & Prec. & Rec. \\

\midrule

Crawl & 0.950 & 0.327 & 0.998 & 0.489 & 0.927 & 0.285 \\

Classes & 0.902 & 0.079 & 0.986 & 0.166 & 0.794 & 0.107 \\
\midrule

\resource & 0.926 & 0.361 & 0.994 & 0.540 & 0.870 & 0.339 \\
\bottomrule
\end{tabular}

\end{table}

