# Experiment on NTCIR-17 Transfer Task Eval Dataset

This notebook shows how to apply BM25 to the eval dataset of NTCIR-17 Transfer Task using [PyTerrier](https://pyterrier.readthedocs.io/en/latest/) (v0.9.2).

## Previous Step

- `preprocess-transfer1-eval-ipynb`

## Requirement

- Java v11

## Path

In [1]:
import os
os.environ['INDEX'] = '../indexes/ntcir17-transfer/jance'
os.environ['RUN'] = '../runs/ntcir17-transfer/jance'

## Datasets

In [2]:
import sys
!{sys.executable} -m pip install -q ir_datasets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '..'))

In [4]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/1/eval')

## Tokenization

- In this example, we use [SudachiPy](https://github.com/WorksApplications/SudachiPy) (v0.5.4) + sudachidict_core dictionary + SplitMode.A
- Other tokenizers can also be used

In [5]:
import sys
!{sys.executable} -m pip install -q sudachipy sudachidict_core


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
import re
import json
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.A

In [7]:
def tokenize_text(text):
    atok = ' '.join([m.surface() for m in tokenizer_obj.tokenize(text, mode)])
    return atok

In [8]:
tokenize_text('すもももももももものうち')

'すもも も もも も もも の うち'

## Experiment

### PyTerrier

In [9]:
# Change JAVA_HOME to fit your environment
JAVA_HOME = '/usr/lib/jvm/default'
os.environ['JAVA_HOME'] = JAVA_HOME
os.getenv('JAVA_HOME')

'/usr/lib/jvm/default'

In [10]:
import sys
# !{sys.executable} -m pip install -q python-terrier

In [11]:
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init(tqdm='notebook')

PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [12]:
dataset_pt = pt.get_dataset('irds:ntcir-transfer/1/eval')

### Indexing

In [13]:
# !rm -rf $INDEX
!mkdir -p $INDEX

In [14]:
# indexer = pt.IterDictIndexer(os.getenv('INDEX'))
# indexer.setProperty("tokeniser", "UTFTokeniser")
# indexer.setProperty("termpipelines", "")
from pathlib import Path
from importlib import reload
import models
from models.jance.jance import PyTDenseIndexer, PyTDenseRetrieval
reload(models.jance.jance)
from models.jance.jance import PyTDenseIndexer, PyTDenseRetrieval

indexer = PyTDenseIndexer(Path(os.getenv('INDEX')), verbose=False)

  from .autonotebook import tqdm as notebook_tqdm


Using mean: False


In [15]:
def train_doc_generate():
    for doc in dataset.docs_iter():
        yield { 'docno': doc.doc_id, 'text': tokenize_text(doc.text) }

In [16]:
%%time
index_path = indexer.index(train_doc_generate())

Segment 0


  3%|▎         | 479/15625 [02:54<1:32:12,  2.74it/s]


KeyboardInterrupt: 

In [None]:
!ls $INDEX

### Topics

In [None]:
def tokenize_topics():
    import re
    code = re.compile('[!"#$%&\'\\\\()*+,-./:;<=>?@[\\]^_`{|}~「」〔〕“”〈〉『』【】＆＊・（）＄＃＠。、？！｀＋￥％]')
    queries = dataset_pt.get_topics(tokenise_query=False)
    for idx, row in queries.iterrows():
        queries.iloc[idx, 1] = code.sub('', tokenize_text(row.query))
    return queries

In [None]:
tokenize_topics()

Unnamed: 0,qid,query
0,101,Ｂ 型 肝炎
1,102,異種 膵島 移植
2,103,高 血圧
3,104,肺 小 細胞 癌
4,105,新規 キノロン 剤
5,106,β ３ アドレナリン 受容 体 遺伝 子 変異
6,107,塞栓 療法
7,108,XML
8,109,TCP の 高速 化
9,110,情報 検索 の 可視 化


### Retrieval

- The performance value (e.g., nDCG) is expected to be 0.0.
- You can use the generated run files for submission.

In [None]:
# Load existing index files
# indexref = pt.IndexFactory.of(os.getenv('INDEX'))

In [None]:
!mkdir -p $RUN

In [None]:
# bm25 = pt.BatchRetrieve(indexref, wmodel="BM25")
jance = PyTDenseRetrieval(index_path)

NameError: name 'index_path' is not defined

In [None]:
# dummy qrels
import pandas as pd
dummy_qrels = pd.DataFrame(dataset_pt.get_topics(), columns=['qid'])
dummy_qrels['docno'] = 'docno'
dummy_qrels['label'] = 0

In [None]:
%%time
from pyterrier.measures import *
pt.Experiment(
    [jance],
    tokenize_topics(),
    dummy_qrels,
    eval_metrics=[nDCG],
    names = ["MyRun-BM25"],
    save_dir = os.getenv('RUN'),
    save_mode = "overwrite"
)

CPU times: user 4.06 s, sys: 2.18 s, total: 6.24 s
Wall time: 2.13 s


Unnamed: 0,name,nDCG
0,MyRun-BM25,0.0


In [None]:
!gunzip -c $RUN/MyRun-BM25.res.gz | head

0101 Q0 kaken-j-0911436000 0 21.86485284250732 pyterrier
0101 Q0 kaken-j-0921440800 1 21.733548660790195 pyterrier
0101 Q0 kaken-j-0960142800 2 21.6993557888258 pyterrier
0101 Q0 kaken-j-0975101400 3 21.659867826004042 pyterrier
0101 Q0 kaken-j-0934033100 4 21.651742953172338 pyterrier
0101 Q0 kaken-j-0912100600 5 21.594389793069617 pyterrier
0101 Q0 kaken-j-0882391600 6 21.511995261510908 pyterrier
0101 Q0 kaken-j-0883102100 7 21.46097823369766 pyterrier
0101 Q0 kaken-j-0937129200 8 21.450990698888955 pyterrier
0101 Q0 kaken-j-0941469900 9 21.421462357753665 pyterrier
