# Experiment on NTCIR-17 Transfer Task Train Dataset

This notebook shows how to apply BM25 to the train dataset of NTCIR-17 Transfer Task using [PyTerrier](https://pyterrier.readthedocs.io/en/latest/) (v0.9.2).

## Previous Step

- `preprocess-transfer1-train.ipynb`

## Requirement

- Java v11

## Path

In [109]:
import os
os.environ['INDEX'] = '../indexes/ntcir17-transfer/train'
os.environ['RUN'] = '../runs/ntcir17-transfer/train'

## Datasets

In [110]:
import sys
!{sys.executable} -m pip install -q ir_datasets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [111]:
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [112]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/1/train')

## Tokenization

- In this example, we use [SudachiPy](https://github.com/WorksApplications/SudachiPy) (v0.5.4) + sudachidict_core dictionary + SplitMode.A
- Other tokenizers can also be used

In [113]:
import sys
!{sys.executable} -m pip install -q sudachipy sudachidict_core


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [114]:
import re
import json
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.A

In [115]:
def tokenize_text(text):
    atok = ' '.join([m.surface() for m in tokenizer_obj.tokenize(text, mode)])
    return atok

In [116]:
tokenize_text('すもももももももものうち')

'すもも も もも も もも の うち'

## Experiment

### PyTerrier

In [117]:
# Change JAVA_HOME to fit your environment
JAVA_HOME = '/usr/lib/jvm/default'
os.environ['JAVA_HOME'] = JAVA_HOME
os.getenv('JAVA_HOME')

'/usr/lib/jvm/default'

In [118]:
import sys
!{sys.executable} -m pip install -q python-terrier


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [119]:
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init(tqdm='notebook')

In [120]:
dataset_pt = pt.get_dataset('irds:ntcir-transfer/1/train')

### Indexing

In [121]:
# !rm -rf $INDEX
!mkdir -p $INDEX

In [122]:
indexer = pt.IterDictIndexer(os.getenv('INDEX'))
indexer.setProperty("tokeniser", "UTFTokeniser")
indexer.setProperty("termpipelines", "")

In [123]:
def train_doc_generate():
    for doc in dataset.docs_iter():
        yield { 'docno': doc.doc_id, 'text': tokenize_text(doc.text) }

In [124]:
%%time
# indexref = indexer.index(train_doc_generate())

CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 3.1 µs


In [125]:
sys.path.append("../")
from pathlib import Path
from models.jance.jance import PyTDenseIndexer, PyTDenseRetrieval

index_path = Path("../indexes/ntcir17-transfer/train/jance")
if not index_path.joinpath('shards.pkl').exists():
    jance_indexer = PyTDenseIndexer(index_path, verbose=False)
    index_path = jance_indexer.index(train_doc_generate())

Using mean: False
Segment 0


100%|██████████| 10404/10404 [53:42<00:00,  3.23it/s] 


In [126]:
!ls $INDEX

data.direct.bf		   data.lexicon.fsomaphash  data.meta.zdata
data.document.fsarrayfile  data.lexicon.fsomapid    data.properties
data.inverted.bf	   data.meta-0.fsomapfile   jance
data.lexicon.fsomapfile    data.meta.idx


### Topics

In [127]:
def tokenize_topics():
    import re
    code = re.compile('[!"#$%&\'\\\\()*+,-./:;<=>?@[\\]^_`{|}~「」〔〕“”〈〉『』【】＆＊・（）＄＃＠。、？！｀＋￥％]')
    queries = dataset_pt.get_topics(tokenise_query=False)
    for idx, row in queries.iterrows():
        queries.iloc[idx, 1] = code.sub('', tokenize_text(row.query))
    return queries

In [128]:
tokenize_topics()

Unnamed: 0,qid,query
0,0001,ロボット
1,0002,複合 名詞 の 構造 解析
2,0003,サンプル 複雑 性
3,0004,文書 画像 理解
4,0005,特徴 次元 リダクション
...,...,...
78,0079,β アミロイド タンパク
79,0080,神経 再生
80,0081,脳 の 性差
81,0082,抗 マラリア 薬剤


### Retrieval

In [129]:
# Load existing index files
indexref = pt.IndexFactory.of(os.getenv('INDEX'))

In [130]:
!mkdir -p $RUN

In [131]:
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25")
anceretr = PyTDenseRetrieval(index_path)

Using mean: False


In [132]:
%%time
from pyterrier.measures import *
pt.Experiment(
    [bm25, anceretr],
    tokenize_topics(),
    dataset_pt.get_qrels(),
    eval_metrics=[nDCG],
    names = ["MyRun-BM25", "JANCE"],
    save_dir = os.getenv('RUN'),
    save_mode = "overwrite"
)

***** inference of 83 queries *****
***** faiss search for 83 queries on 1 shards *****


Loading shards: 100%|██████████| 1/1 [12:04<00:00, 724.53s/shard]
Calc Scores: 1it [12:04, 724.53s/it]


CPU times: user 12min 12s, sys: 20.1 s, total: 12min 32s
Wall time: 12min 24s


Unnamed: 0,name,nDCG
0,MyRun-BM25,0.526288
1,JANCE,0.299774


In [133]:
!gunzip -c $RUN/MyRun-BM25.res.gz | head

0001 Q0 gakkai-0000064659 0 13.583940506962453 pyterrier
0001 Q0 gakkai-0000225773 1 13.527180904344576 pyterrier
0001 Q0 gakkai-0000328806 2 13.432803868573687 pyterrier
0001 Q0 gakkai-0000198139 3 13.419095037920066 pyterrier
0001 Q0 gakkai-0000124728 4 13.402377752084877 pyterrier
0001 Q0 gakkai-0000168454 5 13.397874358074015 pyterrier
0001 Q0 gakkai-0000297977 6 13.395025935634221 pyterrier
0001 Q0 gakkai-0000245010 7 13.39289585771344 pyterrier
0001 Q0 gakkai-0000045041 8 13.392088725112261 pyterrier
0001 Q0 gakkai-0000094695 9 13.391086411155124 pyterrier
