# For linux users, please configure as follow:

In order to reproduce this experiment. We advise to use the following configuration:

* Python 3.8
* python-terrier==0.8.1

apt-get update

apt-get upgrade

apt install default-jre

apt install default-jdk

apt-get install python3.8

python3.8 --version `# to check if python 3.8 was successful installed`

update-alternatives --install /usr/bin/python python /usr/bin/python3.6 1

update-alternatives --install /usr/bin/python python /usr/bin/python3.8 2

update-alternatives --set python /usr/bin/python3.8

update-alternatives --config python

python -m pip install --upgrade pip

pip3 install jupyter

pip install git+https://github.com/allenai/ir_datasets.git

pip install wheel

pip install python-terrier==0.8.1


# Initializing pyterrier 

In [None]:
import pyterrier as pt
if not pt.started():
    pt.init()

# Loading Wikiclir portuguese collection

Display top 10 items in the collection

In [3]:
import ir_datasets
import pandas as pd
dataset = ir_datasets.load("wikiclir/pt")
for i, doc in enumerate(dataset.docs_iter()):
  print(doc.title)
  if i == 10:
    break

[INFO] If you have a local copy of https://www.cs.jhu.edu/~kevinduh/a/wikiclir2018/wiki-clir.tar.gz, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/705abb611eb8cbab9ced2b8767a3bdb6
[INFO] [starting] https://www.cs.jhu.edu/~kevinduh/a/wikiclir2018/wiki-clir.tar.gz
[INFO] [finished] https://www.cs.jhu.edu/~kevinduh/a/wikiclir2018/wiki-clir.tar.gz: [01:17] [7.04GB] [90.9MB/s]
[INFO] [starting] extracting from tar file                                                   


Astronomia
América Latina
Albino Forjaz de Sampaio
Anno Domini
Aquiles
Anarcocapitalismo
Anarquismo
Albert Einstein
Aquecimento global
Lista de padrões de arquivo gráfico
Abel (desambiguação)


[INFO] [finished] extracting from tar file [02:51]


In [6]:
dataset = pt.get_dataset('irds:wikiclir/pt') 
for i, doc in enumerate(dataset.get_corpus_iter()):
  print(doc)
  if i == 1:
    break

wikiclir/pt documents:   0%|          | 1/973057 [00:00<06:36, 2454.24it/s]

{'title': 'Astronomia', 'text': " astronomia é uma ciência natural que estuda corpos celestes ( como estrelas , planetas , cometas , nebulosas , aglomerados de estrelas , galáxias ) e fenômenos que se originam fora da atmosfera da terra ( como a radiação cósmica de fundo em micro-ondas ) . preocupada com a evolução , a física , a química e o movimento de objetos celestes , bem como a formação e o desenvolvimento do universo . a astronomia é uma das mais antigas ciências . culturas pré-históricas deixaram registrados vários artefatos astronômicos , como stonehenge , os montes de newgrange e os menires . as primeiras civilizações , como os babilônios , gregos , chineses , indianos , iranianos e maias realizaram observações metódicas do céu noturno . no entanto , a invenção do telescópio permitiu o desenvolvimento da astronomia moderna . historicamente , a astronomia incluiu disciplinas tão diversas como astrometria , navegação astronômica , astronomia observacional e a elaboração de cale




# Indexing

index collections using different pre-processing techniques:
 1. Stemming with Terrier's [Portuguese Snowball Stemmer](http://terrier.org/docs/v5.2/javadoc/org/terrier/terms/PortugueseSnowballStemmer.html) 
 2. No stemming

In [8]:
import os 
if not os.path.exists('./indices/wikiclir-stem'):
  indexer = pt.IterDictIndexer('./indices/wikiclir-stem')
  indexer.setProperty("tokeniser", "UTFTokeniser")  # Replaces the default EnglishTokeniser, which makes assumptions specific to English
  indexer.setProperty("termpipelines", "PortugueseSnowballStemmer") # Applies Terrier's SpanishSnowballStemmer (replacing PorterStemmer)
  index_stem = indexer.index(dataset.get_corpus_iter())
else:
  index_stem = pt.IndexRef.of('./indices/wikiclir-stem/data.properties')


wikiclir/pt documents:  11%|█         | 109171/973057 [00:40<03:21, 4277.08it/s]



wikiclir/pt documents: 100%|██████████| 973057/973057 [05:09<00:00, 3139.38it/s]

09:41:02.674 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 9 empty documents





In [10]:
import os 
if not os.path.exists('./indices/wikiclir-no-stem'):
  indexer = pt.IterDictIndexer('./indices/wikiclir-no-stem')

  indexer.setProperty("tokeniser", "UTFTokeniser")  # Replaces the default EnglishTokeniser, which makes assumptions specific to English
  indexer.setProperty("termpipelines", "") # Removes the default PorterStemmer (English)
  index_nostem = indexer.index(dataset.get_corpus_iter())

else:
  index_nostem = pt.IndexRef.of('./indices/wikiclir-no-stem/data.properties')

wikiclir/pt documents:  11%|█         | 108564/973057 [00:27<02:03, 7022.88it/s]



wikiclir/pt documents: 100%|██████████| 973057/973057 [03:45<00:00, 4320.16it/s] 

09:54:54.283 [ForkJoinPool-2-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 9 empty documents





# Collection Statistics

In [11]:
index = pt.IndexFactory.of(index_stem)
print(index.getCollectionStatistics().toString())

Number of documents: 973057
Number of terms: 967661
Number of postings: 67205267
Number of fields: 1
Number of tokens: 103922756
Field names: [text]
Positions:   false



# Translation English Queries 

First, install deep-translator

In [None]:
!pip install deep-translator

In [None]:
from deep_translator import GoogleTranslator, YandexTranslator, MyMemoryTranslator
def google_translator(en):
  translated = GoogleTranslator(source='auto', target='pt').translate(en)
  return translated

def yandex_translator(en):
  translated = YandexTranslator('pdct.1.1.20220520T161332Z.5ecf5681257bb66f.1864aa2b9cc8cfc2885ae6fa7cb98b8c12fb18a6').translate(source="en", target="pt", text=en)
  return translated

def myMemory_translator(en):
  translated = MyMemoryTranslator(source='en', target='pt').translate(en)
  return translated

# Preprocess Queries

Removing punctuation and unnecessary symbols

In [None]:
import re 

def process(text):
  if text:
    text = text.replace("'","")
    text = text.replace('"',"")
    text = text.replace("/","")
    text = text.replace("\\","")
    text = text.replace("^","")

    text = re.sub(r"""
                  [:,.;@#?!&$*/(){}']+  # Accept one or more copies of punctuation
                  \ *           # plus zero or more copies of a space,
                  """,
                  " ",          # and replace it with a single space
                  text, flags=re.VERBOSE)
  return text

# Generating CSV files with 10K queries translated

In [None]:
dataset = ir_datasets.load("wikiclir/pt")

def generate_file_title():
    for i, query in enumerate(dataset.queries_iter()):
        merge = process(google_translator(query.title))

        with open('queries-title.csv','a+', encoding='utf8') as f:
          merge = merge if merge else ''
          f.write(query.query_id+', '+merge+'\n')
        
        # first 10000 documents
        if i == 10000:
            break


def generate_file_first_sent():
    for i, query in enumerate(dataset.queries_iter()):
        merge = process(google_translator(query.first_sent))

        with open('queries-first-sent.csv','a+', encoding='utf8') as f:
          merge = merge if merge else ''
          f.write(query.query_id+', '+merge+'\n')
        
        # first 10000 documents
        if i == 10000:
            break


def generate_file_merged():
    for i, query in enumerate(dataset.queries_iter()):
        merge = process(google_translator(query.title+query.first_sent))

        with open('queries-merged.csv','a+', encoding='utf8') as f:
          merge = merge if merge else ''
          f.write(query.query_id+', '+merge+'\n')
        
        # first 10000 documents
        if i == 10000:
            break

In [14]:
import pandas as pd
df = pd.read_csv ('queries-title.csv', names=['qid', 'query'], header=None)
# df = pd.read_csv ('queries-first-sent.csv', names=['qid', 'query'], header=None)
# df = pd.read_csv ('queries-merged.csv', names=['qid', 'query'], header=None)

dataset = pt.get_dataset('irds:wikiclir/pt') 
queries = dataset.get_topics()
queries = queries.head(10000)
queries['query'] =  df['query']

queries

Unnamed: 0,qid,title,first_sent,query
0,12,Anarchism,is a political philosophy that advocates self-...,Anarquismo
1,25,Autism,is a neurodevelopmental disorder characterized...,Autismo
2,39,Albedo,() is a measure for reflectance or optical bri...,Albedo
3,290,A,"(named , plural ""as"", ""a's"", ""a""s, ""a's"" or ""a...",UMA
4,303,Alabama,() is a state in the southeastern region of th...,Alabama
...,...,...,...,...
9995,28612,Second Epistle of John,"the , often referred to as and often written 2...",Segunda Epístola de João
9996,28615,Sequencing,"in genetics and biochemistry, means to determi...",Sequenciamento
9997,28616,Shotgun sequencing,"in genetics, is a method used for long dna str...",Sequenciamento de espingarda
9998,28617,Statue of Liberty,the (liberty enlightening the world; ) is a co...,Estátua da Liberdade


In [15]:
queries.drop(queries[queries['query']  == ' '].index, inplace = True)

del queries['title']
del queries['first_sent']

queries


Unnamed: 0,qid,query
0,12,Anarquismo
1,25,Autismo
2,39,Albedo
3,290,UMA
4,303,Alabama
...,...,...
9995,28612,Segunda Epístola de João
9996,28615,Sequenciamento
9997,28616,Sequenciamento de espingarda
9998,28617,Estátua da Liberdade


In [16]:
tf_idf_stem = pt.BatchRetrieve(index_stem, wmodel='TF_IDF')
bm25_stem = pt.BatchRetrieve(index_stem, wmodel="BM25")

In [17]:
from pyterrier.measures import *
import time
title_qrels = dataset.get_qrels().copy()
title_qrels.loc[title_qrels.label < 2, 'label'] = 0

start_time = time.time()
results = pt.Experiment(
    [tf_idf_stem, bm25_stem],
    queries,
    title_qrels,
    names=['TF-IDF', 'bm25'],
    eval_metrics= [nDCG@5, nDCG@10]
)

results
print("--- %s seconds ---" % (time.time() - start_time))

--- 735.451601266861 seconds ---


In [18]:
results

Unnamed: 0,name,nDCG@5,nDCG@10
0,TF-IDF,0.32217,0.355608
1,bm25,0.310462,0.344121


In [19]:
tf_idf_nostem = pt.BatchRetrieve(index_nostem, wmodel='TF_IDF')
bm25_nostem = pt.BatchRetrieve(index_nostem, wmodel="BM25")

In [20]:
from pyterrier.measures import *
import time
title_qrels = dataset.get_qrels().copy()
title_qrels.loc[title_qrels.label < 2, 'label'] = 0

start_time = time.time()
result = pt.Experiment(
    [tf_idf_nostem, bm25_nostem],
    queries,
    title_qrels,
    names=['TF-IDF', 'bm25'],
    eval_metrics= [nDCG@5, nDCG@10, NumQ]
)

end_time = time.time()
print("--- %s seconds ---" % (end_time - start_time))
result

--- 712.4234652519226 seconds ---


Unnamed: 0,name,nDCG@5,nDCG@10,NumQ
0,TF-IDF,0.319855,0.352782,9912.0
1,bm25,0.311578,0.343747,9912.0


In [21]:
import pandas as pd
df = pd.read_csv ('queries-first-sent.csv', names=['qid', 'query'], header=None)


dataset = pt.get_dataset('irds:wikiclir/pt') 
queries = dataset.get_topics()
queries = queries.head(10000)
queries['query'] =  df['query']

queries

Unnamed: 0,qid,title,first_sent,query
0,12,Anarchism,is a political philosophy that advocates self-...,é uma filosofia política que defende sociedad...
1,25,Autism,is a neurodevelopmental disorder characterized...,é um transtorno do neurodesenvolvimento carac...
2,39,Albedo,() is a measure for reflectance or optical bri...,é uma medida de refletância ou brilho óptico...
3,290,A,"(named , plural ""as"", ""a's"", ""a""s, ""a's"" or ""a...",chamado plural as as as as ou aes é a prime...
4,303,Alabama,() is a state in the southeastern region of th...,é um estado na região sudeste dos estados un...
...,...,...,...,...
9995,28612,Second Epistle of John,"the , often referred to as and often written 2...",o muitas vezes referido como e muitas vezes ...
9996,28615,Sequencing,"in genetics and biochemistry, means to determi...",em genética e bioquímica significa determinar...
9997,28616,Shotgun sequencing,"in genetics, is a method used for long dna str...",em genética é um método usado para longas fit...
9998,28617,Statue of Liberty,the (liberty enlightening the world; ) is a co...,the liberty iluminando o mundo é uma coloss...


In [22]:
queries.drop(queries[queries['query']  == ' '].index, inplace = True)

del queries['title']
del queries['first_sent']

queries

Unnamed: 0,qid,query
0,12,é uma filosofia política que defende sociedad...
1,25,é um transtorno do neurodesenvolvimento carac...
2,39,é uma medida de refletância ou brilho óptico...
3,290,chamado plural as as as as ou aes é a prime...
4,303,é um estado na região sudeste dos estados un...
...,...,...
9995,28612,o muitas vezes referido como e muitas vezes ...
9996,28615,em genética e bioquímica significa determinar...
9997,28616,em genética é um método usado para longas fit...
9998,28617,the liberty iluminando o mundo é uma coloss...


In [24]:
from pyterrier.measures import *
import time
title_qrels = dataset.get_qrels().copy()
title_qrels.loc[title_qrels.label < 2, 'label'] = 0

start_time = time.time()
result = pt.Experiment(
    [tf_idf_nostem, bm25_nostem],
    queries,
    title_qrels,
    names=['TF-IDF', 'bm25'],
    eval_metrics= [nDCG@5, nDCG@10, NumQ]
)

end_time = time.time()
print("--- %s seconds ---" % (end_time - start_time))
result

--- 8940.976504564285 seconds ---


Unnamed: 0,name,nDCG@5,nDCG@10,NumQ
0,TF-IDF,0.363053,0.381412,9977.0
1,bm25,0.28617,0.303373,9977.0


In [25]:
from pyterrier.measures import *
import time
title_qrels = dataset.get_qrels().copy()
title_qrels.loc[title_qrels.label < 2, 'label'] = 0

start_time = time.time()
results = pt.Experiment(
    [tf_idf_stem, bm25_stem],
    queries,
    title_qrels,
    names=['TF-IDF', 'bm25'],
    eval_metrics= [nDCG@5, nDCG@10]
)

results
print("--- %s seconds ---" % (time.time() - start_time))

--- 9429.518537044525 seconds ---


In [26]:
results

Unnamed: 0,name,nDCG@5,nDCG@10
0,TF-IDF,0.372976,0.392676
1,bm25,0.290681,0.308565
