# [STF-IDF](https://arxiv.org/abs/2209.14281): Multilingual Search with Subword TF-IDF

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing. The software to reproduce these results are open-sourced as a part of [Text2Text](https://github.com/artitw/text2text)




```
@article{stfidf,
  doi = {10.48550/ARXIV.2209.14281},
  url = {https://arxiv.org/abs/2209.14281},
  author = {Wangperawong, Artit},
  title = {Multilingual Search with Subword TF-IDF},
  publisher = {arXiv},
  year = {2022},
}
```




In [1]:
%%bash
pip install -q -U text2text
sudo apt-get -qq install libopenblas-dev
sudo apt-get -qq install libomp-dev

Selecting previously unselected package libomp5:amd64.
(Reading database ... (Reading database ... 5%(Reading database ... 10%(Reading database ... 15%(Reading database ... 20%(Reading database ... 25%(Reading database ... 30%(Reading database ... 35%(Reading database ... 40%(Reading database ... 45%(Reading database ... 50%(Reading database ... 55%(Reading database ... 60%(Reading database ... 65%(Reading database ... 70%(Reading database ... 75%(Reading database ... 80%(Reading database ... 85%(Reading database ... 90%(Reading database ... 95%(Reading database ... 100%(Reading database ... 123991 files and directories currently installed.)
Preparing to unpack .../libomp5_5.0.1-1_amd64.deb ...
Unpacking libomp5:amd64 (5.0.1-1) ...
Selecting previously unselected package libomp-dev.
Preparing to unpack .../libomp-dev_5.0.1-1_amd64.deb ...
Unpacking libomp-dev (5.0.1-1) ...
Setting up libomp5:amd64 (5.0.1-1) ...
Setting up libomp-dev (5.0.1-1) ...
Processin

debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 2.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


In [2]:
import requests

def get_data(lang_code="en"):
  url = f"https://raw.githubusercontent.com/deepmind/xquad/master/xquad.{lang_code}.json"
  r = requests.get(url)
  d = r.json()
  corpus = []
  queries = []
  id = 0
  for a in d["data"]:
    for p in a["paragraphs"]:
      c = p["context"]
      corpus.append((id,c))
      for qa in p["qas"]:
        q = qa["question"]
        queries.append((id,q))
      id += 1
  cids, c = zip(*corpus)
  qids, q = zip(*queries)
  return cids, c, qids, q

In [3]:
import text2text as t2t
import numpy as np

def evaluate_stfidf(corpus_ids, corpus, ans_ids, queries):
  index = t2t.Handler(corpus).index(ids=corpus_ids)
  dist, pred_ids = index.search(queries, k=1)
  accuracy = np.sum(pred_ids.reshape(-1)==np.array(ans_ids))/len(ans_ids)
  return accuracy

Better speed can be achieved with apex installed.


In [4]:
lang_codes = ["en","es","de","el","ru","tr","ar","vi","th","zh","hi","ro"]

for lang in lang_codes:
  corpus_ids, corpus, ans_ids, queries = get_data(lang_code=lang)
  acc = evaluate_stfidf(corpus_ids, corpus, ans_ids, queries)
  print(lang, acc)

Downloading:   0%|          | 0.00/271 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/909 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

en 0.853781512605042
es 0.8579831932773109
de 0.8487394957983193
el 0.8134453781512605
ru 0.8294117647058824
tr 0.8008403361344538
ar 0.7705882352941177
vi 0.8445378151260504
th 0.8352941176470589
zh 0.8243697478991596
hi 0.8092436974789916
ro 0.8504201680672269


In [5]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from nltk.stem.porter import PorterStemmer
import nltk 
nltk.download('punkt')
stemmer = PorterStemmer()

def _tokenizer(strategy, s):
  s = nltk.word_tokenize(s)
  if strategy == "word":
    return s

  if strategy == "word>stem":
    return [stemmer.stem(item) for item in s]

  s = [word for word in s if word not in ENGLISH_STOP_WORDS]
  if strategy == "word>stop":
    return s
    
  s = [stemmer.stem(item) for item in s]
  if strategy == "word>stop>stem":
    return s
  return s

corpus_ids, corpus, ans_ids, queries = get_data(lang_code="en")

for strat in ["word", "word>stop", "word>stem", "word>stop>stem"]:
  vectorizer = TfidfVectorizer(tokenizer=lambda x: _tokenizer(strat, x))
  C = vectorizer.fit_transform(corpus)
  Q = vectorizer.transform(queries)
  scores = np.matmul(C.toarray(), Q.transpose().toarray())
  pred_ids = np.argmax(scores, axis=0)
  accuracy = np.sum(pred_ids==np.array(ans_ids))/len(ans_ids)
  print(strat, accuracy)
  c_subword = [" ".join(_tokenizer(strat, d)) for d in corpus]
  q_subword = [" ".join(_tokenizer(strat, q)) for q in queries]
  acc = evaluate_stfidf(corpus_ids, c_subword, ans_ids, q_subword)
  print(strat+">subword", acc)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


word 0.8420168067226891
word>subword 0.8495798319327731
word>stop 0.8394957983193277
word>stop>subword 0.8420168067226891
word>stem 0.8487394957983193
word>stem>subword 0.853781512605042
word>stop>stem 0.8521008403361344
word>stop>stem>subword 0.8445378151260504
