# [STF-IDF](https://arxiv.org/abs/2209.14281): Multilingual Search with Subword TF-IDF

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing. The software to reproduce these results are open-sourced as a part of [Text2Text](https://github.com/artitw/text2text)




```
@article{stfidf,
  doi = {10.48550/ARXIV.2209.14281},
  url = {https://arxiv.org/abs/2209.14281},
  author = {Wangperawong, Artit},
  title = {Multilingual Search with Subword TF-IDF},
  publisher = {arXiv},
  year = {2022},
}
```




In [None]:
%%bash
pip install -qq -U text2text

[K     |████████████████████████████████| 69 kB 2.0 MB/s 
[K     |████████████████████████████████| 5.8 MB 10.9 MB/s 
[K     |████████████████████████████████| 17.0 MB 554 kB/s 
[K     |████████████████████████████████| 1.3 MB 12.7 MB/s 
[K     |████████████████████████████████| 7.6 MB 13.1 MB/s 
[K     |████████████████████████████████| 182 kB 9.0 MB/s 
[?25h

In [None]:
### Restart runtime to use the newly installed packages

import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

In [None]:
import requests

def get_data(lang_code="en"):
  url = f"https://raw.githubusercontent.com/deepmind/xquad/master/xquad.{lang_code}.json"
  r = requests.get(url)
  d = r.json()
  corpus = []
  queries = []
  id = 0
  for a in d["data"]:
    for p in a["paragraphs"]:
      c = p["context"]
      corpus.append((id,c))
      for qa in p["qas"]:
        q = qa["question"]
        queries.append((id,q))
      id += 1
  cids, c = zip(*corpus)
  qids, q = zip(*queries)
  return cids, c, qids, q

In [None]:
import text2text as t2t
import numpy as np

def evaluate_stfidf(corpus_ids, corpus, ans_ids, queries):
  index = t2t.Indexer().transform(corpus, ids=corpus_ids)
  dist, pred_ids = index.search(queries, k=1)
  accuracy = np.sum(pred_ids.reshape(-1)==np.array(ans_ids))/len(ans_ids)
  return accuracy

Better speed can be achieved with apex installed.


In [None]:
lang_codes = ["en","es","de","el","ru","tr","ar","vi","th","zh","hi","ro"]

for lang in lang_codes:
  corpus_ids, corpus, ans_ids, queries = get_data(lang_code=lang)
  acc = evaluate_stfidf(corpus_ids, corpus, ans_ids, queries)
  print(lang, acc)

Downloading:   0%|          | 0.00/271 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/909 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Creating index with 128104 dimensions.
en 0.853781512605042
Creating index with 128104 dimensions.
es 0.8579831932773109
Creating index with 128104 dimensions.
de 0.8487394957983193
Creating index with 128104 dimensions.
el 0.8134453781512605
Creating index with 128104 dimensions.
ru 0.8294117647058824
Creating index with 128104 dimensions.
tr 0.8008403361344538
Creating index with 128104 dimensions.
ar 0.7705882352941177
Creating index with 128104 dimensions.
vi 0.8445378151260504
Creating index with 128104 dimensions.
th 0.8352941176470589
Creating index with 128104 dimensions.
zh 0.8243697478991596
Creating index with 128104 dimensions.
hi 0.8092436974789916
Creating index with 128104 dimensions.
ro 0.8504201680672269


In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from nltk.stem.porter import PorterStemmer
import nltk
nltk.download('punkt')
stemmer = PorterStemmer()

def _tokenizer(strategy, s):
  s = nltk.word_tokenize(s)
  if strategy == "word":
    return s

  if strategy == "word>stem":
    return [stemmer.stem(item) for item in s]

  s = [word for word in s if word not in ENGLISH_STOP_WORDS]
  if strategy == "word>stop":
    return s

  s = [stemmer.stem(item) for item in s]
  if strategy == "word>stop>stem":
    return s
  return s

corpus_ids, corpus, ans_ids, queries = get_data(lang_code="en")

for strat in ["word", "word>stop", "word>stem", "word>stop>stem"]:
  vectorizer = TfidfVectorizer(tokenizer=lambda x: _tokenizer(strat, x))
  C = vectorizer.fit_transform(corpus)
  Q = vectorizer.transform(queries)
  scores = np.matmul(C.toarray(), Q.transpose().toarray())
  pred_ids = np.argmax(scores, axis=0)
  accuracy = np.sum(pred_ids==np.array(ans_ids))/len(ans_ids)
  print(strat, accuracy)
  c_subword = [" ".join(_tokenizer(strat, d)) for d in corpus]
  q_subword = [" ".join(_tokenizer(strat, q)) for q in queries]
  acc = evaluate_stfidf(corpus_ids, c_subword, ans_ids, q_subword)
  print(strat+">subword", acc)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


word 0.8420168067226891
Creating index with 128104 dimensions.
word>subword 0.8495798319327731
word>stop 0.8394957983193277
Creating index with 128104 dimensions.
word>stop>subword 0.8420168067226891
word>stem 0.8487394957983193
Creating index with 128104 dimensions.
word>stem>subword 0.853781512605042
word>stop>stem 0.8521008403361344
Creating index with 128104 dimensions.
word>stop>stem>subword 0.8445378151260504
