simple-retriever (basic retrival methods)

A simple warpper around common retrival tools

this repository is a simple warpper around common retrival tools especially sklearn tf-idf and some of huggingface models.
We support fast and typo telorant tf-idf and multilingual sentence embedding models for retrieval.
this repo is mainly for educational porposes so the code is very basic and super readable (no fancy abstraction)

editable install

clone repository from github
cd to repository folder
install with pip in editable mode
pip install -e .

usage:

you can find most relevent document in just 3 line of code !

from retriever import retriever_factory
docs = [
    'hello world',
    'how are you woooorld',
    'i am fine ',
    'this is a junk sentence!',
    'this is a siiiiimilar word with a typo',
    "it's time to find most similar documents",
]
retriever = retriever_factory(method='tf_idf_cfg_1')
retriever.add_doc_batch(docs)
results = retriever.find_similars('similar term!', top_k=4)

dense models also have same interface

samples = [
    "میوه تازه",
    "شیر",
    "ماست",
    "شیرینی",
    "دوغ و نوشابه",
    "دروغ گفتن",
    "بادام",
    "کره",
    "هلو های تازه",
    "نارنج",
    "شکلات صبحانه",
    "فندق",
    "آزادی بیان",
    "مبارزه با تروریسم",
    "مبارزه با فساد",
    "مبارزه ی مدنی",
    "یادگیری ماشین",
    "الگوریتم های دسته بندی",
    "سربار مالیاتی",
    "نت ضعیفه",
    "ورزش صبحگاه",
]
retriever = retriever_factory(method='dense_LaBSE')
retriever.add_doc_batch(samples)
results1 = retriever.find_similars("فعالیت بدنی") # ->  "ورزش صبحگاه"
results2 = retriever.find_similars("نوشیدنی") # -> "دوغ و نوشابه"
results3 = retriever.find_similars("هزینه های پنهان") # -> "سربار مالیاتی"
results4 = retriever.find_similars("کلاه برداری کردن") # -> "دروغ گفتن"

supported retrival methods

tf_idf_cfg_1

tf-idf config-1 -> fast and typo telorant tf-idf (insensitive to word orders)

tf_idf_cfg_2

tf-idf config-2 -> less typo telorant tf-idf + little bit of order sensitiveness

dense_LaBSE

good and big model.
you can try this model here on hugging-face

dense_multilingual_e5

a little smaller than LaBSE but still good model.
you can try this model here on hugging-face

dense_MiniLM

small model.
you can try this model here on hugging-face

ensemble_cfg_1

an ensemble of 4 different models (labse,minilm,e5,tf-ifd) (it may need 4GB of free RAM for initialization)

custom configs

for undestanding custom configs please refere to factory.py file

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
retriever		retriever
test		test
.gitignore		.gitignore
README.md		README.md
requirement.txt		requirement.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retriever

retriever

test

test

.gitignore

.gitignore

README.md

README.md

requirement.txt

requirement.txt

setup.py

setup.py

Repository files navigation

simple-retriever (basic retrival methods)

A simple warpper around common retrival tools

editable install

usage:

supported retrival methods

tf_idf_cfg_1

tf_idf_cfg_2

dense_LaBSE

dense_multilingual_e5

dense_MiniLM

ensemble_cfg_1

custom configs

About

Releases

Packages

Languages

alireza-hariri/simple-retriever

Folders and files

Latest commit

History

Repository files navigation

simple-retriever (basic retrival methods)

A simple warpper around common retrival tools

editable install

usage:

supported retrival methods

tf_idf_cfg_1

tf_idf_cfg_2

dense_LaBSE

dense_multilingual_e5

dense_MiniLM

ensemble_cfg_1

custom configs

About

Topics

Resources

Stars

Watchers

Forks

Languages