minsearch

Minimalistic text search engine that uses sklearn and pandas.

This is a simple search library implemented using sklearn and pandas.

It allows you to index documents with text and keyword fields and perform search queries with support for filtering and boosting.

Installation

pip install minsearch

Environment setup

To run it locally, make sure you have the required dependencies installed:

pip install pandas scikit-learn

Alternatively, use pipenv:

pipenv install --dev

Usage

Here's how you can use the library:

Define Your Documents

Prepare your documents as a list of dictionaries. Each dictionary should have the text and keyword fields you want to index.

docs = [
    {
        "question": "How do I join the course after it has started?",
        "text": "You can join the course at any time. We have recordings available.",
        "section": "General Information",
        "course": "data-engineering-zoomcamp"
    },
    {
        "question": "What are the prerequisites for the course?",
        "text": "You need to have basic knowledge of programming.",
        "section": "Course Requirements",
        "course": "data-engineering-zoomcamp"
    }
]

Create the Index

Create an instance of the Index class, specifying the text and keyword fields.

from minsearch import Index

index = Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

Fit the index with your documents

index.fit(docs)

Perform a Search

Search the index with a query string, optional filter dictionary, and optional boost dictionary.

query = "Can I join the course if it has already started?"

filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3, "text": 1, "section": 1}

results = index.search(query, filter_dict, boost_dict)

for result in results:
    print(result)

Notebook

Run it in a notebook to test it yourself

pipenv run jupyter notebook

File structure

There's minsearch folder and minsearch.py file in the root.

The file minsearch.py is kept there because it was used in the LLM Zoomcamp course, where we'd use wget to donwload it. To avoid breaking changes, we keep the file.

Publishing

Use twine for publishing and build for building

pipenv install --dev twine build

Generate a wheel:

pipenv run python -m build

Check the packages:

pipenv run twine check dist/*

Upload the library to test PyPI to verify everything is working:

pipenv run twine upload --repository-url https://test.pypi.org/legacy/ dist/*

Upload to PyPI:

pipenv run twine upload dist/*

Clean:

rm -r build/ dist/ minsearch.egg-info/

Done!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
minsearch		minsearch
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
minsearch.py		minsearch.py
minsearch_example.ipynb		minsearch_example.ipynb
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

minsearch

Installation

Environment setup

Usage

Define Your Documents

Create the Index

Perform a Search

Notebook

File structure

Publishing

About

Releases

Packages

Languages

alexeygrigorev/minsearch

Folders and files

Latest commit

History

Repository files navigation

minsearch

Installation

Environment setup

Usage

Define Your Documents

Create the Index

Perform a Search

Notebook

File structure

Publishing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages