docsearcx is a simple search engine that retrieves information from pdfs based on term frequency-inverse Document frequency and cosine similarity to retrieve relevant documents.
For the sake of POC this application relies on in memory storage.
If pipenv is already installed skip this step.
pip install pipenv
pipenv install
& Activate the virtual environment shell by
pipenv shell
python app.py
cd client/
npm install
npm run serve
TF-IDF is a numerical statistics which reflects how important a word is to a document. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.