Team THM Submission

Tom, Henrique, Michel, Corentin
Oct. - Nov. 2022
https://artefactory.github.io/redis-team-THM/
https://thm-cli.community.saturnenterprise.io/api/docs

This demo showcases the vector search similarity feature of Redis Enterprise.

RediSearch enables developers to add documents and their embeddings indexes to the database, turning Redis into a vector database that can be used for modern data web applications.

See Architecture to see how it works, and User Workflow to see how it can be used.

Documentation

Basic Demo | GitHub
Redis Vector Similarity Search
Huggingface Tokenizers + Models
Cornell University - arXiv dataset, arxiv-metadata-oai-snapshot.json file is used
FastAPI, pydantic, redis-om
redis see Vector database and JSON storage

History of Changes

1/11 - Added a multi-category classifier, a Question Answering engine and a CLI HTTP client to the backend
31/10 - Draft blog posts and CLI ETL tool
30/10 - Refactored RedisVentures/redis-arXiv-search project
27/10 - Setup Redis Cloud Enterprise and Saturn Cloud accounts and organized within the team
15/10 - Added a blog based on Pelican
15/10 - Added CI/CD script
15/10 - Forked from RedisVentures/redis-arXiv-search

Machine Setup

brew install yarn redis
pip install -r backend/requirements.txt
pip install -r scripts/requirements.txt

Architecture

The user will perform searches to the Redis database through a REST API HTTP Server.

We wrote a small interactive CLI client tool that performs calls to the HTTP Server and returns papers matching the user queries.

                        writes pickle and loads index
+-------------------+      +----------------+
|                   |      |                |
|  Redis            +<-----+  ETL CLI       |
|                   |      |                |
+--------+----------+      +----------------+
         ^
         |  reads search index
+--------+----------+
|                   |
|  FastAPI          |
|                   |
+--------+----------+
         ^
         |  calls backend
+--------+----------+      +---------------------+
|                   |      |                     |
|  THM CLI          +----->+  arxiv.org          |
|                   |      |  wolfram.alpha.com  |
+-------------------+      +---------------------+
    researcher uses the THM CLI while writing research

User Workflow

This CLI tool is a quick assistant for a researcher daily activities and helps him improves his efficiency.

It can be used with his text editor and browser and helps him in the process of:

building bibliography in Markdown or BibTeX formats,
checking the PDF papers using arXiv website,
checking scientific facts on Wolfram Alpha website.

  graph TD;
      welcome_message-->choose_activity;
      welcome_message-->configure_parameters;
      choose_activity-->search_keywords;
      search_keywords-->Search_API;
      choose_activity-->search_similar_to;
      search_similar_to-->Search_API;
      choose_activity-->fetch_paper_details;
      fetch_paper_details-->Search_API;
      choose_activity-->ask_open_question;
      ask_open_question-->HugginFacePipeline;
      choose_activity-->find_formula;
      find_formula-->Wolfram_Alpha_API;

Running The Application

Backend Application

Setup your Redis Enterprise Cloud then,

cd backend/
./start.sh

open http://0.0.0.0:8080/api/docs

Deploy on Saturn Cloud

Train model

cd scripts/
pip install -r requirements.txt
bash retrain_model.sh

THM CLI

cd scripts/
pip install -r requirements.txt
./thm-cli.py

ETL CLI

cd scripts/
pip install -r requirements.txt
./pipeline.sh

Blog

# To preview files locally
pelican blog/content && pelican --listen

# To publish on GitHub pages
make publish_blog

Machine Learning Models

The project uses the UKPLab/sentence-transformers library to compute dense vector representations for sentences found in Cornell's arXiv corpus.

We found the following models interesting NLP models from the leaderboard that community built.

sentence-transformers/all-mpnet-base-v2 has embeddings of size 768 and relative good performance
sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2 has embedding of size 384 and interesting for development as performing inference is faster

We also used transformers.AutoModelForSequenceClassification for the problem of multi-category classification.

For the problem of Question Answering we used distilbert-base-cased-distilled-squad.

  graph TD;
      sentence-transformers/all-MiniLM-L12-v2-->THM_API;
      transformers.AutoModelForSequenceClassification-->THM_API;
      THM_API-->THM_CLI;
      distilbert-base-cased-distilled-squad-->THM_CLI;

Benchmarks

See on our blog for the benchmarks we did to evaluate the full solution.

Contributions

Changes and improvements are welcome! Feel free to fork and open a pull request into main.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
backend		backend
blog		blog
datascience		datascience
exploration		exploration
scripts		scripts
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

License

artefactory/redis-team-THM

Folders and files

Latest commit

History

Repository files navigation

Team THM Submission

Table of Contents

Documentation

History of Changes

Machine Setup

Architecture

User Workflow

Running The Application

Backend Application

Train model

THM CLI

ETL CLI

Blog

Machine Learning Models

Benchmarks

Contributions

About

Topics

Resources

License

Stars

Watchers

Forks

Languages