Home

Introduction

This page presents the scope of 'VectorWisdom' organization. 'VectorWisdom' is a non profit research organization which focus on the way up the pyramid of data, information, knowledge and wisdom. This is achieved using machine learning and more specifically on Large Language Models 'LLMs', their unit elements the Embeddings and transformers that generate them, the Vector databases that stores the Embeddings and Semantic search which can reference the sources.

Large Language Models

A Large Language Model is a neural network that can have up to millions or billions of parameters (neurones weights) and targets natural language processing applications. It can have different architectures such as GPT (Generative pre-trained transformer) or BERT (Bidirectional Encoder Representations from Transformers)

Hugging face is a community driven website that collects state of the art datasets and models but also provides a machine learning framework for building, training and deploying the hosted datasets and models.

Homepage : https://huggingface.co/
List of LLMs : https://huggingface.co/LLMs
LLMs Leaderboard : https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Langchain is a framework abstracting other libraries to speed up development of LLM Apps

docs : https://docs.langchain.com/docs/

Models

OpenAI
- Product : https://openai.com/product
Facebook
- LLaMA : https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
- helper to run LLaMA https://github.com/cocktailpeanut/dalai
Alpaca - Stanford
- github (23k) https://github.com/tatsu-lab/stanford_alpaca

chat libraries

GPT4All (40k) https://github.com/nomic-ai/gpt4all
Baize 2.5 k https://github.com/project-baize/baize-chatbot

Embeddings

Tokens are words or parts of words. Tokens can be transformed to a large list of numbers representing features or meanings, these numbers are structured in a vector called Embedding.

Word Embeddings - Wikipedia : https://en.wikipedia.org/wiki/Word_embedding
Massive Text Embedding Benchmark - Leaderboard : https://huggingface.co/spaces/mteb/leaderboard

Transformers

A transformer is the trained model that can generate Embeddings

Sentence Transformers
- Homepage : https://www.sbert.net/
- SBERT (11k) https://github.com/UKPLab/sentence-transformers
OpenAI
- Guide : https://platform.openai.com/docs/guides/embeddings
- API Reference : https://platform.openai.com/docs/api-reference/embeddings
Microsoft
- Text Embeddings by Weakly-Supervised Contrastive Pre-training : https://arxiv.org/pdf/2212.03533.pdf
- unified pre-training for language understanding and generation (13k) https://github.com/microsoft/unilm
Google BERT
- Bidirectional Encoder Representations from Transformers https://arxiv.org/pdf/1810.04805.pdf
Universal Sentence Encoder
- https://arxiv.org/pdf/1803.11175.pdf
Other techniques : Word2Vec, GloVe, FastText

Images Transformers

CNN models : VGG, ResNet, Inception
OpenCV : SIFT, SURF, ORB
FastText
ImageNet
Hugging face transformers (pre-trained models) : ViT, DeiT

Vector databases

A Vector database is a storage management for vectors that is optimized for Embeddings or dense vectors. It can scale to a large number of Embeddings.

Milvus
- homepage https://milvus.io/
- Milvus (21k) https://github.com/milvus-io/milvus
Weaviate
- homepage https://weaviate.io/developers/weaviate
- github (6k) https://github.com/weaviate/weaviate
Pinecone https://docs.pinecone.io/docs/overview
faiss - facebook (23k) https://github.com/facebookresearch/faiss

Semantic Search

Semantic search is a search that tries to 'understand' the user query and match it with meaningful result. It can be referred to as an extension of text search to include similarities and meanings in the search match as opposed to Full text search which only finds an exact characters match. Semantic search can covert word tokens to Embeddings for fuzzy and synonyms search which make it use the same principle as LLMs

Meilisearch
- website https://www.meilisearch.com/docs/learn/what_is_meilisearch/overview
- github (37k) https://github.com/meilisearch/meilisearch
Typesense
- docs https://typesense.org/docs/
- github (13k) https://github.com/typesense/typesense
Solr
- features : https://solr.apache.org/features.html
- github (770) https://github.com/apache/solr
Algoia https://www.algolia.com/doc/
Elastic Search https://www.elastic.co/subscriptions

comparisions

https://www.meilisearch.com/docs/learn/what_is_meilisearch/comparison_to_alternatives#comparison-table

Analysis

LLMs and search

Transformers can be used for both instant search and context input for LLMs, they offer eased integrations e.g. with ML frameworks such as OpenAI and HuggingFace.
LLMs can support a Generative Search which pipes search results through LLM models
LLMs contain all the data within its internal parameters and does not know where the data is coming from, therefore not appropriate for search, rather for providing answers (falling from the sky).
A Semantic search engine has the task to connect a user query with a known reference that the user can open for further consultation.

Preference of semantic search over an LLM prompt is a matter of result representation and use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Introduction

Large Language Models

Models

chat libraries

Embeddings

Transformers

Images Transformers

Vector databases

Semantic Search

comparisions

Analysis

LLMs and search

Home

Applications

Diagrams Transformers

Clone this wiki locally