-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This page presents the scope of 'VectorWisdom' organization. 'VectorWisdom' is a non profit research organization which focus on the way up the pyramid of data, information, knowledge and wisdom. This is achieved using machine learning and more specifically on Large Language Models 'LLMs', their unit elements the Embeddings and transformers that generate them, the Vector databases that stores the Embeddings and Semantic search which can reference the sources.
A Large Language Model is a neural network that can have up to millions or billions of parameters (neurones weights) and targets natural language processing applications. It can have different architectures such as GPT (Generative pre-trained transformer) or BERT (Bidirectional Encoder Representations from Transformers)
Hugging face is a community driven website that collects state of the art datasets and models but also provides a machine learning framework for building, training and deploying the hosted datasets and models.
- Homepage : https://huggingface.co/
- List of LLMs : https://huggingface.co/LLMs
- LLMs Leaderboard : https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Langchain is a framework abstracting other libraries to speed up development of LLM Apps
- OpenAI
- Product : https://openai.com/product
- Facebook
- LLaMA : https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
- helper to run LLaMA https://github.com/cocktailpeanut/dalai
- Alpaca - Stanford
- github (23k) https://github.com/tatsu-lab/stanford_alpaca
- GPT4All (40k) https://github.com/nomic-ai/gpt4all
- Baize 2.5 k https://github.com/project-baize/baize-chatbot
Tokens are words or parts of words. Tokens can be transformed to a large list of numbers representing features or meanings, these numbers are structured in a vector called Embedding.
- Word Embeddings - Wikipedia : https://en.wikipedia.org/wiki/Word_embedding
- Massive Text Embedding Benchmark - Leaderboard : https://huggingface.co/spaces/mteb/leaderboard
A transformer is the trained model that can generate Embeddings
- Sentence Transformers
- Homepage : https://www.sbert.net/
- SBERT (11k) https://github.com/UKPLab/sentence-transformers
- OpenAI
- Microsoft
- Text Embeddings by Weakly-Supervised Contrastive Pre-training : https://arxiv.org/pdf/2212.03533.pdf
- unified pre-training for language understanding and generation (13k) https://github.com/microsoft/unilm
- Google BERT
- Bidirectional Encoder Representations from Transformers https://arxiv.org/pdf/1810.04805.pdf
- Universal Sentence Encoder
- Other techniques : Word2Vec, GloVe, FastText
- CNN models : VGG, ResNet, Inception
- OpenCV : SIFT, SURF, ORB
- FastText
- ImageNet
- Hugging face transformers (pre-trained models) : ViT, DeiT
A Vector database is a storage management for vectors that is optimized for Embeddings or dense vectors. It can scale to a large number of Embeddings.
- Milvus
- homepage https://milvus.io/
- Milvus (21k) https://github.com/milvus-io/milvus
- Weaviate
- homepage https://weaviate.io/developers/weaviate
- github (6k) https://github.com/weaviate/weaviate
- Pinecone https://docs.pinecone.io/docs/overview
- faiss - facebook (23k) https://github.com/facebookresearch/faiss
Semantic search is a search that tries to 'understand' the user query and match it with meaningful result.
It can be referred to as an extension of text search to include similarities and meanings in the search match as opposed to Full text search
which only finds an exact characters match. Semantic search can covert word tokens to Embeddings for fuzzy and synonyms search which make it use the same principle as LLMs
- Meilisearch
- Typesense
- docs https://typesense.org/docs/
- github (13k) https://github.com/typesense/typesense
- Solr
- features : https://solr.apache.org/features.html
- github (770) https://github.com/apache/solr
- Algoia https://www.algolia.com/doc/
- Elastic Search https://www.elastic.co/subscriptions
- Transformers can be used for both instant search and context input for LLMs, they offer eased integrations e.g. with ML frameworks such as OpenAI and HuggingFace.
- LLMs can support a Generative Search which pipes search results through LLM models
- LLMs contain all the data within its internal parameters and does not know where the data is coming from, therefore not appropriate for search, rather for providing answers (falling from the sky).
- A Semantic search engine has the task to connect a user query with a known reference that the user can open for further consultation.
Preference of semantic search over an LLM prompt is a matter of result representation and use case.