<a href="https://colab.research.google.com/github/different-ai/embedbase/blob/main/notebooks/Embedbase_Getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Welcome to Embedbase!

As a reminder, embedbase is the end-to-end platform to manage ML embeddings.

Embeddings allows you to:
- connect your data to ChatGPT or any other LLM.
- create recommendation engines
- classify data
- detect anomalies
- etc.

Today we will run a local-first Embedbase using a `sentence-transformers` model as `Embedder` and a `MemoryDatabase` to store embeddings.

We need to install a few dependencies, such as
- [Huggingface's "datasets" library](https://huggingface.co/docs/datasets/index) to get some real data to play with

In [1]:
!pip install -q embedbase sentence-transformers datasets git+https://github.com/different-ai/embedbase.git@main#subdirectory=sdk/embedbase-py

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.4/54.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m58.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from typing import List, Union
from embedbase import get_app
from embedbase.database.memory_db import MemoryDatabase
from embedbase.embedding.base import Embedder
from sentence_transformers import SentenceTransformer

class LocalEmbedder(Embedder):
    EMBEDDING_MODEL = "all-MiniLM-L6-v2"
 
    def __init__(
        self, model: str = EMBEDDING_MODEL, **kwargs
    ):
        super().__init__(**kwargs)
        self.model = SentenceTransformer(model)
 
    @property
    def dimensions(self) -> int:
        """
        Return the dimensions of the embeddings
        :return: dimensions of the embeddings
        """
        return self._dimensions
 
    def is_too_big(self, text: str) -> bool:
        """
        Check if text is too big to be embedded,
        delegating the splitting UX to the caller
        :param text: text to check
        :return: True if text is too big, False otherwise
        """
        return len(text) > self.model.get_max_seq_length()
 
    async def embed(self, data: Union[List[str], str]) -> List[List[float]]:
        """
        Embed a list of texts
        :param texts: list of texts
        :return: list of embeddings
        """
        embeddings = self.model.encode(data)
        return embeddings.tolist() if isinstance(data, list) else [embeddings.tolist()]


def run_app():
    print(
        """
                 _           _   _         _               _          _         
        /\ \        /\_\/\_\ _    / /\            /\ \       /\ \       
       /  \ \      / / / / //\_\ / /  \          /  \ \     /  \ \____  
      / /\ \ \    /\ \/ \ \/ / // / /\ \        / /\ \ \   / /\ \_____\ 
     / / /\ \_\  /  \____\__/ // / /\ \ \      / / /\ \_\ / / /\/___  / 
    / /_/_ \/_/ / /\/________// / /\ \_\ \    / /_/_ \/_// / /   / / /  
   / /____/\   / / /\/_// / // / /\ \ \___\  / /____/\  / / /   / / /   
  / /\____\/  / / /    / / // / /  \ \ \__/ / /\____\/ / / /   / / /    
 / / /______ / / /    / / // / /____\_\ \  / / /______ \ \ \__/ / /     
/ / /_______\\/_/    / / // / /__________\/ / /_______\ \ \___\/ /      
\/__________/ _      \/_/ \/_____________/\/__________/  \/_____/       
             / /\            / /\               / /\         /\ \       
            / /  \          / /  \             / /  \       /  \ \      
           / / /\ \        / / /\ \           / / /\ \__   / /\ \ \     
          / / /\ \ \      / / /\ \ \         / / /\ \___\ / / /\ \_\    
         / / /\ \_\ \    / / /  \ \ \        \ \ \ \/___// /_/_ \/_/    
        / / /\ \ \___\  / / /___/ /\ \        \ \ \     / /____/\       
       / / /  \ \ \__/ / / /_____/ /\ \   _    \ \ \   / /\____\/       
      / / /____\_\ \  / /_________/\ \ \ /_/\__/ / /  / / /______       
     / / /__________\/ / /_       __\ \_\\ \/___/ /  / / /_______\      
     \/_____________/\_\___\     /____/_/ \_____\/   \/__________/      
                                                                                                
                 [-0.005, 0.012, -0.008, ..., -0.010]
        """
    )
    return get_app().use_db(MemoryDatabase()).use_embedder(LocalEmbedder()).run()

app = run_app()

2023-04-23 11:46:08,501 - embedbase - INFO - Enabling Database <embedbase.database.memory_db.MemoryDatabase object at 0x7f7edacca4c0>
INFO:embedbase:Enabling Database <embedbase.database.memory_db.MemoryDatabase object at 0x7f7edacca4c0>



                 _           _   _         _               _          _         
        /\ \        /\_\/\_\ _    / /\            /\ \       /\ \       
       /  \ \      / / / / //\_\ / /  \          /  \ \     /  \ \____  
      / /\ \ \    /\ \/ \ \/ / // / /\ \        / /\ \ \   / /\ \_____\ 
     / / /\ \_\  /  \____\__/ // / /\ \ \      / / /\ \_\ / / /\/___  / 
    / /_/_ \/_/ / /\/________// / /\ \_\ \    / /_/_ \/_// / /   / / /  
   / /____/\   / / /\/_// / // / /\ \ \___\  / /____/\  / / /   / / /   
  / /\____\/  / / /    / / // / /  \ \ \__/ / /\____\/ / / /   / / /    
 / / /______ / / /    / / // / /____\_\ \  / / /______ \ \ \__/ / /     
/ / /_______\/_/    / / // / /__________\/ / /_______\ \ \___\/ /      
\/__________/ _      \/_/ \/_____________/\/__________/  \/_____/       
             / /\            / /\               / /\         /\ \       
            / /  \          / /  \             / /  \       /  \ \      
           / / /\ \        / / /\ \        

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

2023-04-23 11:46:10,730 - embedbase - INFO - Enabling Embedder <__main__.LocalEmbedder object at 0x7f7fbd2a2fa0>
INFO:embedbase:Enabling Embedder <__main__.LocalEmbedder object at 0x7f7fbd2a2fa0>


Let's load the [dataset of chatgpt prompts](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts)

In [3]:
from datasets import load_dataset
dataset_id = "fka/awesome-chatgpt-prompts"
dataset = load_dataset(dataset_id, 'en', split='train', streaming=True)
print(next(iter(dataset)))

Downloading readme:   0%|          | 0.00/274 [00:00<?, ?B/s]

{'act': 'Linux Terminal', 'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd'}


In [6]:
# this is a necessary hack when you want to use EmbedbaseClient in Jupyter notebook like colab
# https://stackoverflow.com/questions/46827007/runtimeerror-this-event-loop-is-already-running-in-python
import nest_asyncio
nest_asyncio.apply()

In [8]:
from embedbase_client.client import EmbedbaseClient
from embedbase_client.split import split_text
from pprint import pprint

embedbase = EmbedbaseClient("http://localhost:8000", fastapi_app=app)

embedbase_dataset_id = dataset_id.split("/")[-1]
i = 0
documents = []
for row in dataset:
  # ⚠️ note here that we split in small chunks of max_tokens "30" because
  # the model used has a relatively limited input size
  # when using other models such as OpenAI's embeddings model, you can
  # use max_tokens of 500 and chunk_overlap of 200 for example
  # (embedbase cloud use openai model at the moment) ⚠️
  for c in split_text(row["prompt"], max_tokens=30, chunk_overlap=20):
    documents.append({
        "data": c.chunk,
    })

  i+=1
  if i > 100_000:
    break
res = embedbase.dataset(embedbase_dataset_id).batch_add(documents)
pprint(res)

2023-04-23 11:57:29,368 - embedbase - INFO - Refreshing 1476 embeddings
INFO:embedbase:Refreshing 1476 embeddings
2023-04-23 11:57:29,376 - embedbase - INFO - Checking embeddings computing necessity for 1476 documents
INFO:embedbase:Checking embeddings computing necessity for 1476 documents
2023-04-23 11:57:29,451 - embedbase - INFO - We will compute embeddings for 0/1476 documents
INFO:embedbase:We will compute embeddings for 0/1476 documents
2023-04-23 11:57:29,526 - embedbase - INFO - Uploaded 0 documents
INFO:embedbase:Uploaded 0 documents
2023-04-23 11:57:29,531 - embedbase - INFO - Uploaded in 0.16291451454162598 seconds
INFO:embedbase:Uploaded in 0.16291451454162598 seconds


[{'id': '2714c785b83665cb225a561bbf88198384bdc9d5d845236650395cccd2649a06',
  'status': 'success'},
 {'id': 'acc6a94f82f5876390392fb336b3078f7c73fdb5238f762a4e5301e559b52878',
  'status': 'success'},
 {'id': '9d0ba86c2644b9bbf2b18a9e0baea1a1eef685c14ba13f11b84546c1bca9e259',
  'status': 'success'},
 {'id': '1394ed85d47ad5e26342e0144aaac87f3bda6cdf3b7a274c59108746d6ea2eea',
  'status': 'success'},
 {'id': '2100bb59bc6f2250c2b330f8fdd6f686d207d14b91f83800baf669bb03827531',
  'status': 'success'},
 {'id': '5749fdaa62afc3b7d38f7a08d271618e35e0c45b25b890b7419dc20218a809ff',
  'status': 'success'},
 {'id': '813a41d5d5e54333df93cfec17cbf1f9cd64d92a0185d04872ccda548021dd77',
  'status': 'success'},
 {'id': '642f76096938867d3885a45bdb82c30401972396ed506369e3611f44ff142b0f',
  'status': 'success'},
 {'id': '854d2a3f5eb794a7b4223ecb26d8a41c22008ffbdae7fc408b6940f2318a455b',
  'status': 'success'},
 {'id': 'ee14c3176014a0d22b455ab25200e14eda06c2a64426f0bdd01c9fe6b328a575',
  'status': 'success'},


In [9]:
res = embedbase.dataset(embedbase_dataset_id).search("any idea of recipes with lentils with basil and ginger", 15)
for r in res:
    pprint(r.data)

2023-04-23 11:58:23,161 - embedbase - INFO - Query any idea of recipes with lentils with basil and ginger created embedding, querying index
INFO:embedbase:Query any idea of recipes with lentils with basil and ginger created embedding, querying index


(', and you will suggest recipes for me to try. You should only reply with the '
 'recipes you recommend, and nothing else. Do not write explanations.')
(' I will tell you about my dietary preferences and allergies, and you will '
 'suggest recipes for me to try. You should only reply with the recipes you '
 'recommend')
('. You should only reply with the recipes you recommend, and nothing else. Do '
 'not write explanations. My first request is "I am a vegetarian and')
(' a vegetarian recipe for 2 people that has approximate 500 calories per '
 'serving and has a low glycemic index. Can you please provide a suggestion?')
('I want you to act as my personal chef. I will tell you about my dietary '
 'preferences and allergies, and you will suggest recipes for me to try')
' I am looking for healthy dinner ideas."'
('I require someone who can suggest delicious recipes that includes foods '
 'which are nutritionally beneficial but also easy & not time consuming enough '
 'therefore suitable

Congrats, you saw the main features of embedbase, from this, you can build:
- A [recommendation engine](https://betterprogramming.pub/using-openai-to-increase-time-spent-on-your-blog-3f138d5ae6aa)
- Connect your data sources to ChatGPT/LLMs, for example, for [a chatgpt powered documentation](https://betterprogramming.pub/building-a-chatgpt-powered-markdown-documentation-in-no-time-50e308f9038e)
- Detect anomalies
- Classify your data
- Vizualize your data distribution in 2D or 3D