<a href="https://colab.research.google.com/github/different-ai/embedbase/blob/main/notebooks/Embedbase_Getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Welcome to Embedbase!

![embedbase logo](https://docs.embedbase.xyz/embedbase-long.svg)

As a reminder, embedbase is the end-to-end platform to manage ML embeddings.

Embeddings allows you to:
- connect your data to ChatGPT or any other LLM.
- create recommendation engines
- classify data
- detect anomalies
- etc.

Today we will run a local-first Embedbase using a `sentence-transformers` model as `Embedder` and a `MemoryDatabase` to store embeddings.

We need to install a few dependencies, such as
- [Huggingface's "datasets" library](https://huggingface.co/docs/datasets/index) to get some real data to play with

In [1]:
!pip install -q embedbase sentence-transformers datasets git+https://github.com/different-ai/embedbase.git@main#subdirectory=sdk/embedbase-py

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.4/54.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.9/51.9 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Let's set up an "Embedder", a key component of embedbase that transforms data into vectors allowing to compare similarities.

Another key component of embedbase is the `database` where you store your data and these vectors. In this example we will store everything in memory. 

ℹ️ In production you should use a database that supports vectors. Embedbase Cloud uses Supabase for example.

In [3]:
from typing import List, Union
from embedbase import get_app
from embedbase.database.memory_db import MemoryDatabase
from embedbase.embedding.base import Embedder
from sentence_transformers import SentenceTransformer

class LocalEmbedder(Embedder):
    EMBEDDING_MODEL = "all-MiniLM-L6-v2"
 
    def __init__(
        self, model: str = EMBEDDING_MODEL, **kwargs
    ):
        super().__init__(**kwargs)
        self.model = SentenceTransformer(model)
 
    @property
    def dimensions(self) -> int:
        """
        Return the dimensions of the embeddings
        :return: dimensions of the embeddings
        """
        return self._dimensions
 
    def is_too_big(self, text: str) -> bool:
        """
        Check if text is too big to be embedded,
        delegating the splitting UX to the caller
        :param text: text to check
        :return: True if text is too big, False otherwise
        """
        return len(text) > self.model.get_max_seq_length()
 
    async def embed(self, data: Union[List[str], str]) -> List[List[float]]:
        """
        Embed a list of texts
        :param texts: list of texts
        :return: list of embeddings
        """
        embeddings = self.model.encode(data)
        return embeddings.tolist() if isinstance(data, list) else [embeddings.tolist()]


def run_app():
    print(
        """
                 _           _   _         _               _          _         
        /\ \        /\_\/\_\ _    / /\            /\ \       /\ \       
       /  \ \      / / / / //\_\ / /  \          /  \ \     /  \ \____  
      / /\ \ \    /\ \/ \ \/ / // / /\ \        / /\ \ \   / /\ \_____\ 
     / / /\ \_\  /  \____\__/ // / /\ \ \      / / /\ \_\ / / /\/___  / 
    / /_/_ \/_/ / /\/________// / /\ \_\ \    / /_/_ \/_// / /   / / /  
   / /____/\   / / /\/_// / // / /\ \ \___\  / /____/\  / / /   / / /   
  / /\____\/  / / /    / / // / /  \ \ \__/ / /\____\/ / / /   / / /    
 / / /______ / / /    / / // / /____\_\ \  / / /______ \ \ \__/ / /     
/ / /_______\\/_/    / / // / /__________\/ / /_______\ \ \___\/ /      
\/__________/ _      \/_/ \/_____________/\/__________/  \/_____/       
             / /\            / /\               / /\         /\ \       
            / /  \          / /  \             / /  \       /  \ \      
           / / /\ \        / / /\ \           / / /\ \__   / /\ \ \     
          / / /\ \ \      / / /\ \ \         / / /\ \___\ / / /\ \_\    
         / / /\ \_\ \    / / /  \ \ \        \ \ \ \/___// /_/_ \/_/    
        / / /\ \ \___\  / / /___/ /\ \        \ \ \     / /____/\       
       / / /  \ \ \__/ / / /_____/ /\ \   _    \ \ \   / /\____\/       
      / / /____\_\ \  / /_________/\ \ \ /_/\__/ / /  / / /______       
     / / /__________\/ / /_       __\ \_\\ \/___/ /  / / /_______\      
     \/_____________/\_\___\     /____/_/ \_____\/   \/__________/      
                                                                                                
                 [-0.005, 0.012, -0.008, ..., -0.010]
        """
    )
    return get_app().use_db(MemoryDatabase()).use_embedder(LocalEmbedder()).run()

app = run_app()

2023-04-23 12:24:46,622 - embedbase - INFO - Enabling Database <embedbase.database.memory_db.MemoryDatabase object at 0x7f36ec232be0>
2023-04-23 12:24:46,622 - embedbase - INFO - Enabling Database <embedbase.database.memory_db.MemoryDatabase object at 0x7f36ec232be0>
INFO:embedbase:Enabling Database <embedbase.database.memory_db.MemoryDatabase object at 0x7f36ec232be0>



                 _           _   _         _               _          _         
        /\ \        /\_\/\_\ _    / /\            /\ \       /\ \       
       /  \ \      / / / / //\_\ / /  \          /  \ \     /  \ \____  
      / /\ \ \    /\ \/ \ \/ / // / /\ \        / /\ \ \   / /\ \_____\ 
     / / /\ \_\  /  \____\__/ // / /\ \ \      / / /\ \_\ / / /\/___  / 
    / /_/_ \/_/ / /\/________// / /\ \_\ \    / /_/_ \/_// / /   / / /  
   / /____/\   / / /\/_// / // / /\ \ \___\  / /____/\  / / /   / / /   
  / /\____\/  / / /    / / // / /  \ \ \__/ / /\____\/ / / /   / / /    
 / / /______ / / /    / / // / /____\_\ \  / / /______ \ \ \__/ / /     
/ / /_______\/_/    / / // / /__________\/ / /_______\ \ \___\/ /      
\/__________/ _      \/_/ \/_____________/\/__________/  \/_____/       
             / /\            / /\               / /\         /\ \       
            / /  \          / /  \             / /  \       /  \ \      
           / / /\ \        / / /\ \        

2023-04-23 12:24:46,832 - embedbase - INFO - Enabling Embedder <__main__.LocalEmbedder object at 0x7f37d7d0c5b0>
2023-04-23 12:24:46,832 - embedbase - INFO - Enabling Embedder <__main__.LocalEmbedder object at 0x7f37d7d0c5b0>
INFO:embedbase:Enabling Embedder <__main__.LocalEmbedder object at 0x7f37d7d0c5b0>


Let's load the [dataset of chatgpt prompts](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts)

In [7]:
from datasets import load_dataset
dataset_id = "fka/awesome-chatgpt-prompts"
dataset = load_dataset(dataset_id, 'en', split='train', streaming=True)
print(next(iter(dataset)))

Downloading readme:   0%|          | 0.00/274 [00:00<?, ?B/s]

{'act': 'Linux Terminal', 'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd'}


In [8]:
# this is a necessary hack when you want to use EmbedbaseClient in Jupyter notebook like colab
# https://stackoverflow.com/questions/46827007/runtimeerror-this-event-loop-is-already-running-in-python
import nest_asyncio
nest_asyncio.apply()

In [9]:
from embedbase_client.client import EmbedbaseClient
from embedbase_client.split import split_text
from pprint import pprint

embedbase = EmbedbaseClient("http://localhost:8000", fastapi_app=app)

embedbase_dataset_id = dataset_id.split("/")[-1]
documents = []
for row in dataset:
  # ⚠️ note here that we split in small chunks of max_tokens "30" because
  # the model used has a relatively limited input size
  # when using other models such as OpenAI's embeddings model, you can
  # use max_tokens of 500 and chunk_overlap of 200 for example
  # (embedbase cloud use openai model at the moment) ⚠️
  for c in split_text(row["prompt"], max_tokens=30, chunk_overlap=20):
    documents.append({
        "data": c.chunk,
    })

res = embedbase.dataset(embedbase_dataset_id).batch_add(documents)
pprint(res[0:5])

2023-04-23 12:34:01,658 - embedbase - INFO - Refreshing 1476 embeddings
2023-04-23 12:34:01,658 - embedbase - INFO - Refreshing 1476 embeddings
INFO:embedbase:Refreshing 1476 embeddings
2023-04-23 12:34:01,666 - embedbase - INFO - Checking embeddings computing necessity for 1476 documents
2023-04-23 12:34:01,666 - embedbase - INFO - Checking embeddings computing necessity for 1476 documents
INFO:embedbase:Checking embeddings computing necessity for 1476 documents
2023-04-23 12:34:01,707 - embedbase - INFO - We will compute embeddings for 1476/1476 documents
2023-04-23 12:34:01,707 - embedbase - INFO - We will compute embeddings for 1476/1476 documents
INFO:embedbase:We will compute embeddings for 1476/1476 documents
2023-04-23 12:34:17,606 - embedbase - INFO - Uploaded 1476 documents
2023-04-23 12:34:17,606 - embedbase - INFO - Uploaded 1476 documents
INFO:embedbase:Uploaded 1476 documents
2023-04-23 12:34:17,610 - embedbase - INFO - Uploaded in 15.952126264572144 seconds
2023-04-23 12

[{'id': '27e1404fd1cd74b065ebd5984e0578dab9d005f10f7a9ac514122debe894b7e7',
  'status': 'success'},
 {'id': '4412fafe70a5d744e502e859ef4326a635c70cddda7ea005f217c7d2a6b9cd46',
  'status': 'success'},
 {'id': '8c5149b60a4f16266abd52393ee9029f1d367f3a79eb9ef43b2ce8686fe3ceb1',
  'status': 'success'},
 {'id': '728dbf6ec7ee429feb3ed787c351f72d4680cb9d166e2cb60d7c612948d88e24',
  'status': 'success'},
 {'id': 'dbd9e5692642a4bd7d69485097764ad95a5e19e446a7181f384e576bb263bc41',
  'status': 'success'}]


In [12]:
res = embedbase.dataset(embedbase_dataset_id).search("historian persona, expert in cultural, economic, political, and social events", 15)
for r in res:
    pprint(r.data)

2023-04-23 12:35:36,621 - embedbase - INFO - Query historian persona, expert in cultural, economic, political, and social events created embedding, querying index
2023-04-23 12:35:36,621 - embedbase - INFO - Query historian persona, expert in cultural, economic, political, and social events created embedding, querying index
INFO:embedbase:Query historian persona, expert in cultural, economic, political, and social events created embedding, querying index


('I want you to act as a historian. You will research and analyze cultural, '
 'economic, political, and social events in the past, collect data from')
(' and social events in the past, collect data from primary sources and use it '
 'to develop theories about what happened during various periods of history. '
 'My first suggestion')
(' will research and analyze cultural, economic, political, and social events '
 'in the past, collect data from primary sources and use it to develop '
 'theories about what')
(' primary sources and use it to develop theories about what happened during '
 'various periods of history. My first suggestion request is "I need help '
 'uncovering facts about')
('I want you to act as my time travel guide. I will provide you with the '
 'historical period or future time I want to visit and you will suggest')
' suggest some interesting events, sights, or people for me to experience?"'
('. I will provide you with the historical period or future time I want to '
 '

Congrats, you saw the main features of embedbase, from this, you can build:
- A [recommendation engine](https://betterprogramming.pub/using-openai-to-increase-time-spent-on-your-blog-3f138d5ae6aa)
- Connect your data sources to ChatGPT/LLMs, for example, for [a chatgpt powered documentation](https://betterprogramming.pub/building-a-chatgpt-powered-markdown-documentation-in-no-time-50e308f9038e)
- Detect anomalies
- Classify your data
- Vizualize your data distribution in 2D or 3D

# Connecting the dot with LLMs like ChatGPT

ChatGPT is a game-changer, but it doesn't have any knowledge about your company, your product, or your data. Embedbase makes it easy for you to connect any data sources to ChatGPT.

Here we will plug our dataset of prompts to chatgpt so that it can answer questions about it.

We will need an OpenAI API key that you can get at
https://platform.openai.com/account/api-keys.

Then we can install the OpenAI SDK:

In [10]:
!pip install -q openai

⚠️ Large language models such as ChatGPT, GPT-4 and others are limited in input size, so we need to give them the best information to solve the user's problem.

A good question create a good answer, and that's where embedbase shines!

We will use the function `merge` from `embedbase_client` in order to merge the search results into a string that can fit into the AI input.

In [15]:
import openai
from embedbase_client.split import merge

openai.api_key = "<https://platform.openai.com/account/api-keys>"
openai.api_key = "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"
def build_prompt(question: str, context: str):
  return f"Based on the following context:\n{context}\nAnswer the user's question: {question}"

question = "Provide me a persona that is 'historian persona, expert in cultural, economic, political, and social events' but also contrarian"
results = embedbase.dataset(embedbase_dataset_id).search(question, limit=5)
merged_results = merge([result.data for result in results])
persona = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a helpful assistant that answers questions about a list of chatgpt prompts"},
        {"role": "user", "content": build_prompt(question, merged_results)}
    ]
)["choices"][0]["message"]["content"]
persona

2023-04-23 12:42:29,807 - embedbase - INFO - Query Provide me a persona that is 'historian persona, expert in cultural, economic, political, and social events' but also contrarian created embedding, querying index
2023-04-23 12:42:29,807 - embedbase - INFO - Query Provide me a persona that is 'historian persona, expert in cultural, economic, political, and social events' but also contrarian created embedding, querying index
INFO:embedbase:Query Provide me a persona that is 'historian persona, expert in cultural, economic, political, and social events' but also contrarian created embedding, querying index


"Sure! Meet Dr. Julia Martinez. Julia is a highly respected historian, specializing in cultural, economic, political, and social events throughout history. She's published numerous papers, lectured at prestigious universities, and consulted for various historical organizations. However, Julia is also known as a contrarian - she enjoys questioning traditional theories and perspectives, often challenging the status quo. Her unique approach to analyzing historical events provides a fresh perspective that has made her a sought-after expert in the field."

Okay, let's use this persona to ask a question about Jean-Francois de La Perouse

In [18]:
pprint(openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": f"You are a powerful AI assistant with this personality: '{persona}'"},
        {"role": "user", "content": "List me the main discoveries of Jean-Francois de La Perouse and tell me a bit about his personality"}
    ]
)["choices"][0]["message"]["content"])

('Jean-Francois de La Perouse was a French explorer and naval officer who led '
 'the voyage of the frigates Astrolabe and Boussole in the late 18th century. '
 'Here are some of his main discoveries:\n'
 '\n'
 '1. Mapping the coast of Alaska: La Perouse explored the Alaskan coast and '
 'produced the first accurate map of the area.\n'
 '\n'
 '2. Discovering the island of Maui: During his travels in the Pacific, La '
 'Perouse discovered the Hawaiian island of Maui and named it "Isle de la '
 'Caimane."\n'
 '\n'
 '3. Studying the indigenous peoples of Alaska and the Pacific: La Perouse was '
 'known for his detailed research on the native peoples of the regions he '
 'explored, including their languages, customs, and religions.\n'
 '\n'
 'As for his personality, La Perouse was known as a methodical and meticulous '
 'explorer who paid great attention to detail. He was also respected for his '
 'humanitarianism, as he often went out of his way to help the native peoples '
 'he encounter