# Embedding Search
* **Created by:** Eric Martinez
* **For:** CSCI 4341
* **At:** University of Texas Rio-Grande Valley

## Step 0: Setup your `.env` file locally

Setup your `OPENAI_API_BASE` key and `OPENAI_API_KEY` in a file `.env` in this same folder.

```sh
# example .env contents (copy paste this into a .env file)
OPENAI_API_BASE=yourapibase
OPENAI_API_KEY=yourapikey
```

Install the required dependencies.

In [None]:
%pip install -q -r requirements.txt

## Step 1: Create the Dataset

Create a dataset of restaurant data.

This example is just to get you started.

It should include:
    
- The name of the restaurant
- The address of the restaurant (with city, state, zip)
- The rating of the restaurant
- Description of the restaurant including the type of food / drinks they have.

In [1]:
%%writefile data.json
[
    {
        "name": "Jason's Deli",
        "address": "1604 W University Dr."
    },
    {
        "name": "Taco Palenque",
        "address": "1414 W University Dr."
    },
    {
        "name": "University Drafthouse",
        "address": "2405 W University Dr. F"
    }
]

Overwriting data.json


Function to load the data

In [2]:
import json

def load_data():
    with open("data.json") as f:
        data = json.load(f)
    return data

Now actually load the data

In [3]:
data = load_data()
print(data)

[{'name': "Jason's Deli", 'address': '1604 W University Dr.'}, {'name': 'Taco Palenque', 'address': '1414 W University Dr.'}, {'name': 'University Drafthouse', 'address': '2405 W University Dr. F'}]


## Step 2: Create Chroma Collection

In [5]:
from dotenv import load_dotenv
load_dotenv()  # take environment variables from .env.
import os

import chromadb
from chromadb.utils import embedding_functions


def get_chroma_collection(collection_name):
    ## Use this one to save to memory
    # chroma_client = chromadb.Client() 

    ## Use this one to save to disk
    chroma_client = chromadb.PersistentClient(path=".")

    openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                    api_key=os.getenv("OPENAI_API_KEY"),
                    api_base=os.getenv("OPENAI_API_BASE"),
                    model_name="text-embedding-ada-002"
                )

    collection = chroma_client.get_or_create_collection(name=collection_name, embedding_function=openai_ef)
    return collection

In [6]:
collection = get_chroma_collection("food")

## Step 3: Add Data to Chroma Collection

In [7]:
def add_data_to_collection(data, collection):
    documents = []
    metadatas = []
    ids = []

    for i, restaurant in enumerate(data):
        name = restaurant['name']
        address = restaurant['address']
        # add description, rating, etc

        # what are we embedding for each restaurant - obviously add to this
        embeddable_string = f"{name}"
        documents.append(embeddable_string)

        # lets just store everything we have as metadata
        metadatas.append(restaurant)

        # lets use the index as the id
        ids.append(str(i))

    collection.add(
        documents=documents,
        metadatas=metadatas,
        ids=ids
    )

In [9]:
add_data_to_collection(data, collection)

Add of existing embedding ID: 0
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Insert of existing embedding ID: 0
Insert of existing embedding ID: 1
Insert of existing embedding ID: 2
Add of existing embedding ID: 0
Add of existing embedding ID: 1
Add of existing embedding ID: 2


## Step 4: Query the Collection

In [10]:
def get_results(query, n_results=2):
    metadatas = []
    n_results = 2
    results = collection.query(query_texts=[query], n_results=n_results)
    
    for i in range(n_results):
        metadatas.append(results["metadatas"][0][i])
        
    return metadatas

In [11]:
results = get_results("fajita", n_results=2)

for result in results:
    print(result)

{'address': '1414 W University Dr.', 'name': 'Taco Palenque'}
{'address': '1604 W University Dr', 'name': "Jason's Deli"}


## Step 5: Build the Gradio UI

In [12]:
from dotenv import load_dotenv
load_dotenv()  # take environment variables from .env.
import gradio as gr
import openai
import pandas as pd

def search(query, n_results):
    results = get_results(query, n_results=n_results)
    
    try:
        df = pd.DataFrame(results, columns=['name', 'address'])
        return df
    except Exception as e:
        raise gr.Error(e.message)

with gr.Blocks() as demo:
    with gr.Tab("Food Finder"):
        with gr.Row():
            with gr.Column():
                query = gr.Textbox(label="What are you looking for?", lines=5)
                n_results = gr.Slider(label="Results to Display", minimum=0, maximum=10, value=2, step=1)
                btn = gr.Button(value ="Submit")
                table = gr.Dataframe(label="Results", headers=['name', 'address'])
            btn.click(search, inputs = [query, n_results], outputs = [table])
    demo.launch(share=True)

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://a324148dd8aeb0c087.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces


## Your Job

#### 1. Expand the dataset fields

Should include:
    
- The name of the restaurant
- The address of the restaurant (with city, state, zip)
- The rating of the restaurant
- Description of the restaurant including the type of food (if restaurant) / drinks (if bar/cocktail type place) they have.
    * You don't have to get too crazy here just we need to know what is special or unique about each place.
    
#### 2. Increase the dataset size to 30

- In order for this to be useful we have to add more restaurants
- Focus on higher quality restaurants than just listing every Mcdonalds / fast-food place

#### 3. Modify the UI 

- Add description of the restaurant to the table
- Add rating of the restaurant to the table

#### Submit

In [None]:
!git add assignment-part1.ipynb; git add assignment-part2.ipynb

In [None]:
!git add data.json

In [None]:
!git commit -m "finished part 1"

In [None]:
!git push

That's it! 🎉 Move on to Part 2.