# 01 Hello Vecto!

This is a notebook to help you get started with Vecto. You will learn how to set up Vector Space, build and experiment with a basic Vecto application. This guide will walk you through the building process step by step.

## Set Up Vecto Application

In [None]:
!pip install ftfy tqdm requests pillow ipywidgets

In [None]:
import requests
from ipywidgets import interact_manual, IntSlider, FileUpload
import pathlib
from IPython.display import Image, display
import math
from tqdm.notebook import tqdm
import io
import json

## Make a Vector Space

1. Access the Vecto login page at <[Vecto Login](https://app.vecto.ai/login.xhtml)>, insert your *Username* and *Password* and click Sign In. 
   
  2. From the admin page sidebar, select the **Dashboard** tab and click on *Create new vector space*. Fill in the Vector Space name; in this case, we will call it `Hello_world`. You will then be able to choose a `vectorization model`. As we are going to work with both images and text, choose the [CLIP](https://github.com/openai/CLIP) model. Finally, click the `Create Vector Space` button. You can view your Vector Space details by clicking on the Vector Space name in the Vector Spaces list. Take note on your Vector Space ID to use in a later step.  
    
  3. In order to access the vector space, we will need to create a Vector Space authentication token. Click on the **Tokens** tab in the sidebar, set the token name to `Hello_world_token`, and then select the Vector Space `Hello_world` that we created earlier, click on `Create User Token`. Click on your token name in the list to view it. This token will be used to authenticate access to Vecto servers. Copy the token to use in the next step. Here we go! Now you have your first Vector Space.

<img src="img/login_vecto.gif">

## Add and Ingest Data into Vector Space

 To start, let's initialize the Vecto API end-point and pass our `Hello_world` Vector Space ID and authentication token. Copy the below cell into your notebook 2nd cell and insert the values for the `token` and `vecto_space_id`, then run the cell:    

In [None]:
vecto_base_url ="https://api.vecto.ai/api/v0"
token = ""
vector_space_id = ""

Please note that the Vector Space ID and token are unique for every Vector Space. You refer to the previous step if you can not find your `Hello_world` Vector Space ID or token. 

### Dataset 

In this *Hello World!* guide, we are using [LFW - People (Face Recognition) dataset from Kaggle](https://www.kaggle.com/datasets/atulanandjha/lfwpeople); this dataset has around 13,000 `.jpg` images of different people faces. 

To proceed, you will need to download and extract the dataset into the working directory `simple_vecto_app`.

Now your working directory should look like this:
```
|__simple_vecto_app
    |__requirements.txt
    |__vecto_application.ipynb
    |__lfw_funneled
        |__name1
            |__name1_0001
            |__name1_0002
            ...
        |__name2
            |__name2_0001
            |__name2_0002
            ...
```
Now, let's set the path to our images in the notebook. First, let's find our base directory path and join it to our dataset folder; we use `list(dataset_path.glob('**/*.jpg'))` to collect all the images path in the dataset into a Python list. 

In [None]:
base_dir = pathlib.Path().absolute()
dataset_path = base_dir.joinpath('lfw-funneled')
dataset_images = list(dataset_path.glob('**/*.jpg'))
print(dataset_images[:5]) # print path to the first 5 image paths.

### Ingest the Dataset

To ingest the images in the `dataset_images` list into our `Hello_world` Vector Space, we will need a few helper functions to split the images ingesting process into batches. Let's add the two functions `ingest_image_batch` and `ingest_all_images` to our `vecto_application` notebook: 

In [None]:
def ingest_image_batch(batch_path_list):
    data = {'vector_space_id': vector_space_id, 'data': [], 'modality': 'IMAGE'}
    files = []
    for path in batch_path_list:
        relative = "%s/%s" % (path.parent.name, path.name)
        data['data'].append(json.dumps(relative))
        files.append(open(path, 'rb'))
    
    result = requests.post("%s/index" % vecto_base_url,
                               data=data,
                               files=[('input', ('_', f, '_')) for f in files],
                               headers={"Authorization": "Bearer %s" %token})    
    
    if not result.ok:
        print(result.status_code)
        
    for f in files:
        f.close()

In [None]:
def ingest_all_images(path_list, batch_size=64):
    batch_count = math.ceil(len(path_list) / batch_size)
    batches = [path_list[i * batch_size: (i + 1) * batch_size] for i in range(batch_count)]
    for batch in tqdm(batches):
        ingest_image_batch(batch)

The batch size determines the number of images ingested in each batch. Here, we set the batch size to `64` to speed up the initial ingest process. The batch size could be set to any other integer value as low as `1`, as it depends on the dataset type and size.  

In [None]:
ingest_all_images(path_list=dataset_images, batch_size=64)

You will need to wait for the vectorization process to finish before moving on.

## Vector Search in Vector Space

After the dataset ingesting finishes, we can perform multiple search queries on the unique `Hello_World` Vector Space. In this case, the queries could be images from within the dataset or external. Also, we can search for similarities for text queries as well, even though our Vector Space consists of images only.

To search within the Vector Space, we need to ingest the query into a vector and search for similar data to the query vector against the whole Vector Space, then display the images with the highest similarity. For that, we will use a few helper functions to handle the mentioned processes. Let's add these four functions `display_results`, `lookup`, `text_query`, and`image_query` to our `vecto_application` notebook :

In [None]:
def display_results(results):
    output = []
    for result in results:
        output.append("Similarity: %s" % result['similarity'])
        output.append(Image(dataset_path.joinpath(result['data'])))
    display(*output)

In [None]:
def lookup(f, modality, top_k):
    result = requests.post("%s/lookup" % vecto_base_url,
                           data={'vector_space_id': vector_space_id, 'modality': modality, 'top_k': top_k},
                           files={'query': f},
                           headers={"Authorization":"Bearer %s" % token})

    results = result.json()['results']
    display_results(results)

In [None]:
def text_query(query, top_k=10):
    f = io.StringIO(query)
    lookup(f, 'TEXT', top_k)

def image_query(query, top_k=10):
    f = io.BytesIO(query[0]['content'])
    lookup(f, 'IMAGE', top_k)

### Search using in-dataset query

Let's pick an image from the dataset as our search query. Our goal is to find similar items "people" for that image within our `Hello_world` Vector Space. Here, we will pick this image `Aaron_Eckhart_0001.jpg`.

<img src="img/Aaron_Eckhart_0001.jpg" height="200" >

In [None]:
interact_manual(image_query, query=FileUpload(multiple=False), top_k=IntSlider(min=1, max=50))

After we add and run the above line of code to our `vecto_application` notebook, you will see the following widget, upload the query image available at `lfw_funneled/Aaron_Eckhart/Aaron_Eckhart_0001.jpg`, adjust the `top_k` bar to limit the number of returning top similar items and click the `Run Interact` button to start the vector search:      

The returned images of the query vector search show the image with the highest similarity is the image itself **Similarity=1.0**, and that is because our query was an image available within the Vector Space.  

### Search using out-of-dataset query

To check if our vector space is robust to external data, we will upload an image from outside the dataset as our query image. Similarly, you must follow the same steps in [search using in-dataset query](#Search-using-in-dataset-query). Here, we download and use this [out-of-dataset image](/img/docs/user-guide/Hello_world/out_dataset.jpg) we downloaded from [Pexels](https://www.pexels.com/photo/closeup-photo-of-woman-with-brown-coat-and-gray-top-733872/) website. 

<img src="img/out_dataset.jpg"  width="200" height="200" >

After we add and run the above line of code to our `vecto_application` notebook, the widget appears, upload the out-of-dataset image as a search query, choose a `top_k` value and click on the `Run Interact` button to start the vector search.

In [None]:
interact_manual(image_query, query=FileUpload(multiple=False), top_k=IntSlider(min=1, max=50))

The returned images of the query vector search are for different women with relatively similar features to our out-of-dataset image.

### Search using text query


Finding similar data in the `Hello_world` Vector Space based on a text query is achievable too. All that needs to be done is to pass the text to the widget then Vecto will handle the text ingest and query vector search. 

For text query, after we add and run the above line of code to our `vecto_application` notebook, you are expected to see the following widget instead, type *Woman* as a text query into the text-box, adjust the `top_k` bar to limit the number of the returning similar items and click the `Run Interact` button to start the vector search.   

In [None]:
interact_manual(text_query, query="Women", top_k=IntSlider(min=1, max=50))

The returned images of the text query vector search are for different women. Now try other text queries and analyze the vector search output.

## Create and Apply Analogy

Analogy completion via vector arithmetic has become a common means of demonstrating the compositionality of embeddings. Taking `Men is to King as Women is to Queen` as an analogy, we could use the vector difference *King vector - Men vector* as an analogy vector to modify the vector search output for *Women* query from returning images of Women in a straight forward query search as shown [above](#search-using-text-query) to returning images of Queens instead. The overall arithmetic equation that governs such an analogy can be represented as follows: 

<p align="center">
Let <strong>Men - King = Women - Queen</strong> <br></br> 
Therefore, <strong>Men - King + Women = Queen</strong>
</p>

To construct an analogy you need 3 components:
1. The *start* of the analogy, in this example is **Men** 
2. The *end* of the analogy, in this example is **King**
3. The *query* to apply the analogy on, in this example is **Women**


To apply the analogy to a vector search query, we need to determine the analogy *start* and *end* vector difference, then add it to the *query* vector before finding similarity within the vector space. For that, let's add this three functions `analogy`,`text_analogy` and `image_analogy` to our `vecto_application` notebook:

In [None]:
def analogy(query, start, end, modality, top_k):

    result = requests.post("%s/analogy" % vecto_base_url,
                           data={'vector_space_id': vector_space_id, 'modality': modality, 'top_k': top_k},
                           files={'query': query, 'from': start, 'to': end},
                           headers={"Authorization":"Bearer %s" %token})
    
    print(result)
    results = result.json()['results']
    display_results(results)

In [None]:
def text_analogy(query, start, end, top_k=10):
    analogy(io.StringIO(query), io.StringIO(start), io.StringIO(end), 'TEXT', top_k)

def image_analogy(query, start, end, top_k=10):
    analogy(
        io.BytesIO(list(query.values())[0]['content']),
        io.BytesIO(list(start.values())[0]['content']),
        io.BytesIO(list(end.values())[0]['content']),
        'IMAGE',
        top_k
    )

### Dynamic Analogy 

Dynamic analogy allows experimenting with different analogy's *start* and *end* to create a difference vector that generalizes to multiple queries. All need to be done is to set text to the analogy's *start*, *end* and *query*.

For a dynamic analogy, after we add and run the above line of code to our `vecto_application` notebook, you are expected to see the following widgets, type **Man** as analogy *start*, **King** as analogy *end* and **Woman** as *query* text, adjust the `top_k` bar to limit the number of the returning similar items and click the `Run Interact` button to start vector search with the analogy. 

In [None]:
interact_manual(text_analogy, query="Woman", start="Man", end="King", top_k=IntSlider(min=1, max=50))

Here, the returned images for the *women* query after applying the analogy are different from before in [*Search using text query*](#Search-using-text-query). An analogy can play a significant role in customizing the vector search output to the desired results.    