# Indexing Demo - Create your own 'Vector Store' with ClassifAI Package

## Overview

ClassifAI uses Vector Search - consine similarity, distance measure, etc -  to find similar strings to a users input text. Part of the functionality of the ClassifAI package, therefore, is to help users create vector stores from a collection of texts -  which the search can be performed against

 This Notebook provides a simple guide of how to use the packages Vectoriser and Indexing modules to convert a collection of texts into a corresponding collection of vectors, that can then be used to set up a fully functioning RestAPI search service.

<u><b>You will learn:</b></u>

- How to generate a vector embeddings parquet file from a CSV file containing the columns ['id', 'text'] using the Vectoriser and Indexer modules of this package


<u><b>This Notebook covers:</b></u>

- How to create <b>Vectoriser Objects</b> with Gcloud or Huggingface that let you transform strings from text into vectors
- How to use these Vectoriser objects with a provided <b>Indexer Function</b> to iterate over a CSV file of texts, creating a vector store.

- Minimal setup should be required, installing the ClassifAI package, and logging into Gcloud Vertex AI (or you can try providing a HuggingFace Model)

## Installation and Setup

### Installing the Package

Running the following cell to install the ClassifAI package, when you have your Python Environment setup for this Notebook.

In [24]:
## PIP INSTALLATION INSTRUCTIONS GO HERE i.e. !pip install package_name from github address

### GCloud Project Configuration

If you are using GCloud for your project, and have Vertex AI enabled for the project, then you will be able to access the GCloud Vertex AI embedding models for use in this package. We provide a way of obtaining embeddings/vectors from these Language Models hosted by Google. More on that later...

For now you just need to log into your Gcloud project by running the command: `gcloud auth application-default login`. This will authenticate you for the project you are working on w.r.t GCloud.

If you are not using GCloud, we provide an alternate method to work with embedding model using HuggingFace Transformers, that does not require user authentication.

In [None]:
#Only needs to be run by users using GCloud Vertex AI
!gcloud auth application-default login

## Creating Vectors with the Vectoriser Class

### With Gcloud and Vertex AI

We provide Wrapper Classes that make converting text into vectors easy - with GCloud Vertex AI embedding models, (or skip to the next section to HuggingFace Transformer Models instead of Gcloud)

Simply enter the project id for you google project that has access to embedding models on Vertex AI like in the example below. It takes 2 additional arguments:

* `project_id` : The gcloud project id that vertex is enabled for, and where the models will be called
* `location` : Standard cloud argument for where the resource should be called (defaults to `europe-west2`)
* `model_name` : the specific embedding model to call on vertex (default is `text-embedding-004`)

In [None]:
#Import the ClassifAI Package vectoriser model
from classifAI_API.vectorisers import GcpVectoriser

#Just pass your GCloud Project ID as an argument, for your project that has VERTEX AI setup
my_demo_vectoriser = GcpVectoriser(project_id="<YOUR_GCLOUD_PROJECT_ID_GOES_HERE>")


The code cell above creates one of our <b>'vectoriser' objects</b>, which you can then pass text to and get vectors.

Call the `.transform()` method and pass a string to it, like in the cell below:

In [10]:
#This creates a Vectoriser Object that you can use to embed text (convert them to vectors)

#just call the .transform method on the class
vector = my_demo_vectoriser.transform("The quick brown fox jumped over the log")

#display the returned vector, which is returned in numpy format
vector

array([[-6.41123652e-02,  2.78484495e-03, -5.75615047e-03,
         2.09132154e-02, -3.05869561e-02,  5.20960018e-02,
         3.16308402e-02,  1.53242995e-03,  7.16841221e-02,
        -1.60857365e-02, -1.60744824e-02,  3.65860835e-02,
         6.23130938e-03,  6.03814870e-02, -1.30988015e-02,
        -2.81967185e-02,  9.13312752e-03,  9.14438665e-02,
        -5.34801371e-02,  2.68589426e-02,  6.42369613e-02,
         5.79211954e-03, -1.42357508e-02, -4.49805595e-02,
         2.24941354e-02, -5.84460376e-03, -1.49912462e-02,
        -8.60473793e-03,  1.61052383e-02, -3.51867490e-02,
         2.12130994e-02,  5.82003221e-02,  2.20011361e-02,
         2.00838950e-02,  5.85184544e-02,  4.83704656e-02,
        -3.20709720e-02,  2.29926500e-02,  2.14196686e-02,
         1.38152363e-02, -5.71106561e-02,  3.32018435e-02,
        -5.57417572e-02,  5.78195229e-02, -2.03742506e-03,
        -3.40057164e-02,  8.81149899e-04,  5.97874559e-02,
        -3.84803489e-02,  4.59107347e-02,  4.83456850e-0

In [11]:
#you can also pass a list of strings to the transform method and it will encode them as a batch
vector2 = my_demo_vectoriser.transform(['My favourite colour is green', 'my favourite ice-cream is mint-choc-chip'])

vector2.shape

(2, 768)

### With Huggingface Transformers

<b>If you don't have Gcloud with Vertex AI setup...</b> You can try using the HuggingFace Vectoriser Class, which you can pass a publically avaialble model name thats hosted on HuggingFace:

Isntead of passing a project ID, users can pass the name of an embedding language model they want to use. Currently we've tested support for the following models:

- sentence-transformers/all-MiniLM-L6-v2
- google-bert/bert-base-uncased

<b>WARNING</b>

* If you use the HuggingFace vectoriser class, this will download the model to your machine - and attempt to perform operations locally using your CPU. 

* If you are using the GCP vectoriser class, this will send the text to the Vertex API - therefore no model processing happens on your local machine

In [None]:
#this time importing Huggingface_Vectoriser from the same module, and passing it a model name rather than the GCloud Project ID.
from classifAI_API.vectorisers import HuggingFaceVectoriser

second_demo_vectoriser = HuggingFaceVectoriser(model_name="sentece-transformers/all-MiniLM-L6-v2")


In [20]:
#and creating the vectors with this new vectoriser model, instantiated using HuggingFace models, instead of Vertex AI models

more_vectors = second_demo_vectoriser.transform(['my least favourite colour is red', "I don't like strawberry ice cream"])

more_vectors.shape

(2, 768)

## Indexing - Creating a Vector Store

<u><b>The previous section discussed</b></u> how to set up Language Models to create embeddings from string.


<u><b>This current section shows</b></u> how to create a full vector store from a CSV file of strings

### Make sure we have a vectoriser

The function to create a vector store uses the vectoriser created in the previous section!

<b>Make sure you have one initialised from the previous section or run one of the following...</b>

In [None]:
#instantiating a vectoriser object again, if needed
from classifAI_API.vectorisers import HuggingFaceVectoriser
demo_vectoriser = GcpVectoriser(project_id="<YOUR_PROJECT_ID>")

<b>or if not using vertex, instead use the Huggingface class:</b>

In [27]:
#from classifAI_API.vectorisers import Huggingface_Vectoriser
#demo_vectoriser = Huggingface_Vectoriser(model_name="sentence-transformers/all-MiniLM-L6-v2")

### The `create_vector_index_from_string()` function

We provide a simple function that accepts as arguments
  - a path to a CSV file of text data
  - the dataType of the file (`'csv'` is currently the only available file type)
  - the vectoriser you created
  - a batch size to indicate how many rows from the file to process at once


  Currently, this function takes the path to a CSV file, which should have two columns ['id', 'text']. See an example from this package github: DEMO/demo_data.csv
  
  It creates an embedding for each row in the file, and saves the emmbeddings with original strings and ID in a parquet file

In [23]:
from classifAI_API.indexers import create_vector_index_from_string_file


PATH_TO_DEMO_CSV = "demo_data.csv"


vector_df = create_vector_index_from_string_file(
    fileName=PATH_TO_DEMO_CSV, 
    dataType='csv', 
    embedder=my_demo_vectoriser, # the variable name of your vectoriser goes here!
    batch_size=8)



INFO - Processing file: demo_data.csv in batches of size 8...

Processing batches: 2it [00:00,  4.88it/s]
INFO - Finished creating vectors, attempting to save to parquet file...
INFO - DataFrame created with 13 rows and 3 columns.
INFO - Saved DataFrame to Parquet file: demo_data.csv.parquet


---------


The above function saves the created embeddings and original texts + id to a parquet file, and it also returns a pandas dataframe of the data which you can view

In [18]:
vector_df.head()

Unnamed: 0,id,text,embeddings
0,1,The quick brown fox jumped over the log,"[-0.06411236524581909, 0.0027848449535667896, ..."
1,2,You can't spell Spain without pain,"[0.00010710849892348051, -0.03479038178920746,..."
2,3,the weather is nice today in Edinburgh,"[-0.05706198886036873, 0.001962454291060567, -..."
3,4,The quick brown fox jumped over the log,"[-0.06411236524581909, 0.0027848449535667896, ..."
4,5,You can't spell Spain without pain,"[0.00010710849892348051, -0.03479038178920746,..."


## That's it!

The dataframe is already saved to filespace as part of the function. 

Thanks for following through with this tutorial. Check out the readme on the repo for the quick start guide on indexing. 😃