# Add your own pre and post processing logic with <i>Hooks</i>ü™ù

#### The ClassifAI package uses a defined data flow process to ensure that all the modules work together:

* The vectoriser .transform() methods:
    - expect strings or list[strings] as input, 
    - and return Numpy ND arrays
* The VectorStore .search() method expects:
    - a string or list[strings] as input
    - return a pandas dataframe of results with specific typed columns - ['query_id', 'query_text', 'doc_id', 'doc_text', 'rank', 'score']


This expected typing ensures that the modules of the Package can work together, from vectorising, to searching, to deploying with FastAPI where the strict type modelling is used to generate the correct JSON schema format.



<images of the type flow process go here?>

#### But sometimes you might want to perform additional operations on this data

For example you might want your classification system to:

* Convert all the text input to CAPITAL letters before it is embedded by the vectoriser,
* Remove punctuation or do other sanitization on your input queries to a Vectorstore.search() method,
* Remove any duplicate rows from a dataframe of results returned by a VectorStore.search call for a user query, based on a specific column in the data,
* Perform spell checking before before a user's input query is passed to the Vectorstore.search(),
* Inject some additional data into your classification results.

## This Demo:

This notebook demo will show you how to make:
- a pre-processing function that removes punctuation from input user queries,
- a post-processing function removes results rows that have duplicate ids to other rows of the results.

- We will then make a final post-processing function that injects additional SOC definition data to the VectorStore results dataframe and show how this can be chained together with the deduplication code, to make a multi-step post-processing function!

### Pre-requisite

If you are new to the package, its recommended to follow through the ```general_workflow.ipynb``` notebook tutorial first. That interactive DEMO will showcase the core features of the ```ClassifAI package```. This currentl notebook provides examples of how to modify the flow of data which is initially described in the general_workflow.ipynb notebook.

Check out the ClassifAI repository DEMO folder for all our notebook walkthrough tutorials including those mentioned above:

https://github.com/datasciencecampus/classifai/tree/main/DEMO 


### Installation

In [None]:
## if using pip
# % pip install -e ".[huggingface]"

## if using uv
# ! uv sync --extra huggingface

### Normal vectorstore setup

We can start by loading a normal vectorstore up with no additional preprocessing. We can use one of our fake example known datasets is known to have several rows of data with the same ID value.

In [None]:
from classifai.indexers import VectorStore
from classifai.vectorisers import HuggingFaceVectoriser

vectoriser = HuggingFaceVectoriser(model_name="sentence-transformers/all-MiniLM-L6-v2")


my_vector_store = VectorStore(
    file_name="data/fake_soc_dataset.csv",
    data_type="csv",
    vectoriser=vectoriser,
    overwrite=True,
)

#### Notice how when we run the search method with a user query:
 * an exclaimation mark in the query (that in some cases we may want to sanitise) is shown in the results. 
 * Also the results for the below query should also show several rows with the same ```'doc_id'``` value (because our example data file had multiple entries with the same id label)

In [None]:
my_vector_store.search("a fruit and vegetable farmer!!!", n_results=10)

### Making pre- and post- processing hooks 

So lets write some functions that will remove punctuation on the user's input query, before the Vectorstore.search() method begins, and remove rows with duplicate IDs from the results dataframe just before the results are retutned from the Vectorstore.search() method

In [None]:
import string


def remove_punctuation(input_data):
    # we wwant to modify the 'texts' field in the input_data pydantic model, which is a list of texts
    # this line removes punctuation from each string with list comprehension
    sanitized_texts = [x.translate(str.maketrans("", "", string.punctuation)) for x in input_data.query]

    input_data.query = sanitized_texts

    # Return the modified input as a dictionary using model_dump from Pydantic
    return input_data.model_dump()


def drop_duplicates(expected_results_dataframe):
    # we want to depuplicate the ranking attribute of the pydantic model which is a pandas dataframe
    # specifically we want to drop all but the first occurrence of each unique 'doc_id' value for each subset of query results
    expected_results_dataframe = expected_results_dataframe.drop_duplicates(subset=["query_id", "doc_id"], keep="first")

    return expected_results_dataframe

### Adding our Hooks to the VectorStore

Now when we initialise the Vectorstore we can declare our custom functions in the hooks dictionary.

The Vectorstore codebase looks for specifically named dictionary entries in the Hooks dictionary, to decide what pre and post processing hooks to run. There are hooks for each major methods of the Vectoriser and VectorStore classes.

Each dictionary entry uses the method name of the class and '_preprocessor' or '_postprocessor' appended to the name. Currenlty the implemented method hooks are:

- for the vectoriser classes:
    * tranform_preprocess
    * transform_postprocess

- for the VectorStore class:
    * search_preprocess
    * search_postprocess
    * reverse_search_preprocess
    * reverse_search_postprocess


For our case in this excercise, we are implementig the search_preprocessor and search_postprocessor methods in the VectorStore.


However if we preferred wanted to add a preprocessing or postprocessing hook to a Vectoriser we would pass that to the Vectoriser object on its instantiation.

#

In [None]:
my_vector_store_with_hooks = VectorStore(
    file_name="data/fake_soc_dataset.csv",
    data_type="csv",
    vectoriser=vectoriser,
    overwrite=True,
    hooks={
        "search_preprocess": remove_punctuation,
        "search_postprocess": drop_duplicates,
    },
)

### Our hooks will run with the VectorStore search method


Now we've passed our desired additional functions to our VectorStore initiation and those hook should run accordingly - lets see:

In [None]:
my_vector_store_with_hooks.search("a fruit and vegetable farmer!!!", n_results=10)

#### Oops!

Notice how in the above dataframe, the rank column now leaps over some values in each ranking. 

We didn't reset the ranking values, per query, when we removed duplicate rows...

lets redo that now in a new function and hook it up to our preprocessing hook.

In [None]:
def drop_duplicates_and_reset_rank(expected_results_dataframe):
    # Remove duplicates based on 'query_id' and 'doc_id'
    expected_results_dataframe = expected_results_dataframe.drop_duplicates(subset=["query_id", "doc_id"], keep="first")

    # Reset the rank column per query_id using .loc to avoid SettingWithCopyWarning
    expected_results_dataframe.loc[:, "rank"] = expected_results_dataframe.groupby("query_id").cumcount()

    return expected_results_dataframe

In [None]:
# and lets access the hooks directly from the vector store instance to modify them
my_vector_store_with_hooks.hooks["search_postprocess"] = drop_duplicates_and_reset_rank

#### done - now lets run that query again

In [None]:
my_vector_store_with_hooks.search("a fruit and vegetable farmer!!!", n_results=10)

#### This of course still works well when you pass multiple queries:

In [None]:
my_vector_store_with_hooks.search(["a fruit and vegetable farmer!!!", "Digital marketing@"], n_results=10)

### Injecting Data into our classification results with a hook

What if we had some additional context information that we wanted to add in our pipeline. It could be some official taxonomy definitions about our doc_id labels, such as SIC or SOC code definitions.

We may want to inject this extra information that's not directly stored as metadata in the knowledgebase, so that a downstream component (such as a RAG agent) can use the additional information

#### But we also want keep our existing hook logic that removes punctuation...

In [None]:
official_id_definitions = {
    "101": "Fruit farmer: Grows and harvests fruits such as apples, oranges, and berries.",
    "102": "iry farmer: Manages cows for milk production and processes dairy products.",
    "103": "nstruction laborer: Performs physical tasks on construction sites, such as digging and carrying materials.",
    "104": "rpenter: Constructs, installs, and repairs wooden frameworks and structures.",
    "105": "ectrician: Installs, maintains, and repairs electrical systems in buildings and equipment.",
    "106": "umber: Installs and repairs water, gas, and drainage systems in homes and businesses.",
    "107": "ftware developer: Designs, writes, and tests computer programs and applications.",
    "108": "ta analyst: Analyzes data to provide insights and support decision-making.",
    "109": "countant: Prepares and examines financial records, ensuring accuracy and compliance with regulations.",
    "110": "acher: Educates students in schools, colleges, or universities.",
    "111": "rse: Provides medical care and support to patients in hospitals, clinics, or homes.",
    "112": "ef: Prepares and cooks meals in restaurants, hotels, or other food establishments.",
    "113": "aphic designer: Creates visual concepts for advertisements, websites, and branding.",
    "114": "chanic: Repairs and maintains vehicles and machinery.",
    "115": "otographer: Captures images for events, advertising, or artistic purposes.",
}

In [None]:
def add_id_definitions(expected_results_dataframe):
    # Map the 'doc_id' column to the corresponding definitions from the dictionary
    expected_results_dataframe.loc[:, "id_definition"] = expected_results_dataframe["doc_id"].map(
        official_id_definitions
    )

    return expected_results_dataframe

#### We can now combine this with our deduplicating hook in a new function that runs both

In [None]:
def process_results(expected_results_dataframe):
    # First, remove duplicates and reset rank
    processed_dataframe = drop_duplicates_and_reset_rank(expected_results_dataframe)

    # Then, add ID definitions
    processed_dataframe = add_id_definitions(processed_dataframe)

    # Return the final processed dataframe
    return processed_dataframe

#### lets once again update the postprocessing hook on our vectorstore

In [None]:
my_vector_store_with_hooks.hooks["search_postprocess"] = process_results

#### and lets try the search again!

In [None]:
my_vector_store_with_hooks.search(["a fruit and vegetable farmer!!!", "Digital marketing@"], n_results=10)

#### We can see a few NaN values because we did not provide definitions for all doc_ids in the dictionary for this example

### Takeaways:

- We wrote and combined several hooks on the Vectorstore class to:
    - remove punctuation from queries before the  ```VectorStore.search()``` method is executed
    - remove duplicates from the results list per query ranking and fixed the ranking
    - injected data into our dataflow outside of constructing a vectorstore
    - chained several Vectorstore.search() postprocessing steps together into one function that calls other functions

- In this scenario we effectively showed how to deduplicate the rows of the results dataframe and add additional context columns of information in the form of the id_definitions. Hopefully, it is clear that you can add many pre- or post-processing steps this way, or by writing all steps in one big function - Hooks give you the flexibility and choice here.

- Hooks let you disrupt the normal flow of data between Vectoriser, VectorStores and the RestAPI system. In this case we just had a small amount of dictionary data being added in, however the Hooks allow for more complex scenarios:
    - using a 3rd party API to do automated corrective spell checking before passing your queries to the search method
    - making an SQL query call to a database to get the extra information you want to inject in each row
    - handle errors qhwn the API or database fails and just return the original Pydantic object or throw an error if needed


### Next Steps and Challenges

#### We focused soley on showcasing pre- and post-processing hooks for the VectorStore in this notebook:

- See if you can implement some pre- and post- processing hooks for the Vectoriser class of you choice:
    - try doing punctuation removal in the Vectoriser ```transform_preprocess()``` method and see how it is subtly different from performing this in the ```search_preprocess()``` hook of the VectorStore
    - you could also try normalising the Numpy arrays returned by the vectoriser so that in the vectorstore construction and search your vectors will all be normalised and the search results will all contain values between 0 and 1.