### Task example

Given a set of lease documents, along with a set of available classes (for example, 'residential', 'ground', 'equipment').
We need to find class the most relevant to each of these documents. Find the best approach that gives >95% accuracy of classification. 

### How to solve

The best way to measure quality of classification is supervised classification, when there is a "golden" dataset of already labeled (classified) documents. Ideally, it should be a balanced dataset where each class has the same amount of documents - this way we can guarantee our approach will have minimal bias toward one or another class.

There are a couple of techniques to solve classification tasks using dataset of labeled documents, basically divided by two major categories: with training dataset and without training dataset (i.g. using "golden" dataset as test dataset only).

With training dataset:
- LLM fine-tuning (GPT, Llama, etc)
- k-nearest embeddings

Without training dataset:
- LLM prompting (prompt contains text description of all classes)

[Important] If you use "golden" dataset for training, remember to divide it to two parts: use 80% of your data as training, and 20% as test data. This way you can validate how your approach will work with new data, that is not appeared in training dataset.


In [6]:
import pandas as pd
from sklearn.metrics import accuracy_score

# Sample datasets
classification_df = pd.DataFrame([
    {"doc_name": "doc1", "class": "A"},
    {"doc_name": "doc2", "class": "B"},
    {"doc_name": "doc3", "class": "A"},
])

golden_df = pd.DataFrame([
    {"doc_name": "doc1", "class": "A"},
    {"doc_name": "doc2", "class": "A"},
    {"doc_name": "doc3", "class": "A"},
])


# Assuming 'doc_name' is unique and can be used as an index
classification_df.set_index('doc_name', inplace=True)
golden_df.set_index('doc_name', inplace=True)

# Joining the DataFrames on 'doc_name' to align the classes
combined_df = classification_df.join(golden_df, lsuffix='_pred', rsuffix='_true')

# Calculating accuracy
accuracy = accuracy_score(combined_df['class_pred'], combined_df['class_true']) * 100

print(f"Accuracy: {accuracy:.1f}%")

Accuracy: 66.7%


### How to balance your "golden" dataset

There are two common ways: 
1. Reduction - remove items from dataset, to achieve the same number of items per each class
2. Synthetic data - create new items for less-present classes, using GenAI. New data might be generated by rules or by few-shots from "golden" dataset

For Synthetic Data check out DeepEval's package [deepeval.dataset](https://docs.confident-ai.com/docs/getting-started#generate-synthetic-datasets)

As for ready-to-use tools, I know there is Amazon SageMaker Data Wrangler that might help with it. Unfortunately, I don't have experience with this tool, so can't say more - Google it to find more.

### Summary

Keep in mind the following rules:
- Always try to get balanced dataset (both training and test)
- For training dataset use only 80% of your full "golden" dataset

### Unsupervised classification

With unsupervised classification there are no so much options to check the quality: what we can do is just doublecheck by another LLM if the class is correct. That might be another LLM model, different to what we used for classification itself.

Further steps of accuracy evaluation are similar to what we have in supervised classification, as in example above.