In [1]:
#| hide
from classifier import schema

# classifier

> Classifying customer service emails

This file will become your README and also the index of your documentation.

## Install

1. Clone the repo
2. In the terminal, navigate to the project directory
3. Create a virtualenv with at least python version 3.11
4. Install via `pip install '.'`

## Motivation

### Why the existing model is likely failing

My hypothesis is that the existing ML solution gained a lot of predictive power by utilizing specific keywords in the training corpus. Whether those keywords are predictive, or the usage of those keywords have lessened, the existing model is suffering because it essentially biased itself on the training data.

### Why prompting doesn't work well

Prompting on a blob of email text struggles for a few likely reasons. One is that email text can be full of extraneous, inconsequential "stuff". Filtering text for what is essential to the conversation might help this.

Another reason is that the LLM doesn't really understand these labels well. It wasn't trained to understand cardinal businesses. It is essentially a layman when it comes to CAH business process. It simply can't know what email text belongs to a category.

### My approach

1. Load emails
2. Prompt individual emails to remove boilerplate text ("Summarize the conversation of this email as a series of steps") using map-reduce (so we may handle larger than context examples).
3. Build a chroma vector database of embedded training instances using the concatenated results of [2] and the existing labels.
4. Query the chroma database for similar, labeled instances of data and pass them plus the summarized email to the LLM for a final prediction

## Notebooks

- `00_schema.ipynb` - Pydantic objects
- `01_load.ipynb` - Load our documents from GCS
- `02_process.ipynb` - Process emails according to step 2 above