In [None]:
import sys
sys.path.insert(0,"..")
from queuerious_detector.preprocessing import *
from queuerious_detector.leak_prevent import deduplicate_and_split
import pandas as pd

#load the data
raw_data = pd.read_csv(
    "../data/raw/aa_dataset-tickets-multi-lang-5-2-50-version.csv")

#combine classes based on previous analysis
class_map = {'Technical Support': 'Technical & IT Support',
    'IT Support': 'Technical & IT Support',
    'Customer Service': 'Customer Service, Returns & Exchanges',
    'Returns and Exchanges': 'Customer Service, Returns & Exchanges'
}

#preprocess the data
preprocessed_df = preprocess_tickets(
    df=raw_data,
    text_fields=["subject", "body"],
    target_col="queue",
    new_target_col="queue_grouped",
    class_map=class_map,
    output_columns=["combined_text", "queue_grouped"]
)

#redact PII from the combined text
preprocessed_df["redacted_text"] = preprocessed_df["combined_text"].apply(redact_pii)

#clean text
preprocessed_df["redacted_text_clean"] = preprocessed_df["redacted_text"].apply(clean_text)

#split and save the data
train_df, val_df, test_df = deduplicate_and_split(
    preprocessed_df,
    text_col="redacted_text_clean",
    target_col="queue_grouped"
)

train_df.to_csv("../data/processed/train_tickets.csv", index=False)
val_df.to_csv("../data/processed/val_tickets.csv", index=False)
test_df.to_csv("../data/processed/test_tickets.csv", index=False)

### Lessons Learned:
1. In viewing the "Answer" column I can see that the data was previously redacted, however it doesn't seem that "body" and "subject" were redacted.

2. The redaction function will get added to the pipeline to help reduce pii in new data (production)

3. I experimented the spacy Small models but had too many misclassifications, the Large models seem to perform better.

4. One thing that increases missclassifications is the lack of context - for example the "subject" column alone may be too short for NER to be effective. I decided to combine the "subject" and "body" columns.

5. Sometimes product is misclassified by Spacy NER - it's recognized as person entity

6. One enhancement I would make is translation - the traslation function takes hours to implement since we have 12k german records.