In [1]:
import sys
sys.path.insert(0,"..")
from queuerious_detector.preprocessing import *
import pandas as pd

In [2]:
#load the data
raw_data = pd.read_csv(
    "../data/raw/aa_dataset-tickets-multi-lang-5-2-50-version.csv")

#combine classes based on previous analysis
class_map = {'Technical Support': 'Technical & IT Support',
    'IT Support': 'Technical & IT Support',
    'Customer Service': 'Customer Service, Returns & Exchanges',
    'Returns and Exchanges': 'Customer Service, Returns & Exchanges'
}

#preprocess the data
preprocessed_df = preprocess_tickets(
    df=raw_data,
    text_fields=["subject", "body"],
    target_col="queue",
    new_target_col="queue_grouped",
    class_map=class_map,
    output_columns=["combined_text", "queue_grouped"]
)

#redact PII from the combined text
preprocessed_df["redacted_text"] = preprocessed_df["combined_text"].apply(redact_pii)

In [3]:
preprocessed_df.head()

Unnamed: 0,combined_text,queue_grouped,redacted_text
1,"Account Disruption Dear Customer Support Team,...",Technical & IT Support,"Account Disruption Dear Customer Support Team,..."
2,Query About Smart Home System Integration Feat...,"Customer Service, Returns & Exchanges",Query About Smart Home System Integration Feat...
3,Inquiry Regarding Invoice Details Dear Custome...,Billing and Payments,Inquiry Regarding Invoice Details Dear Custome...
4,Question About Marketing Agency Software Compa...,Sales and Pre-Sales,Question About Marketing Agency Software Compa...
5,"Feature Query Dear Customer Support,\n\nI hope...",Technical & IT Support,"Feature Query Dear Customer Support,\n\nI hope..."


In [4]:
#running this cell a few times to assess
preprocessed_df[["combined_text","redacted_text","queue_grouped"]].sample(n=1)

Unnamed: 0,combined_text,redacted_text,queue_grouped
13814,Secure Handling of Medical Data in Docker Envi...,Secure Handling of Medical Data in Docker Envi...,Technical & IT Support


### Lessons Learned:
1. In viewing the "Answer" column I can see that the data was previously redacted, however it doesn't seem that "body" and "subject" were redacted.

2. The redaction function will get added to the pipeline to help reduce pii in new data (production)

3. I experimented the spacy Small models but had too many misclassifications, the Large models seem to perform better.

4. One thing that increases missclassifications is the lack of context - for example the "subject" column alone may be too short for NER to be effective. I decided to combine the "subject" and "body" columns.

5. Sometimes product is misclassified by Spacy NER - it's recognized as person entity

6. One enhancement I would make is translation - the traslation function takes hours to implement since we have 12k german records.