In [4]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [5]:
#load the first dataset for consideration
raw_inc_df = pd.read_csv("../data/raw/incident_event_log.csv")

In [6]:
raw_inc_df.sample(n=5)

Unnamed: 0,number,incident_state,active,reassignment_count,reopen_count,sys_mod_count,made_sla,caller_id,opened_by,opened_at,sys_created_by,sys_created_at,sys_updated_by,sys_updated_at,contact_type,location,category,subcategory,u_symptom,cmdb_ci,impact,urgency,priority,assignment_group,assigned_to,knowledge,u_priority_confirmation,notify,problem_id,rfc,vendor,caused_by,closed_code,resolved_by,resolved_at,closed_at
63077,INC0014731,New,True,0,0,0,True,Caller 291,Opened by 17,4/4/2016 09:56,?,?,Updated by 908,4/4/2016 09:56,Phone,Location 135,Category 42,Subcategory 223,Symptom 534,?,2 - Medium,2 - Medium,3 - Moderate,Group 70,?,False,False,Do Not Notify,?,?,?,?,code 6,Resolved by 11,4/4/2016 10:19,9/4/2016 10:59
43560,INC0009795,New,True,4,0,8,True,Caller 2931,Opened by 17,22/3/2016 08:48,Created by 10,22/3/2016 09:00,Updated by 308,29/3/2016 11:40,Phone,Location 204,Category 37,Subcategory 135,Symptom 118,?,2 - Medium,2 - Medium,3 - Moderate,Group 70,Resolver 13,False,False,Do Not Notify,?,?,?,?,code 6,Resolved by 11,1/4/2016 12:04,6/4/2016 12:59
78371,INC0018607,Closed,False,0,0,2,True,Caller 5285,Opened by 17,13/4/2016 11:21,?,?,Updated by 908,18/4/2016 12:07,Phone,Location 161,Category 32,Subcategory 9,Symptom 105,?,2 - Medium,2 - Medium,3 - Moderate,Group 70,Resolver 13,False,True,Do Not Notify,?,?,?,?,code 9,Resolved by 11,13/4/2016 11:24,18/4/2016 12:07
41696,INC0009375,New,True,1,0,2,True,Caller 4739,Opened by 390,21/3/2016 10:56,Created by 169,21/3/2016 10:58,Updated by 131,22/3/2016 22:14,Phone,Location 204,Category 26,Subcategory 164,Symptom 420,?,2 - Medium,2 - Medium,3 - Moderate,?,Resolver 57,False,False,Do Not Notify,?,?,?,?,code 6,Resolved by 53,28/3/2016 16:26,2/4/2016 16:59
83828,INC0020088,Active,True,7,0,12,True,Caller 5287,Opened by 17,17/4/2016 09:40,?,?,Updated by 510,29/4/2016 16:20,Phone,Location 204,Category 23,Subcategory 174,Symptom 491,?,2 - Medium,2 - Medium,3 - Moderate,Group 72,Resolver 110,False,False,Do Not Notify,?,?,?,?,code 9,Resolved by 127,2/5/2016 11:33,7/5/2016 12:07


The incident dataset does not have a text field, which doesn't align with our
learning objective of getting hands on NLP experience. Let's try another dataset.

In [7]:
#load the second dataset for consideration
raw_cust_df = pd.read_csv("../data/raw/customer_support_tickets.csv")

In [8]:
raw_cust_df.sample(n=5)

Unnamed: 0,Ticket ID,Customer Name,Customer Email,Customer Age,Customer Gender,Product Purchased,Date of Purchase,Ticket Type,Ticket Subject,Ticket Description,Ticket Status,Resolution,Ticket Priority,Ticket Channel,First Response Time,Time to Resolution,Customer Satisfaction Rating
3920,3921,Eugene Brown,csmith@example.org,68,Other,Dyson Vacuum Cleaner,2021-07-18,Cancellation request,Display issue,I'm having an issue with the {product_purchase...,Open,,Medium,Social media,,,
7900,7901,Regina Espinoza,swansonrebecca@example.com,53,Other,Amazon Echo,2020-04-26,Refund request,Battery life,I'm having an issue with the {product_purchase...,Open,,Critical,Social media,,,
6291,6292,Joseph Smith,daniellejohnson@example.net,27,Male,MacBook Pro,2020-09-05,Technical issue,Product setup,I'm having an issue with the {product_purchase...,Open,,Medium,Email,,,
580,581,Rebekah Price,ocarter@example.org,27,Other,Samsung Galaxy,2020-09-07,Cancellation request,Software bug,I'm having an issue with the {product_purchase...,Closed,Lay common nature dog stuff staff late.,Critical,Social media,2023-06-01 03:37:01,2023-06-01 06:13:01,2.0
7651,7652,Ryan Castillo,joelhall@example.org,20,Female,Canon EOS,2021-09-24,Product inquiry,Data loss,I'm having an issue with the {product_purchase...,Open,,Critical,Email,,,


In [9]:
raw_cust_df.isnull().sum() * 100/len(raw_cust_df)

Ticket ID                        0.000000
Customer Name                    0.000000
Customer Email                   0.000000
Customer Age                     0.000000
Customer Gender                  0.000000
Product Purchased                0.000000
Date of Purchase                 0.000000
Ticket Type                      0.000000
Ticket Subject                   0.000000
Ticket Description               0.000000
Ticket Status                    0.000000
Resolution                      67.304286
Ticket Priority                  0.000000
Ticket Channel                   0.000000
First Response Time             33.286102
Time to Resolution              67.304286
Customer Satisfaction Rating    67.304286
dtype: float64

### High-Level Data Audit:
- Structured fields: demographics (age, gender), product, priority, channel, type, etc.
- Semi-structured / unstructured text fields: subject, description, resolution.
- Target-like fields: time to resolution, satisfaction rating.
- Many missing values — notably in time-based and satisfaction fields.
- This dataset is extremely rich for multi-modal modeling — combining structured features + text embeddings.

### Project Idea 1: Ticket Topic Modeling (BERTopic)
- Objective: Use BERTopic (or similar) to create meaningful topics from Ticket Description or Ticket Subject.
- Useful for automated tagging.
- Could help reduce manual ticket triage.
- Can visualize trends: which topics are most common, which have higher resolution delays or satisfaction problems.

### Project Idea 2: Agent Assignment / Routing Model
- Objective: Predict which agent (or team) should handle the ticket based on early ticket info.
- Don’t have an agent field right now, but you could simulate this using synthetic labels:
- Build clusters of tickets using BERTopic
- Assign each cluster to a hypothetical specialized team
- Build classification models to route new tickets.
- **Business value: More efficient routing → faster resolutions → higher satisfaction.**

### Project Idea 3: Sentiment Analysis Augmentation
- Objective: Perform sentiment analysis on the Ticket Description field.
- May correlate highly with satisfaction, resolution time, or escalation likelihood.
- Could serve as feature enrichment for any of the previous models.

In [10]:
#load the third dataset for consideration
raw_multi_df = pd.read_csv("../data/raw/aa_dataset-tickets-multi-lang-5-2-50-version.csv")

In [11]:
#check shape
raw_multi_df.shape

(28587, 16)

In [12]:
raw_multi_df.sample(n=5)

Unnamed: 0,subject,body,answer,type,queue,priority,language,version,tag_1,tag_2,tag_3,tag_4,tag_5,tag_6,tag_7,tag_8
25237,Urgent: Improve Hospital Data Security Now,We need to implement strong encryption methods...,We appreciate the email highlighting the impor...,Change,Technical Support,high,en,400,Security,IT,Tech Support,Compliance,Training,Alert,,
28585,Update Request for SaaS Platform Integration F...,Requesting an update on the integration featur...,Received your request for updates on the integ...,Change,IT Support,high,en,400,Feature,IT,Tech Support,,,,,
15832,Unterstützung benötigt für Datenanalyse in Mic...,Ich benötige Hilfe bei der Integration und Dat...,Ich kann Ihnen Rat zur Integration und Datenan...,Request,Billing and Payments,low,de,400,IT,Tech Support,Documentation,Feedback,,,,
13014,Assistance with Financial Products,Is it possible to get detailed information on ...,Thank you for reaching out regarding support w...,Request,Product Support,medium,en,400,Feedback,Sales,IT,Tech Support,,,,
4974,Inquiry Regarding Healthcare Data Security Sol...,Customer Support is reaching out to inquire ab...,To provide information about healthcare data s...,Request,Customer Service,medium,en,52,Security,IT,Feedback,,,,,


In [13]:
raw_multi_df.isnull().sum() * 100/len(raw_multi_df)

subject     13.425683
body         0.000000
answer       0.024487
type         0.000000
queue        0.000000
priority     0.000000
language     0.000000
version      0.000000
tag_1        0.000000
tag_2        0.045475
tag_3        0.475741
tag_4       10.697170
tag_5       49.120229
tag_6       79.452199
tag_7       92.863889
tag_8       98.023577
dtype: float64

In [15]:
raw_multi_df["language"].value_counts()

language
en    16338
de    12249
Name: count, dtype: int64

In [16]:
raw_multi_df["queue"].value_counts()

queue
Technical Support                  8362
Product Support                    5252
Customer Service                   4268
IT Support                         3433
Billing and Payments               2788
Returns and Exchanges              1437
Service Outages and Maintenance    1148
Sales and Pre-Sales                 918
Human Resources                     576
General Inquiry                     405
Name: count, dtype: int64

### High-Level Audit of Dataset
- Customer ticket (subject, body)
- Agent response (answer)
- Structured metadata (queue, priority, type, tags, language)
- Multilingual support (en, de)
- Tagged labels (tags fields: tag_1 through tag_8)

### Project Idea 1: Ticket Routing / Department Classification (Text Classification)
- Objective: Build a model that classifies the ticket into the correct queue based on subject + body.
- Use Case:
     - Automate ticket triage.
     - Reduce agent workload.
     - Speed up first response times.

### Project Idea 2: Priority Prediction
- Objective: Predict priority based on ticket content and metadata.
- Use Case:
     - Auto-triage incoming tickets.
     - Proactively identify urgent issues.

### Project Idea 3: Multi-label Tag Prediction
- Objective: Predict tag_1 through tag_8 fields based on subject and body.
- Use Case:
     - Suggest tags to agents as they draft tickets.
     - Analyze most common issues.