In [1]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [2]:
#load the first dataset for consideration
raw_inc_df = pd.read_csv("../data/raw/incident_event_log.csv")

In [3]:
raw_inc_df.sample(n=5)

Unnamed: 0,number,incident_state,active,reassignment_count,reopen_count,sys_mod_count,made_sla,caller_id,opened_by,opened_at,sys_created_by,sys_created_at,sys_updated_by,sys_updated_at,contact_type,location,category,subcategory,u_symptom,cmdb_ci,impact,urgency,priority,assignment_group,assigned_to,knowledge,u_priority_confirmation,notify,problem_id,rfc,vendor,caused_by,closed_code,resolved_by,resolved_at,closed_at
15273,INC0003419,Resolved,True,2,1,22,True,Caller 448,Opened by 468,7/3/2016 11:58,Created by 188,7/3/2016 12:02,Updated by 96,31/3/2016 08:11,Phone,Location 143,Category 46,Subcategory 251,Symptom 218,?,2 - Medium,2 - Medium,3 - Moderate,Group 22,Resolver 133,True,True,Do Not Notify,?,?,?,?,code 6,Resolved by 119,5/4/2016 16:11,10/4/2016 16:59
65917,INC0015424,Active,True,0,0,4,True,Caller 5368,Opened by 17,5/4/2016 11:14,Created by 10,5/4/2016 11:32,Updated by 44,5/4/2016 12:12,Phone,Location 143,Category 46,Subcategory 174,Symptom 491,?,2 - Medium,2 - Medium,3 - Moderate,Group 39,Resolver 194,False,False,Do Not Notify,?,?,?,?,code 6,Resolved by 177,5/4/2016 17:13,10/4/2016 17:59
113661,INC0027551,Awaiting User Info,True,7,0,12,True,Caller 3279,Opened by 17,9/5/2016 14:59,?,?,Updated by 770,20/5/2016 16:47,Phone,Location 161,Category 53,Subcategory 174,Symptom 491,?,2 - Medium,2 - Medium,3 - Moderate,Group 70,?,False,False,Do Not Notify,?,?,?,?,code 6,Resolved by 16,19/12/2016 15:32,24/12/2016 16:00
10494,INC0002412,Resolved,True,1,0,4,True,Caller 4640,Opened by 301,4/3/2016 11:05,Created by 129,4/3/2016 11:21,Updated by 974,4/3/2016 11:44,Phone,Location 161,Category 20,Subcategory 125,Symptom 387,?,2 - Medium,2 - Medium,3 - Moderate,Group 24,Resolver 249,True,False,Do Not Notify,?,?,?,?,code 6,Resolved by 227,4/3/2016 11:44,9/3/2016 12:00
23609,INC0005234,Active,True,2,0,3,True,Caller 247,Opened by 59,10/3/2016 09:28,?,?,Updated by 135,14/3/2016 08:40,Phone,Location 242,Category 61,Subcategory 16,Symptom 102,?,2 - Medium,2 - Medium,3 - Moderate,?,?,True,False,Do Not Notify,?,?,?,?,code 6,Resolved by 180,23/3/2016 08:55,28/3/2016 08:59


The incident dataset does not have a text field, which doesn't align with our
learning objective of getting hands on NLP experience. Let's try another dataset.

In [4]:
#load the second dataset for consideration
raw_cust_df = pd.read_csv("../data/raw/customer_support_tickets.csv")

In [5]:
raw_cust_df.sample(n=5)

Unnamed: 0,Ticket ID,Customer Name,Customer Email,Customer Age,Customer Gender,Product Purchased,Date of Purchase,Ticket Type,Ticket Subject,Ticket Description,Ticket Status,Resolution,Ticket Priority,Ticket Channel,First Response Time,Time to Resolution,Customer Satisfaction Rating
7637,7638,Rebecca Garcia,djohnson@example.com,48,Male,Nest Thermostat,2020-01-10,Billing inquiry,Payment issue,I'm having an issue with the {product_purchase...,Pending Customer Response,,High,Phone,2023-06-01 02:19:02,,
7277,7278,Linda Evans,jessicagibson@example.net,70,Other,Xbox,2020-06-30,Cancellation request,Installation support,I'm having an issue with the {product_purchase...,Open,,Low,Phone,,,
7222,7223,Wendy Singleton,baileyjose@example.net,21,Male,Apple AirPods,2021-08-19,Billing inquiry,Software bug,I'm having an issue with the {product_purchase...,Closed,Push weight commercial whom.,Critical,Chat,2023-06-01 09:28:19,2023-06-01 15:27:19,2.0
8064,8065,Roger Smith,farrellandrea@example.com,44,Other,Google Nest,2021-05-16,Billing inquiry,Software bug,"My {product_purchased} crashed, and I lost all...",Pending Customer Response,,High,Chat,2023-06-01 21:28:07,,
784,785,Jeffrey Green,sellerstara@example.com,43,Male,Garmin Forerunner,2021-09-26,Technical issue,Product compatibility,I'm unable to access my {product_purchased} ac...,Pending Customer Response,,Low,Social media,2023-06-01 14:46:43,,


In [6]:
raw_cust_df.isnull().sum() * 100/len(raw_cust_df)

Ticket ID                        0.000000
Customer Name                    0.000000
Customer Email                   0.000000
Customer Age                     0.000000
Customer Gender                  0.000000
Product Purchased                0.000000
Date of Purchase                 0.000000
Ticket Type                      0.000000
Ticket Subject                   0.000000
Ticket Description               0.000000
Ticket Status                    0.000000
Resolution                      67.304286
Ticket Priority                  0.000000
Ticket Channel                   0.000000
First Response Time             33.286102
Time to Resolution              67.304286
Customer Satisfaction Rating    67.304286
dtype: float64

### High-Level Data Audit:
- Structured fields: demographics (age, gender), product, priority, channel, type, etc.
- Semi-structured / unstructured text fields: subject, description, resolution.
- Target-like fields: time to resolution, satisfaction rating.
- Many missing values — notably in time-based and satisfaction fields.
- This dataset is extremely rich for multi-modal modeling — combining structured features + text embeddings.

### Project Idea 1: Ticket Topic Modeling (BERTopic)
- Objective: Use BERTopic (or similar) to create meaningful topics from Ticket Description or Ticket Subject.
- Useful for automated tagging.
- Could help reduce manual ticket triage.
- Can visualize trends: which topics are most common, which have higher resolution delays or satisfaction problems.

### Project Idea 2: Agent Assignment / Routing Model
- Objective: Predict which agent (or team) should handle the ticket based on early ticket info.
- Don’t have an agent field right now, but you could simulate this using synthetic labels:
- Build clusters of tickets using BERTopic
- Assign each cluster to a hypothetical specialized team
- Build classification models to route new tickets.
- **Business value: More efficient routing → faster resolutions → higher satisfaction.**

### Project Idea 3: Sentiment Analysis Augmentation
- Objective: Perform sentiment analysis on the Ticket Description field.
- May correlate highly with satisfaction, resolution time, or escalation likelihood.
- Could serve as feature enrichment for any of the previous models.

In [7]:
#load the third dataset for consideration
raw_multi_df = pd.read_csv("../data/raw/aa_dataset-tickets-multi-lang-5-2-50-version.csv")

In [8]:
#check shape
raw_multi_df.shape

(28587, 16)

In [9]:
raw_multi_df.sample(n=5)

Unnamed: 0,subject,body,answer,type,queue,priority,language,version,tag_1,tag_2,tag_3,tag_4,tag_5,tag_6,tag_7,tag_8
21767,Technical Support for System Glitches,"A significant problem has arisen, leading to i...",Please investigate the integration failures an...,Incident,Product Support,high,en,400,Bug,Performance,IT,Tech Support,,,,
26928,Probleme im Digitalem Marketing,Leistungsstufen sind unkonstant.,Das Problem der unkonstanten Leistungen im dig...,Incident,Technical Support,medium,de,400,Performance,Feedback,Bug,IT,,,,
22618,,I am requesting enhancements for data integrat...,We are responding to your request for enhancem...,Change,Product Support,high,en,400,Feature,Feedback,Performance,,,,,
27821,Enhancing Security Measures for Medical Data i...,"Customer Support, please provide detailed prac...",We recommend implementing practices for securi...,Request,Technical Support,medium,en,400,Security,IT,Tech Support,Feedback,,,,
24668,,The data analytics tool has been experiencing ...,We are reviewing the performance issue with th...,Incident,Technical Support,low,en,400,Performance,Bug,IT,Tech Support,,,,


In [10]:
raw_multi_df.isnull().sum() * 100/len(raw_multi_df)

subject     13.425683
body         0.000000
answer       0.024487
type         0.000000
queue        0.000000
priority     0.000000
language     0.000000
version      0.000000
tag_1        0.000000
tag_2        0.045475
tag_3        0.475741
tag_4       10.697170
tag_5       49.120229
tag_6       79.452199
tag_7       92.863889
tag_8       98.023577
dtype: float64

In [11]:
raw_multi_df["queue"].value_counts()

queue
Technical Support                  8362
Product Support                    5252
Customer Service                   4268
IT Support                         3433
Billing and Payments               2788
Returns and Exchanges              1437
Service Outages and Maintenance    1148
Sales and Pre-Sales                 918
Human Resources                     576
General Inquiry                     405
Name: count, dtype: int64

### High-Level Audit of Dataset
- Customer ticket (subject, body)
- Agent response (answer)
- Structured metadata (queue, priority, type, tags, language)
- Multilingual support (en, de)
- Tagged labels (tags fields: tag_1 through tag_8)

### Project Idea 1: Ticket Routing / Department Classification (Text Classification)
- Objective: Build a model that classifies the ticket into the correct queue based on subject + body.
- Use Case:
     - Automate ticket triage.
     - Reduce agent workload.
     - Speed up first response times.

### Project Idea 2: Priority Prediction
- Objective: Predict priority based on ticket content and metadata.
- Use Case:
     - Auto-triage incoming tickets.
     - Proactively identify urgent issues.

### Project Idea 3: Multi-label Tag Prediction
- Objective: Predict tag_1 through tag_8 fields based on subject and body.
- Use Case:
     - Suggest tags to agents as they draft tickets.
     - Analyze most common issues.