# 01 — Data Profiling & Operational Readiness  
**Project:** SupportOps  
**Phase:** 1 (Data Understanding & Preparation)

## 1. Purpose of This Notebook

The purpose of this notebook is to understand the raw support ticket data and assess whether it can support operational decision-making.

Before building KPIs or dashboards, we answer:
- What does this data represent?
- Which fields are reliable for operational analysis?
- What questions can (and cannot) be answered honestly?

This phase ensures that downstream metrics are defensible and trustworthy.

## 2. Dataset Overview

**Dataset:** Multilingual Customer Support Tickets  
**Source:** Kaggle  
**File:** `dataset-tickets-multi-lang-4-20k.csv`

**Granularity:**  
- One row represents one support ticket

This dataset represents operational support telemetry rather than customer behavior or sentiment.

## 3. Load the Raw Data

The raw dataset is loaded without modification to confirm shape, schema, and successful ingestion.

In [1]:
import pandas as pd

df = pd.read_csv("../data/raw/dataset-tickets-multi-lang-4-20k.csv")

df.shape

(20000, 15)

## 4. Dataset Shape & Granularity

**Observed Results**
- Rows: 20,000  
- Columns: 15  

**Interpretation**

The dataset is at the ticket level, which is appropriate for:
- Ticket volume analysis
- Priority and queue distribution
- Operational workload visibility

The dataset does not represent event-level status changes over time.

In [2]:
df.columns

Index(['subject', 'body', 'answer', 'type', 'queue', 'priority', 'language',
       'tag_1', 'tag_2', 'tag_3', 'tag_4', 'tag_5', 'tag_6', 'tag_7', 'tag_8'],
      dtype='object')

## 5. Column Inventory

Each column is classified based on its role in operational analysis.

| Column Name | Classification | Operational Use |
|------------|----------------|------------------|
| subject | Text | Context only |
| body | Text | Context only |
| answer | Text | Not used for ops KPIs |
| type | Categorical | Ticket classification |
| queue | Categorical | Routing / workload |
| priority | Categorical (ordinal) | Prioritization |
| language | Categorical | Segmentation |
| tag_1 – tag_8 | Categorical | Issue patterns |

Text fields are not aggregated.  
Categorical fields are candidates for grouping and KPI construction.

In [3]:
df.isnull().sum()

subject      1461
body            2
answer          4
type            0
queue           0
priority        0
language        0
tag_1           0
tag_2          46
tag_3          95
tag_4        1539
tag_5        6909
tag_6       12649
tag_7       16072
tag_8       18093
dtype: int64

## 6. Missing Data Assessment

Missing values are assessed to identify fields that may limit operational metrics.

**Observed Patterns**
- Missing values are concentrated in:
  - Text fields (`subject`, `body`, `answer`)
  - Optional tag fields (`tag_2` through `tag_8`)
- Core operational fields have no missing values:
  - `type`
  - `queue`
  - `priority`
  - `language`

**Interpretation**

Missingness does not impact core operational analysis.  
Tag sparsity is expected and will be handled explicitly in later phases.

In [4]:
print("Priority:")
print(df["priority"].value_counts(dropna=False))

print("\nQueue:")
print(df["queue"].value_counts(dropna=False))

print("\nType:")
print(df["type"].value_counts(dropna=False))

print("\nLanguage:")
print(df["language"].value_counts(dropna=False))

Priority:
priority
medium    8144
high      7801
low       4055
Name: count, dtype: int64

Queue:
queue
Technical Support                  5824
Product Support                    3708
Customer Service                   3152
IT Support                         2292
Billing and Payments               2086
Returns and Exchanges              1001
Service Outages and Maintenance     764
Sales and Pre-Sales                 572
Human Resources                     338
General Inquiry                     263
Name: count, dtype: int64

Type:
type
Incident    7978
Request     5763
Problem     4184
Change      2075
Name: count, dtype: int64

Language:
language
en    11923
de     8077
Name: count, dtype: int64


## 7. Categorical Field Inspection

Key categorical fields were inspected to confirm suitability for grouping, prioritization logic, and KPI construction.

### Priority Distribution

Ticket priorities are well-distributed across categories, with a clear ordinal structure (e.g., low, medium, high).  
This confirms that `priority` can be safely used for workload prioritization and segmentation.

### Queue Distribution

Ticket volume is concentrated in a small number of operational queues (e.g., Technical Support, Product Support, Customer Service), with a long tail of lower-volume queues.

This confirms:
- Queue-level workload analysis is meaningful
- Operational focus can be directed toward high-volume queues
- Low-volume queues can be grouped or analyzed separately if needed

### Ticket Type Distribution

Ticket types are dominated by a small set of operational categories (e.g., Incident, Request, Problem), indicating consistent classification.

This field is suitable for:
- High-level operational reporting
- Issue-type mix analysis

### Language Distribution

The dataset contains two languages:
- English (`en`)
- German (`de`)

Language will be treated as a segmentation dimension only.  
Text translation is not required for operational KPIs.

## 8. Tag Fields & Issue Classification

Tags are distributed across multiple columns (`tag_1` through `tag_8`) with increasing sparsity.

**Interpretation**
- Tags represent issue classification, not operational state
- Not all tickets require multiple tags
- Tags will be reshaped or aggregated in later phases

Tags support recurring issue detection and deflection analysis but are not required for all KPIs.

## 9. Operational Readiness Assessment

### Supported Use Cases

Based on Phase 1 inspection, the dataset supports:
- Ticket volume analysis
- Priority-based workload segmentation
- Queue and routing distribution
- Language-based segmentation
- Issue pattern identification via tags

These directly support the SupportOps deliverables:
1. Prioritization
2. Backlog control (volume-based)
3. Operational visibility

### Unsupported or Limited Use Cases

This dataset does not support:
- True SLA compliance tracking (no timestamps)
- Backlog aging by time (no created/resolved dates)
- Agent-level productivity analysis (no agent identifiers)

These metrics will be explicitly excluded to avoid misleading conclusions.

## 10. Phase 1 Conclusion

The dataset is operationally viable for SupportOps.

Core routing and prioritization fields are complete and suitable for KPI reporting, SQL analytics, and Power BI dashboards.

Time-based SLA and aging metrics will not be claimed due to data limitations.

**Next Phase:**  
Phase 2 will focus on light standardization and exploratory analysis to surface operational patterns that inform KPI design.