### Task 1: Build a raw log dataset

Write code that generates a list of dictionaries representing support tickets. Each dictionary should include the fields described in the setup. Include at least 200 entries so that summaries are meaningful. Introduce realistic variation, such as a few categories that appear more frequently and occasional missing or malformed `resolution_minutes` values to simulate dirty data.

You are expected to write the generator logic yourself. Keep it readable and explain the logic in short markdown notes where necessary. After generating the list, print the first five entries and the total count to validate the structure.

In [2]:
categories = [
    "billing",
    "technical",
    "account",
    "delivery",
    "password_reset",
    "login_issue",
    "payment_failure",
    "refund_request",
    "subscription_cancel",
    "subscription_upgrade",
    "shipping_delay",
    "damaged_product",
    "product_inquiry",
    "feature_request",
    "bug_report",
    "account_suspension",
    "security_issue",
    "verification_problem",
    "installation_help",
    "connectivity_issue",
    "performance_issue",
    "data_loss",
    "complaint",
    "general_question",
    ""
]

In [3]:
len(categories)

25

In [4]:
import random

In [5]:
def generate_tickets(n):
    for i in range(1, n+1):
        resolution = random.choice([random.randint(5, 120), float('nan')])
        escalated_status = random.choice([random.randint(0, 6), float('nan')])
        yield {
            "ticked_id": i,
            "customer_id": i + 200,
            "category": random.choice(categories),
            "resolution_minutes": resolution,
            "escalated": escalated_status
        } 

In [6]:
tickets = generate_tickets(200) 
tickets = list(tickets)
print(tickets) 

[{'ticked_id': 1, 'customer_id': 201, 'category': 'account', 'resolution_minutes': nan, 'escalated': nan}, {'ticked_id': 2, 'customer_id': 202, 'category': 'connectivity_issue', 'resolution_minutes': 6, 'escalated': 4}, {'ticked_id': 3, 'customer_id': 203, 'category': 'delivery', 'resolution_minutes': nan, 'escalated': 5}, {'ticked_id': 4, 'customer_id': 204, 'category': 'complaint', 'resolution_minutes': nan, 'escalated': 2}, {'ticked_id': 5, 'customer_id': 205, 'category': 'data_loss', 'resolution_minutes': nan, 'escalated': 3}, {'ticked_id': 6, 'customer_id': 206, 'category': 'data_loss', 'resolution_minutes': nan, 'escalated': 3}, {'ticked_id': 7, 'customer_id': 207, 'category': 'shipping_delay', 'resolution_minutes': nan, 'escalated': 1}, {'ticked_id': 8, 'customer_id': 208, 'category': 'login_issue', 'resolution_minutes': nan, 'escalated': nan}, {'ticked_id': 9, 'customer_id': 209, 'category': 'complaint', 'resolution_minutes': 9, 'escalated': nan}, {'ticked_id': 10, 'customer_id

#### there i created a generated randomly chooses values within ranges and list. i also add some nan as required in the statement. 

In [7]:
tickets[0:5]

[{'ticked_id': 1,
  'customer_id': 201,
  'category': 'account',
  'resolution_minutes': nan,
  'escalated': nan},
 {'ticked_id': 2,
  'customer_id': 202,
  'category': 'connectivity_issue',
  'resolution_minutes': 6,
  'escalated': 4},
 {'ticked_id': 3,
  'customer_id': 203,
  'category': 'delivery',
  'resolution_minutes': nan,
  'escalated': 5},
 {'ticked_id': 4,
  'customer_id': 204,
  'category': 'complaint',
  'resolution_minutes': nan,
  'escalated': 2},
 {'ticked_id': 5,
  'customer_id': 205,
  'category': 'data_loss',
  'resolution_minutes': nan,
  'escalated': 3}]

### Task 2: Design validation helpers 

Create small functions that validate the dataset. For example, write one function that checks whether all required keys are present in each record, and another function that identifies records with missing or invalid `resolution_minutes`. These functions should return clear results such as a list of bad records or counts of issues.

Keep function signatures simple and explicit. For instance, a validation function should take the list of records as input and return a list of indices or a filtered list. Avoid printing inside these functions; return values instead so you can reuse them in other contexts.

In [8]:
key_list = ['ticked_id','customer_id','category','resolution_minutes','escalated']

In [11]:
def check_keys(tickets):
    n=0
    for dictionary in tickets:
        if set(dictionary.keys())!=set(key_list):
            n=n+1
            return n
    return "All good" 

# from m1_02_summary_functions import check_keys

result = check_keys(tickets)
print(result)

All good


In [15]:
import math

In [16]:
def invalid_records(tickets):
    m=0
    for i in tickets:
        values = i.values()
        if i["resolution_minutes"] is None or (isinstance(i["resolution_minutes"], float) and math.isnan(i["resolution_minutes"])):
            m=m+1
    return m 

# from m1_02_summary_functions import invalid_records

result = invalid_records(tickets)
print(result)

96


### Task 3: Clean and normalize records

Write a function that takes the raw records and returns a cleaned version. At minimum, it should handle missing `resolution_minutes` values in a defined way and normalize `category` strings (such as trimming whitespace and standardizing case). If you introduced malformed values, decide whether to drop those records or repair them, and document the decision in a short markdown cell.

Use list comprehensions or loops to build the cleaned list. Avoid mutating the original list in place. At the end, show the number of records before and after cleaning, and display a few cleaned records.

In [17]:
sum_=0
n=0
for i in tickets:
    resolution_minutes_ = i["resolution_minutes"]
    if str(resolution_minutes_).isdigit():
        sum_=sum_ + resolution_minutes_
        n=n+1

print(sum_, n) 

5941 104


In [18]:
import math

def clean_tickets(tickets):
    cleaned = []
    for record in tickets:
        if not record.get('ticked_id') or not record.get('customer_id'):
            continue
        category = record['category'].strip().lower()
        resolution = record['resolution_minutes']
        if isinstance(resolution, float) and math.isnan(resolution):
            # replace with mean or flag as needed
            continue
        cleaned.append({
            **record, 'category': category
        })
    return cleaned

#### # i created a function that cleans records which cstomer or ticket id is empty, replace invalid values in resolution minutes with average of it for avoiding data losing

### Task 4: Build summary functions

Create functions that compute useful summaries from the cleaned data. At a minimum, include:

1. Average resolution time per category
2. Count of tickets per customer
3. Escalation rate overall and by category

Use dictionaries to store summary results, with clear keys and values. For example, the average resolution time per category should be a dictionary mapping category name to average minutes. Your functions should return these dictionaries rather than printing them directly.

After computing each summary, write a small validation check. For example, confirm that the sum of category counts matches the total number of cleaned records. These checks are essential for catching logic errors early.

In [32]:
cat_count = {} 
cat_res_min_sum = {}
cat_avg = {}

for i in tickets:
    category = i["category"]
    if category in cat_count:
        cat_count[category] = cat_count[category] + 1 
    else:
        cat_count[category] = 1 

for i in tickets:
    category = i["category"] 
    if category in cat_res_min_sum:
        cat_res_min_sum[category] = cat_res_min_sum[category] + i["resolution_minutes"] 
    else:
        cat_res_min_sum[category] = i["resolution_minutes"] 

import math

def avg_resolution_per_category(tickets):
    agg = {}
    counts = {}

    for t in tickets:
        cat = t['category']
        val = t['resolution_minutes']

        # skip NaN
        if val is None or (isinstance(val, float) and math.isnan(val)):
            continue

        agg[cat] = agg.get(cat, 0) + val
        counts[cat] = counts.get(cat, 0) + 1

    return {k: agg[k]/counts[k] for k in agg} 

print(avg_resolution_per_category(tickets)) 

{'connectivity_issue': 56.333333333333336, 'complaint': 19.5, 'login_issue': 66.2, 'account': 37.5, 'technical': 58.57142857142857, 'refund_request': 89.0, 'subscription_cancel': 53.4, 'general_question': 68.0, 'bug_report': 60.2, 'delivery': 25.333333333333332, 'performance_issue': 54.833333333333336, 'billing': 25.5, 'damaged_product': 53.5, 'data_loss': 69.5, 'feature_request': 33.333333333333336, 'installation_help': 33.166666666666664, 'password_reset': 55.666666666666664, 'payment_failure': 41.75, 'shipping_delay': 78.5, 'security_issue': 49.5, 'verification_problem': 82.14285714285714, 'product_inquiry': 65.0, 'subscription_upgrade': 43.75, '': 85.0, 'account_suspension': 72.75}


In [21]:
cust_count = {}

def func_cust(tickets):

    for i in tickets:
        customer_id = i["customer_id"]
        if customer_id in cust_count:
            cust_count[customer_id] = cust_count[customer_id] + 1 
        else:
            cust_count[customer_id] = 1 
    return cust_count
print(func_cust(tickets)) 

{201: 1, 202: 1, 203: 1, 204: 1, 205: 1, 206: 1, 207: 1, 208: 1, 209: 1, 210: 1, 211: 1, 212: 1, 213: 1, 214: 1, 215: 1, 216: 1, 217: 1, 218: 1, 219: 1, 220: 1, 221: 1, 222: 1, 223: 1, 224: 1, 225: 1, 226: 1, 227: 1, 228: 1, 229: 1, 230: 1, 231: 1, 232: 1, 233: 1, 234: 1, 235: 1, 236: 1, 237: 1, 238: 1, 239: 1, 240: 1, 241: 1, 242: 1, 243: 1, 244: 1, 245: 1, 246: 1, 247: 1, 248: 1, 249: 1, 250: 1, 251: 1, 252: 1, 253: 1, 254: 1, 255: 1, 256: 1, 257: 1, 258: 1, 259: 1, 260: 1, 261: 1, 262: 1, 263: 1, 264: 1, 265: 1, 266: 1, 267: 1, 268: 1, 269: 1, 270: 1, 271: 1, 272: 1, 273: 1, 274: 1, 275: 1, 276: 1, 277: 1, 278: 1, 279: 1, 280: 1, 281: 1, 282: 1, 283: 1, 284: 1, 285: 1, 286: 1, 287: 1, 288: 1, 289: 1, 290: 1, 291: 1, 292: 1, 293: 1, 294: 1, 295: 1, 296: 1, 297: 1, 298: 1, 299: 1, 300: 1, 301: 1, 302: 1, 303: 1, 304: 1, 305: 1, 306: 1, 307: 1, 308: 1, 309: 1, 310: 1, 311: 1, 312: 1, 313: 1, 314: 1, 315: 1, 316: 1, 317: 1, 318: 1, 319: 1, 320: 1, 321: 1, 322: 1, 323: 1, 324: 1, 325: 1,

In [28]:
def func_esc(tickets):
    esc_rate_cat = {}
    total_cat = {}

    for i in tickets:
        category = i["category"]

        if category in total_cat:
            total_cat[category] += 1
        else:
            total_cat[category] = 1

        if i["escalated"]:
            if category in esc_rate_cat:
                esc_rate_cat[category] += 1
            else:
                esc_rate_cat[category] = 1

    for category in total_cat:
        esc_rate_cat[category] = esc_rate_cat.get(category, 0) / total_cat[category]

    return esc_rate_cat

In [29]:
print(func_esc(tickets))

{'account': 1.0, 'connectivity_issue': 1.0, 'delivery': 0.875, 'complaint': 1.0, 'data_loss': 0.875, 'shipping_delay': 1.0, 'login_issue': 0.8888888888888888, 'general_question': 1.0, 'feature_request': 1.0, 'technical': 1.0, 'subscription_cancel': 1.0, 'bug_report': 1.0, 'billing': 1.0, 'performance_issue': 1.0, 'payment_failure': 0.875, 'password_reset': 1.0, '': 1.0, 'account_suspension': 1.0, 'installation_help': 1.0, 'damaged_product': 0.8, 'verification_problem': 1.0, 'security_issue': 0.75, 'product_inquiry': 0.8, 'subscription_upgrade': 0.8571428571428571, 'refund_request': 0.875}


### Task 5: Package a final report

Write a function that combines the outputs of your summaries into a single report structure. This might be a dictionary that contains other dictionaries. The goal is to provide a single object that could be serialized or used by another part of a pipeline.

In a final notebook cell, print a compact report and add a short text explanation of one insight you observed. Keep the report readable and avoid overly verbose output.

In [33]:
# def report_dict(tickets):
#     rep_dict = {}
#     rep_dict["first"] = cat_count
#     rep_dict["second"] = cat_res_min_sum
#     rep_dict["third"] = cat_avg
#     rep_dict["forth"] = esc_rate_cat
#     rep_dict["fifth"] = cust_count
#     return rep_dict
# print(report_dict(tickets)) 

def final_report(summary1, summary2, summary3):
    return {
        'avg_resolution_per_category': avg_resolution_per_category(tickets),
        'ticket_count_per_customer': func_cust(tickets),
        'escalation_rate': func_esc(tickets),
    } 

print(final_report(avg_resolution_per_category(tickets), func_cust(tickets), func_esc(tickets)))

# from m1_02_summary_functions import report_dict

# result = report_dict(tickets)
# print(result) 

{'avg_resolution_per_category': {'connectivity_issue': 56.333333333333336, 'complaint': 19.5, 'login_issue': 66.2, 'account': 37.5, 'technical': 58.57142857142857, 'refund_request': 89.0, 'subscription_cancel': 53.4, 'general_question': 68.0, 'bug_report': 60.2, 'delivery': 25.333333333333332, 'performance_issue': 54.833333333333336, 'billing': 25.5, 'damaged_product': 53.5, 'data_loss': 69.5, 'feature_request': 33.333333333333336, 'installation_help': 33.166666666666664, 'password_reset': 55.666666666666664, 'payment_failure': 41.75, 'shipping_delay': 78.5, 'security_issue': 49.5, 'verification_problem': 82.14285714285714, 'product_inquiry': 65.0, 'subscription_upgrade': 43.75, '': 85.0, 'account_suspension': 72.75}, 'ticket_count_per_customer': {201: 5, 202: 5, 203: 5, 204: 5, 205: 5, 206: 5, 207: 5, 208: 5, 209: 5, 210: 5, 211: 5, 212: 5, 213: 5, 214: 5, 215: 5, 216: 5, 217: 5, 218: 5, 219: 5, 220: 5, 221: 5, 222: 5, 223: 5, 224: 5, 225: 5, 226: 5, 227: 5, 228: 5, 229: 5, 230: 5, 23

#### highest res min by category is 89.0 and the largest escalation rate is 1