# Module 2 — Lab 2: Customer Data Quality Engine

In Lab 2: __Customer Data Quality Engine__, you will build a small Python program that checks the quality of customer address records and generates a simple text-based report. The lab is designed to strengthen your __algorithm__, __data structure__, and __class/function__ design skills, using a realistic “data engineering” style problem: validating and maintaining customer data.

**Instructions:**

Please carefully read instructions for each step to understand tasks.


**You will learn:**
1. Core data structures in practice: list, dict, set, deque, heap

2. Core algorithms in practice: scanning, grouping, sorting, top-K selection, event processing

3. Class + function design: building clean reusable methods instead of messy scripts

4. Data engineering mindset: validating data, building indexes, generating quality reports, and handling updates like a pipeline


By the end of the lab, you will have a working “quality engine” that can detect missing information, prioritize the worst records, and update customer data efficiently—using fundamental CS skills that appear in real data systems.

**Prerequisites:**

1. Python 3.10+

2. Jupyter Notebook / JupyterLab (or VS Code notebooks)

## Table of Contents
- [1 - Packages](#1)
- [2 - Overview of the Problem set](#2)
    - [Exercise 1](#ex-1)
    - [Exercise 2](#ex-2)
    - [Exercise 4](#ex-4)
    - [Exercise 5](#ex-5)
    - [Exercise 6](#ex-6)

<a name='1'></a>
## 1 - Packages ##

First, let's run the cell below to import all the packages that you will need during this assignment. 
- [numpy](https://numpy.org/doc/1.20/) is the fundamental package for scientific computing with Python.
- [pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

You can add any library if you need.

In [30]:
import numpy as np
import pandas as pd
from dataclasses import dataclass, field

<a name='2'></a>
## 2 - Overview of the Problem set ##

**Problem Statement**: This dataset is a customer information table (a “customer master” file). Each row describes a customer’s basic identity and contact details, such as customer ID and name, address fields (street, city, state, postcode), and sometimes tax information. It also includes simple business-related fields like how many units the customer purchased and a loyalty segment label.

You will build two classes for data storage and dataset manipulation.

Let's get more familiar with tasks.

<a name='ex-1'></a>
### Exercise 1 
Create a __Customer__ class (or @dataclass) with columns as attributes.
[state, city, postcode, street, number, unit, region, district]

add three atrributes: 
1. __address_update_needed__: whether the customer needs to update his/her address.
2. __missing_fields__: all his/her missing parts.
3. __completeness_score__: a score showing the completeness.

All of these attributes should have a initial value.

In [36]:
# Your Codes Start Here

<a name='ex-2'></a>
### Exercise 2
Build class __CustomerQualityEngine__:
1. including a list containing all customers.
2. design a dictionary to map all __customer_id__, and __Customer__ objects as well as its related function if required.
3. create a function __validate_all__ to update completeness_scores of all customers.
4. A function __load_from_csv__ that can import all data from a csv file to this class
5. A function __get_missing_fields__ that can update the __missing_fields__ attribute of the Customer object. 

In [None]:
# Your Codes Start Here

<a name='ex-3'></a>
### Exercise 3
1. build a function __count_missing_fields__ to count how often each field is missing across all customers.
2. build a function __missing_by_city__ to count how many customers need to update address in each city.
3. build a function __group_by_state__ to collect all __customer_id__ of customers requiring to update address in each state.

In [None]:
# Your Codes Start Here

<a name='ex-4'></a>
### Exercise 4
Build a function to return list of (city, count) for top-k cities with missing updates.

Tips: you can use heapq.nlargest.

In [None]:
# Your Codes Start Here

<a name='ex-5'></a>
### Exercise 5
In this part, we will build two functions to update our dataset if the customer updated his/her address information.
1. build a function __enqueue_update__ to record all update information. Every update needs: the __customer_id__, the field_name and the new value.
2. build a function __process_updates__ to process update events, apply to customer via id_index and re-validate only that customer.

In [None]:
# Your Codes Start Here

<a name='ex-6'></a>
### Exercise 6 (Optional)
Now you can create a report to show all your findings. Create it as a plain text here using code. Feel free to design it! Enjoy your coding!

In [None]:
# Your Codes Start Here