# Module 1 — Lab 1 (Python Only): Build a Customer Data Cleaning Pipeline

Welcome to your first programming lab! You will build a notebook that reads customers.csv, performs data profiling and quality checks, cleans key fields, builds a current customer snapshot (one row per customer), then exports:
1. reports/quality_report.md
2. data/processed/customers_current.csv

**Instructions:**

Please carefully read instructions for each step to understand tasks.


**You will learn:**
1. Data ingestion: load a real CSV dataset correctly and inspect its schema (columns, types, size).

2. Data profiling: quickly understand the dataset using summaries (missing values, duplicates, distributions).

3. Data quality thinking: define simple validation rules (e.g., required fields, valid ranges, allowed categories) and measure how many records violate them.

4. Data cleaning & standardization: fix messy real-world fields (timestamps, postcode/ZIP formats, tax IDs, whitespace) into consistent formats.

5. Record consolidation: handle duplicate keys by building a current customer snapshot (one row per customer using “latest/current” logic).

6. Data publishing: export cleaned outputs and produce a short quality report—a mini version of a production pipeline deliverable.

**Prerequisites:**

1. Python 3.10+

2. Jupyter Notebook / JupyterLab (or VS Code notebooks)

## Table of Contents
- [1 - Packages](#1)
- [2 - Overview of the Problem set](#2)
    - [Exercise 1](#ex-1)
    - [Exercise 2](#ex-2)
- [3 - Missing and Duplicate samples](#3)
- [4 - Data Quality](#4)
    - [Exercise 4](#ex-4)
    - [Exercise 5](#ex-5)
- [5 - Customer Object Builder (Optional)](#5)
    - [Exercise 6](#ex-6)
    - [Exercise 7](#ex-7)

<a name='1'></a>
## 1 - Packages ##

First, let's run the cell below to import all the packages that you will need during this assignment. 
- [numpy](https://numpy.org/doc/1.20/) is the fundamental package for scientific computing with Python.
- [pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

In [None]:
import numpy as np
import pandas as pd

<a name='2'></a>
## 2 - Overview of the Problem set ##

**Problem Statement**: This dataset is a customer information table (a “customer master” file). Each row describes a customer’s basic identity and contact details, such as customer ID and name, address fields (street, city, state, postcode), and sometimes tax information. It also includes simple business-related fields like how many units the customer purchased and a loyalty segment label.

You will extract and understand this dataset. Practice the core “data engineering basics” workflow in a Python notebook.

Let's get more familiar with the dataset.

<a name='ex-1'></a>
### Exercise 1 
Load data and check dataset size
You will first load the dataset "customers.csv" and then check its shape and columns.

In [None]:
# Your Codes Start Here

<a name='ex-2'></a>
### Exercise 2
Profiling and basic quality checks:
1. Check the first 10 samples of the dataset.
2. Check the information of the dataset, including columns and their datatypes, as well as non-null values count.
3. Generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution.

In [None]:
# Your Codes Start Here

<a name='3'></a>
## 3 - Missing and Duplicate samples
Missing values are absent data points in a dataset, occurring from errors, non-responses, or system issues, which can skew analysis; they are often represented as blanks, NaN, None. 
Duplicate values are identical entries appearing more than once in a dataset. In some special cases, duplicate values may provide extra information.

<a name='ex-3'></a>
### Exercise 3
You will do:
1. Calculate the number of missing values of each columns.
2. Calculate the total number of unique customers and the number of duplicate samples.

In [None]:
# Your Codes Start Here

<a name='4'></a>
## 4 - Data quality ## 
Data quality refers to how fit, reliable, and usable data is for its intended purpose, measured by dimensions like accuracy (correctness), completeness (no missing data), consistency (uniformity), timeliness (up-to-date), validity (correct format), and uniqueness (no duplicates). High-quality data enables sound decision-making, while poor data quality leads to errors, inefficiencies, and risks, making ongoing management crucial for business success. 

<a name='ex-4'></a>
### Exercise 4
Use these rules to clean your dataset:

1. __customer_id__ must exist

2. __customer_name__ should not be empty

3. __valid_from__ must be valid epoch seconds → datetime

4. __valid_to epoch seconds__ → datetime (null = current)

5. __postcode__ normalize to 5 digits when possible, else null

6. __tax_id__ normalize to 9 digits when possible, else null

7. __loyalty_segment__ must be in {0,1,2,3}

8. __units_purchased__ must be >= 0

In [None]:
# Your Codes Start Here

<a name='ex-5'></a>
### Exercise 5
Now let's focus on each data sample.

Rule: For each __customer_id__, if any information of __state__,__city__,__postcode__,__street__,__number__,__unit__,__region__,__district__ is missing, create an __address_update__  column showing whether the customer needs to update address information.

In [None]:
# Your Codes Start Here

<a name='5'></a>
## 5 - Customer Object Builder (Optional)
In this part, you will practice basic Python OOP by turning each row in customers.csv into a Customer object. Every object stores all columns as its attributes, just like a real “customer profile” in a system. You will then add a business rule: if any address-related field is missing, the customer should be flagged as needing an address update. Finally, you will convert the objects back into a table (DataFrame) and export the results.

<a name='ex-6'></a>
### Exercise 6 (Optional)
Good job! We can start to build a class __Customers__  and create every customer as an object. Each customer should have all his information as its attributes.

In [None]:
# Your Codes Start Here

<a name='ex-7'></a>
### Exercise 7 (Optional)
Please add functions and methods for __Customer__ class, including get and modify attribute information, additionally, these modification should also update the dataframe.

In [None]:
# Your Codes Start Here