# 🚀 Exercise: Day 1 - Initial Data Ingestion & Cleaning for Fraud Detection 🚀

---

**Objective:** This exercise focuses on ingesting and cleaning the foundational data for our fraud detection system using Python and Pandas. You will handle both batch customer data and simulated real-time transaction data, addressing common data quality issues and edge cases.

**Scenario:** You are starting to build the "Real-time Transaction Fraud Detection System." Your first task is to reliably ingest customer master data and a batch of simulated transaction events, performing necessary cleaning and preprocessing.

## 📝 Tasks 📝

You'll create two files: `day1_exercise_solution.py` (your Python code) and `day1_exercise_reasoning.md` (your explanations).

---

### Part 1: `day1_exercise_solution.py` (Python Script)

1.  **Data Ingestion:**
    * Load the **Customer Master Data CSV string** into a Pandas DataFrame named `df_customers`.
    * Parse the **Simulated Transaction Events JSON list** into a Pandas DataFrame named `df_transactions`.
    * Print the **first 5 rows** and the `info()` for both DataFrames immediately after ingestion.

2.  **Data Cleaning & Preprocessing (`df_customers`):**
    * **Handle Missing Values:** Ensure no missing values in critical columns like `customer_id`, `customer_name`, `registration_date`. For this dataset, assume all are present.
    * **Correct Data Types:** Convert `registration_date` and `last_login_date` to datetime objects.
    * **Handle Duplicates:** Identify and remove any duplicate `customer_id` entries, keeping the first occurrence.

3.  **Data Cleaning & Preprocessing (`df_transactions`):**
    * **Handle Missing Values:** For the `amount` column, replace "NULL" (string) with the median amount of all transactions.
    * **Correct Data Types:**
        * Ensure `amount` is numeric (float). Handle cases where `amount` might be a string (e.g., "50.25").
        * Convert `timestamp` to datetime objects.
    * **Handle Duplicates:** Identify and remove any duplicate `transaction_id` entries, keeping the first occurrence.
    * **Create Derived Column:** Calculate `transaction_hour` (hour of the day, as an integer) from the `timestamp`.

4.  **Output:**
    * Print the **cleaned `df_customers` info and first 5 rows**.
    * Print the **cleaned `df_transactions` info and first 5 rows**.
    * Print the **number of unique customers and transactions after cleaning**.

---

### Part 2: `day1_exercise_reasoning.md` (Markdown File)

Explain your reasoning for each step, focusing on the following:

1.  **Data Cleaning & Preprocessing Explanations:**
    * **Missing Values:** How did you handle missing values in both DataFrames and why did you choose that strategy (e.g., replacement with median, removal)?
    * **Data Type Conversion:** Describe your approach to data type conversion for datetime and numeric columns.
    * **Duplicate Handling:** Explain your logic for identifying and removing duplicate entries in `customer_id` and `transaction_id`.

2.  **Edge Cases & Robustness:**
    * **Edge Case 1: Large Data Simulation:**
        Imagine the Simulated Transaction Events list contains millions of entries and cannot be loaded into memory all at once. Describe conceptually how you would handle this scenario (e.g., using Dask, processing in chunks, or a distributed framework like Spark). You don't need to implement this, but explain your approach.
    * **Edge Case 2: Inconsistent Data Formats:**
        What if the `amount` occasionally comes with currency symbols (e.g., "$150.75") or uses a comma as a decimal separator (e.g., "150,75")? Describe how you would make your parsing more robust.
        Implement a basic version of this for the currency symbol in your Python script (e.g., remove "$").
    * **Edge Case 3: Schema Drift/Unexpected Columns:**
        What if a new column like `payment_processor` suddenly appears in some transaction events, or an expected column like `currency` is sometimes missing? Describe how you would handle such schema variations in a production ETL pipeline (e.g., flexible schema, error logging, schema evolution tools).

3.  **Assumptions:**

    List any assumptions you made while performing the data ingestion and cleaning.

---

## Data to Use:

### Customer Master Data (CSV String):

```csv
customer_id,customer_name,customer_email,registration_date,customer_tier,last_login_date
C101,Alice Smith,alice@example.com,2023-01-15,Gold,2024-07-15
C102,Bob Johnson,bob@example.com,2024-07-09,Silver,2024-07-16
C103,Charlie Brown,charlie@example.com,2023-11-20,Bronze,2024-07-14
C104,Diana Prince,diana@example.com,2024-07-10,Silver,2024-07-16
C105,Eve Adams,eve@example.com,2023-05-01,Gold,2024-07-15
C106,Frank White,frank@example.com,2024-07-11,Bronze,2024-07-15
C107,Grace Hopper,grace@example.com,2023-03-22,Silver,2024-07-14
C108,Heidi Klum,heidi@example.com,2024-07-16,Bronze,2024-07-16
C109,Ivan Drago,ivan@example.com,2024-07-16,Silver,2024-07-16
```
---

### Simulated Transaction Events (List of JSON Strings):

```json
[
    {"transaction_id": "TX001", "customer_id": "C101", "amount": 150.75, "timestamp": "2024-07-16 10:00:00", "currency": "USD", "ip_address": "192.168.1.10"},
    {"transaction_id": "TX002", "customer_id": "C102", "amount": 25.00, "timestamp": "2024-07-16 10:01:30", "currency": "USD", "ip_address": "192.168.1.11"},
    {"transaction_id": "TX003", "customer_id": "C101", "amount": "50.25", "timestamp": "2024-07-16 10:02:00", "currency": "USD", "ip_address": "192.168.1.10"},
    {"transaction_id": "TX004", "customer_id": "C103", "amount": 1200.00, "timestamp": "2024-07-16 10:03:15", "currency": "EUR", "ip_address": "192.168.1.12"},
    {"transaction_id": "TX005", "customer_id": "C102", "amount": "75.50", "timestamp": "2024-07-16 10:04:00", "currency": "USD", "ip_address": "192.168.1.11"},
    {"transaction_id": "TX006", "customer_id": "C104", "amount": 300.00, "timestamp": "2024-07-16 10:05:00", "currency": "USD", "ip_address": "192.168.1.13"},
    {"transaction_id": "TX007", "customer_id": "C101", "amount": 10.00, "timestamp": "2024-07-16 10:06:00", "currency": "USD", "ip_address": "192.168.1.14"},
    {"transaction_id": "TX008", "customer_id": "C105", "amount": "NULL", "timestamp": "2024-07-16 10:07:00", "currency": "USD", "ip_address": "192.168.1.15"},
    {"transaction_id": "TX009", "customer_id": "C102", "amount": 200.00, "timestamp": "2024-07-16 10:08:00", "currency": "USD", "ip_address": "192.168.1.11"},
    {"transaction_id": "TX010", "customer_id": "C106", "amount": 45.00, "timestamp": "2024-07-16 10:09:00", "currency": "USD", "ip_address": "192.168.1.16"},
    {"transaction_id": "TX011", "customer_id": "C101", "amount": 500.00, "timestamp": "2024-07-16 10:10:00", "currency": "USD", "ip_address": "192.168.1.10"},
    {"transaction_id": "TX012", "customer_id": "C107", "amount": 20.00, "timestamp": "2024-07-16 10:11:00", "currency": "USD", "ip_address": "192.168.1.17"},
    {"transaction_id": "TX013", "customer_id": "C108", "amount": 10.00, "timestamp": "2024-07-16 10:12:00", "currency": "USD", "ip_address": "192.168.1.18"},
    {"transaction_id": "TX014", "customer_id": "C109", "amount": 15.00, "timestamp": "2024-07-16 10:13:00", "currency": "USD", "ip_address": "192.168.1.19"},
    {"transaction_id": "TX015", "customer_id": "C999", "amount": 100.00, "timestamp": "2024-07-16 10:14:00", "currency": "USD", "ip_address": "192.168.1.20"},
    {"transaction_id": "TX001", "customer_id": "C101", "amount": 150.75, "timestamp": "2024-07-16 10:00:00", "currency": "USD", "ip_address": "192.168.1.10"}
]
```