## Step 1: Hello, Data!


Load the raw CSV data and display the first 3 rows. We will use a list of dictionaries to represent the data initially, fulfilling the requirement to select appropriate Python data structures.


In [None]:
import csv
import pprint

data_path = '../data/sales_data.csv'

# Loading data into a list of dictionaries
raw_data = []
with open(data_path, mode='r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        raw_data.append(row)

print(f"Total rows loaded: {len(raw_data)}")
print("First 3 rows:")
pprint.pprint(raw_data[:3])


## Step 2: Pick the Right Container


**Justification:**
I chose a **list of dictionaries** for the initial load because:
1.  **Flexibility**: Dictionaries naturally map CSV headers to values, handling mixed types well.
2.  **Readability**: Accessing fields by name (e.g., `row['Region']`) is clearer than index-based access (e.g., `row[0]`).
3.  **Mutability**: It allowed strictly sequential loading and easy appending.
For more structured schema enforcement later, a **NamedTuple** or **Class** would be better to save memory and ensure consistency, which we will explore next. **Sets** will be used for unique value lookups (like cities or item types) for O(1) checks.



## Step 3: Implement Functions and Data structure


We will design a small Python class `Transaction` to encapsulate the logic for a single row, and implement a `clean()` method within it.


In [None]:
class Transaction:
    def __init__(self, data_dict):
        self.region = data_dict.get('Region')
        self.country = data_dict.get('Country')
        self.item_type = data_dict.get('Item Type')
        self.sales_channel = data_dict.get('Sales Channel')
        # Storing raw strings initially, will transform later
        self.order_date = data_dict.get('Order Date')
        self.order_id = data_dict.get('Order ID')
        self.units_sold = data_dict.get('Units Sold')
        self.unit_price = data_dict.get('Unit Price')
        self.unit_cost = data_dict.get('Unit Cost')
        self.total_revenue = data_dict.get('Total Revenue')
        self.total_cost = data_dict.get('Total Cost')
        self.total_profit = data_dict.get('Total Profit')

    def clean(self):
        # Example cleaning logic: ensuring Units Sold is an integer
        if isinstance(self.units_sold, str):
            try:
                self.units_sold = int(self.units_sold)
            except ValueError:
                self.units_sold = 0 # Default or error handling
        
        # Float conversions
        for field in ['unit_price', 'unit_cost', 'total_revenue', 'total_cost', 'total_profit']:
            val = getattr(self, field)
            if isinstance(val, str):
                try:
                    setattr(self, field, float(val))
                except ValueError:
                    setattr(self, field, 0.0)

    def total(self):
        # A simple method to verify calculations or logic
        return self.total_revenue
    
    def __repr__(self):
        return f"<Transaction ID={self.order_id} Item={self.item_type} Rev={self.total_revenue}>"

print("Class defined successfully.")


## Step 4: Bulk Loaded


Map the initial raw dictionaries into our `Transaction` objects.


In [None]:
transactions = []
for row in raw_data:
    t = Transaction(row)
    transactions.append(t)

print(f"Loaded {len(transactions)} Transaction objects.")
print("Sample object:", transactions[0])


## Step 5: Quick Profiling


Calculate min/mean/max price and count unique countries. Note: We cast to float here temporarily as data is still dirty (strings).


In [None]:
prices = []
for t in transactions:
    try:
        prices.append(float(t.unit_price))
    except (ValueError, TypeError):
        pass

if prices:
    min_price = min(prices)
    max_price = max(prices)
    mean_price = sum(prices) / len(prices)
else:
    min_price = max_price = mean_price = 0.0

# Using a Set for unique countries
unique_countries = {t.country for t in transactions}

print(f"Price Stats -> Min: {min_price:.2f}, Max: {max_price:.2f}, Mean: {mean_price:.2f}")
print(f"Unique Countries Count: {len(unique_countries)}")


## Step 6: Spot the Grime


We identify potential dirty data cases:
1.  **Data Types**: All numerical fields were loaded as strings (e.g., '437.20').
2.  **Date Format**: Dates are strings 'MM/DD/YYYY', need parsing to datetime objects.
3.  **Missing/Anomalous Values**: (Hypothetically) Checking if any profit is negative or units sold is 0.



## Step 7: Cleaning Rules


Execute the `clean()` method on all objects and show before/after types.


In [None]:
print(f"Before cleaning sample type (Units Sold): {type(transactions[0].units_sold)}")

cleaned_count = 0
for t in transactions:
    t.clean()
    cleaned_count += 1

print(f"Cleaned {cleaned_count} records.")
print(f"After cleaning sample type (Units Sold): {type(transactions[0].units_sold)}")
print("Sample cleaned object properties:", vars(transactions[0]))


## Step 8: Transformations


Transform Date strings to proper datetime objects.


In [None]:
from datetime import datetime

# Transformation rule: Parse Order Date
for t in transactions:
    if isinstance(t.order_date, str):
        try:
            t.order_date = datetime.strptime(t.order_date, '%m/%d/%Y')
        except ValueError:
            t.order_date = None

print(f"Transformed Date sample: {transactions[0].order_date} (Type: {type(transactions[0].order_date)})")


## Step 9: Feature Engineering


Add a calculated feature: `margin_percentage` = (Total Profit / Total Revenue) * 100.


In [None]:
for t in transactions:
    if t.total_revenue > 0:
        t.margin_percentage = (t.total_profit / t.total_revenue) * 100
    else:
        t.margin_percentage = 0.0

print(f"Sample Feature (Margin %): {transactions[0].margin_percentage:.2f}%")


## Step 10: Mini-Aggregation


Calculate Total Revenue per Region using a dictionary.


In [None]:
revenue_per_region = {}

for t in transactions:
    if t.region not in revenue_per_region:
        revenue_per_region[t.region] = 0.0
    revenue_per_region[t.region] += t.total_revenue

print("Total Revenue per Region:")
for region, rev in revenue_per_region.items():
    print(f"{region}: ${rev:,.2f}")


## Step 11: Serialization Checkpoint


Save the cleaned and enriched data to **JSON** and **CSV** to satisfy the learning objective of serializing in at least two formats.


In [None]:
import json
import csv

# 1. Serialize to JSON
def transaction_serializer(obj):
    if isinstance(obj, datetime):
        return obj.strftime('%Y-%m-%d')
    return obj.__dict__

json_path = '../data/processed_transactions.json'
with open(json_path, 'w') as f:
    json.dump(transactions, f, default=transaction_serializer, indent=4)

print(f"Data serialized to {json_path}")

# 2. Serialize to CSV
csv_path = '../data/processed_transactions.csv'
if transactions:
    # Extract headers from the first transaction object
    headers = transactions[0].__dict__.keys()
    
    with open(csv_path, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=headers)
        writer.writeheader()
        for t in transactions:
            # Need to handle the datetime objects for CSV as well
            row = t.__dict__.copy()
            if isinstance(row['order_date'], datetime):
                row['order_date'] = row['order_date'].strftime('%Y-%m-%d')
            writer.writerow(row)

print(f"Data serialized to {csv_path}")


## Step 12: Soft Interview Reflection


Using Functions and Classes fundamentally organized the code better than scattered scripts. 
Encapsulating logic in the `Transaction` class made the **cleaning** process repeatable and isolated; if I needed to change how 'Units Sold' is parsed, I only touched one method. 
The `clean()` method acted as a trusted gatekeeper, ensuring that once an object leaves that state, its data types are reliable.
This object-oriented approach scales better than managing parallel lists or raw dictionaries, especially as complexity grows.


## Data Dictionary


Merged definitions from the primary CSV and secondary metadata source.

| Field | Type | Description | Source |
|-------|------|-------------|--------|
| Region | String | Geographical region of the transaction | Primary CSV |
| Country | String | Country of origin | Primary CSV |
| Item Type | String | Category of product sold | Primary CSV |
| Sales Channel | String | Online or Offline sales | Primary CSV |
| Order Priority | String | Priority code (C, H, M, L) | Primary CSV |
| Order Date | Date | Date order was placed | Primary CSV |
| Order ID | Int | Unique identifier for the order | Primary CSV |
| Ship Date | Date | Date order was shipped | Primary CSV |
| Units Sold | Int | Quantity of items sold | Primary CSV |
| Unit Price | Float | Price per unit | Primary CSV |
| Unit Cost | Float | Cost to produce per unit | Primary CSV |
| Total Revenue | Float | Revenue columns (Price * Units) | Primary CSV |
| Total Cost | Float | Cost columns (Cost * Units) | Primary CSV |
| Total Profit | Float | Profit (Revenue - Cost) | Primary CSV |
| Category Description | String | Detailed description of the item category | Secondary Metadata |
| Shelf Life (Days) | Int | Estimated shelf life of the product | Secondary Metadata |
| Margin Percentage | Float | Calculated profit margin (%) | Feature Engineering |

**Note on Secondary Metadata:**
We enriched the data dictionary with 'Category Description' and 'Shelf Life' which can be found in `data/product_metadata.csv`.

