
# AI Reconciliation Bonus Report

## Objective

The task was to reconcile a schema-mismatched customer dataset (reconciliation_challenge_data.csv) with an existing cleaned schema (orders_clean.csv / v1_ecommerce.db::customers). The goal was to use AI tools to assist in schema alignment and demonstrate analytical thinking and creative problem-solving.

---

## Step 1: Initial Dataset Understanding

Two datasets were loaded:

- reconciliation_challenge_data.csv: A noisy, mismatched schema with transactional + customer-level attributes
- orders_clean.csv / orders table: Our reference schema with standardized order fields

Initial inspection showed that many column names in the new dataset did not directly align with the expected schema, and some columns had ambiguous meanings (e.g. payment_status, delivery_status).

---

## Step 2: Fuzzy Mapping Using Jaccard Similarity

To begin the reconciliation, I implemented a string-based fuzzy mapping approach using a Jaccard similarity function defined as:

```python
def jaccard_similarity(a, b):
    a_set = set(a.lower().split("_"))
    b_set = set(b.lower().split("_"))
    return len(a_set & b_set) / len(a_set | b_set)
```

This was applied to every pair of (new_column, reference_column) to build a similarity matrix. While it helped shortlist candidates (payment_status -> payment_method, customer_segment -> possibly inferred from order behavior), many mappings were weak or misleading due to limited semantic overlap in names.

Conclusion: Jaccard was not sufficient alone for semantic understanding.

---

## Step 3: AI-Assisted Reconciliation with Gemini

To improve the mapping quality, I integrated Gemini AI (model: gemini-1.5-flash-latest) using the google.generativeai Python SDK.

For each new column:
- I sampled a few representative values from the dataset
- Constructed a prompt: "Given these values and the reference schema, which field in the reference schema is semantically equivalent to this column?"

This approach significantly improved match precision, especially for semantically aligned fields like:
- full_customer_name → full_name
- etc

---

## Step 4: Final Column Mapping Strategy

The final schema reconciliation was constructed as a union of:

- AI-suggested mappings (via Gemini)
- Human judgment from fuzzy matches
- Manual logic for ambiguous columns 

Columns that did not map to the reference schema (e.g. item_reference, shipping_fee) were retained for completeness but not integrated.

Final joining was done with my corrected mappings.
---

## Step 5: Data Quality Checks & Visualization

Post-reconciliation, I performed a series of sanity checks:

- No missing critical fields (full_name, segment, status)
- Valid date formats and timestamps
- Outlier detection in numeric fields (total_spent, amount_paid)
- Visual distribution of status and segment values confirmed balanced dataset

---

## Step 6: Row-Level Integration (Planned)

Schema alignment was successfully achieved. The next step is row-level integration: matching reconciled records to the existing orders table (using fields like cust_id, order_total, order_datetime) and enriching/updating order records.

---

## Conclusion

This reconciliation task involved:
- Combining rule-based fuzzy logic with LLM-driven semantic alignment
- Parsing and verifying structured AI responses
- Applying domain reasoning to finalize mappings

The outcome was a fully aligned dataset with explainable, traceable decisions across the pipeline. 
