# Data Clean-up Assumptions

In developing cleaned versions of the Littlepay data tables for analysis. The cleaned tables that we've developed so far are for:
* `device_transactions`
* `micropayment_adjustments`
* `micropayment_device_transactions`
* `micropayments`
* `product_data`

For `micropayment_adjustments` and `micropayment_device_transactions`

In [2]:
from calitp import magics
import os
os.environ['AIRFLOW_ENV'] = 'development'

## Device Transactions

### Summary
* **Unique key:** `littlepay_transaction_id`
* **De-dupe criteria:** Keep record in newest export file; if multiple in newest export, keep record with latest `transaction_date_time_utc`

We're assuming that the `littlepay_device_transaction_id` field should be a unique key. When a `littlepay_device_transaction_id` shows up in multiple data exports (which happens with some frequency -- about 10% of the time), we only keep the newest record.

In [8]:
%%sql

with

duplicated_ids as (
    select littlepay_transaction_id
    from payments.device_transactions
    group by littlepay_transaction_id
    having count(*) > 1
)

select 'Total Unique IDs' as label, count(distinct littlepay_transaction_id) as n
from payments.device_transactions

union all

select 'Duplicated IDs' as label, count(distinct littlepay_transaction_id) as n
from payments.device_transactions
join duplicated_ids using (littlepay_transaction_id)

Unnamed: 0,label,n
0,Total Unique IDs,37530
1,Duplicated IDs,3418


## Micropayments

### Summary
* **Unique key:** `micropayment_id`
* **De-dupe criteria:** Keep record in newest export file; if multiple in newest export, keep record with latest `transaction_time`


## Micropayment Device Transactions

### Summary
* **Unique key:** MD5 hash of `micropayment_id` and `littlepay_transaction_id`
* **Invariants:**
  * `littlepay_transaction_id` should be unique (i.e. many:1 device transactions to micropayments)
  * The timestamp on micropayment records should be at least as late as its associated transactions timestamps
* **De-dupe criteria:** Keep record in newest export file


## Transaction Types

We are associating a new field to each `device_transactions` record through a `device_transaction_types` table. In the LittlePay data there is no such thing as a transaction type, but this field is useful for determining whether a transaction represents a complete trip in itself or just one side of a tap on/off trip. The values we assign to the `transaction_type` field are: `'on'`, `'off'`, or `'single'`.

* a transaction linked to a micropayment that has `charge_type = 'flat_fare'` will have `transaction_type = 'single'`
* a transaction linked to a micropayment that has `charge_type = 'pending_charge_fare'` will have `transaction_type = 'on'`
* a transaction linked to a micropayment that has `charge_type = 'complete_variable_fare'` and is the _earlier_ of the two transactions linked to the micropayment according to the `transaction_date_time_utc` will have `transaction_type = 'on'`
* a transaction linked to a micropayment that has `charge_type = 'complete_variable_fare'` and is the _later_ of the two transactions linked to the micropayment according to the `transaction_date_time_utc` will have `transaction_type = 'off'`


## Micropayment Adjustments

### Summary
* **Unique key:** MD5 hash of `micropayment_id` and `adjustment_id`
* **De-dupe criteria:** Keep record in newest export file


## Product Data

### Summary
* **Unique key:** `product_id`
* **De-dupe criteria:** Keep record in newest export file


## Customers

We split the `customer_funding_source` table from LittlePay into tables for `customers` and `customer_funding_source_vaults`; the IDs available in the original table are very unpredictable, so instead we are cherry-picking the information that is useful to us and creating new quasi-dimensions.

For the `customers` table, we're primarily concerned with associating a `customer_id` with a `principal_customer_id`.

### Summary
* **Unique key:** `customer_id`
* **Invariants:**
  * No null `principal_customer_id` values
  * Each `pricipal_customer_id` should only ever have itself as a `pricipal_customer_id`.
* **De-dupe criteria:** Keep record in newest export file; select distinct customer_id and principal_customer_id from the latest export. There are a number of pairs that appear multiple times, but each principal_customer_id should be associated with a single customer_id value.


## Customers Funding Source Vaults

We split the `customer_funding_source` table from LittlePay into tables for `customers` and `customer_funding_source_vaults`; the IDs available in the original table are very unpredictable, so instead we are cherry-picking the information that is useful to us and creating new quasi-dimensions.

For the `customer_funding_source_vaults` table, we construct a slowly changing dimension

### Summary
* **Key:** `funding_source_vault_id`, valid from the timestamp of one `customer_funding_sources_export` to the next.
