# Cleaning `customer_funding_source` data

We have ways for cleaning `micropayments` and `device_transactions`. Based on information we've recently received from Littlepay, we may be able to clean `customer_funding_sources` as well.

A few things to note:
* Each `customer_funding_source` file is a full extract of the `customer_funding_source`,  not an increment, so there is duplication across files.
* `customer_id`s may become inactive and be replaced. In these cases, both `customer_id` values will be present in the data export, but the active one will be set in the `principal_customer_id`.

Maybe the best approach is to use the most recent file only and match with other tables on either the `principal_customer_id` or `funding_source_vault_id`.

In [1]:
from calitp import get_engine, query_sql, magics
# import pandas as pd
# import pyarrow

In [2]:
import os
os.environ['AIRFLOW_ENV'] = 'development'

Based on the information from Littlepay, to get the latest state of `customer_funding_sources` information, we should just need to keep the latest file.

In [3]:
%%sql

select max(_FILE_NAME) as fn
from payments.customer_funding_source

Unnamed: 0,fn
0,gs://gtfs-data-test/payments-processed/custome...


Except it's not just the one latest file we need, but the latest file from each synced batch of data. For example, `mst` and `sbmtd` each have their own batches of data, while `sacrt` and `clean-air-express` are grouped together under the `cal-itp` batch. Normally we would use the `participant_id` for this, but `customer_funding_sources` does not have a `participant_id`. But fret not, we can extract this information from the synced files' names. We've included views in our pipeline that add the source file name, as well as the account and exported datetime as parsed from the filename, to each record.

In [4]:
%%sql

select distinct 
    calitp_export_account,
    calitp_export_datetime,
    calitp_file_name
from payments.stg_enriched_customer_funding_source

Unnamed: 0,calitp_export_account,calitp_export_datetime,calitp_file_name
0,cal-itp,2021-09-20 08:23:00,gs://gtfs-data-test/payments-processed/custome...
1,cal-itp,2021-09-20 07:56:00,gs://gtfs-data-test/payments-processed/custome...
2,cal-itp,2021-10-15 08:05:00,gs://gtfs-data-test/payments-processed/custome...
3,cal-itp,2021-09-09 04:50:00,gs://gtfs-data-test/payments-processed/custome...
4,cal-itp,2021-10-18 23:15:00,gs://gtfs-data-test/payments-processed/custome...
...,...,...,...
263,mst,2021-08-25 05:45:00,gs://gtfs-data-test/payments-processed/custome...
264,mst,2021-08-13 05:36:00,gs://gtfs-data-test/payments-processed/custome...
265,mst,2021-08-14 05:31:00,gs://gtfs-data-test/payments-processed/custome...
266,mst,2021-08-16 05:38:00,gs://gtfs-data-test/payments-processed/custome...


Using this, we can get the latest export metadata for each account.

In [5]:
%%sql

create or replace table sandbox.latest_export_metadata
options(
  expiration_timestamp=timestamp_add(current_timestamp(), interval 3 hour)
) as (
    with
    
    latest_export_times as (
        select calitp_export_account,
               max(calitp_export_datetime) as calitp_export_datetime
        from payments.stg_enriched_customer_funding_source
        group by 1
    )
    
    select distinct
        calitp_export_account,
        calitp_export_datetime,
        calitp_file_name
    from payments.stg_enriched_customer_funding_source
    join latest_export_times using (calitp_export_account, calitp_export_datetime)
)

Unnamed: 0,calitp_export_account,calitp_export_datetime,calitp_file_name
0,mst,2021-11-06 05:00:00,gs://gtfs-data-test/payments-processed/custome...
1,cal-itp,2021-11-05 05:01:00,gs://gtfs-data-test/payments-processed/custome...
2,sbmtd,2021-11-06 05:07:00,gs://gtfs-data-test/payments-processed/custome...


This gives us the latest date of export for each AWS account that we're pulling LittlePay data from. As a sanity check, I'd like to see whether all IDs are represented in the extracts with these dates. If we truly have full exports each time, then every ID should be represented in each file. Now, there are three ID fields in the table and I'm not sure which is ; I'm wondering whether any of them is more likely to be unique. Maybe if one has significantly more unique values than the others...

In [6]:
%%sql

select
    count(distinct funding_source_id) as cnt_funding_source_id,
    count(distinct funding_source_vault_id) as cnt_funding_source_vault_id,
    count(distinct customer_id) as cnt_customer_id
from payments.stg_enriched_customer_funding_source

Unnamed: 0,cnt_funding_source_id,cnt_funding_source_vault_id,cnt_customer_id
0,4491,4169,4169


So, there's no clear unique-looking field here. In fact, I wonder whether the vault ID and the customer ID are 1-to-1. Let's check the number of unique paired (`funding_source_vault_id`,`customer_id`)...

In [7]:
%%sql

select count(*) from (
    select distinct funding_source_vault_id, customer_id
    from payments.stg_enriched_customer_funding_source
) q

Unnamed: 0,f0_
0,4400


We end up with a different count. So, even though there are the same number of uniques across those two IDs, they're not quite 1:1.

Next, let's see how many `funding_source_id` values there are across the data set that don't exist in the latest export files.

In [8]:
%%sql

with

latest_funding_sources as (
    select distinct funding_source_id
    from payments.stg_enriched_customer_funding_source
    join sandbox.latest_export_metadata using (calitp_export_account, calitp_export_datetime)
)

select
    calitp_export_account,
    calitp_export_datetime,
    count(*)
from payments.stg_enriched_customer_funding_source
where funding_source_id not in (select funding_source_id from latest_funding_sources)
group by 1, 2
order by 2 desc, 1

Unnamed: 0,calitp_export_account,calitp_export_datetime,f0_
0,sbmtd,2021-11-05 05:05:00,6
1,mst,2021-11-05 04:59:00,6
2,sbmtd,2021-11-04 05:07:00,7
3,cal-itp,2021-11-04 05:02:00,13
4,mst,2021-11-04 05:01:00,18
...,...,...,...
260,mst,2021-06-21 05:16:00,9
261,mst,2021-06-20 05:29:00,19
262,mst,2021-06-19 06:30:00,23
263,mst,2021-06-18 05:29:00,20


What this tells me is that there are a bunch of funding source IDs that were in past files are are not in the latest file. _Quel dommage_. My first thought is that maybe we should filter for the latest file available record for each id -- essentially keep the latest version of each.

Below I do the same check for `customer_id` and `funding_source_vault_id`.

In [9]:
%%sql

with

latest_funding_source_vaults as (
    select distinct funding_source_vault_id
    from payments.stg_enriched_customer_funding_source
    join sandbox.latest_export_metadata using (calitp_export_account, calitp_export_datetime)
)

select
    calitp_export_account,
    calitp_export_datetime,
    count(*)
from payments.stg_enriched_customer_funding_source
where funding_source_vault_id not in (select * from latest_funding_source_vaults)
group by 1, 2
order by 2 desc, 1

Unnamed: 0,calitp_export_account,calitp_export_datetime,f0_
0,sbmtd,2021-11-05 05:05:00,6
1,mst,2021-11-05 04:59:00,6
2,sbmtd,2021-11-04 05:07:00,7
3,cal-itp,2021-11-04 05:02:00,13
4,mst,2021-11-04 05:01:00,18
...,...,...,...
260,mst,2021-06-21 05:16:00,9
261,mst,2021-06-20 05:29:00,19
262,mst,2021-06-19 06:30:00,23
263,mst,2021-06-18 05:29:00,20


In [10]:
%%sql

with

latest_customers as (
    select distinct customer_id
    from payments.stg_enriched_customer_funding_source
    join sandbox.latest_export_metadata using (calitp_export_account, calitp_export_datetime)
)

select
    calitp_export_account,
    calitp_export_datetime,
    count(*)
from payments.stg_enriched_customer_funding_source
where customer_id not in (select * from latest_customers)
group by 1, 2
order by 2 desc, 1

Unnamed: 0,calitp_export_account,calitp_export_datetime,f0_
0,sbmtd,2021-11-05 05:05:00,6
1,mst,2021-11-05 04:59:00,6
2,sbmtd,2021-11-04 05:07:00,7
3,cal-itp,2021-11-04 05:02:00,13
4,mst,2021-11-04 05:01:00,18
...,...,...,...
260,mst,2021-06-21 05:16:00,9
261,mst,2021-06-20 05:29:00,19
262,mst,2021-06-19 06:30:00,23
263,mst,2021-06-18 05:29:00,20


Just spot checking, it looks like all three IDs have around the same numbers of values that are missing from the latest files. Now I want to know whether there are any gaps for any IDs. In other words, is an ID in an exported file one day, gone the next, and back on the third day? Ideally, if IDs can disappear, they'll stay gone.

To check this, I'll match each sighting of an ID in one file with the next file the ID was seen in. Since the exports aren't always on consecutive days, I'll create a separate field for the export rank of the export files.

In [11]:
%%sql

create or replace table sandbox.ranked_customer_funding_source_exports
options(
  expiration_timestamp=timestamp_add(current_timestamp(), interval 3 hour)
) as (
    with

    customer_funding_source_exports as (
        select distinct
            calitp_export_account,
            calitp_export_datetime
        from payments.stg_enriched_customer_funding_source
    )

    select
        calitp_export_account,
        calitp_export_datetime,
        rank() over (partition by calitp_export_account order by calitp_export_datetime desc) as calitp_export_rank
    from customer_funding_source_exports
    order by 3, 2, 1
)

Unnamed: 0,calitp_export_account,calitp_export_datetime,calitp_export_rank
0,cal-itp,2021-11-05 05:01:00,1
1,mst,2021-11-06 05:00:00,1
2,sbmtd,2021-11-06 05:07:00,1
3,cal-itp,2021-11-04 05:02:00,2
4,mst,2021-11-05 04:59:00,2
...,...,...,...
263,mst,2021-06-21 05:16:00,143
264,mst,2021-06-20 05:29:00,144
265,mst,2021-06-19 06:30:00,145
266,mst,2021-06-18 05:29:00,146


Starting with the `funding_source_vault_id` field, we can now check for gaps...

In [12]:
%%sql

with

unique_ids as (
    select distinct
        calitp_export_account,
        calitp_export_datetime,
        funding_source_vault_id as the_id
    from payments.stg_enriched_customer_funding_source
),

gap_finder as (
    select
        calitp_export_account,
        the_id,
        calitp_export_datetime dt1,
        calitp_export_rank as rank1,
        lag(calitp_export_datetime) over (partition by calitp_export_account, the_id order by calitp_export_datetime) as dt2,
        lag(calitp_export_rank) over (partition by calitp_export_account, the_id order by calitp_export_datetime) as rank2
    from unique_ids
    join sandbox.ranked_customer_funding_source_exports using (calitp_export_account, calitp_export_datetime)
)

select * from gap_finder
where rank2 is not null
and rank2 - rank1 > 1

Unnamed: 0,calitp_export_account,the_id,dt1,rank1,dt2,rank2
0,cal-itp,16921013-59f4-4a91-930e-355a35adb1d8,2021-09-18 05:00:00,52,2021-09-05 04:47:00,63
1,cal-itp,b0c6c834-94c5-441a-8a2c-0bf457399eef,2021-10-15 08:05:00,22,2021-09-28 04:36:00,40
2,cal-itp,ce2cb4dc-99a3-468e-85d4-669903012b2f,2021-10-15 08:05:00,22,2021-10-05 05:18:00,33
3,cal-itp,d32932ac-9de2-4127-befe-692886147158,2021-10-15 08:05:00,22,2021-10-08 04:45:00,30
4,mst,06706e65-08ef-40bc-b50c-8f174112574c,2021-08-05 05:33:00,98,2021-06-29 05:31:00,135
...,...,...,...,...,...,...
12545,sbmtd,2f49b0ea-7470-47f6-bd8e-c60446f245d6,2021-10-15 08:06:00,24,2021-09-21 04:59:00,49
12546,sbmtd,3cd46b28-6f27-4b98-9672-bc38b40d7226,2021-10-15 08:06:00,24,2021-09-21 04:59:00,49
12547,sbmtd,6cd856d8-a957-417c-97f9-0418f6688af5,2021-10-15 08:06:00,24,2021-09-21 04:59:00,49
12548,sbmtd,f178071c-0300-4c7e-872d-df9abc90c921,2021-10-15 08:06:00,24,2021-09-21 04:59:00,49


Same as above, except `funding_source_id`...

In [13]:
%%sql

with

unique_ids as (
    select distinct
        calitp_export_account,
        calitp_export_datetime,
        funding_source_id as the_id
    from payments.stg_enriched_customer_funding_source
),

gap_finder as (
    select
        calitp_export_account,
        the_id,
        calitp_export_datetime dt1,
        calitp_export_rank as rank1,
        lag(calitp_export_datetime) over (partition by calitp_export_account, the_id order by calitp_export_datetime) as dt2,
        lag(calitp_export_rank) over (partition by calitp_export_account, the_id order by calitp_export_datetime) as rank2
    from unique_ids
    join sandbox.ranked_customer_funding_source_exports using (calitp_export_account, calitp_export_datetime)
)

select * from gap_finder
where rank2 is not null
and rank2 - rank1 > 1

Unnamed: 0,calitp_export_account,the_id,dt1,rank1,dt2,rank2
0,sbmtd,3a679162-8348-4780-a83f-0d800a0a88d5,2021-10-02 05:06:00,38,2021-09-20 08:24:00,50
1,sbmtd,8773d214-6209-4126-b49a-1e41120f23c9,2021-10-02 05:06:00,38,2021-09-20 08:24:00,50
2,sbmtd,dbc4a2b3-b2b3-480a-957d-bce8a78533a4,2021-10-02 05:06:00,38,2021-09-20 08:24:00,50
3,cal-itp,10f96234-2613-4c55-bbcc-e79d1cf462a8,2021-11-03 04:59:00,3,2021-10-19 04:37:00,17
4,cal-itp,12e33e51-9a75-4054-9b7d-c4b48c44e40b,2021-10-02 05:03:00,36,2021-09-20 08:23:00,48
...,...,...,...,...,...,...
13084,mst,101be6d3-e69a-4c43-be4a-8c1545f96795,2021-07-31 05:33:00,103,2021-07-04 05:18:00,130
13085,mst,0f9242da-d033-4c16-bd95-a79151b22b4f,2021-07-31 05:33:00,103,2021-07-18 05:29:00,116
13086,mst,04633f8b-6472-4bba-ae86-86fc2b955d43,2021-07-31 05:33:00,103,2021-06-30 05:29:00,134
13087,mst,1563864f-9e6a-49d7-a85f-ca8de2b78c8c,2021-07-31 05:33:00,103,2021-06-18 02:24:00,147


And, finally, `customer_id`...

In [14]:
%%sql

with

unique_ids as (
    select distinct
        calitp_export_account,
        calitp_export_datetime,
        customer_id as the_id
    from payments.stg_enriched_customer_funding_source
),

gap_finder as (
    select
        calitp_export_account,
        the_id,
        calitp_export_datetime dt1,
        calitp_export_rank as rank1,
        lag(calitp_export_datetime) over (partition by calitp_export_account, the_id order by calitp_export_datetime) as dt2,
        lag(calitp_export_rank) over (partition by calitp_export_account, the_id order by calitp_export_datetime) as rank2
    from unique_ids
    join sandbox.ranked_customer_funding_source_exports using (calitp_export_account, calitp_export_datetime)
)

select * from gap_finder
where rank2 is not null
and rank2 - rank1 > 1

Unnamed: 0,calitp_export_account,the_id,dt1,rank1,dt2,rank2
0,cal-itp,2136327d-a5de-4ab2-a99a-e519794717cf,2021-09-18 05:00:00,52,2021-09-09 04:50:00,60
1,cal-itp,413f4ca4-40ff-4570-b329-d7755a80ad95,2021-10-15 08:05:00,22,2021-09-23 04:59:00,45
2,cal-itp,4516f204-29f2-4d40-95a3-62253501f00c,2021-09-09 04:50:00,60,2021-09-06 04:45:00,62
3,cal-itp,4516f204-29f2-4d40-95a3-62253501f00c,2021-09-18 05:00:00,52,2021-09-09 04:50:00,60
4,cal-itp,d759a318-d607-412f-8181-cc05dcad04af,2021-09-17 04:56:00,53,2021-09-03 04:46:00,65
...,...,...,...,...,...,...
12642,sbmtd,9f8178cf-a0d6-48a9-9176-5551cef8570a,2021-10-15 08:06:00,24,2021-09-21 04:59:00,49
12643,sbmtd,ee86b3ec-f630-4ac4-b1ff-a56067942639,2021-10-15 08:06:00,24,2021-09-21 04:59:00,49
12644,sbmtd,4f6d16ef-1ab2-4264-bac8-8bf67af1ddee,2021-10-15 08:06:00,24,2021-09-21 04:59:00,49
12645,sbmtd,13a06897-fa3b-4193-835e-07cdae8bb3b7,2021-10-15 08:06:00,24,2021-09-21 04:59:00,49


So, there are a lot of gaps.

**In that case, we could maybe just use the latest export on a per-record basis. There's still a question of which ID field(s) we should use as the unique identifier.**