Notebook purpose:

- Perform sample selection checks.

In [2]:
import sys

sys.path.append("/Users/fgu/dev/projects/mdb_eval")

import pandas as pd

import src.data.aggregators as agg
import src.data.selectors as sl
import src.helpers.data as hd

pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)
pd.set_option("max_colwidth", None)
%load_ext autoreload
%autoreload 2

# Selection on income

- In our selection procedure, we first drop users who signed up before April 2017 and that don't have enough pre and post signup data, as well as users that don't have a current and a savings account. After that, we drop users with less than £5,000 of income, which drops about 55 percent of users from the sample.

- Here, we check that requirint that level of income has the same effect on sample size if we apply the criteria to the full sample of users, to ensure that low income is not a feature of users that passed the previous selection criteria.


Load a 1 percent sample of the raw data. (Sample with name `X11` is the sample with all users whose ids end in `11`.)

In [3]:
dft = hd.read_txn_data("X11")
hd.inspect(dft)

Time for read_txn_data                 : 3.75 minutes
shape: (6,653,551, 35), users: 2734


Unnamed: 0,date,user_id,amount,desc,merchant,tag_group,tag_spend,user_registration_date,account_created,account_id,account_last_refreshed,account_provider,account_type,birth_year,data_warehouse_date_created,data_warehouse_date_last_updated,id,is_debit,is_female,is_sa_flow,is_salary_pmt,is_urban,latest_balance,lsoa,merchant_business_line,msoa,postcode,region_name,salary_range,tag,tag_auto,tag_manual,tag_up,updated_flag,ym
0,2012-01-03,11,69.75,david lloyd <mdbremoved>,david lloyd,spend,sports,2010-06-30,1900-01-01,303733,2014-07-24 11:05:00,lloyds,current,1954.0,2014-07-18,2017-08-15,80656,True,0.0,False,False,1.0,150.029999,e01015428,david lloyd,e02003207,bh15 4,south west,10k to 20k,hobbies,gym membership,no tag,gym membership,u,2012-01
1,2012-01-03,11,96.400002,sky digital xxxxxxxxxx9317,sky,spend,"entertainment, tv, media",2010-06-30,1900-01-01,303733,2014-07-24 11:05:00,lloyds,current,1954.0,2014-07-18,2017-08-15,80654,True,0.0,False,False,1.0,150.029999,e01015428,sky,e02003207,bh15 4,south west,10k to 20k,services,"entertainment, tv, media",no tag,media bundle,u,2012-01


Calculate month income. (Month income is calculated as the mean monthly income in a given calendar year. See `src/data/aggregators.py`.)

In [4]:
user_month_inc = agg.income(dft).reset_index()
user_month_inc.tail(10)

Unnamed: 0,user_id,ym,month_income
79404,589711,2019-11,1408.872559
79405,589711,2019-12,1408.872559
79406,589711,2020-01,198.547501
79407,589711,2020-02,198.547501
79408,589711,2020-03,198.547501
79409,589711,2020-04,198.547501
79410,589711,2020-05,198.547501
79411,589711,2020-06,198.547501
79412,589711,2020-07,198.547501
79413,589711,2020-08,198.547501


Drop users with incomes of less than £5,000. (Code below unwraps `year_income` function to ignore wrappers used in `scr/data/selectors.py`.)

In [5]:
selected = sl.year_income.__wrapped__(user_month_inc)
selected.head(3)

Unnamed: 0,user_id,ym,month_income
104,111,2012-01,1661.16748
105,111,2012-02,1661.16748
106,111,2012-03,1661.16748


Calculate drop in number of unique users.

In [6]:
users_sel = selected.user_id.nunique()
users_raw = dft.user_id.nunique()
print(f"{(users_raw - users_sel) / users_raw:.1%}")

64.8%


Conclusion:

- Number of users drops by even more when applying the income selection criteria to the raw data, so applying earlier selection criteria doesn't select for low-income users. 