Notebook purpose:

- Explore feasibility of identifying number of children

Info:

- [Historical child benefit rates](https://revenuebenefits.org.uk/child-benefit/guidance/how-much-can-your-client-get/rates-and-tables/)

- [UK tax year runs from 6 April to 5 April](https://www.gov.uk/self-assessment-tax-returns/deadlines)



In [6]:
import sys

import numpy as np
import pandas as pd
import s3fs
import scipy
import seaborn as sns

sys.path.append("/Users/fgu/dev/projects/entropy")
import entropy.data.aggregators as ag
import entropy.data.cleaners as cl
import entropy.data.make_data as md
import entropy.data.selectors as sl
import entropy.data.validators as vl
import entropy.helpers.aws as ha
import entropy.helpers.data as hd
import entropy.helpers.helpers as hh

pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)
pd.set_option("max_colwidth", None)
%load_ext autoreload
%autoreload 2

fs = s3fs.S3FileSystem(profile="3di")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
df = hd.read_txn_data("777")

Time for read_txn_data                 : 2.41 seconds


## Identifying child benefit payments in the data

In [9]:
def child_benefits(df):
    return (
        df.loc[df.tag_auto.eq("family benefits")]
        .set_index("date")
        .loc["Apr 2019":"March 2020"]
        .sort_values(["user_id", "date"])
    )


cb = child_benefits(df)
cb.head(3)

Unnamed: 0_level_0,user_id,amount,desc,merchant,tag_group,tag,account_id,account_last_refreshed,account_provider,account_type,debit,female,id,is_urban,latest_balance,logins,postcode,region_name,tag_auto,yob
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2019-04-09,777,-137.600006,bank credit <mdbremoved> -chb xxxxxx xxxx1802,,income,benefits,1419376,2019-09-08 07:52:00,nationwide,current,False,0.0,606532064,1.0,1729.390015,0.0,wa1 4,north west,family benefits,1969.0
2019-05-07,777,-137.600006,bank credit <mdbremoved> -chb xxxxxx xxxx1802,,income,benefits,1419376,2019-09-08 07:52:00,nationwide,current,False,0.0,606532073,1.0,1729.390015,0.0,wa1 4,north west,family benefits,1969.0
2019-06-04,777,-137.600006,bank credit <mdbremoved> -chb xxxxxx xxxx1802,,income,benefits,1419376,2019-09-08 07:52:00,nationwide,current,False,0.0,606532085,1.0,1729.390015,0.0,wa1 4,north west,family benefits,1969.0


In tax year 2019/2020, wkly allowance was £20.7 for first child and £13.7 for each additional child. We'd thus expect to find the following amounts:

In [12]:
[(20.7 + (children - 1) * 13.7) * 4 for children in range(1, 6)]

[82.8, 137.6, 192.39999999999998, 247.2, 302.0, 356.8]

Which is exactly what we tend to find

In [13]:
cb.amount.value_counts().head(5)

-82.800003     86
-137.600006    65
-49.820000     39
-299.390015    15
-144.630005    13
Name: amount, dtype: int64

## Implementation

In [17]:
def get_num_children(df):
    """Returns number of children per user-month for child benefit recipients.

    Forward-updates values with max number children identified up to that
    point to deal with cases where benefits drop temporarily or for rest of
    observation period.
    """
    # 2021 refers to tax year Apr 2021 to Mar 2022
    # Values are weekly allowances for first and
    # subsequent children, respectively.
    # Source: https://revenuebenefits.org.uk/child-benefit/
    # guidance/how-much-can-your-client-get/rates-and-tables/
    tax_year_rates = {
        2021: [21.15, 14.00],
        2020: [21.05, 13.95],
        2019: [20.70, 13.70],
        2018: [20.70, 13.70],
        2017: [20.70, 13.70],
        2016: [20.70, 13.70],
        2015: [20.70, 13.70],
        2014: [20.50, 13.55],
        2013: [20.30, 13.40],
        2012: [20.30, 13.40],
        2011: [20.30, 13.40],
    }

    # Example:
    # num_children[(2020, 140)] = 2, as (21.05 + 13.95) * 4 = 140,
    # and allowances are paid in 4-week intervals
    num_children = {}
    for year, (rate_first, rate_additional) in tax_year_rates.items():
        for children in range(1, 6):
            allowance = int((rate_first + (children - 1) * rate_additional) * 4)
            num_children[(year, allowance)] = children

    is_chb = df.tag_auto.eq("family benefits") & ~df.is_debit
    amount = -df.amount.where(is_chb, 0).astype(int)
    tax_year = (df.date.dt.to_period("A-Mar") - 1).dt.year

    num_children = (
        pd.Series(zip(tax_year, amount))
        .map(num_children)
        .groupby([df.user_id, df.date.dt.to_period("m")])
        .transform("max")
        .fillna(0)
        .groupby(df.user_id)
        .cummax()
        .rename("num_children")
    )

    has_new_child = (
        num_children.groupby(df.user_id)
        .diff()
        .groupby([df.user_id, df.date.dt.to_period("m")])
        .transform("max")
        .eq(1)
        .astype(int)
        .rename("has_new_child")
    )

    return pd.concat([num_children, has_new_child], axis=1)


df[["num_children", "has_new_child"]] = get_num_children(df)

## Sense checks

We observe children for about 15 percent of users

In [18]:
df.groupby("user_id").num_children.max().gt(0).mean().round(2)

0.16

... and for about the same proportion of transactions.

In [19]:
df.num_children.gt(0).mean().round(2)

0.2

Observed number of jumps:

In [20]:
def children_data(df):
    cond = df.groupby("user_id").num_children.max().gt(0)
    users = cond[cond].index
    cols = [
        "date",
        "user_id",
        "num_children",
        "amount",
        "desc",
        "merchant",
        "tag_group",
        "tag",
        "is_female",
    ]
    return df.loc[df.user_id.isin(users), cols]


dfc = children_data(df)
dfc.head(3)

Unnamed: 0,date,user_id,num_children,amount,desc,merchant,tag_group,tag,female
0,2012-02-01,777,0.0,400.0,<mdbremoved> - s/o,,transfers,other_transfers,0.0
1,2012-02-01,777,0.0,3.03,aviva pa - d/d,aviva,spend,finance,0.0
2,2012-02-03,777,0.0,8.75,chart ins log tran - d/d,,,,0.0


60 percent of recipients are male

In [21]:
dfc.groupby("user_id").is_female.first().mean().round(2)

0.4

We observe a fair number of cases for which the number of children increases by more than 1. Given that probability of having twins is 1/250 ([NHS says](https://www.nhs.uk/pregnancy/finding-out/pregnant-with-twins/)), it's much more likely that in these cases we simply don't observe benefit payments. So, we'll only treat increases of 1 as the birth of a new child. 

In [22]:
monthly = dfc.groupby(["user_id", df.date.dt.to_period("m")]).num_children.first()
cond = monthly.groupby("user_id").diff().gt(0)
monthly[cond].value_counts()

1.0    8
2.0    5
4.0    1
3.0    1
Name: num_children, dtype: int64