Notebook purpose:

- Explore feasibility of identifying number of children

Info:

- [Historical child benefit rates](https://revenuebenefits.org.uk/child-benefit/guidance/how-much-can-your-client-get/rates-and-tables/)

- [UK tax year runs from 6 April to 5 April](https://www.gov.uk/self-assessment-tax-returns/deadlines)



In [2]:
import sys

import numpy as np
import pandas as pd
import s3fs
import scipy
import seaborn as sns

sys.path.append("/Users/fgu/dev/projects/entropy")
import entropy.data.aggregators as ag
import entropy.data.cleaners as cl
import entropy.data.make_data as md
import entropy.data.selectors as sl
import entropy.data.validators as vl
import entropy.helpers.aws as ha
import entropy.helpers.data as hd
import entropy.helpers.helpers as hh

pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)
pd.set_option("max_colwidth", None)
%load_ext autoreload
%autoreload 2

fs = s3fs.S3FileSystem(profile="3di")

In [4]:
df = hd.read_txn_data("777")

Time for read_txn_data                 : 2.98 seconds


## Identifying child benefit payments in the data

In [20]:
def child_benefits(df):
    return (
        df.loc[df.tag.eq("benefits") & df.desc.str.contains("chb")]
        .set_index("date")
        .loc["Apr 2019":"March 2020"]
        .sort_values(["user_id", "date"])
    )


cb = child_benefits(df)
cb.head()

Unnamed: 0_level_0,user_id,amount,desc,merchant,tag_group,tag,account_id,account_last_refreshed,account_provider,account_type,debit,female,id,is_urban,latest_balance,logins,postcode,region_name,tag_auto,yob
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2019-04-09,777,-137.600006,bank credit <mdbremoved> -chb xxxxxx xxxx1802,,income,benefits,1419376,2019-09-08 07:52:00,nationwide,current,False,0.0,606532064,1.0,1729.390015,0.0,wa1 4,north west,family benefits,1969.0
2019-05-07,777,-137.600006,bank credit <mdbremoved> -chb xxxxxx xxxx1802,,income,benefits,1419376,2019-09-08 07:52:00,nationwide,current,False,0.0,606532073,1.0,1729.390015,0.0,wa1 4,north west,family benefits,1969.0
2019-06-04,777,-137.600006,bank credit <mdbremoved> -chb xxxxxx xxxx1802,,income,benefits,1419376,2019-09-08 07:52:00,nationwide,current,False,0.0,606532085,1.0,1729.390015,0.0,wa1 4,north west,family benefits,1969.0
2019-07-02,777,-137.600006,bank credit <mdbremoved> -chb xxxxxx xxxx1802,,income,benefits,1419376,2019-09-08 07:52:00,nationwide,current,False,0.0,606532103,1.0,1729.390015,0.0,wa1 4,north west,family benefits,1969.0
2019-07-30,777,-137.600006,bank credit <mdbremoved> -chb xxxxxx xxxx1802,,income,benefits,1419376,2019-09-08 07:52:00,nationwide,current,False,0.0,606532110,1.0,1729.390015,0.0,wa1 4,north west,family benefits,1969.0


In tax year 2019/2020, wkly allowance was £20.7 for first child and £13.7 for each additional child. We'd thus expect to find the following amounts:

In [34]:
[(children, (20.7 + (children - 1) * 13.7) * 4) for children in range(1, 5)]

[(1, 82.8), (2, 137.6), (3, 192.39999999999998), (4, 247.2)]

Which is exactly what we find

In [22]:
cb.amount.value_counts()

-137.600006    64
-82.800003     59
-192.399994    12
-20.700001      1
Name: amount, dtype: int64

## Implementation

In [212]:
def get_num_children(df):
    """Returns number of children for child benefit recipients."""
    # 2021 refers to tax year Apr 2021 to Mar 2022
    # values are weekly allowances for first and
    # subsequent children, respectively
    tax_year_rates = {
        2021: [21.15, 14.00],
        2020: [21.05, 13.95],
        2019: [20.70, 13.70],
        2018: [20.70, 13.70],
        2017: [20.70, 13.70],
        2016: [20.70, 13.70],
        2015: [20.70, 13.70],
        2014: [20.50, 13.55],
        2013: [20.30, 13.40],
        2012: [20.30, 13.40],
        2011: [20.30, 13.40],
    }

    # Example:
    # num_children[(2020, 140)] = 2,
    # since (21.05 + 13.95) * 4 = 140
    num_children = {}
    for year, (rate_first, rate_additional) in tax_year_rates.items():
        for children in range(1, 6):
            allowance = int((rate_first + (children - 1) * rate_additional) * 4)
            num_children[(year, allowance)] = children

    is_chb = df.tag.eq("benefits") & df.desc.str.contains("chb") & ~df.debit
    amount = -df.amount.where(is_chb, 0).astype(int)
    tax_year = (df.date.dt.to_period("A-Mar") - 1).dt.year
    return pd.Series(zip(tax_year, amount)).map(num_children)


get_num_children(df)

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
          ..
651784   NaN
651785   NaN
651786   NaN
651787   NaN
651788   NaN
Length: 651789, dtype: float64

## Sense checks