# Adding activity chains to synthetic populations 

The purpose of this script is to match each individual in the synthetic population to a respondant from the [National Travel Survey (NTS)](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=5340). 

### Methods

We will try two methods

1. categorical matching: joining on relevant socio-demographic variables
2. statistical matching, as described in [An unconstrained statistical matching algorithm for combining individual and household level geo-specific census and survey data](https://doi.org/10.1016/j.compenvurbsys.2016.11.003). 

In [1]:
import numpy as np
import pandas as pd
from acbm.preprocessing import nts_filter_by_year, nts_filter_by_region

## Step 1: Load in the datasets  

### SPC 

In [2]:
# useful variables
region = "west-yorkshire"

In [None]:
# Read in the spc data (parquet format)
spc = pd.read_parquet('../data/spc_output/' + region + '_people_hh.parquet')
spc.head()

In [4]:
# temporary reduction of the dataset for quick analysis
spc = spc.head(5000)

In [None]:
spc.columns

### NTS

The NTS is split up into multiple tables. We will load in the following tables:
- individuals
- households
- trips

In [2]:
# path where datasets are stored
path_psu = "../data/nts/UKDA-5340-tab/tab/psu_eul_2002-2022.tab"
psu = pd.read_csv(path_psu, sep="\t")

path_individuals = "../data/nts/UKDA-5340-tab/tab/individual_eul_2002-2022.tab"
nts_individuals = pd.read_csv(path_individuals, sep="\t")

path_households = "../data/nts/UKDA-5340-tab/tab/household_eul_2002-2022.tab"
nts_households = pd.read_csv(path_households, sep="\t")

path_trips = "../data/nts/UKDA-5340-tab/tab/trip_eul_2002-2022.tab"
nts_trips = pd.read_csv(path_trips, sep="\t")


#### Filter by year

We will filter the NTS data to only include data from specific years. We can choose only 1 year, or multiple years to increase our sample size and the likelihood of a match with the spc

In [31]:
years = [2019, 2021, 2022]

nts_individuals_filtered = nts_filter_by_year(nts_individuals, psu, years)
nts_households_filtered = nts_filter_by_year(nts_households, psu, years)
nts_trips_filtered = nts_filter_by_year(nts_trips, psu, years)



#### Filter by geography 

I will not do this for categorical matching, as it reduces the sample significantly, and leads to more spc households not being matched

In [32]:
# regions = ['Yorkshire and the Humber', 'North West']

# nts_individuals_filtered = nts_filter_by_region(nts_individuals_filtered, psu, regions)
# nts_households_filtered = nts_filter_by_region(nts_households_filtered, psu, regions)
# nts_trips_filtered = nts_filter_by_region(nts_trips_filtered, psu, regions)


Create dictionaries of key value pairs

In [None]:
'''
guide to the dictionaries:

_nts_hh: from NTS households table
_nts_ind: from NTS individuals table
_spc: from SPC

'''


# ---------- NTS

# Create a dictionary for the HHIncome2002_B02ID column
income_dict_nts_hh = {
     '1': '0-25k',
     '2': '25k-50k',
     '3': '50k+',
    '-8': 'NA',
    # should be -10, but
    # it could be a typo in household_eul_2002-2022_ukda_data_dictionary
    '-1': 'DEAD'
}

# Create a dictionary for the HHoldEmploy_B01ID column
# (PT: Part time, FT: Full time)
employment_dict_nts_hh = {
    '1': 'None',
    '2': '0 FT, 1 PT',
    '3': '1 FT, 0 PT',
    '4': '0 FT, 2 PT',
    '5': '1 FT, 1 PT',
    '6': '2 FT, 0 PT',
    '7': '1 FT, 2+ PT',
    '8': '2 FT, 1+ PT',
    '9': '0 FT, 3+ PT',
    '10': '3+ FT, 0 PT',
    '11': '3+ FT, 1+ PT',
    '-8': 'NA',
    '-10': 'DEAD'
}

# Create a dictionary for the Ten1_B02ID column
tenure_dict_nts_hh = {
    '1': 'Owns / buying',
    '2': 'Rents',
    '3': 'Other (including rent free)',
    '-8': 'NA',
    '-9': 'DNA',
    '-10': 'DEAD'
}


# ---------- SPC


# create a dictionary for the pwkstat column
employment_dict_spc = {
    '0': 'Not applicable (age < 16)',
    '1': 'Employee FT',
    '2': 'Employee PT',
    '3': 'Employee unspecified',
    '4': 'Self-employed',
    '5': 'Unemployed',
    '6': 'Retired',
    '7': 'Homemaker/Maternal leave',
    '8': 'Student',
    '9': 'Long term sickness/disability',
    '10': 'Other'
}


# Create a dictionary for the tenure column
tenure_dict_spc = {
    '1': 'Owned: Owned outright',
    '2': 'Owned: Owned with a mortgage or loan or shared ownership',
    '3': 'Rented or living rent free: Total',
    '4': 'Rented: Social rented',
    '5': 'Rented: Private rented or living rent free',
    '-8': 'NA',
    '-9': 'DNA',
    '-10': 'DEAD'
}


# Combine the dictionaries into a dictionary of dictionaries

dict_nts = {
    'HHIncome2002_B02ID': income_dict_nts_hh,
    'HHoldEmploy_B01ID': employment_dict_nts_hh,
    'Ten1_B02ID': tenure_dict_nts_hh
}

dict_spc = {
    'pwkstat': employment_dict_spc,
    'tenure': tenure_dict_spc
}



## Step 2: Decide on matching variables  

We need to identify the socio-demographic characteristics that we will match on. The schema for the synthetic population can be found [here](https://github.com/alan-turing-institute/uatk-spc/blob/main/synthpop.proto). 

Matching between the SPC and the NTS will happen in two steps: 

1. Match at the household level
2. Match individuals within the household

### Household level matching 

| Variable           | Name (NTS)           | Name (SPC)      | Transformation (NTS) | Transformation (SPC) |
| ------------------ | -------------------- | --------------- | -------------------- | -------------------- |
| Household income   | `HHIncome2002_BO2ID` | `salary_yearly` | NA                   | Group by household ID and sum |
| Number of adults   | `HHoldNumAdults`        | `age_years`     | NA                   | Group by household ID and count |
| Number of children | `HHoldNumChildren`      | `age_years`     | NA                   | Group by household ID and count |
| Employment status  | `HHoldEmploy_B01ID`  | `pwkstat`       | NA                   | a) match to NTS categories. b) group by household ID |
| Car ownership      | `NumCar`             | `num_cars`      | SPC is capped at 2. We change all entries > 2 to 2 | NA  |

Other columns to match in the future
| Variable           | Name (NTS)           | Name (SPC)      | Transformation (NTS) | Transformation (SPC) |
| ------------------ | -------------------- | --------------- | -------------------- | -------------------- |
| Type of tenancy    | `Ten1_B02ID`         | `tenure`        | ?? | ?? |
|  Urban-Rural classification of residence | `Settlement2011EW_B04ID`         | NA     | NA            | Spatial join between [layer](https://www.gov.uk/government/collections/rural-urban-classification) and SPC  |



### 2.1 Edit SPC columns 

#### Household Income

In [None]:
# Household Income

# --- Get sum of spc.salary_yearly per household
spc['salary_yearly_hh'] = (spc
                           .groupby('household')['salary_yearly']
                           .transform('sum'))

# --- Recode column so that it matches the reported NTS values (Use income_dict_nts_hh dictionary for reference)

# Define the bins
bins = [0, 24999, 49999, np.inf]
# Define the labels for the bins
labels = [1, 2, 3]

spc['salary_yearly_hh_cat'] = (pd.cut(spc['salary_yearly_hh'], bins=bins, labels=labels)
                                 .astype('str')
                                 .astype('float'))

# replace NA values with -8 (to be consistent with NTS)
spc['salary_yearly_hh_cat'] = spc['salary_yearly_hh_cat'].fillna(-8)

# Convert the column to int
spc['salary_yearly_hh_cat'] = spc['salary_yearly_hh_cat'].astype('int')

#### Household Composition (No. of Adults / Children)

In [None]:
# Number of adults and children in the household

spc = spc.assign(
    is_adult = (spc['age_years'] >= 16).astype(int),
    num_adults = lambda df: df.groupby('household')['is_adult'].transform('sum'),
    is_child = (spc['age_years'] < 16).astype(int),
    num_children = lambda df: df.groupby('household')['is_child'].transform('sum')
)




#### Employment Status

In [None]:
# Employment status

# check the colums values from our dictionary
dict_spc['pwkstat']

In [None]:
# We will only use '1' and '2' for the employment status


# Function to count the number of occurences of specific values in a column,
# and return a new column per value specified
def count_values(group, column, values, value_names):
    """
    Count the number of occurrences of specific values in a column, 
    and return a new column per value specified.

    Parameters:
    group (DataFrame): The group of data to count values in.
    column (str): The name of the column to count values in.
    values (list): The values to count.
    value_names (list): The names to use for the new columns in the output.

    Returns:
    Series: A pandas Series where the index is the value_names and 
            the values are the counts.
    """
    counts = group[column].value_counts()
    return pd.Series([counts.get(val, 0) for val in values], index=value_names)

# Apply the function to each group
counts_df = (spc.groupby('household')
                .apply(count_values,
                       column='pwkstat',
                       values=[1, 2],
                       value_names=['pwkstat_FT_hh','pwkstat_PT_hh']))

# Check results
# counts_df.head(10)
counts_df.iloc[460:480, :]

In [None]:
# We want to match the SPC values to the NTS
dict_nts['HHoldEmploy_B01ID']
'''
{
    '1': 'None',
    '2': '0 FT, 1 PT',
    '3': '1 FT, 0 PT',
    '4': '0 FT, 2 PT',
    '5': '1 FT, 1 PT',
    '6': '2 FT, 0 PT',
    '7': '1 FT, 2+ PT',
    '8': '2 FT, 1+ PT',
    '9': '0 FT, 3+ PT',
    '10': '3+ FT, 0 PT',
    '11': '3+ FT, 1+ PT',
    '-8': 'NA',
    '-10': 'DEAD'}
 '''

# 1) Match each row to the NTS

# Define the conditions and outputs.
# We are using the keys in dict_nts['HHoldEmploy_B01ID'] as reference
conditions = [
    (counts_df['pwkstat_FT_hh'] == 0) & (counts_df['pwkstat_PT_hh'] == 0),
    (counts_df['pwkstat_FT_hh'] == 0) & (counts_df['pwkstat_PT_hh'] == 1),
    (counts_df['pwkstat_FT_hh'] == 1) & (counts_df['pwkstat_PT_hh'] == 0),
    (counts_df['pwkstat_FT_hh'] == 0) & (counts_df['pwkstat_PT_hh'] == 2),
    (counts_df['pwkstat_FT_hh'] == 1) & (counts_df['pwkstat_PT_hh'] == 1),
    (counts_df['pwkstat_FT_hh'] == 2) & (counts_df['pwkstat_PT_hh'] == 0),
    (counts_df['pwkstat_FT_hh'] == 1) & (counts_df['pwkstat_PT_hh'] >= 2),
    (counts_df['pwkstat_FT_hh'] == 2) & (counts_df['pwkstat_PT_hh'] >= 1),
    (counts_df['pwkstat_FT_hh'] == 0) & (counts_df['pwkstat_PT_hh'] >= 3),
    (counts_df['pwkstat_FT_hh'] >= 3) & (counts_df['pwkstat_PT_hh'] == 0),
    (counts_df['pwkstat_FT_hh'] >= 3) & (counts_df['pwkstat_PT_hh'] >= 1)
]

# Define the corresponding outputs based on dict_nts['HHoldEmploy_B01ID]
outputs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

# Create a new column using np.select
counts_df['pwkstat_NTS_match'] = np.select(conditions,
                                           outputs,
                                           default= -8)



# 2) merge back onto the spc
spc = spc.merge(counts_df, left_on='household', right_index=True)
spc.head(10)


### 2.2 Edit NTS columns

#### Number of cars

- `SPC.num_cars` only has values [0, 1, 2]. 2 is for all households with 2 or more cars
- `NTS.NumCar` is more detailed. It has the actual value of the number of cars. We will cap this at 2.

In [None]:
# Define a function to match the values
def match_values(x):
    if x > 2:
        return 2
    else:
        return x

# Create a new column in NTS
nts_households_filtered.loc[:, 'NumCar_SPC_match'] = nts_households_filtered['NumCar'].apply(match_values)

#### Type of tenancy

Breakdown between NTS and SPC is different. 

In [None]:
dict_nts['Ten1_B02ID'], dict_spc['tenure']

In [None]:
# Dictionary showing how we want the final columns to look like
tenure_dict_nts_spc = {
    1: 'Owned',
    2: 'Rented or rent free',
    -8: 'NA',
    -9: 'DNA',
    -10: 'DEAD'
}

# Matching NTS to tenure_dict_nts_spc

# Create a new dictionary for matching
matching_dict_nts_tenure = {
    1: 1,
    2: 2,
    3: 2
}

matching_dict_spc_tenure = {
    1: 1, #'Owned: Owned outright' : 'Owned'
    2: 1, #'Owned: Owned with a mortgage or loan or shared ownership', : 'Owned'
    3: 2, #'Rented or living rent free: Total', : 'Rented or rent free'
    4: 2, #'Rented: Social rented', : 'Rented or rent free'
    5: 2, #'Rented: Private rented or living rent free', : 'Rented or rent free'
}

# Create a new column in nts_households_filtered
nts_households_filtered['tenure_nts_for_matching'] = (nts_households_filtered['Ten1_B02ID']
                                                    .map(matching_dict_nts_tenure) # map the values to the new dictionary
                                                    .fillna(nts_households_filtered['Ten1_B02ID'])) # fill the NaNs with the original values

# Create a new column in spc
spc['tenure_spc_for_matching'] = (spc['tenure']
                                    .map(matching_dict_spc_tenure) # map the values to the new dictionary
                                    .fillna(spc['tenure'])) # fill the NaNs with the original values


## Step 3: Matching at Household Level

Now that we've prepared all the columns, we can start matching.

### 3.1 Categorical matching

We will match on the following columns:

| Matching variable | NTS column | SPC column |
| ------------------| ---------- | ---------- |
| Household income  | `HHIncome2002_BO2ID` | `salary_yearly_hh_cat` |
| Number of adults  | `HHoldNumAdults` | `num_adults` |
| Number of children | `HHoldNumChildren` | `num_children` |
| Employment status | `HHoldEmploy_B01ID` | `pwkstat_NTS_match` |
| Car ownership | `NumCar_SPC_match` | `num_cars` |
| Type of tenancy | `tenure_nts_for_matching` | `tenure_spc_for_matching` |

In [None]:
# Select multiple columns
spc_matching = spc[[
    'hid',
    'salary_yearly_hh_cat', 'num_adults',
    'num_children', 'pwkstat_NTS_match',
    'num_cars', 'tenure_spc_for_matching']]

# edit the df so that we have one row per hid
spc_matching = spc_matching.drop_duplicates(subset='hid')

spc_matching.head(10)

In [None]:
nts_matching = nts_households_filtered[[
    'HouseholdID','HHIncome2002_B02ID',
    'HHoldNumAdults', 'HHoldNumChildren',
    'HHoldEmploy_B01ID', 'NumCar_SPC_match',
    'tenure_nts_for_matching']]

nts_matching.head(10)

In [None]:
# Join the NTS onto the SPC (each column in SPC can be matched to multiple columns in the NTS)
spc_nts = spc_matching.merge(nts_matching,
                             left_on= ['salary_yearly_hh_cat',
                                       'num_adults',
                                       'num_children',
                                       'pwkstat_NTS_match',
                                       'num_cars',
                                       'tenure_spc_for_matching'],
                             right_on= ['HHIncome2002_B02ID',
                                        'HHoldNumAdults',
                                        'HHoldNumChildren',
                                        'HHoldEmploy_B01ID',
                                        'NumCar_SPC_match',
                                        'tenure_nts_for_matching'],
                             how = 'left')

Check how many households from the NTS matched onto each SPC household

In [None]:
# Calculate how many rows from nts_matching are matched onto each hid in spc_matching,
spc_nts['count'] = spc_nts.groupby('hid')['HouseholdID'].transform('count')

# plot a histogram of the counts, and add a line for the mean value


spc_nts_hist = spc_nts.drop_duplicates(subset='hid')


spc_nts_hist['count'].plot(kind='hist', bins=50)

In [None]:
# how many rows in spc_matching were not matched onto any rows in nts_matching?
spc_nts_hist['count'].value_counts()
# calculate th percentage of rows in spc_matching that were not matched onto any rows in nts_matching
# round the result to 2 decimal places
spc_nts_hist['count'].value_counts(normalize=True).round(2)
# plot a histogram of the counts, and add a line for the mean value
#spc_nts_hist['count'].plot(kind='hist', bins=100)



Store the results in a dictionary, 

- Key: SPC hid
- Value: List of NTS Household IDs



In [None]:
# Each hid in spc_matching is joined onto multiple HouseholdID in nts_matching.
# Create a dictionary to store the hid to HouseholdID matches

# Initialize an empty dictionary
hid_to_HouseholdID = {}

# Loop through the DataFrame
for index, row in spc_nts.iterrows():
    # Get the hid and HouseholdID from the row
    hid = row['hid']
    HouseholdID = row['HouseholdID']

    # If the hid is already a key in the dictionary, append the HouseholdID to its list
    if hid in hid_to_HouseholdID:
        hid_to_HouseholdID[hid].append(HouseholdID)
    # If the hid is not a key in the dictionary, add it with a new list that contains the HouseholdID
    else:
        hid_to_HouseholdID[hid] = [HouseholdID]


In [None]:
# Check the first 10 entries in the dictionary
#list(hid_to_HouseholdID.items())[:100]

# access all the values for a specific key in the dictionary
hid_to_HouseholdID['E02002183_0010']



In [None]:
# for each key in the dictionary, sample 1 of the values associated with it and store it in a new dictionary

'''
- iterate over each key-value pair in the hid_to_HouseholdID dictionary.
- For each key-value pair, use np.random.choice(value) to randomly select 
one item from the list of values associated with the current key.
- create a new dictionary hid_to_HouseholdID_sample where each key from the 
original dictionary is associated with one randomly selected value from the 
original list of values.

'''
hid_to_HouseholdID_sample = {key: np.random.choice(value) for key, value in hid_to_HouseholdID.items()}


In [None]:
# same logic as cell above, but repeat it multiple times and store each result as a separate dictionary in a list
hid_to_HouseholdID_sample_list = [{key: np.random.choice(value) for key, value in hid_to_HouseholdID.items()} for i in range(100)]

In [None]:
# identify the datatype of hid_to_HouseholdID_sample_list
type(hid_to_HouseholdID_sample_list), type(hid_to_HouseholdID_sample)