# Converting the National Travel Survey into a Simple MATSim Format Population

This notebook demonstrates an example workflow for converting tabular diary data (household attributes, person attributes and trip data) into MATSim formatted xml population data for London households.

This includes:
- pre-processing of tabular inputs
- loading data into pam
- household sampling
- facility sampling
- preliminary investigation
- writing to xml

This example is highly simplified. Of particular note: the diary data used is spatially very aggregate (trip locations are aggregated to inner/outer London). This creates significant variance in the sampled trip lengths. Generally we would expect more precise spatial data to be used. Alternately the complexity of the facility sampling step can be improved to better account for known trip features such as mode and duration.

The diary data used is available from the UK Data Service (https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=5340) and is described here:http://doc.ukdataservice.ac.uk/doc/5340/mrdoc/pdf/5340_nts_user_guidance_1995-2016.pdf

In [None]:
use_dummy_data = True

In [None]:
import pandas as pd
import numpy as np
import geopandas as gp
import os
from matplotlib import pyplot as plt

In [None]:
out_dir = '../outputs'  # outputs are writen here

# required inputs from the National Travel Survey
if use_dummy_data:
    households_csv = './data/dummyNTS/householdeul2017.tab'
    individuals_csv = './data/dummyNTS/individualeul2017.tab'
    trips_csv ='./data/dummyNTS/tripeul2017.tab'

else:
    households_csv = '~/Data/UKDA-5340-tab/tab/householdeul2017.tab'
    individuals_csv = '~/Data/UKDA-5340-tab/tab/individualeul2017.tab'
    trips_csv ='~/Data/UKDA-5340-tab/tab/tripeul2017.tab'


## Load households data

1. Load household data into pandas DataFrame.
2. Create some mappings of participation and weighting by household for use later. These are described in http://doc.ukdataservice.ac.uk/doc/5340/mrdoc/pdf/5340_nts_user_guidance_1995-2016.pdf

In [None]:
hh_in = pd.read_csv(
    households_csv,
    sep='\t',
    usecols=['HouseholdID', 'SurveyYear', 'PSUID', 'W2', 'OutCom_B02ID',
       'HHIncome2002_B02ID', 'AddressType_B01ID', 'Ten1_B02ID',
       'Landlord_B01ID', 'ResLength_B01ID', 'HHoldCountry_B01ID',
       'HHoldGOR_B02ID', 'HHoldNumAdults', 'HHoldNumChildren',
       'HHoldNumPeople', 'HHoldStruct_B02ID', 'NumLicHolders',
       'HHoldEmploy_B01ID', 'NumVehicles', 'NumBike', 'NumCar', 'NumMCycle',
       'NumVanLorry', 'NumCarVan', 'WalkBus_B01ID', 'Getbus_B01ID',
       'WalkRail_B01ID', 'WalkRailAlt_B01ID',
       'HRPWorkStat_B02ID', 'HRPSEGWorkStat_B01ID', 'HHoldOAClass2011_B03ID',
       'Settlement2011EW_B03ID', 'Settlement2011EW_B04ID'],
)

hh_in.HHIncome2002_B02ID = pd.to_numeric(hh_in.HHIncome2002_B02ID, errors='coerce')
hh_in.NumLicHolders = pd.to_numeric(hh_in.NumLicHolders, errors='coerce')
hh_in.NumVehicles = pd.to_numeric(hh_in.NumVehicles, errors='coerce')
hh_in.NumCar = pd.to_numeric(hh_in.NumCar, errors='coerce')
hh_in.NumMCycle = pd.to_numeric(hh_in.NumMCycle, errors='coerce')
hh_in.NumVanLorry = pd.to_numeric(hh_in.NumVanLorry, errors='coerce')
hh_in.NumCarVan = pd.to_numeric(hh_in.NumCarVan, errors='coerce')
hh_in.Settlement2011EW_B04ID = pd.to_numeric(hh_in.Settlement2011EW_B04ID, errors='coerce')

hh_in.head()

In [None]:
participation_mapping = dict(zip(hh_in.HouseholdID, hh_in.OutCom_B02ID))
weight_mapping = dict(zip(hh_in.HouseholdID, hh_in.W2))

## Load person data

Load person attributes data into pandas DataFrame.

In [None]:
persons_in = pd.read_csv(
    individuals_csv,
    sep='\t',
    usecols=['SurveyYear', 'IndividualID', 'HouseholdID', 'PSUID', 'VehicleID',
       'PersNo', 'Age_B01ID', 'OfPenAge_B01ID', 'Sex_B01ID', 'EdAttn1_B01ID',
       'EdAttn2_B01ID', 'EdAttn3_B01ID', 'DrivLic_B02ID', 'CarAccess_B01ID',
       'DrivDisable_B01ID', 'WkPlace_B01ID', 'ES2000_B01ID', 'NSSec_B03ID',
       'SC_B01ID', 'Stat_B01ID', 'SVise_B01ID', 'EcoStat_B02ID',
       'PossHom_B01ID']
)
persons_in.head()

## load trip data

1. Load trip data into pandas DataFrame format.
2. Apply some preliminary formatting
3. Replace headers so that we can use pam read method:


- pid - person ID
- hid - household ID
- seq - trip sequence number
- hzone - household zone
- ozone - trip origin zone
- dzone - trip destination zone
- purp - trip purpose
- mode - trip mode
- tst - trip start time (minutes)
- tet - trip end time (minutes)
- freq - weighting for representative population

In [None]:
travel_diaries_in = pd.read_csv(
    trips_csv,
    sep='\t',
    usecols=['TripID', 'SurveyYear', 'DayID', 'IndividualID', 'HouseholdID', 'PSUID',
       'PersNo', 'TravDay', 'JourSeq', 'ShortWalkTrip_B01ID', 'NumStages',
       'MainMode_B04ID', 'TripPurpFrom_B01ID',
        'TripPurpTo_B01ID', 'TripPurpose_B04ID',
       'TripStart', 'TripEnd', 'TripOrigUA2009_B01ID', 'TripDestUA2009_B01ID'],
#     dtype={"W5": np.float64,}
)

travel_diaries_in.TripStart = pd.to_numeric(travel_diaries_in.TripStart, errors='coerce')
travel_diaries_in.TripEnd = pd.to_numeric(travel_diaries_in.TripEnd, errors='coerce')

travel_diaries_in.head()

In [None]:
travel_diaries_in['participation'] = travel_diaries_in.HouseholdID.map(participation_mapping)
travel_diaries_in['hh_weight'] = travel_diaries_in.HouseholdID.map(weight_mapping)

In [None]:
travel_diaries = travel_diaries_in.loc[travel_diaries_in.participation.isin([1,2])]

In [None]:
travel_diaries.head()

In [None]:
travel_diaries.rename(
    columns={  # rename data
        'JourSeq': 'seq',
        'TripOrigUA2009_B01ID': 'ozone',
        'TripDestUA2009_B01ID': 'dzone',
        'TripPurpFrom_B01ID': 'oact',
        'TripPurpTo_B01ID': 'dact',
        'MainMode_B04ID': 'mode',
        'TripStart': 'tst',
        'TripEnd': 'tet',
    },
                inplace=True)

travel_diaries.head()

In [None]:
travel_diaries.dtypes

In [None]:
def check_uniques(df):
    for c in df.columns:
        print(c)
        n = df[c].nunique()
        if n < 1000:
            print(df[c].unique())

In [None]:
check_uniques(travel_diaries)

## Area Mapping

The NTS documentation refers to a 'modified' 2009 Unitary Authorities. The Unmodified 2017 UAs are included below for reference. They 2017 UA names are similar but not the same as the NTS mappings.

Found here: https://data.gov.uk/dataset/4e1d5b2c-bb91-42ad-b420-f7fcab638389/counties-and-unitary-authorities-december-2017-full-extent-boundaries-in-uk-wgs84.

We have built our own geometry:

In [None]:
area_path = "./data/dummyNTS/NTSareas.geojson"

In [None]:
areas = gp.read_file(area_path)
areas.head()

In [None]:
areas.plot(figsize=(6,6))

## Clean out incomplete plans

In [None]:
def remove_broken_plans(plan):
    if plan.isnull().values.any():
        return None
    for col in ['ozone', 'dzone']:
        if -8 in list(plan[col]):
            return None
    return plan

In [None]:
clean_travel_diaries = travel_diaries.groupby(
    ['IndividualID', 'TravDay']
).apply(
    remove_broken_plans
).reset_index(drop=True)

In [None]:
clean_travel_diaries.head()

In [None]:
print(len(travel_diaries))
print(len(clean_travel_diaries))

## Build Mappings and apply to common fields

We simplify key trip variables such as mode and activity.

In [None]:
def string_to_dict(string):
    """used to build dicts from NTS rtf format dictionaries (cut and paste from the NTS documentation)"""
    mapping = {}
    for line in string.split("\n"):
        _, v, l = line.split("\t")
        v = v.split(" = ")[1]
        l = l.split(" = ")[1]
        mapping[float(v)] = str(l)
    return mapping

In [None]:
mode_mapping = {
    1: 'walk',
     2: 'bike',
     3: 'car',  #'Car/van driver'
     4: 'car',  #'Car/van driver'
     5: 'car',  #'Motorcycle',
     6: 'car',  #'Other private transport',
     7: 'pt', #Bus in London',
     8: 'pt', #'Other local bus',
     9: 'pt', #'Non-local bus',
     10: 'pt', #'London Underground',
     11: 'pt', #'Surface Rail',
     12: 'car',  #'Taxi/minicab',
     13: 'pt', #'Other public transport',
     -10: 'DEAD',
     -8: 'NA'
}

purp_mapping = {
    1: 'work',
     2: 'work',  #'In course of work',
     3: 'education',
     4: 'shop',  #'Food shopping',
     5: 'shop',  #'Non food shopping',
     6: 'medical', #'Personal business medical',
     7: 'other',  #'Personal business eat/drink',
     8: 'other',  #'Personal business other',
     9: 'other',  #'Eat/drink with friends',
     10: 'visit',  #'Visit friends',
     11: 'other',  #'Other social',
     12: 'other',  #'Entertain/ public activity',
     13: 'other',  #'Sport: participate',
     14: 'home',  #'Holiday: base',
     15: 'other',  #'Day trip/just walk',
     16: 'other',  #'Other non-escort',
     17: 'escort',  #'Escort home',
     18: 'escort',  #'Escort work',
     19: 'escort',  #'Escort in course of work',
     20: 'escort',  #'Escort education',
     21: 'escort',  #'Escort shopping/personal business',
     22: 'escort',  #'Other escort',
     23: 'home',  #'Home',
     -10: 'DEAD',
     -8: 'NA'
}

clean_travel_diaries['mode'] = clean_travel_diaries['mode'].map(mode_mapping)
clean_travel_diaries['oact'] = clean_travel_diaries['oact'].map(purp_mapping)
clean_travel_diaries['dact'] = clean_travel_diaries['dact'].map(purp_mapping)

## Reweight and Split Days

In order to get the most from our small sample we treat individual diary days as new persons. In order to maintain the original household weighting we reduce this accordingly.

In [None]:
# reweight and split ids for unique days

def reweight(group):
    """
    Reweight based on multiple diary days, ie if an agent has two diary days, we will treat these as
    two unique agents, so we half the original weighting
    """
    group['freq'] = group.hh_weight / group.DayID.nunique()
    return group

trips = clean_travel_diaries.groupby('IndividualID').apply(reweight)
trips['pid'] = [f"{p}_{d}" for p, d in zip(trips.IndividualID, trips.TravDay)]
trips['hid'] = [f"{h}_{d}" for h, d in zip(trips.HouseholdID, trips.TravDay)]

In [None]:
trips.head()

In [None]:
def expand_days(
    trips,
    target,
    trips_on='Diary_number',
    target_on='Diary_number',
    new_id='pid',
    trim=True
):
    """
    Expand target df based on mapping between trips target_on and new_id.
    This is so slow. Fix
    Set index to new_id.
    """
    print("Building mapping.")
    mapping = {}
    for i, person in trips.groupby(target_on):
        mapping[i] = list(set(person[new_id]))
    n = len(mapping)
    
    if trim:
        print("Trimming target.")
        selection = set(trips[trips_on])
        target = target.loc[target[target_on].isin(selection)]
    
    expanded = pd.DataFrame()
    for p, (i, ids) in enumerate(mapping.items()):
        if not p % 10:
            print(f"Building expanded data {p}/{n}", end='\r', flush=True)
        for idx in ids:
            split = target.loc[target[target_on] == i]
            split[new_id] = idx
            expanded = expanded.append(split)
    expanded.set_index(new_id, inplace=True)
    print(f"Done")
    return expanded

In [None]:
hhs = expand_days(
    trips,
    hh_in,
    trips_on='HouseholdID',
    target_on='HouseholdID',
    new_id='hid'
)

In [None]:
hhs.head()

In [None]:
people = expand_days(
    trips,
    persons_in,
    trips_on='IndividualID',
    target_on='IndividualID',
    new_id='pid'
)

In [None]:
people.head()

## Load into PAM

We load the pandas formatted data into Pam using the `pam.read.load_travel_diary_from_to` read method. We do some very preliminary validation of plans and assurance.

In [None]:
from pam import write
from pam import read
from pam.plot.stats import plot_activity_times, plot_leg_times

In [None]:
trips.tst = trips.tst.astype(int)
trips.tet = trips.tet.astype(int)

In [None]:
population = read.load_travel_diary(
    trips=trips,
    persons_attributes=people,
    hhs_attributes=hhs,
    trip_freq_as_person_freq=True
)

In [None]:
population.fix_plans()

In [None]:
# this should be replaced with a more direct method
for hh in population.households.values():
    for p in hh.people.values():
        p.validate()

In [None]:
population.size  # this also accounts for the weighting

In [None]:
population.stats

In [None]:
population.activity_classes

In [None]:
population.mode_classes

In [None]:
plot_activity_times(population)

In [None]:
plot_leg_times(population)

In [None]:
# night shift @ 2016008863_6

In [None]:
hh = population.random_household()
hh.print()
hh.plot()

In [None]:
population.activity_classes

## Sample the Population

We sample a very small population based on the given NTS household weightings.

In [None]:
from pam.core import Population
from pam.samplers.basic import freq_sample
from copy import deepcopy

population_sample = Population()
    
for hid, household in population.households.items():
    av_hh_weight = household.freq  # this is currently the av of person freq in the hh
    freq = freq_sample(av_hh_weight, 10)

    for idx in range(freq):
        hh = deepcopy(household)
        hh.hid = f"{hh.hid}_{idx}"
        hh.people = {}
        for pid, person in household.people.items():
            p = deepcopy(person)
            p.pid = f"{pid}_{idx}"
            hh.add(p)
        population_sample.add(hh)


In [None]:
population_sample.size

## Facility Sampling¶ 

The facilities input is prepared using a separate project called OSM-Facility Sampler (OSMFS). This project woulbe be better names the OSM Facility *Extractor*. We use it to extract viable activity locations for each activity type for each zone. This project is not currently open source, but is described below:

OSMFS joins osm data with the geographies of an area to create a mapping between zones, acts and facility locations (points). This is output as a geojson:

{"type": "FeatureCollection", "features": [{"id": "0", "type": "Feature", "properties": {"activity": "other"}, "geometry": {"type": "Point", "coordinates": [-4.5235751, 54.1698685]}},

todo: the current methodology does not support shared facilities, ie facilities with more than one activity (schools are places of education and work for example).

todo: the above json has to be rejoined with the geography to create a spatial sampler. This is a duplicated operation which could be included in the Bench output, eg:

zone_id: activity: (id, point)

In [None]:
from pam.samplers import facility

In [None]:
def load_facilities(path, from_crs="EPSG:4326", to_crs="EPSG:27700"):
    
    facilities = gp.read_file(facilities_path)
    facilities.crs = from_crs
    facilities.to_crs(to_crs, inplace=True)
    return facilities

def load_zones(zones_path, from_crs="EPSG:27700", to_crs="EPSG:27700"):
    
    zones = gp.read_file(zones_path)
    zones.set_index('id', inplace=True)
    if not from_crs == to_crs:
        zones.crs = from_crs
        zones.to_crs(to_crs, inplace=True)
    return zones

In [None]:
# facilities_path = '~Data/facilities/NTS_london_facilities.json'

# facilities = load_facilities(facilities_path)
# zones = load_zones(area_path, from_crs="EPSG:4326")

# facility_sampler = facility.FacilitySampler(
# facilities=facilities,
# zones=zones,
# build_xml=True,
# fail=False,
# random_default=True
# )

# facility_sampler.clear()
# population_sample.sample_locs(facility_sampler)

# Random Sampler

Failing a facility sampler - we can use random sampling instead.

In [None]:
from pam.samplers.spatial import RandomPointSampler

In [None]:
zones = load_zones(area_path, from_crs="EPSG:4326")
sampler = RandomPointSampler(geoms=zones)

In [None]:
sampler.sample(530, None)

In [None]:
population_sample.sample_locs(sampler)

In [None]:
person = population_sample.random_person()
person.plot()
person.print()

## Write to Disk

1. write MATSim formats to disk (plans and attributes)
2. write csv and geojson summaries to disk
3. write MATSim formatted facilities to disk

In [None]:
import pam.write as write

In [None]:
comment = 'NTS london prelim 24nov2020 epsg27700'

write.write_matsim(
        population_sample,
        plans_path=os.path.join(out_dir, 'plans.xml'),
        attributes_path=os.path.join(out_dir, 'attributes.xml'),
        comment=comment
    )
population_sample.to_csv(out_dir, crs="EPSG:27700", to_crs="EPSG:4326")
# facility_sampler.write_facilities_xml(os.path.join(out_dir, 'facilities.xml'), comment=comment)