# Converting the National Travel Survey into a Simple MATSim Format Population

This notebook demonstrates an example workflow for converting tabular diary data (household attributes, person attributes and trip data) into MATSim formatted xml population data for London households.

This includes:
- pre-processing of tabular inputs
- loading data into pam
- household sampling
- facility sampling
- preliminary investigation
- writing to xml

This example is highly simplified. Of particular note: the diary data used is spatially very aggregate (trip locations are aggregated to inner/outer London). This creates significant variance in the sampled trip lengths. Generally we would expect more precise spatial data to be used. Alternately the complexity of the facility sampling step can be improved to better account for known trip features such as mode and duration.

The diary data used is available from the UK Data Service (https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=5340) and is described here:http://doc.ukdataservice.ac.uk/doc/5340/mrdoc/pdf/5340_nts_user_guidance_1995-2016.pdf

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import geopandas as gp
import os
import warnings
from matplotlib import pyplot as plt
from copy import deepcopy
from tqdm import tqdm

# source mapbox token
f = open("../.mapbox_token", "r")
mapbox_token = f.read().split('\n')[0]

## Load households data

Load household data into pandas DataFrame.

In [None]:
# Number of samples in csv
N = 5#3000
# Seed used
seed = 1234

cambridge_hh_sample = pd.read_csv(f'../data/outputs/cambridge_nts_samples/cambridge_nts_household_sample_N{N}_SEED{seed}.csv',
                                  index_col=0)
cambridge_hh_sample.head(5)

## Load Cambridge person data

Load person attributes data into pandas DataFrame.

In [None]:
cambridge_persons_sample = pd.read_csv(
    f'../data/outputs/cambridge_nts_samples/cambridge_nts_people_sample_N{N}_SEED{seed}.csv',index_col=0
)
cambridge_persons_sample.head(5)

## Load Cambridge trip data

1. Load trip data into pandas DataFrame format.
2. Apply some preliminary formatting
3. Replace headers so that we can use pam read method:


- pid - person ID
- hid - household ID
- seq - trip sequence number
- hzone - household zone
- ozone - trip origin zone
- dzone - trip destination zone
- purp - trip purpose
- mode - trip mode
- tst - trip start time (minutes)
- tet - trip end time (minutes)
- freq - weighting for representative population

In [None]:
cambridge_trips_sample = pd.read_csv(
    f'../data/outputs/cambridge_nts_samples/cambridge_nts_trip_with_zones_sample_N{N}_SEED{seed}.csv',index_col=0
)
cambridge_trips_sample.head(5)

## Load facilities and zones

In [None]:
from pam.samplers import facility

In [None]:
def load_facilities(path, from_crs="EPSG:4326", to_crs="EPSG:27700"):
    
    facilities = gp.read_file(path)
    facilities = facilities.rename(columns={"activities":"activity"})
    # Reproject if necessary
    if from_crs != to_crs: 
        print('Reprojecting facilities')
        facilities.crs = from_crs
        facilities.to_crs(to_crs, inplace=True)
    facilities.crs = to_crs
    return facilities

def load_zones(zones_path, from_crs="EPSG:27700", to_crs="EPSG:27700"):
    
    zones = gp.read_file(zones_path)
    zones.set_index('ward_code', inplace=True)
    if not from_crs == to_crs:
        zones.crs = from_crs
        zones.to_crs(to_crs, inplace=True)
    return zones

In [None]:
# Import wards and facilities in Cambridge

# Facilities file path
cambridge_facilities_path = '../data/inputs/cambridge/cambridge_facilities_spread_filtered.geojson'
# './data/cambridge/cambridge_facilities_filtered.geojson'

# Boundary file path
area_path = "../data/inputs/cambridge/cambridge_wards.geojson"
# "./data/Counties_Unitary_Authorities_dec2017_full_extent/NTS_boundaries.geojson"
# "./data/cambridge/cambridge_bufferred_convex_hull_output_areas.geojson" 

cambridge_facilities = load_facilities(cambridge_facilities_path, from_crs="EPSG:27700")
wards = load_zones(area_path, from_crs="EPSG:27700")
# wards = wards.reset_index()

### Visualise wards and facilities

In [None]:
def plot_facilities_within_boundaries(fac,
                                      bounds,
                                      plot_flag:bool=False):
    warnings.filterwarnings("ignore")
    
    fig, ax = plt.subplots(figsize = (15,15))
    fac.plot(ax=ax, markersize=2, column='activity',legend=True,legend_kwds={'loc': 'upper right'})
    bounds.geometry.boundary.plot(ax = ax,color=None,edgecolor='k')
    # for zone, centroid in zip(bounds.index, bounds.centroid):
    #     ax.annotate(zone, xy = (centroid.x, centroid.y), size = 15)
    ax.axis('off')
    
    if plot_flag: plt.savefig('../data/outputs/figures/facilities_by_ward.png',dpi=1000)

plot_facilities_within_boundaries(cambridge_facilities,wards)

In [None]:
cambridge_facilities.head(3)

In [None]:
cambridge_facilities.activity.unique()

## Map zones to integers

In [None]:
ward_to_area_mapping = dict(zip(wards.index,wards.objectid))

In [None]:
# cambridge_hh_sample.loc[:,'hzone'] = cambridge_hh_sample.hzone.replace(ward_to_area_mapping).astype(int)
# cambridge_persons_sample.loc[:,'hzone'] = cambridge_persons_sample.hzone.replace(ward_to_area_mapping).astype(int)
# cambridge_trips_sample.loc[:,'hzone'] = cambridge_trips_sample.hzone.replace(ward_to_area_mapping).astype(int)
# cambridge_trips_sample.loc[:,'ozone'] = cambridge_trips_sample.ozone.replace(ward_to_area_mapping).astype(int)
# cambridge_trips_sample.loc[:,'dzone'] = cambridge_trips_sample.dzone.replace(ward_to_area_mapping).astype(int)

## Load into PAM

We load the pandas formatted data into Pam using the `pam.read.load_travel_diary_from_to` read method. We do some very preliminary validation of plans and assurance.

In [None]:
from pam import write
from pam import read
from pam.plot.stats import plot_activity_times, plot_leg_times

In [None]:
cambridge_trips_sample.tst = cambridge_trips_sample.tst.astype(int)
cambridge_trips_sample.tet = cambridge_trips_sample.tet.astype(int)

In [None]:
# Deepcopy objects
cambridge_hh_sam = deepcopy(cambridge_hh_sample).set_index('hid')
cambridge_persons_sam = deepcopy(cambridge_persons_sample).set_index('pid')
cambridge_trips_sam = deepcopy(cambridge_trips_sample)

In [None]:
population = read.load_travel_diary(
    trips = cambridge_trips_sam,
    persons_attributes=cambridge_persons_sam,
    hhs_attributes=cambridge_hh_sam,
    trip_freq_as_person_freq=True,
    trip_freq_as_hh_freq=False,
    sample_perc = None,
#     tour_based = False
)

In [None]:
population.fix_plans()

In [None]:
# this should be replaced with a more direct method
for hh in tqdm(population.households.values()):
    for p in hh.people.values():
        p.validate()
#         p.plot()
#         break

In [None]:
print(population.size)
# this also accounts for the weighting

In [None]:
population.stats

In [None]:
population.activity_classes

In [None]:
population.mode_classes

In [None]:
# plot_activity_times(population)

In [None]:
# plot_leg_times(population)

In [None]:
# Select household id of interest
hhid = np.random.choice(cambridge_hh_sample.hid.values)

hh = population.households[hhid]
# hh.print()
print('Location',hh.location,'Number of people',hh.attributes.get('NumPeople'))
assert len(hh) == hh.attributes.get('NumPeople')
for pid in hh.people:
    p = hh.people[pid]
#     p.print()
    print('Person',pid,'Hzone',p.attributes.get('hzone'))
    for a in p.activities:
        print(a)
    p.plot()

In [None]:
# hh = population.random_household()
# hh.print()
# hh.plot()

## Sample the Population

We sample a very small population based on the given NTS household weightings.

In [None]:
# from pam.core import Population
# from pam.samplers.basic import freq_sample
# # from copy import deepcopy

# population_sample = Population()
    
# for hid, household in tqdm(population.households.items()):
#     av_hh_weight = household.freq  # this is currently the av of person freq in the hh
#     freq = freq_sample(av_hh_weight, 10)

#     for idx in range(freq):
#         hh = deepcopy(household)
#         hh.hid = f"{hh.hid}_{idx}"
#         hh.people = {}
#         for pid, person in household.people.items():
#             p = deepcopy(person)
#             p.pid = f"{pid}_{idx}"
#             hh.add(p)
#         population_sample.add(hh)

In [None]:
# population_sample.size

## Facility Sampling¶ 

The facilities input is prepared using a separate project called OSM-Facility Sampler (OSMFS). This project woulbe be better names the OSM Facility *Extractor*. We use it to extract viable activity locations for each activity type for each zone. This project is not currently open source, but is described below:

OSMFS joins osm data with the geographies of an area to create a mapping between zones, acts and facility locations (points). This is output as a geojson:

{"type": "FeatureCollection", "features": [{"id": "0", "type": "Feature", "properties": {"activity": "other"}, "geometry": {"type": "Point", "coordinates": [-4.5235751, 54.1698685]}},

todo: the current methodology does not support shared facilities, ie facilities with more than one activity (schools are places of education and work for example).

todo: the above json has to be rejoined with the geography to create a spatial sampler. This is a duplicated operation which could be included in the Bench output, eg:

zone_id: activity: (id, point)

### Launch facility location sampler

In [None]:
def test_sampler_distance(population, sampler, n_iterations = 200, complex_sampler = False,
                         title=None):
    distance_commute = []
    
    for i in range(n_iterations):
        if complex_sampler:
            population.sample_locs_complex(sampler)
        else:
            population.sample_locs(sampler)
        distance_commute.append(write.write_benchmarks(population)['euclidean_distance'][0])

    pd.Series(distance_commute).hist(bins = 20)
    if title!=None:
        plt.title(title)
    plt.xlabel('distance')
    plt.ylabel('frequency')
    plt.show()

In [None]:
facility_sampler = facility.FacilitySampler(
    facilities=cambridge_facilities.assign(weight1=1),
    zones=wards,
    build_xml=True,
    fail=False,
    random_default=False,
    weight_on='weight1'
)

facility_sampler.clear()

In [None]:
# Sample facility locations for each activity
population.sample_locs(facility_sampler)

In [None]:
cambridge_hh_sample[['hid','NumPeople']]

In [None]:
population.plot_travel_plotly(
  epsg='epsg:27700',
  colour_by='hid',
  mapbox_access_token=mapbox_token
)

In [None]:
hhold = population.households[hhid]
person = hhold.people[np.random.choice(list(hhold.people))]
print('pid',person.pid,'hid',person.attributes.get('hid'))
print('Home area',person.attributes.get('hzone'),'Home',person.home)
print('\n')

# Check that sampled point is in designated home zone
ward_boundary = wards[wards.index == person.attributes.get('hzone')].geometry.values[0]
assert ward_boundary.contains(person.home.loc)

person.plot()
for act in person.activities:
    print(act)
person.plot_travel_plotly(epsg='epsg:27700',
                          mapbox_access_token=mapbox_token)

In [None]:
test_sampler_distance(population, 
                      facility_sampler, 
                      complex_sampler=False,
                      title = 'Commute distance, simple sampling')

## Ward not containing any facilities

In [None]:
# # Select random facility index
# random_index = np.random.randint(0,cambridge_facilities.shape[0],1)
# # Get facility for that index
# fac = gp.GeoDataFrame(cambridge_facilities.iloc[random_index,:].reset_index(drop=True))
# 
# 
# fig, ax = plt.subplots(1,1, figsize = (15,15))
# wards[wards.index == 'E05002816'].boundary.plot(ax = ax)
# cambridge_facilities.plot(ax=ax, markersize=100, color='red')
# for zone, centroid in zip(wards.index, wards.centroid):
#     ax.annotate(zone, xy = (centroid.x, centroid.y), size = 10)
# ax.axis('off')
# plt.show()
# print('Ward code indentified',fac.ward_code)

## Write to Disk

1. write MATSim formats to disk (plans and attributes)
2. write csv and geojson summaries to disk
3. write MATSim formatted facilities to disk

In [None]:
import pam.write as write

In [None]:
comment = 'NTS cambridge prelim 03122020 epsg27700' 
#'NTS london prelim 24nov2020 epsg27700'
out_dir = '../data/outputs/matsim/'
matsim_dir = '/Users/Yannis/code/matsim-13.0/examples/cambridge-abm/v0.0.1/'

write.write_matsim(
        population,
        plans_path=os.path.join(out_dir, 'plans.xml'),
        attributes_path=os.path.join(out_dir, 'attributes.xml'),
        comment=comment
    )

population.to_csv(out_dir, crs="EPSG:27700", to_crs="EPSG:4326")
population.to_csv(matsim_dir, crs="EPSG:27700", to_crs="EPSG:27700")

facility_sampler.write_facilities_xml(os.path.join(out_dir, 'cambridge_facilities.xml'), comment=comment)
facility_sampler.write_facilities_xml(os.path.join(matsim_dir, 'cambridge_facilities.xml'), comment=comment)