Home areas are inferred for users based on where they spend most time in the nighttime hours.
These inferences are made after filtering the dataset to users with enough datapoints, and to datapoints within a specified geographic region, *and* within a specified time period.

Question:  How much do the inferred home areas change for users  across different  datasets that were filtered for different time periods?

The below code is meant to answer that question.
It is relevant for evaluating how well our trajectory synthesis models do when given home areas that they must produce location data for.


In [43]:
import numpy as np
import pandas as pd


# Column names defined as variables
DEVICE_ID = 'device ID'
HOME_TRACT = 'home TRACT'

# Compare the (user device ID -> inferred home TRACT) mapping between the following files
# File 1: For 1st full work week in May 2018
users_and_homes_file1 = './data/mount/201805/filtered/20180507_20180511/county_middlesex_norfolk_suffolk_3days_3nights_homes.csv'
# File 2: For 2nd full work week in May 2018
users_and_homes_file2 = './data/mount/201805/filtered/20180514_20180518/county_middlesex_norfolk_suffolk_3days_3nights_homes.csv'

In [44]:
# Read in the data
user_homes1_df = pd.read_csv(users_and_homes_file1)
user_homes2_df = pd.read_csv(users_and_homes_file2)

Join the datasets, only keeping data for user device IDs that occur in both datasets

In [54]:
# Merge the data
joined_user_homes_df = pd.merge(user_homes1_df, user_homes2_df, how="inner", on=DEVICE_ID, suffixes=(' 1', ' 2'))

u1 = user_homes1_df.shape[0]
u2 = user_homes2_df.shape[0]
u_joined = joined_user_homes_df.shape[0]
print('number of users in file 1: %s' % u1)
print('number of users in file 2: %s' % u2)

print('number of users in both file 1 and file 2: %s' % u_joined)
print('portion of users in file 1 that are also in file 2: %s/%s = %s' % (u_joined, u1, (u_joined/u1)))

number of users in file 1: 22673
number of users in file 2: 22522
number of users in both file 1 and file 2: 14076
portion of users in file 1 that are also in file 2: 14076/22673 = 0.6208265337626252


For what portion of the remaining users do the inferred homes match across the datasets?

In [55]:
total_users_count = joined_user_homes_df.shape[0]
matching_homes_df = joined_user_homes_df[joined_user_homes_df[HOME_TRACT + ' 1']==joined_user_homes_df[HOME_TRACT + ' 2']]
matching_homes_users_count = matching_homes_df.shape[0]
print('total users: %s' % total_users_count)
print('users with inferred homes matching across datasets: %s' % matching_homes_users_count)
print('portion (users with matching homes)/(total users) = (%s)/(%s) = %s' % \
      (matching_homes_users_count, total_users_count, (matching_homes_users_count/total_users_count)))


total users: 14076
users with inferred homes matching across datasets: 12860
portion (users with matching homes)/(total users) = (12860)/(14076) = 0.9136118215402103
