In [1]:
! pip install scikit-mobility
# mount google drive to pull files in
from google.colab import drive
drive.mount('/content/drive')

Collecting scikit-mobility
  Downloading scikit_mobility-1.3.1-py3-none-any.whl (167 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.7/167.7 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting folium==0.12.1.post1 (from scikit-mobility)
  Downloading folium-0.12.1.post1-py2.py3-none-any.whl (95 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.0/95.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting geojson<3.0.0,>=2.5.0 (from scikit-mobility)
  Downloading geojson-2.5.0-py2.py3-none-any.whl (14 kB)
Collecting geopandas<0.11.0,>=0.10.2 (from scikit-mobility)
  Downloading geopandas-0.10.2-py2.py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h3<4.0.0,>=3.7.3 (from scikit-mobility)
  Downloading h3-3.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# imports
import pandas as pd
import skmob
from skmob.preprocessing import filtering, compression
from skmob.preprocessing import detection
from skmob.preprocessing import detection, clustering
import json
from datetime import datetime
from datetime import date
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (10, 10)
import warnings
warnings.filterwarnings(action='once')
warnings.filterwarnings('ignore')

In [6]:
# Open file - pull dates - create trajectory data frame
filename = "/content/drive/MyDrive/GOOGLoc/Records.json"
with open(filename) as json_file:
    data = json.load(json_file)
df = pd.DataFrame(data['locations'])

# clean timestamps (thanks Alben! - these are a total pain to deal with)
df["timestamp"] = df['timestamp'].str.replace("T", " ")
df["timestamp"] = df['timestamp'].str.replace("Z", "")
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Selecting desired dates - The few months over the summer where I was interning and commuting
rel = (df[(df['timestamp'] > '2023-05-20') & (df['timestamp'] < '2023-08-10')]).copy()
rel.reset_index(inplace=True)

# timestamp in seconds
rel['timeSec'] = rel['timestamp'].astype(np.int64)/10**9
# reset time such that first movement in parsed DataFrame occurs at time = 0s
rel['timeSec'] = rel['timeSec'] - min(rel['timeSec'])

# Convert lat & lon to actual values
rel['lat'] = rel['latitudeE7']/10000000
rel['lon'] = rel['longitudeE7']/10000000
rel['date'] = rel['timestamp']

# Cleaned df
clean = (rel[['lon', 'lat','date']]).copy()

# Create a TrajDataFrame from a DataFrame
tdf = skmob.TrajDataFrame(clean,
                          latitude='lat',
                          datetime='date',
                          longitude='lon')
tdf['leaving_datetime'] = tdf.datetime #this field is added to add stops later

# I did fly home and spend some time in Montana over the summer, so I bumped up the max speed a bit
ftdf = filtering.filter(tdf, max_speed_kmh=1000.0, include_loops=True, ratio_max=1)

In [8]:
from skmob.measures.individual import radius_of_gyration

rg_df = radius_of_gyration(ftdf)
rg_df.head()

Unnamed: 0,radius_of_gyration
0,314.817259


### Radius of Gyration

The above calculated metric represents the characteristic distance traveled by me up to time t=08/10/2023 in kilometers [1]. This  drops significantly though when I remove the period that includes my trip to Montana.

Just summarizing the formula for radius of gyration takes the squareroot of the sum (across all positions in the trajectory dataframe for the individual) of an individuals position, minus the center of mass of the trajectory squared.

In looking further into the utils files behind skmob, the function itself (which works on an array of trajectories for many individuals or on the trajectory of a single individual) first creates an array storying the lat_long pairing from all the points making up an individuals trajectory. The mean of this array (a scalar value) is then calculated and stored as an individuals center of mass. Then, skmob calls a utils function that will calculate the distance of each location denoted by lat long coordinates in the trajectory frame, from the center of mas, via the Haversine distance formula. These distances are squared and then the average is taken (they are all summed, then divided by the number of locations) and finally the squareroot is taken on the result of the np.average.

In this sense, it makes sense that my radius of gyration would be large if I took a big trip during the time period under investigation. The locations recorded in that 10 day period were very very far from the general points that make up the trajectory and my center of mass. We can see this is the case below when the period including my trip is trimmed out and the radius drops to 8km.

In [12]:
subframe = (ftdf[(ftdf['datetime'] > '2023-07-04') & (ftdf['datetime'] < '2023-08-04')]).copy()
rg_df_1 = radius_of_gyration(subframe)
rg_df_1.head()

Unnamed: 0,radius_of_gyration
0,8.304369


In [27]:
lats_lngs = ftdf[['lat', 'lng']].values
center_of_mass = np.mean(lats_lngs)
center_of_mass

-41.635395237015125

### Frequency Rank

Frequency rank a simple, sorted frequency distribution based on how frequently an individual visits a location. The plumbing behind this formula is very simple - it groups the dataframe by lat/long and then pulls value counts for the number of occurences associated with a given location. This frequency distribution is then sorted in descending order and given the frequency rank index once sorted.

In [13]:
from skmob.measures.individual import frequency_rank

fr_df = frequency_rank(ftdf)
fr_df.head()

Unnamed: 0,lat,lng,frequency_rank
0,46.892265,-113.955658,1
1,46.892266,-113.955662,2
2,46.892289,-113.955654,3
3,46.892279,-113.95566,4
4,46.892265,-113.955664,5


I really want to get a better feel without the noise of the high precision lat/long coordinates, which more than likely are incorrect. I am going to round at 5 decimal places and see where that puts the results for frequency rank.

In [19]:
df = ftdf.copy()
df['lat'] = df['lat'].round(5)
df['lng'] = df['lng'].round(5)

fr_df_1 = frequency_rank(df)
fr_df_1.head()

Unnamed: 0,lat,lng,frequency_rank
0,46.89227,-113.95566,1
1,37.81661,-122.2624,2
2,37.81663,-122.26235,3
3,37.81662,-122.2624,4
4,37.81661,-122.26241,5


FASCINATING!!! This *completely* changes the results. I did a bit of googling to see where the heck each of these locations fell. The top dataframe is entirely dominated by the week or so that I spent back in Montana as the top 5 locations are basically my childhood home. When I rounded the lat and long columns to 5 decimal places, ranks 2-5 are replaced by my apartment in Oakland. Which is really bizarre because I was only in Montana for a week.

This essentially should create a list of my most visited locations. The garage behind my apartment makes sense. My house in Montana does not. It makes me wonder about my location permissions, when I had my device on, etc.

In [23]:
user_map0 = ftdf.plot_stops(zoom=15)
# plot the trajectory of the user
ftdf.plot_trajectory(map_f=user_map0, max_points=1000)

The map above won't generate -- it is too big unfortunately. So I move onto my final measurement

### Home Location

In [24]:
from skmob.measures.individual import home_location

hl_df = home_location(ftdf)
hl_df.head()

Unnamed: 0,lat,lng
0,37.816606,-122.262396


Hooray!!! It nailed it! This is indeed my home location, and I even used the default values of 22hr for 'coming home for the evening' and 7hr for the hour of 'leaving in the morning'. In essence this formula finds the location where the probability metric calculated is greatest (argmax). The probability being calculated is the probability that individual u (me in this case) is at location r (specified in lat/long coordinates) given a specific time of day, where the algorithm only assesses times that fall between 22hr (10pm) and 7hr (7am) to make an educated guess that the location generating the highest conditional probability under these circumstances is likely the person's home location. The parameters for start_night and end_night which bind the assumption that the individual is home between those hours, are modifiable. This would be necessary in the event that I worked a night shift or something like that.

References

[1] González, M., Hidalgo, C. & Barabási, AL. Understanding individual human mobility patterns. Nature 453, 779–782 (2008). https://doi.org/10.1038/nature06958