## Load The Pickle File of Google Location Data

As the zipped JSON file is no longer used for this assignnment, I loaded the data directly from the given pickle file and displayed basic information.

In [None]:
import pandas as pd
import plotly.express as px
import datetime
import numpy as np
import gzip
import pickle

data_pickle_path = '/content/LocationData-20241015.pkl'
df = pd.read_pickle(data_pickle_path)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800002 entries, 0 to 2800001
Data columns (total 22 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   latitudeE7                   float64
 1   longitudeE7                  float64
 2   accuracy                     float64
 3   source                       object 
 4   deviceTag                    int64  
 5   timestamp                    object 
 6   activity                     object 
 7   deviceDesignation            object 
 8   activeWifiScan.accessPoints  object 
 9   altitude                     float64
 10  verticalAccuracy             float64
 11  platformType                 object 
 12  osLevel                      float64
 13  serverTimestamp              object 
 14  deviceTimestamp              object 
 15  batteryCharging              object 
 16  formFactor                   object 
 17  locationMetadata             object 
 18  inferredLocation             object 
 19  

Unnamed: 0,latitudeE7,longitudeE7,accuracy,source,deviceTag,timestamp,activity,deviceDesignation,activeWifiScan.accessPoints,altitude,...,osLevel,serverTimestamp,deviceTimestamp,batteryCharging,formFactor,locationMetadata,inferredLocation,placeId,velocity,heading
0,460714447.0,-1183339000.0,1820.0,CELL,1294458137,2013-05-16T21:32:26.733Z,,,,,...,,,,,,,,,,
1,460709585.0,-1183337000.0,34.0,CELL,1294458137,2013-05-16T21:33:27.042Z,"[{'activity': [{'type': 'STILL', 'confidence':...",,,,...,,,,,,,,,,
2,460709585.0,-1183337000.0,34.0,CELL,1294458137,2013-05-16T21:34:27.089Z,,,,,...,,,,,,,,,,
3,460709585.0,-1183337000.0,34.0,CELL,1294458137,2013-05-16T21:35:27.463Z,,,,,...,,,,,,,,,,
4,460709585.0,-1183337000.0,34.0,CELL,1294458137,2013-05-16T21:36:27.498Z,"[{'activity': [{'type': 'STILL', 'confidence':...",,,,...,,,,,,,,,,


## Create Corresponding DataFrame

 Firstly, I transform that dataframe, using the .map() function, to convert the latitudeE7 and longitudeE7 fields to equivalent decimal values by dividing by 1e7. Then, I converted the timestamp column to datetime using the pd.to_datetime() function with the option format='mixed' as instructed in the annoucement and kept only the required columns: latitude and longitude.



In [None]:
df['latitude'] = df['latitudeE7'].map(lambda x: x / 1e7)
df['longitude'] = df['longitudeE7'].map(lambda x: x / 1e7)

df['timestamp'] = pd.to_datetime(df['timestamp'], format='mixed')
df.set_index('timestamp', inplace=True)
df_location = df[['latitude', 'longitude']]

df_location.info()
df_location.head()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2800002 entries, 2013-05-16 21:32:26.733000+00:00 to 2024-10-14 16:46:27.847000+00:00
Data columns (total 2 columns):
 #   Column     Dtype  
---  ------     -----  
 0   latitude   float64
 1   longitude  float64
dtypes: float64(2)
memory usage: 64.1 MB


Unnamed: 0_level_0,latitude,longitude
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-05-16 21:32:26.733000+00:00,46.071445,-118.3339
2013-05-16 21:33:27.042000+00:00,46.070959,-118.333662
2013-05-16 21:34:27.089000+00:00,46.070959,-118.333662
2013-05-16 21:35:27.463000+00:00,46.070959,-118.333662
2013-05-16 21:36:27.498000+00:00,46.070959,-118.333662


## Pickle File

I saved the resulting dataframe as a pickle file for future use to reduce loading time and improve efficiency.

In [None]:
pickle_output_path = '/content/TransformedLocationData.pkl'
df_location.to_pickle(pickle_output_path)

## Read Pickle and Plot

 I loaded the saved pickle file and filtered the data for July 4, 2015. Then, I passed the Mapbox token and plotted the data using Plotly with Mapbox, where I made sure to keep your tokens secure by using the built-in secret manager "Variables" tab on Colab ({x} symbol). And, I used Open Street Map for styling to fill the entire plotting area so that it looks clean without unnecessary empty spaces.

 The map shows the locations tracked for July 4, 2015, around Walla Walla, where the blue dots represent the movement path of a vehicle in that day. By looking at the plotted map, you can see the route taken, including areas where the movement was more concentrated or where there was a loop. I also observe that the distance between the consecutive blue dots vary. If the blue dots are close together, I think it is because there was little movement between those moments — the vehicle might have been moving slowly, staying still, or just moving around a small area. In constrast, if the blue dots are far apart, it means that there was a lot of movement during that time, which could mean that the vehicle is moving quickly.


In [None]:
df_location_loaded = pd.read_pickle(pickle_output_path)
july_4_2015 = df_location_loaded.loc['2015-07-04']
MAPBOX_ACCESS_TOKEN = "pk.eyJ1IjoicHVyZTY5IiwiYSI6ImNtMmw3ZTc5ZjA5Yjgycm9lejluZGRsazYifQ.l4SI2Soo0g14_31M49CnRg"

fig = px.scatter_mapbox(july_4_2015,
                        lat='latitude',
                        lon='longitude',
                        title="Location Data on July 4, 2015",
                        zoom=10,
                        height=600)

fig.update_layout(mapbox_style="open-street-map",
                  mapbox_accesstoken=MAPBOX_ACCESS_TOKEN)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

# Homework 05: Where's Schueller?

## Use K-means Clustering to identify where does Schueller go to work and his home

### Hypotheses
1. Schueller works and spends most of his daytime in Olin Hall
2. Schueller's house is located in the College Place in Walla Walla

In [None]:
from sklearn.cluster import KMeans

df_sample = df_location.sample(10000, random_state=42)  # Sample data for efficiency

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(df_sample[['latitude', 'longitude']])
df_sample['cluster'] = kmeans.labels_

fig_clusters = px.scatter_mapbox(df_sample,
                                 lat='latitude',
                                 lon='longitude',
                                 color='cluster',
                                 title="Identified Clusters (Work & Sleep)",
                                 zoom=10,
                                 height=600)

fig_clusters.update_layout(mapbox_style="open-street-map",
                           mapbox_accesstoken=MAPBOX_ACCESS_TOKEN)
fig_clusters.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig_clusters.show()

In [None]:
work_data = df_sample.between_time('09:00', '17:00')
work_cluster = work_data['cluster'].value_counts().idxmax()
print(f"The identified work cluster is: {work_cluster}")

fig_work = px.scatter_mapbox(work_data[work_data['cluster'] == work_cluster],
                             lat='latitude',
                             lon='longitude',
                             title="Frequent Location During Work Hours (Identified Work Cluster)",
                             zoom=10,
                             height=600)
fig_work.update_layout(mapbox_style="open-street-map",
                       mapbox_accesstoken=MAPBOX_ACCESS_TOKEN)
fig_work.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig_work.show()

The identified work cluster is: 1


## Above and Beyond

### Movement pattern between his home and his workplace

In [None]:
import random
walla_walla_data = df_location_loaded[(df_location_loaded['latitude'] > 46.0) & (df_location_loaded['latitude'] < 46.2) &
                                      (df_location_loaded['longitude'] > -118.5) & (df_location_loaded['longitude'] < -118.2)]

morning_commute = walla_walla_data.between_time('07:00', '09:00')
evening_commute = walla_walla_data.between_time('17:00', '19:00')
commute_data = pd.concat([morning_commute, evening_commute])

weekday_commute_data = commute_data[commute_data.index.weekday < 5]
random_days = random.sample(list(weekday_commute_data.index.normalize().unique()), 5)
selected_commute_data = weekday_commute_data[weekday_commute_data.index.normalize().isin(random_days)]
selected_commute_data = selected_commute_data.sort_index()

fig_commute_path = px.line_mapbox(selected_commute_data,
                                  lat='latitude',
                                  lon='longitude',
                                  title="Identified Commuting Paths Between Home and Work in Walla Walla (5 Random Weekdays)",
                                  zoom=12,
                                  height=600,
                                  color_discrete_sequence=['blue'])
fig_commute_path.update_layout(mapbox_style="open-street-map",
                               mapbox_accesstoken=MAPBOX_ACCESS_TOKEN)
fig_commute_path.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig_commute_path.show()

## Above & Beyond

For Above and Beyond, I calculate the total distance traveled on July 4, 2015 using the Haversine formula by referring to the code from Stack Overflow as cited below. Then, I used shift() function where it will contain the latitude of the previous point and the longitude of the previous point for each row. With the shifted_lat and shifted_lon, I can calculate the distance between each point and the previous point and used .apply() function to through every row with the haversine formula defined earlier in the code to compute the geodesic distance and summed all the distances up.

Note that the result of 78.10 km is a geodesic distance, meaning that it is the shortest possible distance between two points on the surface of a curved shape, which is Earth in this case. However, it suggests that the person either drove for a good portion of the day or engaged in a combination of different activities considering the varying distance between the consecutive blue dots, possibly visiting multiple destinations or going for a scenic drive.

Acknowledgement: https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas

In [None]:
def haversine(lat1, lon1, lat2, lon2):
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371
    return c * r

july_4_2015['shifted_lat'] = july_4_2015['latitude'].shift()
july_4_2015['shifted_lon'] = july_4_2015['longitude'].shift()
july_4_2015.dropna(inplace=True)

july_4_2015['distance_km'] = july_4_2015.apply(lambda row: haversine(row['latitude'], row['longitude'],
                                                                    row['shifted_lat'], row['shifted_lon']), axis=1)

total_distance = july_4_2015['distance_km'].sum()
print(f"Total distance traveled on July 4, 2015: {total_distance:.2f} km")

Total distance traveled on July 4, 2015: 78.10 km




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a