<h1> Notebook showing the clustering of data id "IOOO" on 2020-01-01 </h1>

<h3> Documentation for hdbscan library in python: </h3>
<a> https://cran.r-project.org/web/packages/dbscan/vignettes/hdbscan.html </a>

In [170]:
# import libraries needed
import pandas as pd
import hdbscan
# import matplotlib.pyplot as plt
# import seaborn as sns
# from sklearn.preprocessing import StandardScaler


In [154]:
# load data
user0_df = pd.read_csv("./CleanData/movements_0.csv")

print(user0_df)

          id   longitude   latitude        date      time
0       I000 -122.335167  47.608013  2020-01-01  00:00:00
1       I000 -122.335168  47.608002  2020-01-01  00:01:07
2       I000 -122.335169  47.608598  2020-01-01  00:02:30
3       I000 -122.335168  47.608501  2020-01-01  00:03:40
4       I000 -122.335169  47.607648  2020-01-01  00:05:07
...      ...         ...        ...         ...       ...
526483  I000 -122.426741  47.572358  2020-12-30  23:56:41
526484  I000 -122.426741  47.572356  2020-12-30  23:57:14
526485  I000 -122.426741  47.572213  2020-12-30  23:58:03
526486  I000 -122.316388  47.618509  2020-12-30  23:59:13
526487  I000 -122.316388  47.617609  2020-12-31  00:00:26

[526488 rows x 5 columns]


In [155]:
# let's start off with just the first day of data
first_day = user0_df[user0_df["date"]=="2020-01-01"]
print(first_day)

        id   longitude   latitude        date      time
0     I000 -122.335167  47.608013  2020-01-01  00:00:00
1     I000 -122.335168  47.608002  2020-01-01  00:01:07
2     I000 -122.335169  47.608598  2020-01-01  00:02:30
3     I000 -122.335168  47.608501  2020-01-01  00:03:40
4     I000 -122.335169  47.607648  2020-01-01  00:05:07
...    ...         ...        ...         ...       ...
1442  I000 -122.381754  47.543701  2020-01-01  23:55:41
1443  I000 -122.381754  47.543135  2020-01-01  23:56:38
1444  I000 -122.381754  47.543680  2020-01-01  23:57:26
1445  I000 -122.381754  47.543708  2020-01-01  23:58:39
1446  I000 -122.381755  47.543255  2020-01-01  23:59:29

[1447 rows x 5 columns]


<h2> How accurate does lat long need to be? Can we round? </h2>
<p> So, as you can see from the table below, a lat/long rounded up to a single decimal place can accurately identify a country or region, whereas rounded up to two could identify a large city or district. But five decimal places can accurately hone in on an individual tree, and six can identify a person. </p>
<a> https://blis.com/precision-matters-critical-importance-decimal-places-five-lowest-go/ </a>
<p> Therefore; we can round down to 4 if needed as it "Can identify an individual street or parcel of land" and has a distance of 11.1 meters. </p>

In [156]:
def filter_travel_points(df, freq):
    # round to 4 decimal places
    rounded_df = df[["longitude", "latitude"]].round(4)
    rounded_df[["id","date","time"]] = df[["id","date","time"]]

    # group by (latitude,longitude) points to see most visited streets/parcels of land of about 11m in distance
    grouped_df = rounded_df.groupby(["longitude", "latitude"]).count().reset_index().sort_values("id")
    
    # keep only the points with occurrence > freq
    filtered_df = grouped_df[grouped_df["id"]>freq]

    # merge filtered with rounded
    merged1 = pd.merge(rounded_df, filtered_df, on=["latitude","longitude"])
    # remove unnecessary columns
    merged1 = merged1.drop(["id_x", "id_y","date_y", "time_y", "longitude", "latitude"], axis=1)
    
    # merge original df with filtered to keep only the points with occurrence > 5
    merged2 = pd.merge(df, merged1, left_on=["date","time"], right_on=["date_x","time_x"])
    # remove unnecessary columns (repeats)
    merged2 = merged2.drop(["date_x", "time_x"], axis=1)

    return merged2



In [157]:

# filter df and check results
filtered_df = filter_travel_points(first_day, 6)
print(filtered_df)

       id   longitude   latitude        date      time
0    I000 -122.374718  47.593698  2020-01-01  00:38:02
1    I000 -122.374718  47.593727  2020-01-01  00:38:42
2    I000 -122.374718  47.594143  2020-01-01  00:43:52
3    I000 -122.374718  47.593698  2020-01-01  02:13:39
4    I000 -122.333503  47.670401  2020-01-01  05:37:20
..    ...         ...        ...         ...       ...
150  I000 -122.333502  47.670388  2020-01-01  23:13:09
151  I000 -122.333503  47.670426  2020-01-01  23:13:45
152  I000 -122.333503  47.670419  2020-01-01  23:16:22
153  I000 -122.333503  47.670392  2020-01-01  23:17:02
154  I000 -122.333503  47.670391  2020-01-01  23:17:45

[155 rows x 5 columns]


In [158]:
# save filtered data to csv
filtered_df.to_csv("./FilteredData/round_filtered_0.csv", index=False)

In [159]:
# create clusterer
clusterer = hdbscan.HDBSCAN(min_cluster_size=15)

In [160]:
def create_clusters(df):
    scaler = StandardScaler()
    points = df[["longitude", "latitude"]]
    scaled_df = scaler.fit_transform(points)

    # labels assigns each point to a cluster
    labels = clusterer.fit_predict(scaled_df)

    # add the clusters to the df
    result = df
    result["cluster_label"] = labels

    return result


In [161]:
# create clusters from filtered data
cluster_df = create_clusters(filtered_df)
print(cluster_df)

       id   longitude   latitude        date      time  cluster_label
0    I000 -122.374718  47.593698  2020-01-01  00:38:02              0
1    I000 -122.374718  47.593727  2020-01-01  00:38:42              0
2    I000 -122.374718  47.594143  2020-01-01  00:43:52              0
3    I000 -122.374718  47.593698  2020-01-01  02:13:39              0
4    I000 -122.333503  47.670401  2020-01-01  05:37:20             -1
..    ...         ...        ...         ...       ...            ...
150  I000 -122.333502  47.670388  2020-01-01  23:13:09             -1
151  I000 -122.333503  47.670426  2020-01-01  23:13:45             -1
152  I000 -122.333503  47.670419  2020-01-01  23:16:22             -1
153  I000 -122.333503  47.670392  2020-01-01  23:17:02             -1
154  I000 -122.333503  47.670391  2020-01-01  23:17:45             -1

[155 rows x 6 columns]


<h3> List each cluster with the number of points found in it </h3>

In [162]:
# see how many points are per cluster
print(cluster_df.groupby("cluster_label").count().reset_index().sort_values("id"))

   cluster_label  id  longitude  latitude  date  time
0             -1   8          8         8     8     8
2              1  24         24        24    24    24
1              0  25         25        25    25    25
4              3  30         30        30    30    30
3              2  68         68        68    68    68


<h3> <b>Next Step:</b> Produce a mappable output supported by Mapbox (our visualization tool) </h3>


In [163]:
# create function to turn a df to a geojson dictionary

def df_to_geojson(df):
    geojson = {
        "type": "FeatureCollection",
        "features": []
    }

    for index, row in df.iterrows():
        feature = {
            "type": "Feature",
            "properties": {"date": row["date"], "time":row["time"]},
            "geometry": {
                "type": "Point",
                "coordinates": [row["longitude"], row["latitude"]]
            }
        }
        geojson["features"].append(feature)

    return geojson

In [164]:
import json

# create function to write geojson dict to a file
def write_to_geojson(filepath, geojson):
    with open(filepath, 'w') as outfile:
        json.dump(geojson, outfile, indent=2)
    print("Finished writing geojson dict to ", filepath)

In [165]:
# create geojson dict for cluster for rounding method used in this file
round_geojson = df_to_geojson(cluster_df)

print(round_geojson)

{'type': 'FeatureCollection', 'features': [{'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '00:38:02'}, 'geometry': {'type': 'Point', 'coordinates': [-122.3747182371207, 47.59369838078001]}}, {'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '00:38:42'}, 'geometry': {'type': 'Point', 'coordinates': [-122.37471821899534, 47.593726805143966]}}, {'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '00:43:52'}, 'geometry': {'type': 'Point', 'coordinates': [-122.3747175079831, 47.59414323842463]}}, {'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '02:13:39'}, 'geometry': {'type': 'Point', 'coordinates': [-122.3747182371207, 47.59369838078001]}}, {'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '05:37:20'}, 'geometry': {'type': 'Point', 'coordinates': [-122.33350285844485, 47.67040083251598]}}, {'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '07:39:23'}, 'geometry': {'type': 'Point', 'coordinates': [

In [166]:
write_to_geojson("./ClusterResults/round_clusters_0.json", round_geojson)

Finished writing geojson dict to  ./ClusterResults/round_clusters_0.json


In [167]:
# create geojson dict for cluster for distance method used in R file
# First, import the file as a df
distance_df = pd.read_csv("./FilteredData/distance_filtered_0.csv")
# Narrow down data to be the same day as the one we're doing exploratory analyzis on
distance_df = distance_df[distance_df["date"] == "2020-01-01"]
# finally, write file
distance_geojson = df_to_geojson(distance_df)

print(distance_geojson)

{'type': 'FeatureCollection', 'features': [{'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '00:06:55'}, 'geometry': {'type': 'Point', 'coordinates': [-122.335168841746, 47.6072304075468]}}, {'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '00:11:07'}, 'geometry': {'type': 'Point', 'coordinates': [-122.335169381394, 47.6070505381334]}}, {'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '00:19:34'}, 'geometry': {'type': 'Point', 'coordinates': [-122.335170169724, 47.6066967825694]}}, {'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '00:57:13'}, 'geometry': {'type': 'Point', 'coordinates': [-122.374123483911, 47.6097394780268]}}, {'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '01:43:42'}, 'geometry': {'type': 'Point', 'coordinates': [-122.317920980697, 47.5918025161772]}}, {'type': 'Feature', 'properties': {'date': '2020-01-01', 'time': '01:53:09'}, 'geometry': {'type': 'Point', 'coordinates': [-122.31792067

In [168]:
# write to file
write_to_geojson("./ClusterResults/distance_clusters_0.json", distance_geojson)

Finished writing geojson dict to  ./ClusterResults/distance_clusters_0.json


In [169]:
!jupyter nbconvert --to html clusterer.ipynb

[NbConvertApp] Converting notebook clusterer.ipynb to html
[NbConvertApp] Writing 663459 bytes to clusterer.html
