**CURSO**: Análisis Geoespacial, Departamento de Geociencias y Medio Ambiente, Universidad Nacional de Colombia - sede Medellín <br/>
**Profesor**: Edier Aristizábal (evaristizabalg@unal.edu.co) <br />
**Credits**: The content of this notebook is taken from [Becoming a Spatial Data Scientist Materials](https://github.com/CartoDB/data-science-book). Every effort has been made to trace copyright holders of the materials used in this book. The author apologies for any unintentional omissions and would be pleased to add an acknowledgment in future editions. 

# DBSCAN with Visit Data

For this exercise, we will be working with a sample of [Safegraph's Patterns dataset](https://blog.safegraph.com/introducing-places-patterns-17ac5b96fb33).

The data is a set of home locations from which people travel to visit Panama City Beach, Florida during the month of July 2019. This example is a basic reproduction of some of the findings in the [CARTO - Safegraph partnership blog post](https://carto.com/blog/visit-pattern-footfall-data-safegraph/). The data comes from the `visitor_home_cbgs` home attribute for all Points of Interest (POIs) in Panama City Beach, Florida. See the [Patterns documentation](https://docs.safegraph.com/docs/places-schema#section-patterns) for more information.

Since we know the locations that people are coming from, it might be natural to ask if there are general regions that we can identify as drivers of the visits. For example, are there areas with a higher density of source visits that could be used to understand visit demographics?

Let's get started by downloading the data and taking a look at it.

In [1]:
import geopandas as gpd
import numpy as np

from cartoframes.viz import Map, Layer

import datasets
import warnings

warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'cartoframes'

### Retrieve the data

In [None]:
sg_pcb = datasets.get_safegraph_visits()
sg_pcb.head()

Unnamed: 0,cartodb_id,longitude,latitude,num_visits,geometry
0,1,-84.690182,33.990924,5,POINT (-84.69018 33.99092)
1,2,-85.877212,30.216679,13,POINT (-85.87721 30.21668)
2,3,-85.173263,31.904274,6,POINT (-85.17326 31.90427)
3,4,-86.006852,34.632641,5,POINT (-86.00685 34.63264)
4,5,-85.038783,32.523741,11,POINT (-85.03878 32.52374)


This is a point dataset associated with the number of visits.

### Visualize points on map

In [None]:
Layer(sg_pcb)

### Calculate Clusters

To calculate clusters, we will use DBSCAN because it works well for finding clusters based on density and works well with spatial measurements.

In [None]:
from sklearn.cluster import dbscan

# use lat/lng in radians as coordinates
coords = np.radians(sg_pcb[["latitude", "longitude"]].values)

# choose appropriate epsilon value
# here we use ~35 kilometers
kms_per_radian = 6371
epsilon = 35 / kms_per_radian

# calculate clusters
# use haversine metric for calculating approximate distances on earth's surface (crow fly)
_, cluster_labels = dbscan(
    coords, eps=epsilon, min_samples=4, algorithm="ball_tree", metric="haversine",
)

print("Number of clusters: {}".format(len(set(cluster_labels))))

Number of clusters: 9


### Add cluster labels to data

Now that we have uncovered some natural clusters, let's give them some appropriate labels.

Looking at the map below, we can see there are a few clusters that we can easily identify (e.g., local Panama City Beach and the large area in northern Alabama and Georgia), while other clusters are smaller and less significant. The values of `-1` indicate 'noise' or not falling into a cluster.

In [None]:
from cartoframes.viz import color_category_style

# convert labels to text for creating a category map
sg_pcb["dbscan_labels"] = [str(s) for s in cluster_labels]

# show distribution of labels
Layer(sg_pcb, color_category_style('dbscan_labels'))

### Apply readable labels to clusters

In [None]:
sg_pcb["dbscan_labels"] = cluster_labels

# identify points as within a cluster or not
def in_cluster(cluster_num):
    if cluster_num == -1:
        return "Out of cluster"
    return "In cluster"


sg_pcb["in_cluster"] = sg_pcb["dbscan_labels"].apply(in_cluster)

### Calculate Convex Hulls to show approximate cluster region

To get approximate polygons to represent the regions, we can group the points by label and draw a convex hull. We also added a small buffer to improve the cartography.

In [None]:
# group clusters (excluding noise)
# union points within cluster
# create a convex hull and small buffer
cluster_hulls = (
    sg_pcb[sg_pcb["dbscan_labels"] != -1]
    .groupby("dbscan_labels")
    .geometry.apply(lambda x: x.unary_union.convex_hull.buffer(0.05))
    .reset_index()
)

cluster_hulls = gpd.GeoDataFrame(cluster_hulls)

# Give cluster labels more readable titles
cluster_title_mapping = {
    -1: "Outlier",
    0: "Northern Alabama and Georgia",
    1: "Panama City Beach (Locals)",
}
cluster_title_mapping.update(
    {k: "Other smaller region" for k in range(2, max(cluster_labels) + 1)}
)

cluster_hulls["dbscan_labels_readable"] = cluster_hulls["dbscan_labels"].apply(
    lambda x: cluster_title_mapping.get(x)
)

In [None]:
cluster_hulls

Unnamed: 0,dbscan_labels,geometry,dbscan_labels_readable
0,0,"POLYGON ((-84.94719 32.21661, -84.95174 32.215...",Northern Alabama and Georgia
1,1,"POLYGON ((-85.69464 30.06520, -85.69932 30.063...",Panama City Beach (Locals)
2,2,"POLYGON ((-85.12766 31.76488, -85.12806 31.759...",Other smaller region
3,3,"POLYGON ((-83.43705 31.46919, -83.43920 31.464...",Other smaller region
4,4,"POLYGON ((-87.06329 36.04642, -87.06598 36.042...",Other smaller region
5,5,"POLYGON ((-85.91513 31.18973, -85.91990 31.189...",Other smaller region
6,6,"POLYGON ((-88.37893 34.01090, -88.38195 34.006...",Other smaller region
7,7,"POLYGON ((-86.39062 35.74135, -86.39316 35.737...",Other smaller region


### Visualize outputs

In [None]:
from cartoframes.viz import color_category_legend, category_widget


Map(
    [Layer(cluster_hulls,
        style = color_category_style("dbscan_labels_readable",
                            opacity=0.7,
                            palette=["#66C5CC", "#DCB0F2", "#F89C74"],
                            stroke_color="transparent"),
        legends=color_category_legend(title="Visit Regions"),
        widgets=[category_widget('dbscan_labels_readable',
                          title='Cluster lables',
                          description='Select a category to filter')]
    ),
    Layer(sg_pcb,
          style = color_category_style("in_cluster",
                              palette=["#666", "deeppink"],
                              opacity=0.5),
          legends=color_category_legend(title="In Cluster")
    )]
)