<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/wbqvbi6o6ip0vz55ua5gp17g4f1k7ve9.png" width = 300, align = "center"></a>

<h1 align=center><font size = 5> DBSCAN Clustering in Python</font></h1>

## Introduction

The density-based approach to clustering focus on identifying clusters of high density separated by areas of low density. It is particularly good for population-related geographical data because of its distribution nature. For this type of clustering, DBSCAN is one of the most used algorithms, and because of that, will be the focus of this lesson.

Some real-world applications of DBSCAN clustering:
- Weather data clustering
- Determining the best locations for service centres
- Crime data clustering

-----------------

## Table of contents 

<div class="alert alert-block alert-info" style="margin-top: 20px">
<font size = 3><strong>Density-based Clustering:</strong></font><br>
- <p><a href="#ref1">The problem we want to solve</a></p>
- <p><a href="#ref2">Framing the real problem to the application of DBSCAN</a></p>
- <p><a href="#ref3">The dataset</a></p>
- <p><a href="#ref4">A quick implementation</a></p>
<p></p>
</div>
<br>


----------------

<a id="ref1"></a>
# The problem to solve

The year is 2036. You live in Canada and it is the end of the spring. The proportion of seniors aged 65 and over of the <a href = http://www.statcan.gc.ca/pub/91-520-x/2010001/aftertoc-aprestdm1-eng.htm>population is 25% and increasing</a>. While you are having your breakfast, an interview in the news channel draws your attention. The government of Canada is concerned with the upcoming heat waves in the summer, due to the particularly blistering hot temperatures in the last years. The Minister of Environment and Climate Change mentions during an interview that she is <font color = "green"> willing to create new weather stations. These stations will be packed with brand new equipment and will be responsible to collect information from collections of existing weather stations nearby, to better monitor the temperatures and prevent heat injuries in the increasingly older population. </font> As a person who cares about your future and your family's, you decide to help. You call a friend that works in the Institute of Weather Injuries  Prevention and he says that the government called two days ago asking him to **come up with the locations of the new weather stations. How many should be built, and where should each one be located?** In order to answer that question, your friend already assigned a research team to gather information on the existing weather stations. You will be responsible to process data that this research will generate. **Since you want to prepare the model before the data comes in, you will use a dataset about weather from 2015 to prepare all the code.**


------------

<a id="ref2"></a>
# Framing the real problem

First, let's give names to each type of station:
- The *old station* will be called **"normal station" or "old station"**.
- The *new station* will be called **"main station" or "new station"**.

As mentioned before, DBSCAN focus on identifying clusters of high density separated by areas of low density. Since different types of distributions and densities are possible, each DBSCAN application **must be correctly tuned.**  
  
DBSCAN is an algorithm that must be tuned for each particular dataset, using the **eps** and **minPts** parameters. The "eps" corresponds to the **size of the neighborhood** and "minPts" to the **minimum number that must exist in this neighborhood to define it as a dense area**.  
  
To explain the meaning of each one of these parameters, we can use our example. We have our new weather stations that collect information of <font color = "green">collections of old weather stations</font>. In order to collect this type of information, a communication must be established between the weather stations. But there are two main restrictions:
 - There is a <font color = "green">maximum distance between two weather stations</font>  determined by a special cable with a determined maximum length (We have to use *special cables* to transmit data faster). However, we can connect stations that are further than this maximum distance from the main stations by connecting to weather stations that are within the reach of the main stations, with unlimited number of connections. 
 - Just justify the connection of an old weather station to the main station (through direct or indirect connections), it has to have a <font color = "green">minimum number of "reachable" stations using the *special cable*</font> (Perhaps because the price to setup the device for streaming is high).
   
In the analogy above, the <font color = "green">maximum distance between two weather stations</font> corresponds to the **eps** and the <font color = "green">minimum number of "reachable" stations using the *special cable*</font> to the **minPts**.

From the problem information we had to determine <font color = "green">the number of main weather stations </font> and <font color = "green">their locations</font>, which are the **the number of clusters** and **the cluster centroids**.

 
------------------------

<a id="ref3"></a>
<a id="ref20"></a>

# The dataset
As we mentioned before, the dataset used is from 2015. It contains information about a set of weather stations, with their respective names and geographical coordinates.

Now let's start playing with the data. We will be working according to the following workflow: </font>
1. Load
- Overview
- Clean

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
<font size = 3><strong>Expert tip:</strong></font>
<br>
<br>
We will go through these steps very quickly. If you want a better explanation of the process of managing datasets, you can check the <u>Introduction to R</u> course.<p></p>

</div>

### 1. Loading the data
Since the dataset was uploaded to a Box folder, we can use the ***download.file*** command to download the dataset, and load the data using the ***read.csv*** command.

In [361]:
#Downloading the file in the Data Scientist Workbench
download.file("https://ibm.box.com/shared/static/th5dsj5txtw052dt2k9i5hekkjhydda2.csv","/resources/WeatherStations.csv")

In [104]:
#Reading the csv file
WeatherStations <- read.csv("/resources/WeatherStations.csv", sep =',')

### 2. Overview of the data
In this section, we will take a quick look in the data before processing it.

In [105]:
#How the data looks like?
head(WeatherStations)

                Stn_Name    Lat     Long Prov  Tm DwTm   D   Tx DwTx   Tn DwTn
1              CHEMAINUS 48.935 -123.742   BC 8.2    0  NA 13.5    0  1.0    0
2 COWICHAN LAKE FORESTRY 48.824 -124.133   BC 7.0    0 3.0 15.0    0 -3.0    0
3          LAKE COWICHAN 48.829 -124.052   BC 6.8   13 2.8 16.0    9 -2.5    9
4       DISCOVERY ISLAND 48.425 -123.226   BC  NA   NA  NA 12.5    0   NA   NA
5    DUNCAN KELVIN CREEK 48.735 -123.728   BC 7.7    2 3.4 14.5    2 -1.0    2
6      ESQUIMALT HARBOUR 48.432 -123.439   BC 8.8    0  NA 13.1    0  1.9    0
   S DwS S.N     P DwP P.N S_G Pd BS DwBS BS.   HDD CDD  Stn_No
1  0   0  NA 178.8   0  NA   0 12 NA   NA  NA 273.3   0 1011500
2  0   0   0 258.6   0 104   0 12 NA   NA  NA 307.0   0 1012040
3  0   9  NA 264.6   9  NA  NA 11 NA   NA  NA 168.1   0 1012055
4 NA  NA  NA    NA  NA  NA  NA NA NA   NA  NA    NA  NA 1012475
5  0   2  NA 168.4   2  NA  NA 11 NA   NA  NA 267.7   0 1012573
6 NA  NA  NA  81.0   8  NA  NA 12 NA   NA  NA 258.6   0 1012710

In [106]:
#Let's check general information  about the data!
str(WeatherStations)

'data.frame':	1341 obs. of  25 variables:
 $ Stn_Name: Factor w/ 1318 levels "100 MILE HOUSE 6NE",..: 203 255 608 298 307 347 413 680 717 796 ...
 $ Lat     : num  48.9 48.8 48.8 48.4 48.7 ...
 $ Long    : num  -124 -124 -124 -123 -124 ...
 $ Prov    : Factor w/ 13 levels "AB","BC","MB",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Tm      : num  8.2 7 6.8 NA 7.7 8.8 8.9 7.2 NA NA ...
 $ DwTm    : int  0 0 13 NA 2 0 7 1 NA NA ...
 $ D       : num  NA 3 2.8 NA 3.4 NA NA NA NA NA ...
 $ Tx      : num  13.5 15 16 12.5 14.5 13.1 13.5 12.7 NA NA ...
 $ DwTx    : int  0 0 9 0 2 0 7 1 NA NA ...
 $ Tn      : num  1 -3 -2.5 NA -1 1.9 2 2.2 NA NA ...
 $ DwTn    : int  0 0 9 NA 2 0 7 0 NA NA ...
 $ S       : num  0 0 0 NA 0 NA 0 NA 0 0 ...
 $ DwS     : int  0 0 9 NA 2 NA 7 NA 0 0 ...
 $ S.N     : int  NA 0 NA NA NA NA NA NA 0 0 ...
 $ P       : num  179 259 265 NA 168 ...
 $ DwP     : int  0 0 9 NA 2 8 7 10 0 0 ...
 $ P.N     : int  NA 104 NA NA NA NA NA NA 95 114 ...
 $ S_G     : int  0 0 NA NA NA NA 0 NA 0 0 

If you never worked with this dataset before, it might look very confusing! All the names are abbreviated and there is no way to know what each column means. To solve this problem, **we can simply look in the information given about the dataset.**

In this notebook, we will be using only the columns with names listed in <font color = "green">green</font>.

<h3 align = "center">
About the dataset
</h3>
<h4 align = "center">
Environment Canada    
Monthly Values for July - 2015	
</h4>
<html>
<head>
<style>
table {
    font-family: arial, sans-serif;
    border-collapse: collapse;
    width: 100%;
}

td, th {
    border: 1px solid #dddddd;
    text-align: left;
    padding: 8px;
}

tr:nth-child(even) {
    background-color: #dddddd;
}
</style>
</head>
<body>

<table>
  <tr>
    <th>Name in the table</th>
    <th>Meaning</th>
  </tr>
  <tr>
    <td><font color = "green"><strong>Stn_Name</font></td>
    <td><font color = "green"><strong>Station Name</font</td>
  </tr>
  <tr>
    <td><font color = "green"><strong>Lat</font></td>
    <td><font color = "green"><strong>Latitude (North+, degrees)</font></td>
  </tr>
  <tr>
    <td><font color = "green"><strong>Long</font></td>
    <td><font color = "green"><strong>Longitude (West - , degrees)</font></td>
  </tr>
  <tr>
    <td>Prov</td>
    <td>Province</td>
  </tr>
  <tr>
    <td>Tm</td>
    <td>Mean Temperature (Â°C)</td>
  </tr>
  <tr>
    <td>DwTm</td>
    <td>Days without Valid Mean Temperature</td>
  </tr>
  <tr>
    <td>D</td>
    <td>Mean Temperature difference from Normal (1981-2010) (Â°C)</td>
  </tr>
  <tr>
    <td><font color = "green"><strong>Tx</font></td>
    <td><font color = "green"><strong>Highest Monthly Maximum Temperature (Â°C)</font></td>
  </tr>
  <tr>
    <td>DwTx</td>
    <td>Days without Valid Maximum Temperature</td>
  </tr>
  <tr>
    <td><font color = "green"><strong>Tn</font></td>
    <td><font color = "green"><strong>Lowest Monthly Minimum Temperature (Â°C)</font></td>
  </tr>
  <tr>
    <td>DwTn</td>
    <td>Days without Valid Minimum Temperature</td>
  </tr>
  <tr>
    <td>S</td>
    <td>Snowfall (cm)</td>
  </tr>
  <tr>
    <td>DwS</td>
    <td>Days without Valid Snowfall</td>
  </tr>
  <tr>
    <td>S%N</td>
    <td>Percent of Normal (1981-2010) Snowfall</td>
  </tr>
  <tr>
    <td>P</td>
    <td>Total Precipitation (mm)</td>
  </tr>
  <tr>
    <td>DwP</td>
    <td>Days without Valid Precipitation</td>
  </tr>
  <tr>
    <td>P%N</td>
    <td>Percent of Normal (1981-2010) Precipitation</td>
  </tr>
  <tr>
    <td>S_G</td>
    <td>Snow on the ground at the end of the month (cm)</td>
  </tr>
  <tr>
    <td>Pd</td>
    <td>Number of days with Precipitation 1.0 mm or more</td>
  </tr>
  <tr>
    <td>BS</td>
    <td>Bright Sunshine (hours)</td>
  </tr>
  <tr>
    <td>DwBS</td>
    <td>Days without Valid Bright Sunshine</td>
  </tr>
  <tr>
    <td>BS%</td>
    <td>Percent of Normal (1981-2010) Bright Sunshine</td>
  </tr>
  <tr>
    <td>HDD</td>
    <td>Degree Days below 18 Â°C</td>
  </tr>
  <tr>
    <td>CDD</td>
    <td>Degree Days above 18 Â°C</td>
  </tr>
  <tr>
    <td>Stn_No</td>
    <td>Climate station identifier (first 3 digits indicate   drainage basin, last 4 characters are for sorting alphabetically).</td>
  </tr>
  <tr>
    <td>NA</td>
    <td>Not Available</td>
  </tr>


</table>

</body>
</html>



### 3. Cleaning the data
Since we want to simplify the work and come up with a mock-up of our final model, we will work only in the location and temperature of the stations.

In [107]:
#Creating the main subset: It contains the Station Name, latitude, longitude, highest monthly and lowest monthly temperature
WeatherStations.submain <- subset(WeatherStations, select = c(Stn_Name,Lat,Long,Tx,Tn))
head(WeatherStations.submain)

                Stn_Name    Lat     Long   Tx   Tn
1              CHEMAINUS 48.935 -123.742 13.5  1.0
2 COWICHAN LAKE FORESTRY 48.824 -124.133 15.0 -3.0
3          LAKE COWICHAN 48.829 -124.052 16.0 -2.5
4       DISCOVERY ISLAND 48.425 -123.226 12.5   NA
5    DUNCAN KELVIN CREEK 48.735 -123.728 14.5 -1.0
6      ESQUIMALT HARBOUR 48.432 -123.439 13.1  1.9

In [108]:
#Changing column names to cleaner names
colnames(WeatherStations.submain) <- c("Stn_Name","Lat", "Long", "Tmax", "Tmin")
head(WeatherStations.submain)

                Stn_Name    Lat     Long Tmax Tmin
1              CHEMAINUS 48.935 -123.742 13.5  1.0
2 COWICHAN LAKE FORESTRY 48.824 -124.133 15.0 -3.0
3          LAKE COWICHAN 48.829 -124.052 16.0 -2.5
4       DISCOVERY ISLAND 48.425 -123.226 12.5   NA
5    DUNCAN KELVIN CREEK 48.735 -123.728 14.5 -1.0
6      ESQUIMALT HARBOUR 48.432 -123.439 13.1  1.9

In [109]:
#Removing all the rows that are incompleted (Have at least a NA)
WeatherStations.submain <- WeatherStations.submain[complete.cases(WeatherStations.submain),]
head(WeatherStations.submain)

                Stn_Name    Lat     Long Tmax Tmin
1              CHEMAINUS 48.935 -123.742 13.5  1.0
2 COWICHAN LAKE FORESTRY 48.824 -124.133 15.0 -3.0
3          LAKE COWICHAN 48.829 -124.052 16.0 -2.5
5    DUNCAN KELVIN CREEK 48.735 -123.728 14.5 -1.0
6      ESQUIMALT HARBOUR 48.432 -123.439 13.1  1.9
7          GALIANO NORTH 48.985 -123.573 13.5  2.0

-----------------

# A general visualization of the data
Dealing with big amounts of data can be really overwhelming. A first visualization will be used in order to have a better idea on what kind of data we will be clustering later. For this, <a href="https://rstudio.github.io/leaflet/">Leaflet for R</a> will be used for plotting on top of a copy of a world map:

1. Installing the library
- Setting up the visualization
- Plotting

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
<font size = 3><strong>Expert tip:</strong></font>
<br>
<br>
We will go through these steps very quickly. If you want a better explanation of the process of map visualization, you can check the <u>Map Visualization in R</u> course.<p></p>

</div>

### 1. Installing the library

In [110]:
#Installing leaflet maps
install.packages("leaflet")
library(leaflet)
#Packages used to display the maps in this notebook
library(htmlwidgets)
library(IRdisplay)

Installing package into ‘/home/notebook/spark-1.6.0-bin-hadoop2.6/R/lib’
(as ‘lib’ is unspecified)


### 2. Setting up the visualization
We will set the visualization by assigning the ***setView*** attribute to the map. We just have to come up with a *center point* of our visualization (lat,long) and determine a *zoom* rate.

In [111]:
#Establishing the limits of our default visualization
lower_lon = -140
upper_lon = -50
lower_lat = 40
upper_lat = 65
#Establishing the center of our default visualization
center_lon = (lower_lon + upper_lon)/2
center_lat = (lower_lat + upper_lat)/2
#Setting the default zoom of our default visualization
zoom = 4

### 3. Plotting

In [112]:
subset <- WeatherStations.submain #'subset' stores the subset we will be using for the leaflet map
weather_map <- leaflet(subset) %>% #creating a leaflet map
                setView(center_lon,center_lat, zoom)%>% #setting the default view for our map
                addProviderTiles("OpenStreetMap.BlackAndWhite")%>% #setting the map that we want to use as background
                addCircleMarkers(lng = subset$Long, #the longitude is the longitude of our subset!
                                 lat = subset$Lat, #the latitude is the latitude of our subset!
                                 popup = subset$Stn_Name, #pop-ups will show the name of station if you click in a data point
                                 fillColor = "Black", #colors of the markers will be black
                                 fillOpacity = 1, #the shapes will have maximum opacity
                                 radius = 4, #radius determine the size of each shape
                                 stroke = F) #no stroke will be drawn in each data point

saveWidget(weather_map, file="weather_map.html", selfcontained = F) #saving the leaflet map in html
display_html(paste("<iframe src=' ", 'weather_map.html', " ' width='100%' height='400'","/>")) #displaying the map !

----------------------

<a id="ref4"></a>
# A quick implementation
Instead of going through all the particularities and math of this algorithm, why don't we jump directly to implementing as fast as we can? Let's start with the process we will be using for it:

1. Installing the library
- Clustering by location
    1. Preparing the data
    - Running the algorithm
    - Attaching the cluster information to the data
    - Visualizing the results
- Clustering by location and temperature
    1. Preparing the data
    - Running the algorithm
    - Attaching the cluster information to the data
    - Visualizing the results

### 1. Installing the library

In [113]:
#installing the library 'dbscan'
install.packages("dbscan", dependencies = TRUE)
library('dbscan')

Installing package into ‘/home/notebook/spark-1.6.0-bin-hadoop2.6/R/lib’
(as ‘lib’ is unspecified)


### 2. Clustering by location
At first, it was decided that only the location would matter, since we only have constraints in terms of length of the cable and the way we can make the connections to reach further distances

#####  Preparing the data (Subsetting and Normalizing)

In [114]:
#creating a subset containing only the location information
WeatherStations.sub1 <- subset(WeatherStations.submain, select = c(Lat,Long))

#preparing the dataset for dbscan: center and scale.
scaled_WS.sub1 <- scale(WeatherStations.sub1, center = TRUE, scale = TRUE)
head(scaled_WS.sub1)

         Lat      Long
1 -0.3834153 -1.142244
2 -0.4011865 -1.158987
3 -0.4003860 -1.155518
5 -0.4154355 -1.141644
6 -0.4639460 -1.129269
7 -0.3754103 -1.135007

#####  Running the algorithm

<div class="alert alert-block alert-info" style="margin-top: 20px">
<font size = 3><strong>"Tuning" the algorithm</strong></font><br>
<p>As mentioned above, DBSCAN is an algorithm that must be tuned for each particular dataset, using the **eps** and **minPts** parameters. The "eps" corresponds to the **size of the neighborhood** and "minPts" to the **minimum number that must exist in this neighborhood to define it as a dense area**.  </p>

<p>In this implementation, **we gave the correct parameters for each one of the applications.** However, if you plan to apply this algorithm to your own problem, you will have to tune it by your own, according to the dataset you are using. </p>
</div>


In [310]:
#assigning clusters for each
clusters_assignments1 <- dbscan(scaled_WS.sub1, eps = 0.138, minPts = 12)
clusters_assignments1

DBSCAN clustering for 1258 objects.
Parameters: eps = 0.138, minPts = 12
The clustering contains 5 cluster(s) and 277 noise points.

  0   1   2   3   4   5 
277 516  40  24 360  41 

Available fields: cluster, eps, minPts

##### Attaching cluster assignment to each one of the datapoints

In [311]:
#clusters must be converted to factor before plotting in different colors
clusters_assignments1$cluster <- as.factor(clusters_assignments1$cluster)

#linking the assigned cluster to each station
WeatherStations.sub1$cluster_no <- clusters_assignments1$cluster
head(WeatherStations.sub1)

     Lat     Long cluster_no
1 48.935 -123.742          1
2 48.824 -124.133          1
3 48.829 -124.052          1
5 48.735 -123.728          1
6 48.432 -123.439          1
7 48.985 -123.573          1

##### Calculating the centroids for each cluster

In [322]:
#function that calculates the centroid of a given cluster_no and dataframe
cluster_centroid <- function(cluster_no, df){
    cluster_df <- df[ which(df$cluster_no==cluster_no), ]
    lat_mean <- mean(cluster_df[["Lat"]])
    long_mean <-  mean(cluster_df[["Long"]])
    return(c(lat_mean,long_mean))
}

In [314]:
#function that calculates all the centroids latitudes of a given dataframe
all_lats_centroids <- function(df){
    all_lats <- numeric ()
    for (cluster in unique (df$cluster_no)){
        all_lats[cluster] <- cluster_centroid (cluster,df) [1]
    }
    return (all_lats)
}

In [323]:
#function that calculates all the centroids longitudes of a given dataframe
all_longs_centroids <- function(df){
    all_longs <- numeric ()
    for (cluster in unique (df$cluster_no)){
        all_longs[cluster] <- cluster_centroid (cluster,df) [2]
    }
    return (all_longs)
}

In [329]:
#storing the latitude of all cluster centroids
lats_centroids1 <- all_lats_centroids (WeatherStations.sub1)
lats_centroids1

       1        0        2        3        4        5 
50.95104 59.00052 54.10623 55.80708 45.67793 48.57029 

In [330]:
#storing the longitude of all cluster centroids
longs_centroids1 <- all_longs_centroids (WeatherStations.sub1)
longs_centroids1

         1          0          2          3          4          5 
-113.72187  -97.13291 -128.44907 -119.01042  -72.88263  -55.91310 

##### Visualizing the results
Note that cluster number **0 represents the noise points.**

In [331]:
#plotting the graph
subset <- WeatherStations.sub1
color <- colorFactor("Set1", as.factor(subset$cluster_no))
weather_map1 <- leaflet(subset) %>%
                setView(center_lon,center_lat, zoom)%>% 
                addProviderTiles("OpenStreetMap.BlackAndWhite")%>% 
                addCircleMarkers(lng = subset$Long,
                                lat = subset$Lat, 
                                popup = subset$Stn_Name,
                                fillColor = ~color(cluster_no),
                                fillOpacity = 1,
                                radius = 4,
                                stroke = F) %>% 
                addMarkers(lng = longs_centroids1,
                           lat = lats_centroids1,
                           popup = unique(subset$cluster_no))%>% 
                addLegend("bottomleft",
                        pal = color,
                        values = ~cluster_no,
                        opacity = 1,
                        title = "Cluster")

saveWidget(weather_map1, file="weather_map1.html", selfcontained = F)

#Plotting our raw data points
display_html(paste("<iframe src=' ", 'weather_map.html', " ' width='100%' height='400'","/>")) 

#Plotting our data points clustered by location
display_html(paste("<iframe src=' ", 'weather_map1.html', " ' width='100%' height='400'","/>")) 

The markers represents each **main station** and each cluster color the different groups of **normal stations** assigned to different main stations.   
<font color = "red"> **Note that the cluster 0 is compounded by unassigned points. Therefore, its main station shouldn't be built.** </font>

-----------------------

### 3. Clustering by location and temperature
After discussing more about the problem, it was decided that would be interesting to assign weather stations clustered by temperature. This way, if a main station detect a dangerous change in temperature, it can quickly warn all of its assigned weather stations, without losing time to identify and analyzing different temperature patterns. For that, we will have to maintain the location in consideration, but also take the temperature in account.

#####  Preparing the data (Subsetting and Normalizing)

In [118]:
#creating a subset containing only the location
WeatherStations.sub2 <- subset(WeatherStations.submain, select = c(Lat,Long,Tmax,Tmin))

#preparing the dataset for dbscan: center and scale.
scaled_WS.sub2 <- scale(WeatherStations.sub2, center = TRUE, scale = TRUE)
head(scaled_WS.sub2)

         Lat      Long     Tmax     Tmin
1 -0.3834153 -1.142244 1.229689 2.171379
2 -0.4011865 -1.158987 1.399106 1.853476
3 -0.4003860 -1.155518 1.512051 1.893214
5 -0.4154355 -1.141644 1.342633 2.012428
6 -0.4639460 -1.129269 1.184511 2.242907
7 -0.3754103 -1.135007 1.229689 2.250855

#####  Running the algorithm

In [350]:
#assigning clusters for each
clusters_assignments2 <- dbscan(scaled_WS.sub2, eps = 0.27, minPts = 12)
clusters_assignments2

DBSCAN clustering for 1258 objects.
Parameters: eps = 0.27, minPts = 12
The clustering contains 8 cluster(s) and 443 noise points.

  0   1   2   3   4   5   6   7   8 
443 154  20  14 246  54 297  18  12 

Available fields: cluster, eps, minPts

##### Attaching cluster assignment to each one of the datapoints

In [351]:
#clusters must be converted to factor before plotting in different colors
clusters_assignments2$cluster <- as.factor(clusters_assignments2$cluster)
#linking the assigned cluster to each station
WeatherStations.sub2$cluster_no <- clusters_assignments2$cluster
head(WeatherStations.sub2)

     Lat     Long Tmax Tmin cluster_no
1 48.935 -123.742 13.5  1.0          1
2 48.824 -124.133 15.0 -3.0          1
3 48.829 -124.052 16.0 -2.5          1
5 48.735 -123.728 14.5 -1.0          1
6 48.432 -123.439 13.1  1.9          1
7 48.985 -123.573 13.5  2.0          1

##### Calculating the centroids for each cluster

In [352]:
#storing the latitude of all cluster centroids
lats_centroids2 <- all_lats_centroids (WeatherStations.sub2)
lats_centroids2

       1        0        2        3        4        5        6        7 
49.59656 55.49524 53.64030 51.88571 52.28107 49.98793 45.82021 45.17428 
       8 
47.43650 

In [353]:
#storing the longitude of all cluster centroids
longs_centroids2 <- all_longs_centroids (WeatherStations.sub2)
longs_centroids2

         1          0          2          3          4          5          6 
-122.46975  -95.72415 -131.13415 -120.97850 -108.68546 -112.76052  -73.76266 
         7          8 
 -63.60828  -54.16892 

##### Visualizing the results
Note that cluster number **0 represents the noise points.**

In [354]:
#plotting the graph
subset <- WeatherStations.sub2
color <- colorFactor("Set1", as.factor(subset$cluster_no))
weather_map2 <- leaflet(subset) %>%
                setView(center_lon,center_lat, zoom)%>% 
                addProviderTiles("OpenStreetMap.BlackAndWhite")%>% 
                addCircleMarkers(lng = subset$Long,
                                lat = subset$Lat, 
                                popup = subset$Stn_Name,
                                fillColor = ~color(cluster_no),
                                fillOpacity = 1,
                                radius = 4,
                                stroke = F) %>%
                addMarkers(lng = longs_centroids2,
                           lat = lats_centroids2,
                           popup = unique(subset$cluster_no)) %>%
                addLegend("bottomleft",
                        pal = color,
                        values = ~cluster_no,
                        opacity = 1,
                        title = "Cluster")

saveWidget(weather_map2, file="weather_map2.html", selfcontained = F)

#Plotting our raw data points
display_html(paste("<iframe src=' ", 'weather_map.html', " ' width='100%' height='400'","/>"))

#Plotting our data points clustered by temperature and location
display_html(paste("<iframe src=' ", 'weather_map2.html', " ' width='100%' height='400'","/>"))

The markers represents each **main station** and each cluster color the different groups of **normal stations** assigned to different main stations.   
<font color = "red"> **Note that the cluster 0 is compounded by unassigned points. Therefore, its main station shouldn't be built.** </font>

#### Our model is ready! Once we have more data about other weather conditions and parameters we can further refine, come up with better ideas and come up an infinite amount of ways to cluster the weather stations.

------

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
<font size = 3><strong>Expert tip:</strong></font>
<br>
<br>
Now that you know how we applied the DBSCAN to a specific problem, it's time to think about what kind of problem you can solve with this new tool. You can start by searching about it in the internet, and use this notebook as a quick way to apply DBSCAN on these different problems.<p></p>

</div>

### Thanks for completing this lesson!

Notebook created by: <a href = "https://ca.linkedin.com/in/erich-natsubori-sato">Erich Natsubori Sato</a>

### References:

https://en.wikipedia.org/wiki/DBSCAN <br>
https://cran.r-project.org/web/packages/dbscan/dbscan.pdf <br>
https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf <br>
http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf <br>

<hr>
Copyright &copy; 2016 [Big Data University](https://bigdatauniversity.com/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).