# Weather Station Clustering using DBSCAN

DBSCAN is specially very good for tasks like class identification on a spatial context. The wonderful attribute of DBSCAN algorithm is that it can find out any arbitrary shape cluster without getting affected by noise. For example, this following example cluster the location of weather stations in Canada.DBSCAN can be used here, for instance, to find the group of stations which show the same weather condition. As you can see, it not only finds different arbitrary shaped clusters, can find the denser part of data-centered samples by ignoring less-dense areas or noises.

## About the dataset

<h4 align = "center">
Environment Canada    
Monthly Values for July - 2015	
</h4>
<html>
<head>
<style>
table {
    font-family: arial, sans-serif;
    border-collapse: collapse;
    width: 100%;
}

td, th {
    border: 1px solid #dddddd;
    text-align: left;
    padding: 8px;
}

tr:nth-child(even) {
    background-color: #dddddd;
}
</style>
</head>
<body>

<table>
  <tr>
    <th>Name in the table</th>
    <th>Meaning</th>
  </tr>
  <tr>
    <td><font color = "green"><strong>Stn_Name</font></td>
    <td><font color = "green"><strong>Station Name</font</td>
  </tr>
  <tr>
    <td><font color = "green"><strong>Lat</font></td>
    <td><font color = "green"><strong>Latitude (North+, degrees)</font></td>
  </tr>
  <tr>
    <td><font color = "green"><strong>Long</font></td>
    <td><font color = "green"><strong>Longitude (West - , degrees)</font></td>
  </tr>
  <tr>
    <td>Prov</td>
    <td>Province</td>
  </tr>
  <tr>
    <td>Tm</td>
    <td>Mean Temperature (°C)</td>
  </tr>
  <tr>
    <td>DwTm</td>
    <td>Days without Valid Mean Temperature</td>
  </tr>
  <tr>
    <td>D</td>
    <td>Mean Temperature difference from Normal (1981-2010) (°C)</td>
  </tr>
  <tr>
    <td><font color = "black">Tx</font></td>
    <td><font color = "black">Highest Monthly Maximum Temperature (°C)</font></td>
  </tr>
  <tr>
    <td>DwTx</td>
    <td>Days without Valid Maximum Temperature</td>
  </tr>
  <tr>
    <td><font color = "black">Tn</font></td>
    <td><font color = "black">Lowest Monthly Minimum Temperature (°C)</font></td>
  </tr>
  <tr>
    <td>DwTn</td>
    <td>Days without Valid Minimum Temperature</td>
  </tr>
  <tr>
    <td>S</td>
    <td>Snowfall (cm)</td>
  </tr>
  <tr>
    <td>DwS</td>
    <td>Days without Valid Snowfall</td>
  </tr>
  <tr>
    <td>S%N</td>
    <td>Percent of Normal (1981-2010) Snowfall</td>
  </tr>
  <tr>
    <td><font color = "green"><strong>P</font></td>
    <td><font color = "green"><strong>Total Precipitation (mm)</font></td>
  </tr>
  <tr>
    <td>DwP</td>
    <td>Days without Valid Precipitation</td>
  </tr>
  <tr>
    <td>P%N</td>
    <td>Percent of Normal (1981-2010) Precipitation</td>
  </tr>
  <tr>
    <td>S_G</td>
    <td>Snow on the ground at the end of the month (cm)</td>
  </tr>
  <tr>
    <td>Pd</td>
    <td>Number of days with Precipitation 1.0 mm or more</td>
  </tr>
  <tr>
    <td>BS</td>
    <td>Bright Sunshine (hours)</td>
  </tr>
  <tr>
    <td>DwBS</td>
    <td>Days without Valid Bright Sunshine</td>
  </tr>
  <tr>
    <td>BS%</td>
    <td>Percent of Normal (1981-2010) Bright Sunshine</td>
  </tr>
  <tr>
    <td>HDD</td>
    <td>Degree Days below 18 °C</td>
  </tr>
  <tr>
    <td>CDD</td>
    <td>Degree Days above 18 °C</td>
  </tr>
  <tr>
    <td>Stn_No</td>
    <td>Climate station identifier (first 3 digits indicate   drainage basin, last 4 characters are for sorting alphabetically).</td>
  </tr>
  <tr>
    <td>NA</td>
    <td>Not Available</td>
  </tr>


</table>

</body>
</html>

 

## Load the dataset

In [1]:
import pandas as pd
import numpy as np

filename='weather-stations20140101-20141231.csv'

#Read csv
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Stn_Name,Lat,Long,Prov,Tm,DwTm,D,Tx,DwTx,Tn,...,DwP,P%N,S_G,Pd,BS,DwBS,BS%,HDD,CDD,Stn_No
0,CHEMAINUS,48.935,-123.742,BC,8.2,0.0,,13.5,0.0,1.0,...,0.0,,0.0,12.0,,,,273.3,0.0,1011500
1,COWICHAN LAKE FORESTRY,48.824,-124.133,BC,7.0,0.0,3.0,15.0,0.0,-3.0,...,0.0,104.0,0.0,12.0,,,,307.0,0.0,1012040
2,LAKE COWICHAN,48.829,-124.052,BC,6.8,13.0,2.8,16.0,9.0,-2.5,...,9.0,,,11.0,,,,168.1,0.0,1012055
3,DISCOVERY ISLAND,48.425,-123.226,BC,,,,12.5,0.0,,...,,,,,,,,,,1012475
4,DUNCAN KELVIN CREEK,48.735,-123.728,BC,7.7,2.0,3.4,14.5,2.0,-1.0,...,2.0,,,11.0,,,,267.7,0.0,1012573


In [2]:
df.shape

(1341, 25)

## Cleaning

Lets remove rows that dont have any value in the Tm field.

In [3]:
df = df.dropna(subset=['Tm'],axis=0)
df.shape

(1256, 25)

In [4]:
df.head()

Unnamed: 0,Stn_Name,Lat,Long,Prov,Tm,DwTm,D,Tx,DwTx,Tn,...,DwP,P%N,S_G,Pd,BS,DwBS,BS%,HDD,CDD,Stn_No
0,CHEMAINUS,48.935,-123.742,BC,8.2,0.0,,13.5,0.0,1.0,...,0.0,,0.0,12.0,,,,273.3,0.0,1011500
1,COWICHAN LAKE FORESTRY,48.824,-124.133,BC,7.0,0.0,3.0,15.0,0.0,-3.0,...,0.0,104.0,0.0,12.0,,,,307.0,0.0,1012040
2,LAKE COWICHAN,48.829,-124.052,BC,6.8,13.0,2.8,16.0,9.0,-2.5,...,9.0,,,11.0,,,,168.1,0.0,1012055
4,DUNCAN KELVIN CREEK,48.735,-123.728,BC,7.7,2.0,3.4,14.5,2.0,-1.0,...,2.0,,,11.0,,,,267.7,0.0,1012573
5,ESQUIMALT HARBOUR,48.432,-123.439,BC,8.8,0.0,,13.1,0.0,1.9,...,8.0,,,12.0,,,,258.6,0.0,1012710


## Visualization

Visualization of stations on map using **folium**

In [5]:
import folium

In [6]:
map_can = folium.Map(
    location=[56,-106],
    zoom_start = 4
)

for lat,lng in zip(df['Lat'][0:1000],df['Long'][0:1000]):
    folium.CircleMarker(
        [lat,lng],
        radius=1,
        fill=True,
        color='red'
    ).add_to(map_can)
    
map_can

## a. Clustering of stations based on their location i.e. Latitude & Lonitude

In [7]:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

In [8]:
Cluster = df[['Lat','Long']]
Cluster = StandardScaler().fit_transform(Cluster)

In [9]:
# Compute DBSCAN
db = DBSCAN(eps=0.15, min_samples=10)
db.fit(Cluster)

DBSCAN(algorithm='auto', eps=0.15, leaf_size=30, metric='euclidean',
    metric_params=None, min_samples=10, n_jobs=None, p=None)

In [10]:
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

In [11]:
labels = db.labels_
df['Labels'] = labels

In [12]:
df[["Stn_Name","Tx","Tm","Labels"]].head()

Unnamed: 0,Stn_Name,Tx,Tm,Labels
0,CHEMAINUS,13.5,8.2,0
1,COWICHAN LAKE FORESTRY,15.0,7.0,0
2,LAKE COWICHAN,16.0,6.8,0
4,DUNCAN KELVIN CREEK,14.5,7.7,0
5,ESQUIMALT HARBOUR,13.1,8.8,0


In [13]:
set(labels)

{-1, 0, 1, 2, 3}

As you can see for outliers, the cluster label is -1

## Visualization of clusters based on location

In [14]:
map_can = folium.Map(
    location=[56,-106],
    zoom_start = 4
)

color_map = {0:'red',1:'blue',2:'green',3:'yellow',-1:'gray'}

for lat,lng,lab in zip(df['Lat'],df['Long'],df['Labels']):
    label=folium.Popup('Cluster : '+str(lab),parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=1,
        fill=True,
        color=color_map[lab],
        popup=label
    ).add_to(map_can)
    
map_can

## b. Clustering of stations based on their location, mean, max, and min Temperature

In [15]:
Cluster = df[['Lat','Long','Tx','Tm','Tn']]
Cluster = np.nan_to_num(Cluster)
Cluster = StandardScaler().fit_transform(Cluster)

In [16]:
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10)
db.fit(Cluster)

DBSCAN(algorithm='auto', eps=0.3, leaf_size=30, metric='euclidean',
    metric_params=None, min_samples=10, n_jobs=None, p=None)

In [17]:
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

In [18]:
df['Labels'] = db.labels_

In [19]:
df[["Stn_Name","Tx","Tm","Tn","Labels"]].head()

Unnamed: 0,Stn_Name,Tx,Tm,Tn,Labels
0,CHEMAINUS,13.5,8.2,1.0,0
1,COWICHAN LAKE FORESTRY,15.0,7.0,-3.0,0
2,LAKE COWICHAN,16.0,6.8,-2.5,0
4,DUNCAN KELVIN CREEK,14.5,7.7,-1.0,0
5,ESQUIMALT HARBOUR,13.1,8.8,1.9,0


In [20]:
df['Labels'].unique()

array([ 0, -1,  2,  1,  3,  4,  5,  6,  7])

## Visualization of clusters based on location and Temperature

In [21]:
map_can = folium.Map(
    location=[56,-106],
    zoom_start = 4
)

color_map = {0:'red',1:'blue',2:'green',3:'yellow',4:'orange',-1:'gray',
             5:'pink',6:'purple',7:'brown'
            }

for lat,lng,lbl in zip(df['Lat'],df['Long'],df['Labels']):
    label = folium.Popup('Cluster : '+str(lbl),parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        color=color_map[lbl],
        fill=True,
        radius=1,
        popup=label
    ).add_to(map_can)

map_can